The impact of rewriting on coverage constraint satisfaction Chiara Accinelli, Barbara Catania, Giovanna Guerrini, Simone Minisi DIBRIS - University of Genoa, Genoa - Italy name.surname@dibris.unige.it ABSTRACT In this paper, we are interested in investigating solutions for Due to the impact of analytical processes on our life, an increas- detecting bias in data-preprocessing steps, defined in terms of cov- ing effort is being devoted to the design of technological solu- erage constraints, with a special reference to filtering and merge tions that help humans in measuring the bias introduced by such data transformations, mitigating it, and checking whether the processes and understanding its causes. Existing solutions can mitigation was effective. Specifically, we focus on classical data refer to either back-end or front-end stages of the data process- transformation operations, often defined in terms of Selection- ing pipeline and usually represent bias in terms of some given Projection-Join (SPJ) operations over tabular data, that can reduce diversity or fairness constraint. In our previous work [1], we the number of records related to some protected or disadvantaged proposed an approach for rewriting filtering and merge opera- groups, defined in terms of some sensitive attributes, even if such tions in pre-processing pipelines into the “closest” operations so attributes are not directly used in the specification of the data that protected groups are adequately represented (i.e., covered) transformation operation. in the result. This is relevant because any under-represented In this frame, the approach we developed [1] aims at support- category in an initial or intermediate dataset might lead to an ing the user by minimally rewriting the transformation operation under-representation of that category in any subsequent analyti- so that input coverage constraints are guaranteed to be satisfied in cal process. Since many potential rewritings exist, the proposed the transformation result. Through rewriting, the revised process approach is approximate and relies on a sample-based cardinality is traced for further processing, thus guaranteeing transparency. estimation, thus introducing a trade-off between the accuracy Since many potential rewritings exist, we proposed a sample- and the efficiency of the process. In this paper, we investigate this based two-steps approach for detecting, under an approximate trade-off by first presenting various measures quantifying the er- approach, the minimal (i.e., the optimal) rewriting of the orig- ror introduced by the rewriting, due to the applied approximation inal query. After transforming the SPJ query into a canonical and the selected sample. Then, we (preliminarly) experimentally form, first (in a pre-processing step), the search space of potential evaluate such measures on a real-world dataset. rewritings is discretized, so that an approximation of the optimal solution can be detected in the next processing step, by looking at the resulting finite set of points. The coverage-based rewriting of 1 INTRODUCTION the input query can be obtained by visiting the grid, produced as The impact of data on our society is getting higher and higher, the result of the pre-processing step, according to an order that with data about people being more and more often exploited guarantees the fast detection of the minimal rewriting, and by ver- as the basis to make decisions that might impact people’s lives. ifying constraint satisfaction through a sample-based approach. Thus, it becomes crucial to ensure that, in the systems enabling The coverage-based rewriting is approximate both because of the such data-based decisions, data are dealt with in a responsible and discretization of the search space and of the error in estimating non-discriminating way [23], in all the steps, from acquisition to cardinalities and constraint satisfaction on the sample. analysis. In this paper, we start from this approach and we investigate Non-discrimination can be addressed by considering specific its effectiveness in the detection of the optimal coverage-based diversity and fairness constraints. Diversity allows us to capture solution. To this aim, we introduce three main groups of mea- the quality of a collection of items with respect to the variety sures quantifying the error that can be generated by the used of its constituent elements. On the other hand, fairness can be approximated and sample-based approach. The first group of broadly defined as the impartial treatment of individuals and of measures deals with the accuracy of the discretization applied demographic groups inside data processing tasks. in the pre-processing step. The second group deals with the er- Among all data processing steps, data pre-processing plays a ror due to the usage of the sample for the grid generation and relevant role when considering non-discrimination issues since it cardinality estimations, while the third group helps the user in can introduce technical bias by exacerbating pre-existing bias that quantitatively evaluating specific solutions obtained through the may exist in society, with an impact on the whole data lifecycle. rewriting. When considering pre-processing tasks, an often considered The generated rewritings are then (preliminarly) experimen- non-discriminating constraint is coverage. Coverage constraints tally analyzed in terms of the proposed measures, on a real-world guarantee that the input, or training, dataset includes enough dataset. The obtained results provide hints on how to tune ap- examples for each (protected) category of interest, thus increasing proximation and sample related parameters to achieve a good diversity with the aim of limiting the introduction of bias during trade-off between accuracy and performance in the context of the next analytical steps. Coverage is quite relevant in the first coverage-based rewriting, by combining new results related to data processing tasks, like data transformation, since the used accuracy presented in this paper with results related to perfor- transformations might change the, possibly initially satisfied, mance, previously presented in [1]. coverage of protected categories. Even if the proposed measures have been defined in the context of a specific coverage-based approach, we believe that they can be © 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) of more general value in understanding the role of approximation on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 and sampling during data pre-processing. International (CC BY 4.0) The remainder of the paper is structured as follows. In Sec- The approach. Given a dataset 𝐼 and a set of coverage con- tion 2, we present the overall approach to coverage-based rewrit- straints 𝐶𝐶, each selected sensitive SPJ query 𝑄 can be rewritten 𝑜𝑝𝑡 ing. In Section 3, we introduce new measures to quantify, in into another query 𝑄 𝐼,𝐶𝐶 , according to what presented in [1], so terms of accuracy, the impact of rewriting. In Section 4, we ex- 𝑜𝑝𝑡 that 𝑄 𝐼,𝐶𝐶 is the minimal query relaxing 𝑄 guaranteeing cover- perimentally analyze the impact of the rewriting according to age constraint satisfaction when evaluated over the input dataset. the introduced measures. Section 5 discusses related work while Relaxation is reasonable when the user is happy with the speci- Section 6 concludes the paper and outlines future work directions. fied transformation and she wants to keep the result set of the 𝑜𝑝𝑡 original query after the rewriting. 𝑄 𝐼,𝐶𝐶 must therefore satisfy 2 COVERAGE-BASED REWRITING 𝑜𝑝𝑡 𝑜𝑝𝑡 the following properties: (i) 𝑄 𝐼,𝐶𝐶 ≡ 𝑄 ⟨𝑢⟩, thus 𝑄 𝐼,𝐶𝐶 is obtained We focus on data to be transformed for further data analytical 𝑜𝑝𝑡 tasks. Data can refer specific protected (minorities or historically from 𝑄 by only changing the selection constants; (ii) 𝑄 ⊆ 𝑄 𝐼,𝐶𝐶 , 𝑜𝑝𝑡 disadvantaged) groups and we aim at guaranteeing that each thus 𝑄 𝐼,𝐶𝐶 always contains the result of the input query; (iii) all transformation step during the pre-processing pipeline, based 𝑜𝑝𝑡 coverage constraints associated with 𝑄 are satisfied by 𝑄 𝐼,𝐶𝐶 (𝐼 ). on filtering or merge operations, produces a new dataset con- The rewriting should be optimal, i.e., the new query has to satisfy taining enough entries for each protected group of interest. In specific minimality properties with respect to the input query the following, we briefly describe input data and the proposed 𝑄. In order to make the definition of minimality properties ho- technique. Additional details can be found in [1, 3]. mogeneous with respect to all the selection attributes 𝐴𝑖 in 𝑄, we define them in a transformed unit space, in which the values Datasets. Our rewriting approach can be applied over a collec- for each attribute 𝐴𝑖 in 𝐼 are normalized between 0 and 1. We tion of tabular datasets (e.g., relations in a relational database, denote with 𝑄, 𝐼 , 𝐴𝑖 , 𝑣 a query, a dataset, an attribute, and a vector Data Frames in the Pandas analytical environment) 𝐼 ≡ 𝐼 1, ..., 𝐼𝑟 . of values, respectively, in the unit space. Notice that properties Among the attributes 𝐴1, ..., 𝐴𝑚 of each input dataset, we assume (i), (ii), and (iii) are satisfied in the original space if and only if that some discrete valued attributes 𝑆 1, ..., 𝑆𝑛 are of particular they are satisfied in the normalized one. Minimality can now concern, since they allow the identification of protected groups, be stated according to the following two properties: (iv) there and we call them sensitive attributes. Examples of sensitive at- is no other query 𝑄 ′ satisfying conditions (i), (ii), and (iii) such tributes are the gender (with values in {𝑓 𝑒𝑚𝑎𝑙𝑒, 𝑚𝑎𝑙𝑒}) and the that 𝑄 ′ (𝐼 ) ⊂ 𝑄 𝐼,𝐶𝐶 (𝐼 ) (thus, 𝑄 𝐼,𝐶𝐶 is the minimal query on 𝐼 𝑜𝑝𝑡 𝑜𝑝𝑡 race (with values in {𝑎𝑠𝑖𝑎𝑛, 𝑏𝑙𝑎𝑐𝑘, ℎ𝑖𝑠𝑝𝑎𝑛𝑖𝑐, 𝑤ℎ𝑖𝑡𝑒}). 𝑜𝑝𝑡 satisfying (i), (ii), and (iii)); (v) 𝑄 𝐼,𝐶𝐶 ≡ 𝑄 ⟨𝑢⟩ is the closest query Data pre-processing operations. The pre-processing opera- to 𝑄 ⟨𝑣⟩ according to the Euclidean distance between 𝑣 and 𝑢 in tions we are interested in correspond to monotonic Select-Project- 𝑜𝑝𝑡 the unit space, satisfying (i), (ii), (iii), and (iv) (thus, 𝑄 𝐼,𝐶𝐶 is the Join (SPJ) queries over input tabular data that might alter the representation (i.e., the coverage) of specific groups of interests, coverage-based rewriting syntactically closest to the input query, defined in terms of sensitive attribute values. To this aim, we thus maximizing proximity and potentially user satisfaction). focus on SPJ queries that return, among the others, at least one In order to compute the optimal coverage-based rewriting of sensitive attribute (called sensitive SPJ operations or queries). For an SPJ query 𝑄 ⟨𝑣⟩, given a set of coverage constraints 𝐶𝐶 and an the sake of simplicity, we assume that selection conditions are instance 𝐼 , we follow the approach presented in [1], consisting defined over numeric attributes, even if the proposed approach of three steps shortly discussed in what follows. For the sake of can be easily extended to any other ordered domain. Thus, under simplicity in the notations, in presenting the approach and the the considered assumptions, sensitive attributes are not included related examples, we do not underline symbols referring to the in selection conditions (typical assumption in data processing). unit space, even if proximity is always considered in that space. In the following, when needed, we denote 𝑄 by 𝑄 ⟨𝑣 1, ..., 𝑣𝑑 ⟩ or 𝑄 ⟨𝑣⟩, 𝑣 ≡ (𝑣 1, ..., 𝑣𝑑 ), where 𝑣 1, ..., 𝑣𝑑 are the constant values Canonical form generation. We first translate the selected SPJ appearing in the selection conditions 𝑠𝑒𝑙𝑖 ≡ 𝐴𝑖 𝜃𝑖 𝑣𝑖 in 𝑄. queries into a canonical form, in which each selection condition containing operators (>, ≥, =) is translated into one or more Coverage constraints. Conditions over the number of entries equivalent conditions defined in terms of operator <. For example, belonging to a given protected group of interest returned by the any predicate of the form 𝐴𝑖 > 𝑣𝑖 can be transformed into the execution of SPJ queries can be specified in terms of coverage predicate −𝐴𝑖 < −𝑣𝑖 . When considering canonical forms, an constraints [6, 27]. Given a sensitive SPJ query 𝑄, with reference optimal coverage-based rewriting query is obtained from the to a sensitive attribute 𝑆𝑖 , and a value 𝑠𝑖 belonging to the domain input query by replacing one or more selection predicates 𝑠𝑒𝑙𝑖 ≡ of 𝑆𝑖 , 𝑖 ∈ {1, ..., 𝑛}, a coverage constraint with respect to 𝑆𝑖 and 𝐴𝑖 < 𝑣𝑖 with a predicate 𝑠𝑒𝑙𝑖′ ≡ 𝐴𝑖 < 𝑢𝑖 with 𝑢𝑖 ≥ 𝑣𝑖 . 𝑠𝑒𝑙𝑖′ is called a relaxation of 𝑠𝑒𝑙𝑖 . Relaxed queries generated through 𝑠𝑖 is denoted by 𝑄 ↓𝑆𝑠𝑖𝑖 ≥ 𝑘𝑖 and it is satisfied by 𝑄 over the coverage-based rewriting starting from 𝑄 ⟨𝑣⟩, 𝐼 , and 𝐶𝐶 have the input dataset 𝐼 when 𝑐𝑎𝑟𝑑 (𝜎𝑆𝑖 =𝑠𝑖 (𝑄 (𝐼 ))) ≥ 𝑘𝑖 holds. For example, form 𝑄 ⟨𝑢⟩, with 𝑢 ≥ 𝑣, and can be represented as points 𝑢 in the choosing gender as a sensitive attribute, a coverage constraint 𝑑-dimensional space defined over the selection attributes. Taking could be 𝑄 ↓gender ≥ 10, specifying that the result of 𝑄 must 𝑓 𝑒𝑚𝑎𝑙𝑒 into account the features of the canonical form and property (ii) contain data related to at least 10 female individuals. of the optimal rewriting, it is simple to show that the query point Coverage constraints can be provided together with 𝑄 or they corresponding to the optimal rewriting must be contained in can already be available in the system, as any other integrity the upper right region of the reference space with respect to the constraint. This could be useful when they represent generally point represented by the input query. valid non-discrimination rules that must be satisfied by any query execution. (a) Data distribution (b) Data discretization (4 bins) (c) Multi-dimensional grid (d) Multi-dimensional grid visit Figure 1: Data representation and processing Example 2.1. Consider the Diabetes US dataset1 and let 𝑔𝑒𝑛𝑑𝑒𝑟 Pre-processing. According to property (ii) of the coverage-based be the sensitive attribute. Suppose we are interested in finding rewriting, given an input query 𝑄 ⟨𝑣⟩, any coverage-based rewrit- people whose number of medications is less than 10 and the num- ing is located in the upper right portion of the space defined ber of performed lab tests is less than 30. Additionally, suppose we by 𝑣. Thus, the (unit) search space contains infinite possible would like to guarantee that at least 15 females are present in the coverage-based rewritings among which the optimal one should query result (coverage constraint). The corresponding SPJ query be identified. During the pre-processing step, such search space is is 𝑄 ⟨30, 10⟩, defined in SQL as SELECT * FROM Diabetes WHERE discretized, so that an approximation of the optimal solution can num_lab_procedures < 30 AND num_medications < 10. Fig- be detected in the next processing step by looking at the result- ure 1(a) shows the data distribution corresponding to a small ing finite set of points. To this aim, we first organize the search sample of size 100 taken from the Diabetes relation, projected space as a multi-dimensional grid. The grid has 𝑑 axes, one for over the attributes referenced in the query selection conditions each selection attribute in the canonical form of 𝑄 ⟨𝑣⟩, and each (points are colored and shaped according to the sensitive attribute axis, starting from query values, is discretized into a fixed set values: blue crosses for females and black dots for males). The of bins, by using a binning approach (e.g, equi-width, dividing query corresponds to point (30, 10) in such a space and the region each axis in a fixed number of bins of equal size, or equi-depth, boxed at the bottom left of point (30, 10) contains the result of 𝑄. in which each bin contains an equal number of instances), typ- The query is already in canonical form, so no preliminary ical of histogram generation. Each point 𝑣 at the intersection rewriting is required. The search space for detecting coverage- of hyperplanes corresponding to bin values corresponds to a based rewritings of the input query corresponds to the grey sensitive SPJ query containing 𝑄 ⟨𝑣⟩, thus satisfying condition region in Figure 1(b). ^ (ii) of the reference problem. The set of grid points identified in this way is called discretized search space. The approach is approximate because a smaller query, in terms of minimality and 1 https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+ proximity, than that identified by the algorithm, corresponding 1999-2008 to a coverage-based rewriting of the input query, might exist samples with size 9604 out of 100 will lead to an estimation error but, if it lies inside one grid cell, it cannot be discovered by the equal to 1%. algorithm. Notice that the grid is computed starting from 𝐼 and 𝑄 (𝐶𝐶 is not used). Performance evaluation. The analysis of the efficiency of the proposed coverage-based query rewriting approach has been in- Example 2.2. During the pre-processing step, by applying, as vestigated in a previous work [1]. The experiments demonstrated an example, the equi-depth binning approach to each axis of our that the time complexity of the proposed algorithms depends reference example, we obtain the grid represented in Figure 1(b), on: the number of bins, the used binning approach, the num- considering 4 bins for each axis. The points of the resulting dis- ber of selection conditions in the query, the coverage constraint cretized search space correspond to the grid intersection points. threshold, and the sample size. Specifically, by varying the num- Each point corresponds to a sensitive SPJ query, obtained from ber of selection conditions or the number of bins, and therefore the input one by replacing selection constants with the grid point the dimensionality of the multi-dimensional grid and the size coordinates (see Figure 1(c)). ^ of the discretized search space, the execution time rapidly in- Processing. During the processing step, we visit the discretized creases; the impact of the curve of dimensionality can however search space returned by the pre-processing step starting from be reduced by applying the designed optimizations. Equi-depth the grid point corresponding to the input query. The discretized binning approaches often lead to a more efficient processing than search space is visited one point after the other, at increasing equi-width approaches since the distribution of points in the distance from 𝑄. For each point 𝑢, we check whether the as- discretized search space follow data distribution, thus reducing sociated query 𝑄 ⟨𝑢⟩ is a coverage-based rewriting of 𝑄 ⟨𝑣⟩ by the number of grid cells with no dataset points and, as a conse- estimating the query result cardinality, for each protected group quence, the number of cardinality estimations. The execution referenced by coverage constraints 𝐶𝐶, and the query cardinality time also depends on the chosen coverage constraint thresholds 𝑐𝑎𝑟𝑑 (𝑄 (𝐼 )). and the number of coverage constraints. Finally, the sample size The properties of the discretized search space and of the used influences the cardinality estimation time and therefore the total canonical form are taken into account for pruning cells that can- execution time. not contain the solution and for further improving the efficiency of the process, by iteratively refining the size and the number of 3 IMPACT EVALUATION cells during the visit and, as a consequence, by working with a The coverage-based rewriting of the input query can be obtained discretized search space at varying granularity [1]. by visiting the grid, produced as the result of the pre-processing Example 2.3. Figure 1(d) illustrates the processing approach step, and by verifying constraint satisfaction relying on a sample- for the considered example. Starting from the grid point corre- based approach. The optimal coverage-based rewriting is there- sponding to the input query 𝑄, we estimate the cardinality of fore approximate since: (i) the grid, that corresponds to the dis- 𝑄 ↓gender cretized search space, might have an impact on the accuracy of 𝑓 𝑒𝑚𝑎𝑙𝑒 on 𝐼 , needed for checking constraint satisfaction, and the selected coverage-based rewriting; (ii) the estimation error of 𝑄 (𝐼 ), by relying on a sample-based approach, obtaining (2, 8). related to the sample usage has an impact on query cardinality Since the constraint is not satisfied, we further visit the other estimation and on constraint satisfaction. points of the discretized search space at increasing distance from It is therefore important to introduce some measures quantify- 𝑄, checking constraint satisfaction and looking for the minimum ing the error that can be generated. To this aim, in the following rewriting (property (iv)). The visit proceeds as pointed out in Fig- we discuss three groups of measures: the first group deals with ure 1(d) (the order of the visit is represented by the blue numbers the approximation error related to the usage of the grid for the associated with the top right vertex of each cell). Shaped cells discretization of the query search space; the second group deals are not visited thanks to the pruning effect: if 𝑄 ⟨𝑢⟩(𝐼 ) satisfies with the approximation error related to the usage of a sample dur- the coverage constraints, all the points in the upper right portion ing the pre-processing and processing phases; the third concerns of space defined by 𝑢 will not satisfy conditions (iv) and (v) of the error related to the detected optimal rewriting. the reference problem, thus they can be discarded. The optimal coverage-based rewriting corresponds to query 𝑄 ⟨43, 20⟩ (big blue dot in Figure 1 (d)) since (43, 20) is the point at the minimum 3.1 Grid-based accuracy distance from the origin such that the corresponding query sat- Due to the usage of a discretized search space, the optimal covera- isfies the considered coverage constraint (𝑄 ↓gender𝑓 𝑒𝑚𝑎𝑙𝑒 ≥ 15). The ge-based rewriting identified by the proposed approach is the best approximation of an optimal rewriting, given the considered input query is thus rewritten into SELECT * FROM Adult WHERE grid. The accuracy related to the usage of a given grid for the num_lab_procedures < 43 AND num_medications < 20. ^ detection of the optimal rewriting thus corresponds to the error Sample-based estimation. The processing step, as well as cov- introduced by the discretization process. erage constraint checking, requires fast and accurate cardinality As pointed out before, given a query 𝑄 ⟨𝑣⟩, the visit proceeds estimates. To make the processing more efficient, similarly to at increasing distance from the query point 𝑣. When an optimal [16], we rely on sampling based estimators based on samples rewriting 𝑄 ⟨𝑢⟩ is reached, this means that the neighbours of (uniform, independent, and without replacement) of the input 𝑢 in the discretized search space cannot be optimal rewritings dataset [9], dynamically constructed during the rewriting phase, (otherwise the search would have stopped before). On the other and on the approach in [18] for generating the sample of joined hand, another point 𝑧 might exist that is not included in the tables. As well known, a sample of a given size can be generated search space but 𝑄 ⟨𝑧⟩(𝐼 ) satisfies 𝐶𝐶 and either it is closer to so that query selectivity can be estimated with an error 𝑒 and the query point 𝑣 than 𝑢 or 𝑄 ⟨𝑧⟩(𝐼 ) ⊆ 𝑄 ⟨𝑢⟩(𝐼 ). Such point is an confidence 𝛿 [4]. As an example, if the error is 1% and the confi- optimal rewriting but, due to the approximation, it cannot be dence is 95%, the sample size should be 9604: this means that 95 identified by the proposed approach. The approximation error has an impact on the minimality property, in addition, it might lead to a wrong assessment of constraint satisfaction. Impact on (pre-)processing In order to evaluate the impact of the sample usage on grid generation, we compare the data distribution of the dataset 𝐼 with the data distribution of the sample 𝑆, both projected over the attributes appearing in the considered query 𝑄. The more similar the two distributions, the lower is the impact of the sample selection in the detection of the optimal coverage-based rewriting. Several metrics have been proposed for quantifying the dis- tance between two multivariate datasets. Many of them, e.g., the Wasserstein metric [17], the Kullback-Liebler [13, 14] and the Jensen-Shannon divergence [10], quantify the distance between the corresponding probability distributions and sometimes, as for the Wasserstein metric, the result might tend to infinity [17]. Figure 2: Grid-based accuracy (dashed blue line for the Probability distributions are not directly available under the con- maximum diagonal, dotted blue line for the minimum di- sidered scenarios and, even if computed, they introduce a further agonal) level of approximation in the computation. The Kolmogorov- Smirnov (KS) distance, defined for arbitrary univariate distribu- tions [7, 22], measures the maximum distance, between 0 and 1, for the identified approximate optimal rewriting 𝑄 ⟨𝑢⟩, also called between the cumulative distributions of two datasets. Due to its grid-based accuracy, can therefore be defined as the maximum generality, it is quite used and many, very complex, extensions distance between 𝑢 and all its neighbours on the grid, preceding to the multivariate case exist. it in the search; the accuracy is therefore lower than or equal In our work, we consider a very simple extension of the uni- to the diagonal of the grid cells having 𝑢 as a vertex and closer variate KS metric, obtained by averaging the KS distance com- to 𝑣 than 𝑢. By considering the entire grid, we can quantify the puted for each query attribute. The distance computed between maximum and minimum grid-based accuracy in terms of the two datasets 𝐼 and 𝑆 is denoted by 𝑑 𝐾𝑆 (𝐼, 𝑆). This approach is maximum and the minimum diagonal length of grid cells (see suitable since, similarly to what we have proposed for the search Figure 2 for a graphical explanation). space construction, where we rely on an aggregation of unidi- Definition 3.1 (Grid-based accuracy). Let 𝐺 𝑏 be the grid gen- mensional histograms instead of a multidimensional histogram, erated from a dataset 𝐼 and a query 𝑄 using a certain binning it aggregates distances defined on each single attribute. approach 𝑏. The mimimum/maximum grid-based accuracy of 𝐺 𝑏 , In order to investigate the impact of the KS distance for a denoted by 𝑑𝑖𝑎𝑔𝑚𝑖𝑛 𝑚𝑎𝑥 𝑏 and 𝑑𝑖𝑎𝑔 𝑏 , is defined as the minimum/ sample 𝑆 on the identification of the optimal rewriting, it is useful 𝐺 𝐺 maximum diagonal length of grid cells in 𝐺 𝑏 , normalized between to introduce some metrics, quantifying the difference in using 0 and 1. □ 𝑆 instead of the initial dataset 𝐼 for the detection of the optimal coverage-based rewriting. Such metrics compute the average Example 3.2. Figure 2 shows the normalized cell diagonal difference, in terms of minimality, proximity, and solution distance, lengths for the grid created as discussed in Example 2.2. The 𝑜𝑝𝑡 𝑜𝑝𝑡 between the optimal rewritings 𝑄 𝑆,𝐶𝐶 and 𝑄 𝐼,𝐶𝐶 , obtained by grid-based accuracy for this grid varies between 0.05 (dotted blue processing the sample 𝑆 and the original dataset 𝐼 , respectively, line), corresponding to queries 𝑄 ⟨30, 15⟩ and 𝑄 ⟨30, 20⟩ and 0.73 on a random set of queries 𝑄𝑆, uniformly distributed in the query (dashed blue line), corresponding to query 𝑄 ⟨90, 75⟩. ^ search space: Different binning approaches might lead to a different grid- • Average minimality difference (property (iv) of the optimal based accuracy. In particular, when fixing the reference data 𝑜𝑝𝑡 interval over each axis, binning approaches based on data distri- rewriting). It quantifies how much different 𝑄 𝑆,𝐶𝐶 and 𝑜𝑝𝑡 bution, like equi-depth, lead to a higher variability of grid-based 𝑄 𝐼,𝐶𝐶 are, in average, with respect to their result set car- accuracy, since they generate smaller buckets for dense regions dinalities when they are executed over the input dataset 𝐼 and larger buckets for sparse ones, as stated by the following (thus, it quantifies, in average, the difference in the relax- proposition. ation of the original query over the initial dataset): 𝑜𝑝𝑡 𝑜𝑝𝑡 |𝑐𝑎𝑟𝑑 (𝑄 (𝐼 ))−𝑐𝑎𝑟𝑑 (𝑄 (𝐼 )) | Proposition 1. Let 𝐺 𝑤 be the grid generated from a dataset 𝐼 𝑚(𝐼, 𝑆) ≡ 𝑎𝑣𝑔𝑄 ∈𝑄𝑆 𝑆,𝐶𝐶 𝑐𝑎𝑟𝑑 (𝑄 (𝐼 )) 𝐼 ,𝐶𝐶 . and a query 𝑄 using the equi-width approach and 𝐺 𝑑 that gener- • Average proximity difference (property (v) of the optimal ated using the equi-depth approach, with 𝑛 bins for each axis in both 𝑜𝑝𝑡 rewriting). It quantifies how much different 𝑄 𝑆,𝐶𝐶 and cases. Then: (i) 𝑑𝑖𝑎𝑔𝑚𝑖𝑛 𝑚𝑖𝑛 𝑚𝑎𝑥 ≥ 𝑑𝑖𝑎𝑔𝑚𝑎𝑥 . □ 𝑑 ≤ 𝑑𝑖𝑎𝑔𝐺 𝑤 ; (ii) 𝑑𝑖𝑎𝑔 𝑑 𝐺𝑤 𝑜𝑝𝑡 𝐺 𝐺 𝑄 𝐼,𝐶𝐶 are, in average, with respect to their Euclidean dis- 3.2 Sample-based accuracy tance 𝑑 () from the input query 𝑄, in the unit space, further normalized between 0 and 1 (thus, it quantifies, in average, The accuracy of the query rewriting approach depends on the how far the two optimal rewritings are with respect to the sample data distribution for two main reasons: (i) different sam- original query): ple distributions might lead to the generation of different buckets 𝑜𝑝𝑡 𝑜𝑝𝑡 𝑝 (𝐼, 𝑆) ≡ 𝑎𝑣𝑔𝑄 ∈𝑄𝑆 |𝑑 (𝑄 𝑆,𝐶𝐶 , 𝑄) − 𝑑 (𝑄 𝐼,𝐶𝐶 , 𝑄)|. and therefore of different grids, with an impact on both pre- processing and processing steps; (ii) the sample is used for query • Average solution distance: It quantifies how much differ- 𝑜𝑝𝑡 𝑜𝑝𝑡 cardinality estimation and, as a consequence, the estimation error ent 𝑄 𝑆,𝐶𝐶 and 𝑄 𝐼,𝐶𝐶 are, in average, in terms of their Euclidean distance in the unit space, further normalized guarantee constraint satisfaction, the cardinality of the rewrit- between 0 and 1 (thus, it quantifies, in average, how far ten query is about 4 times that of the original query). Finally, the two optimal rewritings are, without taking the input the proximity, i.e., the Euclidean distance between (30, 10) and dataset and the original query into account): (43, 20) in the unit space, normalized between 0 and 1, is 0.19, 𝑜𝑝𝑡 𝑜𝑝𝑡 𝑠𝑑 (𝐼, 𝑆) ≡ 𝑎𝑣𝑔𝑄 ∈𝑄𝑆 𝑑 (𝑄 𝑆,𝐶𝐶 , 𝑄 𝐼,𝐶𝐶 ). thus corresponding to a relaxation of 19% with respect to the maximum query, returning the whole dataset. ^ Impact on constraint satisfaction. The error and the confi- dence related to the considered sample (see Section 2) have an 4 EXPERIMENTAL RESULTS impact also on coverage constraint satisfaction. A constraint In this section, we present some preliminary experimental results 𝐶𝐶𝑖 ≡ 𝑄 ↓𝑆𝑠𝑖𝑖 ≥ 𝑘𝑖 is satisfied on 𝐼 if 𝑐𝑎𝑟𝑑 (𝜎𝑆𝑖 =𝑠𝑖 (𝑄 (𝐼 ))) ≥ 𝑘𝑖 . with the aim of analyzing the accuracy of the proposed query Since the sample-based estimation of 𝑐𝑎𝑟𝑑 (𝜎𝑆𝑖 =𝑠𝑖 (𝑄 (𝐼 ))) might rewriting approach. lead to an error of 𝑒 × 𝑐𝑎𝑟𝑑 (𝐼 ), in order to guarantee that the constraint is also satisfied by the input dataset 𝐼 , we can mod- 4.1 Experimental Setup ify the constraint, when evaluated over the sample, as: 𝑄 ↓𝑆𝑠𝑖𝑖 ≥ All experiments were conducted on a PC with an Intel Core i5- 𝑘𝑖 + (𝑒 ×𝑐𝑎𝑟𝑑 (𝐼 )). This consideration makes the proposed sample- 8300H 2.30 GHz CPU and 16GB main memory, running Microsoft based approach reasonable for coverage constraints in which 𝑘𝑖 Windows 10. All programs were coded in Python 3.8. 𝑐𝑎𝑟𝑑 (𝐼 ) has the same order of magnitude as 𝑒. Notice that, by The experiments refer to a real dataset, stored in PostgreSQL: changing the constraint as proposed above, we increase the prob- Diabetes US2 representing 10 years (1999-2008) of clinical care ability of constraint satisfaction on the input dataset at the price at 130 US hospitals and integrated delivery networks (100,000 of reducing proximity of the optimal rewriting with respect to instances). It includes over 50 features representing patient and the input query (since the optimal query will be further away hospital outcomes. For our experiments we use gender as sensi- from the initial one). tive attribute and we add a coverage constraint on 𝑓 𝑒𝑚𝑎𝑙𝑒. For this dataset, we generated many random samples with an 3.3 Solution-based accuracy increasing size, guaranteeing a variable percentage of error (1% or 𝑜𝑝𝑡 3%) and a variable level of confidence (95% or 99%). More precisely, Let 𝑆 be a sample of dataset 𝐼 and 𝑄 𝑆,𝐶𝐶 = 𝑄 ⟨𝑢⟩ be the opti- we considered the following sample sizes: 1067 (3% error and mal solution obtained from 𝑆, given a query 𝑄 ⟨𝑣⟩ and a set of 95% confidence), 1843 (3% error and 99% confidence), 9604 (1% coverage constraints 𝐶𝐶. It could be useful to introduce some error and 95% confidence), 16588 (1% error and 99% confidence). additional measures to evaluate the quality of the obtained opti- 𝑜𝑝𝑡 For each sample size, we generated 5 random samples, for a total mal rewriting 𝑄 𝑆,𝐶𝐶 with respect to the discretized search space of 20 samples. The samples are all small enough to be stored in identified by the chosen dataset 𝑆, taking into account both the main memory. applied relaxation with respect to the original query and the In order to evaluate the accuracy of the proposed approach, we approximation error, in line with what we have proposed for randomly generated a set of 1000 queries, with different selectiv- grid-based and sample-based accuracy. To this aim, we propose ities, and we compared the impact of the grid and of the consid- the following three measures: ered sample in detecting the optimal coverage-based rewriting, 𝑜𝑝𝑡 by considering the measures presented in Section 3. All the se- • Grid-based accuracy of 𝑄 𝑆,𝐶𝐶 . According to what discussed lected queries contain three selection conditions (thus leading in Subsection 3.1, it can be computed as the maximum to a three-dimensional grid), with respect to the three numerical distance between 𝑢 and all its neighbours on the grid, attributes with the highest number of distinct values, namely preceding it in the search (thus, the maximum diagonal (number_emergency,num_lab_procedures,num_medications); of the cells of the grid having 𝑢 as a vertex and closer to 𝑣 selection values are picked at random from the attribute value than 𝑢). range in the dataset. For the sake of simplicity, join queries are • Relaxation degree. Similarly to [16], the relaxation degree, not considered for these preliminary experiments. first proposed in [1], quantifies, through estimations over For each query, we then defined the coverage constraint taking the initial dataset 𝐼 , how much the optimal coverage-based 𝑜𝑝𝑡 into account the considered query and in such a way that it is rewriting 𝑄 𝑆,𝐶𝐶 relaxes the original query 𝑄, as the per- neither satisfied on the dataset nor on the considered samples. centage of new added tuples with respect to those con- The experiments were performed by considering two dis- tained in the original query result: 𝑜𝑝𝑡 tinct binning approaches during the pre-processing step: (i) equi- |𝑄 𝑆,𝐶𝐶 (𝐼 ) |− |𝑄 (𝐼 ) | . width∗ , corresponding to an equi-width approach, dividing each |𝑄 (𝐼 ) | • Proximity. It can be computed as the Euclidean distance axis in a fixed number of bins of equal size, set as the minimum 𝑜𝑝𝑡 between a selected number (denoted by #𝑏𝑖𝑛𝑠 in the experimental between the optimal coverage-based rewriting 𝑄 𝑆,𝐶𝐶 and 𝑄 in the unit space, further normalized between 0 and results) and the number of distinct values for the considered at- 1, thus indicating how far the optimal rewriting is with tribute in the dataset; (ii) equi-depth∗ , in which each bin, defined respect to the original query. as for the equi-width∗ approach, contains a constant number of instances. A variable number of bins, namely 4, 8, 16, 32, 64, has Example 3.3. The optimal coverage-based rewriting of the been considered for specific experiments. running example corresponds to query 𝑄 ⟨43, 20⟩ (see Example We then performed three groups of experiments aimed at 2.3). The grid-based accuracy of such solution is 0.16 (see Figure analyzing the grid-based, the sample-based, and the solution- 2), since this is the maximum length between the solution and based accuracy of the optimal coverage-based rewriting. its neighbours on the grid. Since 𝑐𝑎𝑟𝑑 (𝑄 ⟨43, 20⟩(𝐼 )) = 38 and 2 https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+ 𝑐𝑎𝑟𝑑 (𝑄 (𝐼 )) = 8 (see Figure 2), the relaxation degree is 3.75 (to 1999-2008 Sample KS Distance Sample KS Distance 1067_1 0.0185 1843_1 0.0156 1067_2 0.0185 1843_2 0.0183 1067_3 0.0185 1843_3 0.0130 1067_4 0.0199 1843_4 0.0130 1067_5 0.0198 1843_5 0.0119 Mean 0.0190 Mean 0.0144 Variance 4.38e-7 Variance 5.36e-6 Sample KS Distance Sample KS Distance (a) Maximal grid-based accuracy 9604_1 0.0058 16588_1 0.0034 9604_2 0.0064 16588_2 0.0044 9604_3 0.0026 16588_3 0.0032 9604_4 0.0071 16588_4 0.0055 9604_5 0.0134 16588_5 0.0035 Mean 0.0071 Mean 0.0040 Variance 1.24e-5 Variance 7.32e-7 Figure 4: KS distance from the samples to the reference dataset (b) Minimal grid-based accuracy Sample KS avg min. avg prox. avg sol. distance difference difference distance 9604_3 0.0026 0.0275 0.1072 0.1403 Figure 3: Grid-based accuracy, when varying the number 9604_5 0.0134 0.1350 0.0995 0.1291 of bins Table 1: Sample-based accuracy measures 4.2 Experimental evaluation 4.2.1 Grid-based accuracy. The first group of experiments (namely, 9604_3 and 9604_5). We then computed the sample- aims at analyzing the maximum and minimum grid-based ac- based accuracy measures on the set of 1000 random queries. curacy, as presented in Subsection 3.1, by varying the number Table 1 shows the results obtained by considering the equi- of bins, with respect to the selected binning approach, namely depth∗ approach and 32 bins (similar values has been obtained equi-depth∗ and equi-width∗ . To this aim, we selected a sample for the equi-width∗ and other numbers of bins). As you can see, with size 16588 (sample 16588_3 in Figure 4) and we considered there is no clear ordering between the two samples when consid- all the generated 1000 random queries. The obtained results are ering such measures. In particular, sample 9604_3 is better than independent from the query selectivity, therefore in the following sample 9604_5 with respect to minimality; however, for prox- we discuss those obtained with selectivity equal to 2.5%. imity and solution distance the opposite result holds. Thus, it Figure 3 shows that both the maximum and the minimum seems that the KS measure is not good enough for discriminating grid-based accuracy decrease by increasing the number of bins, between samples with a different behaviour with respect to the independently from the chosen binning approach. This is because, detection of the optimal coverage-based rewriting. Additional by increasing the number of bins, the space is discretized into a work is therefore needed for investigating or defining alterna- higher number of cells, thus obtaining cells of smaller size. tive functions able to discriminate between samples under the From Figure 3 we also observe the behaviour described by considered scenario. Proposition 1: the maximal grid-based accuracy obtained with the equi-depth∗ approach is always higher than the grid-based 4.2.3 Solution-based accuracy. The last group of experiments accuracy obtained with the equi-width∗ approach; minimal accu- aims at analyzing the solution-based accuracy, according to the racy behaves in the opposite way. measures introduced in Subsection 3.3, by varying the number of bins, with respect to the selected binning approach, namely equi- 4.2.2 Sample-based accuracy. The second group of experi- depth∗ and equi-width∗ . To this aim, we selected sample 16588_3 ments deals with the analysis of the sample-based accuracy, ac- (i.e., the biggest sample with the smallest KS distance with respect cording to the measures introduced in Subsection 3.2. To this to the initial dataset) and we considered all the generated 1000 aim, we considered all the samples described in Subsection 4.1 random queries. The obtained results show a different behaviour and, for each of them, we computed the KS distance with respect with respect to the region in which the optimal solution is lo- to the input dataset. cated, either dense or sparse. In particular, Figure 5 shows that, Results, presented in Figure 4, show that the obtained values as expected, by increasing the number of bins, the grid-based are very similar and quite small: for most sample sizes, the great- accuracy of the solution will decrease as well. However, depend- est differences refer to the third decimal digit. As expected, by ing on where the optimal solution is located, the equi-depth∗ increasing the sample size, the KS distance tends to decrease. and the equi-width∗ approaches behave in a different way. More In order to evaluate the impact of the KS distance on sample- precisely, when the solution region is dense (Figure 5(a)), the ac- based accuracy measures, we considered the sample size leading curacy is lower with the equi-depth∗ approach since in this case to the highest variance (9604) and we selected the samples with higher density will lead to smaller bins. On the other hand, when the highest difference between the corresponding KS distances the solution region is sparse (Figure 5(b)), the accuracy is usually (a) Solution in a dense region (a) Solution in a dense region (b) Solution in a sparse region (b) Solution in a sparse region Figure 5: Grid-based accuracy of the solution Figure 7: Proximity Finally, notice that dense regions have a greater impact on query cardinalities and, as a consequence, on the relaxation de- gree, whose values are usually higher in this case (starting from 8 bins), independently on the selected binning approach. On the other hand, dense regions tend to generate optimal solutions with lower proximity (more evident with the equi-depth∗ approach, that is more sensible to data distribution), since constraint satis- faction can be obtained on query points closer to the initial one. In general, for uniformly distributed datasets, in which no sparse regions can be detected, equi-depth∗ can be considered the best (a) Solution in a dense region option. By combining the obtained results with those presented in [1, 3], related to performance, a number of bins equal to 16 can be considered a good compromise between effectiveness and efficiency. 5 RELATED WORK The interest for coverage constraints has been introduced in [5, 6], drawing inspiration from the literature on diversity [11]. The problem of evaluating the coverage of a given dataset has been considered in the context of the MithraLabel system [12, 24], in (b) Solution in a sparse region which the lack of coverage is modeled as a widget in the nutri- tional label [26] of a dataset. Once the lack of coverage has been identified, the smallest number of data points needed to hit all Figure 6: Relaxation degree the “large uncovered spaces” is identified with the aim of helping data owners in achieving coverage through a data repairing ap- proach. When protected categories are defined in terms of many attributes, the identification of attribute patterns associated with lower with the equi-width∗ approach since in this case lower coverage problems might lead to performance issues, due to the density will lead to longer bins under the equi-depth∗ approach. combinatorial explosion of such patterns. Efficient techniques, A similar behavior can be observed for the relaxation degree inspired from set enumeration and association rule mining, ad- (Figure 6) and the proximity (Figure 7): by increasing the number dressing this problem have been proposed in [6]. To fix coverage of bins, they decrease for both the binning approaches. When unsatisfaction, additional data can be acquired. Since data ac- the solution is contained in a dense region, equi-depth∗ behaves quisition has a cost in term of data processing, techniques have better than equi-width∗ , especially for low numbers of bins. We been presented in [6] for determining the patterns that can be further notice that, for very low numbers of bins, proximity covered given a certain maximum cost. An efficient approach for and grid-based accuracy coincide (the farthest neighbour of the coverage analysis, given a set of attributes across multiple tables, solution is the query point itself). is presented in [27]. As pointed out by the previous discussion, most existing approaches chase coverage through data repair user to reconsider some of the choices made in the selection of and focus on efficiency issues. By contrast, we consider accuracy the operations to be rewritten, thus producing a new annotated for coverage-based query rewriting during data transformation, script. thus complementing existing approaches. Future work is needed in order to understand how to define The technique considered in this paper relies on rewriting. new distance functions between distributions, focusing on their Other fairness-aware rewriting approaches have been proposed behaviour during coverage-based rewriting. An additional issue for OLAP queries [19, 20]. Bias is defined in terms of causal concerns the definition and the analysis of further solution-based fairness (checking for causal relationships from the sensitive at- measures, evaluating the quality of the detected solution with tributes to the outcome) and detected, explained, and resolved respect to the solution that would have been obtained without through rewriting. On the other hand, we focus on data transfor- considering a sample, as well as their integration in the covRew mations in presence of coverage constraints. prototype system. Impact evaluation is quite relevant in the design and the ex- ecution of non-discriminating pipelines, usually very complex REFERENCES in real-world scenarios. Various systems have been designed [1] C. Accinelli, S. Minisi, B. Catania. Coverage-based rewriting for data prepara- tion. In Proc. of the EDBT/ICDT Workshops, CEUR-WS.org, 2020. for supporting the user during this activity. Among them, we [2] C. Accinelli, B. Catania, G. Guerrini, S. Minisi. covRew: a Python toolkit for recall: Fair-DAGs [25], an open-source library aiming at repre- pre-processing pipeline rewriting ensuring coverage constraint satisfaction. senting data processing pipelines in terms of a directed acyclic In EDBT, Proc. of the 24th Int. Conf. on Extending Database Technology, 2021. [3] C. Accinelli, B. Catania, G. Guerrini, S. Minisi. A coverage-based approach to graph (DAG) and identifying distortions with respect to protected data transformations. In preparation. groups as the data flows through the pipeline; FairPrep [21], an [4] S. Agarwal et al. Knowing when you’re wrong: building fast and reliable environment for investigating the impact of fairness-enhancing approximate query processing systems. In Proc. of the ACM SIGMOD, Int. Conf. on Management of Data, pages 481–492, 2014. interventions inside data processing pipelines; AI Fairness 360 [8], [5] A. Asudeh, H. V. Jagadish, J. Stoyanovich. Towards responsible data-driven an open-source Python toolkit for algorithmic fairness, aimed at decision making in score-based systems. IEEE Data Eng. Bull., 42(3):76–87, 2019. facilitating the transition of fairness-aware research algorithms [6] A. Asudeh, Z. Jin, H. V. Jagadish. Assessing and remedying coverage for a to usage in an industrial setting and at providing a common given dataset. In ICDE, Proc. of the 35th IEEE Int. Conf. on Data Engineering, framework to share and evaluate algorithms. pages 554–565, 2019. [7] M. Basseville. Divergence measures for statistical data processing. 2010. [8] R. K. Bellamy et al. Ai fairness 360: An extensible toolkit for detecting and mit- igating algorithmic bias. IBM Journal of Research and Development, 63(4/5):4–1, 6 CONCLUDING REMARKS 2019. Rewriting approaches have been recognized as an interesting [9] G. Cormode, M. N. Garofalakis, P. J. Haas, C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends Databases, 4(1- mean for enforcing specific non-discriminating properties guar- 3):1–294, 2012. anteeing transparency. In this paper, we started from the ap- [10] I. Dagan, L. Lee, F. Pereira. Similarity-based methods for word sense disam- biguation. arXiv preprint cmp-lg/9708010, 1997. proach proposed in [1] with the aim of investigating the impact [11] M. Drosou, H. V. Jagadish, E. Pitoura, J. Stoyanovich. Diversity in big data: A of rewriting on coverage-constraint satisfaction. The approach is review. Big Data, 5(2):73–84, 2017. approximate and relies on a sample for both the construction of [12] Z. Jin et al. Mithracoverage: A system for investigating population bias for intersectional fairness. In Proc. of the ACM SIGMOD Int. Conf. on Management the solution search space and the detection of the optimal rewrit- of Data, pages 2721–2724, 2020. ing. Three different groups of measures have been proposed for [13] S. Kullback. Information theory and statistics. Courier Corporation, 1997. quantifying the accuracy induced by the approximation and the [14] S. Kullback, R. A. Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951. impact of the sample in the detection of the optimal solution. [15] W. McKinney. Python for data analysis: Data wrangling with Pandas, NumPy, Preliminary experimental results show that: (i) different bin- and IPython. O’Reilly Media, Inc., 2012. [16] C. Mishra, N. Koudas. Interactive query refinement. In EDBT, Proc. of the 12th ning approaches lead to different grid-based accuracy degrees; Int. Conf. on Extending Database Technology, Proceedings, pages 862–873, 2009. (ii) common measures for computing the distance between dis- [17] I. Olkin, F. Pukelsheim. The distance between two random vectors with given tributions are not effective for analyzing their behaviour in the dispersion matrices. Linear Algebra and its Applications, 48:257–263, 1982. [18] M. Riondato et al. The VC-dimension of SQL queries and selectivity estimation detection of the optimal coverage-based solution; (iii) the number through sampling. In Machine Learning and Knowledge Discovery in Databases of bins used in the generation of the search space has an impact - European Conference, ECML PKDD, Proceedings, Part II, pages 661–676, 2011. in the accuracy of the detected solution, and not only on the [19] B. Salimi et al. Hypdb: A demonstration of detecting, explaining and resolving bias in OLAP queries. Proc. VLDB Endow., 11(12):2062–2065, 2018. performance [1], while the optimal binning approach depends [20] B. Salimi, J. Gehrke, D. Suciu. Bias in OLAP queries: Detection, explanation, on the position of the optimal rewriting in the search space; (iv) and removal. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1021–1035, 2018. a number of bins greater than 16 represents a good compromise [21] S. Schelter, Y. He, J. Khilnani, J. Stoyanovich. FairPrep: Promoting data to a between accuracy and efficiency. first-class citizen in studies on fairness-enhancing interventions. In EDBT, The proposed approach is at the basis of covRew [2], a Python Proc. of the 23nd Int. Conf. on Extending Database Technology, pages 395–398, 2020. toolkit for rewriting slicing operations in pre-processing pipelines, [22] M. A. Stephens. Edf statistics for goodness of fit and some comparisons. ensuring coverage constraint satisfaction. covRew takes in input Journal of the American statistical Association, 69(347):730–737, 1974. a two-dimensional tabular dataset 𝐼 , with the related sensitive [23] J. Stoyanovich, B. Howe, H. V. Jagadish. Responsible Data Management. Proc. VLDB Endow., 13(12): 3474–3488, 2020. attribute specification, for the identification of protected groups, [24] C. Sun et al. Mithralabel: Flexible dataset nutritional labels for responsible a processing pipeline represented as a Pandas script [15], and a data science. In CIKM, Proc. of the 28th Int. Conf. on Information and Knowledge Management, pages 2893–2896, 2019. set of coverage constraints. It includes three main components: [25] K. Yang, B. Huang, J. Stoyanovich, S. Schelter. Fairness-aware instrumentation (i) a pipeline analyzer, which identifies candidate operations for of preprocessing pipelines for machine learning. In Workshop on Human-In- rewriting, (ii) a pipeline rewriter, which transforms operations the-Loop Data Analytics (HILDA 2020), 2020. [26] K. Yang et al. A nutritional label for rankings. In Proc. of the 2018 ACM that are selected by the user according to the input coverage con- SIGMOD Int. Conf. on Management of Data, pages 1773–1776, 2018. straints, and (iii) an impact evaluator, assessing the impact of the [27] Y. Lin, Y. Guan, A. Asudeh, H. V. Jagadish. Identifying insufficient data coverage rewriting through the usage of the grid-based and solution-based in databases with multiple relations. Proc. of the VLDB Endow, pages 2229–2242, 2020. measures presented in Section 3. Such measures could lead the