=Paper= {{Paper |id=Vol-2841/PIE+Q_2 |storemode=property |title=The impact of rewriting on coverage constraint satisfaction |pdfUrl=https://ceur-ws.org/Vol-2841/PIE+Q_2.pdf |volume=Vol-2841 |authors=Chiara Accinelli,Barbara Catania,Giovanna Guerrini,Simone Minisi |dblpUrl=https://dblp.org/rec/conf/edbt/AccinelliCGM21a }} ==The impact of rewriting on coverage constraint satisfaction== https://ceur-ws.org/Vol-2841/PIE+Q_2.pdf

The impact of rewriting on coverage constraint satisfaction
Chiara Accinelli, Barbara Catania, Giovanna Guerrini, Simone Minisi
DIBRIS - University of Genoa, Genoa - Italy
name.surname@dibris.unige.it

ABSTRACT In this paper, we are interested in investigating solutions for
Due to the impact of analytical processes on our life, an increas- detecting bias in data-preprocessing steps, defined in terms of cov-
ing effort is being devoted to the design of technological solu- erage constraints, with a special reference to filtering and merge
tions that help humans in measuring the bias introduced by such data transformations, mitigating it, and checking whether the
processes and understanding its causes. Existing solutions can mitigation was effective. Specifically, we focus on classical data
refer to either back-end or front-end stages of the data process- transformation operations, often defined in terms of Selection-
ing pipeline and usually represent bias in terms of some given Projection-Join (SPJ) operations over tabular data, that can reduce
diversity or fairness constraint. In our previous work [1], we the number of records related to some protected or disadvantaged
proposed an approach for rewriting filtering and merge opera- groups, defined in terms of some sensitive attributes, even if such
tions in pre-processing pipelines into the “closest” operations so attributes are not directly used in the specification of the data
that protected groups are adequately represented (i.e., covered) transformation operation.
in the result. This is relevant because any under-represented In this frame, the approach we developed [1] aims at support-
category in an initial or intermediate dataset might lead to an ing the user by minimally rewriting the transformation operation
under-representation of that category in any subsequent analyti- so that input coverage constraints are guaranteed to be satisfied in
cal process. Since many potential rewritings exist, the proposed the transformation result. Through rewriting, the revised process
approach is approximate and relies on a sample-based cardinality is traced for further processing, thus guaranteeing transparency.
estimation, thus introducing a trade-off between the accuracy Since many potential rewritings exist, we proposed a sample-
and the efficiency of the process. In this paper, we investigate this based two-steps approach for detecting, under an approximate
trade-off by first presenting various measures quantifying the er- approach, the minimal (i.e., the optimal) rewriting of the orig-
ror introduced by the rewriting, due to the applied approximation inal query. After transforming the SPJ query into a canonical
and the selected sample. Then, we (preliminarly) experimentally form, first (in a pre-processing step), the search space of potential
evaluate such measures on a real-world dataset. rewritings is discretized, so that an approximation of the optimal
solution can be detected in the next processing step, by looking at
the resulting finite set of points. The coverage-based rewriting of
1 INTRODUCTION the input query can be obtained by visiting the grid, produced as
The impact of data on our society is getting higher and higher, the result of the pre-processing step, according to an order that
with data about people being more and more often exploited guarantees the fast detection of the minimal rewriting, and by ver-
as the basis to make decisions that might impact people’s lives. ifying constraint satisfaction through a sample-based approach.
Thus, it becomes crucial to ensure that, in the systems enabling The coverage-based rewriting is approximate both because of the
such data-based decisions, data are dealt with in a responsible and discretization of the search space and of the error in estimating
non-discriminating way [23], in all the steps, from acquisition to cardinalities and constraint satisfaction on the sample.
analysis. In this paper, we start from this approach and we investigate
Non-discrimination can be addressed by considering specific its effectiveness in the detection of the optimal coverage-based
diversity and fairness constraints. Diversity allows us to capture solution. To this aim, we introduce three main groups of mea-
the quality of a collection of items with respect to the variety sures quantifying the error that can be generated by the used
of its constituent elements. On the other hand, fairness can be approximated and sample-based approach. The first group of
broadly defined as the impartial treatment of individuals and of measures deals with the accuracy of the discretization applied
demographic groups inside data processing tasks. in the pre-processing step. The second group deals with the er-
Among all data processing steps, data pre-processing plays a ror due to the usage of the sample for the grid generation and
relevant role when considering non-discrimination issues since it cardinality estimations, while the third group helps the user in
can introduce technical bias by exacerbating pre-existing bias that quantitatively evaluating specific solutions obtained through the
may exist in society, with an impact on the whole data lifecycle. rewriting.
When considering pre-processing tasks, an often considered The generated rewritings are then (preliminarly) experimen-
non-discriminating constraint is coverage. Coverage constraints tally analyzed in terms of the proposed measures, on a real-world
guarantee that the input, or training, dataset includes enough dataset. The obtained results provide hints on how to tune ap-
examples for each (protected) category of interest, thus increasing proximation and sample related parameters to achieve a good
diversity with the aim of limiting the introduction of bias during trade-off between accuracy and performance in the context of
the next analytical steps. Coverage is quite relevant in the first coverage-based rewriting, by combining new results related to
data processing tasks, like data transformation, since the used accuracy presented in this paper with results related to perfor-
transformations might change the, possibly initially satisfied, mance, previously presented in [1].
coverage of protected categories. Even if the proposed measures have been defined in the context
of a specific coverage-based approach, we believe that they can be
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) of more general value in understanding the role of approximation
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 and sampling during data pre-processing.
International (CC BY 4.0)
The remainder of the paper is structured as follows. In Sec- The approach. Given a dataset 𝐼 and a set of coverage con-
tion 2, we present the overall approach to coverage-based rewrit- straints 𝐶𝐶, each selected sensitive SPJ query 𝑄 can be rewritten
𝑜𝑝𝑡
ing. In Section 3, we introduce new measures to quantify, in into another query 𝑄 𝐼,𝐶𝐶 , according to what presented in [1], so
terms of accuracy, the impact of rewriting. In Section 4, we ex- 𝑜𝑝𝑡
that 𝑄 𝐼,𝐶𝐶 is the minimal query relaxing 𝑄 guaranteeing cover-
perimentally analyze the impact of the rewriting according to
age constraint satisfaction when evaluated over the input dataset.
the introduced measures. Section 5 discusses related work while
Relaxation is reasonable when the user is happy with the speci-
Section 6 concludes the paper and outlines future work directions.
fied transformation and she wants to keep the result set of the
𝑜𝑝𝑡
original query after the rewriting. 𝑄 𝐼,𝐶𝐶 must therefore satisfy
2 COVERAGE-BASED REWRITING 𝑜𝑝𝑡 𝑜𝑝𝑡
the following properties: (i) 𝑄 𝐼,𝐶𝐶 ≡ 𝑄 ⟨𝑢⟩, thus 𝑄 𝐼,𝐶𝐶 is obtained
We focus on data to be transformed for further data analytical 𝑜𝑝𝑡
tasks. Data can refer specific protected (minorities or historically from 𝑄 by only changing the selection constants; (ii) 𝑄 ⊆ 𝑄 𝐼,𝐶𝐶 ,
𝑜𝑝𝑡
disadvantaged) groups and we aim at guaranteeing that each thus 𝑄 𝐼,𝐶𝐶 always contains the result of the input query; (iii) all
transformation step during the pre-processing pipeline, based 𝑜𝑝𝑡
coverage constraints associated with 𝑄 are satisfied by 𝑄 𝐼,𝐶𝐶 (𝐼 ).
on filtering or merge operations, produces a new dataset con-
The rewriting should be optimal, i.e., the new query has to satisfy
taining enough entries for each protected group of interest. In
specific minimality properties with respect to the input query
the following, we briefly describe input data and the proposed
𝑄. In order to make the definition of minimality properties ho-
technique. Additional details can be found in [1, 3].
mogeneous with respect to all the selection attributes 𝐴𝑖 in 𝑄,
we define them in a transformed unit space, in which the values
Datasets. Our rewriting approach can be applied over a collec-
for each attribute 𝐴𝑖 in 𝐼 are normalized between 0 and 1. We
tion of tabular datasets (e.g., relations in a relational database,
denote with 𝑄, 𝐼 , 𝐴𝑖 , 𝑣 a query, a dataset, an attribute, and a vector
Data Frames in the Pandas analytical environment) 𝐼 ≡ 𝐼 1, ..., 𝐼𝑟 .
of values, respectively, in the unit space. Notice that properties
Among the attributes 𝐴1, ..., 𝐴𝑚 of each input dataset, we assume
(i), (ii), and (iii) are satisfied in the original space if and only if
that some discrete valued attributes 𝑆 1, ..., 𝑆𝑛 are of particular
they are satisfied in the normalized one. Minimality can now
concern, since they allow the identification of protected groups,
be stated according to the following two properties: (iv) there
and we call them sensitive attributes. Examples of sensitive at-
is no other query 𝑄 ′ satisfying conditions (i), (ii), and (iii) such
tributes are the gender (with values in {𝑓 𝑒𝑚𝑎𝑙𝑒, 𝑚𝑎𝑙𝑒}) and the
that 𝑄 ′ (𝐼 ) ⊂ 𝑄 𝐼,𝐶𝐶 (𝐼 ) (thus, 𝑄 𝐼,𝐶𝐶 is the minimal query on 𝐼
𝑜𝑝𝑡 𝑜𝑝𝑡
race (with values in {𝑎𝑠𝑖𝑎𝑛, 𝑏𝑙𝑎𝑐𝑘, ℎ𝑖𝑠𝑝𝑎𝑛𝑖𝑐, 𝑤ℎ𝑖𝑡𝑒}).
𝑜𝑝𝑡
satisfying (i), (ii), and (iii)); (v) 𝑄 𝐼,𝐶𝐶 ≡ 𝑄 ⟨𝑢⟩ is the closest query
Data pre-processing operations. The pre-processing opera- to 𝑄 ⟨𝑣⟩ according to the Euclidean distance between 𝑣 and 𝑢 in
tions we are interested in correspond to monotonic Select-Project- 𝑜𝑝𝑡
the unit space, satisfying (i), (ii), (iii), and (iv) (thus, 𝑄 𝐼,𝐶𝐶 is the
Join (SPJ) queries over input tabular data that might alter the
representation (i.e., the coverage) of specific groups of interests, coverage-based rewriting syntactically closest to the input query,
defined in terms of sensitive attribute values. To this aim, we thus maximizing proximity and potentially user satisfaction).
focus on SPJ queries that return, among the others, at least one In order to compute the optimal coverage-based rewriting of
sensitive attribute (called sensitive SPJ operations or queries). For an SPJ query 𝑄 ⟨𝑣⟩, given a set of coverage constraints 𝐶𝐶 and an
the sake of simplicity, we assume that selection conditions are instance 𝐼 , we follow the approach presented in [1], consisting
defined over numeric attributes, even if the proposed approach of three steps shortly discussed in what follows. For the sake of
can be easily extended to any other ordered domain. Thus, under simplicity in the notations, in presenting the approach and the
the considered assumptions, sensitive attributes are not included related examples, we do not underline symbols referring to the
in selection conditions (typical assumption in data processing). unit space, even if proximity is always considered in that space.
In the following, when needed, we denote 𝑄 by 𝑄 ⟨𝑣 1, ..., 𝑣𝑑 ⟩
or 𝑄 ⟨𝑣⟩, 𝑣 ≡ (𝑣 1, ..., 𝑣𝑑 ), where 𝑣 1, ..., 𝑣𝑑 are the constant values Canonical form generation. We first translate the selected SPJ
appearing in the selection conditions 𝑠𝑒𝑙𝑖 ≡ 𝐴𝑖 𝜃𝑖 𝑣𝑖 in 𝑄. queries into a canonical form, in which each selection condition
containing operators (>, ≥, =) is translated into one or more
Coverage constraints. Conditions over the number of entries equivalent conditions defined in terms of operator <. For example,
belonging to a given protected group of interest returned by the any predicate of the form 𝐴𝑖 > 𝑣𝑖 can be transformed into the
execution of SPJ queries can be specified in terms of coverage predicate −𝐴𝑖 < −𝑣𝑖 . When considering canonical forms, an
constraints [6, 27]. Given a sensitive SPJ query 𝑄, with reference optimal coverage-based rewriting query is obtained from the
to a sensitive attribute 𝑆𝑖 , and a value 𝑠𝑖 belonging to the domain input query by replacing one or more selection predicates 𝑠𝑒𝑙𝑖 ≡
of 𝑆𝑖 , 𝑖 ∈ {1, ..., 𝑛}, a coverage constraint with respect to 𝑆𝑖 and 𝐴𝑖 < 𝑣𝑖 with a predicate 𝑠𝑒𝑙𝑖′ ≡ 𝐴𝑖 < 𝑢𝑖 with 𝑢𝑖 ≥ 𝑣𝑖 . 𝑠𝑒𝑙𝑖′ is
called a relaxation of 𝑠𝑒𝑙𝑖 . Relaxed queries generated through
𝑠𝑖 is denoted by 𝑄 ↓𝑆𝑠𝑖𝑖 ≥ 𝑘𝑖 and it is satisfied by 𝑄 over the
coverage-based rewriting starting from 𝑄 ⟨𝑣⟩, 𝐼 , and 𝐶𝐶 have the
input dataset 𝐼 when 𝑐𝑎𝑟𝑑 (𝜎𝑆𝑖 =𝑠𝑖 (𝑄 (𝐼 ))) ≥ 𝑘𝑖 holds. For example,
form 𝑄 ⟨𝑢⟩, with 𝑢 ≥ 𝑣, and can be represented as points 𝑢 in the
choosing gender as a sensitive attribute, a coverage constraint
𝑑-dimensional space defined over the selection attributes. Taking
could be 𝑄 ↓gender ≥ 10, specifying that the result of 𝑄 must
𝑓 𝑒𝑚𝑎𝑙𝑒 into account the features of the canonical form and property (ii)
contain data related to at least 10 female individuals. of the optimal rewriting, it is simple to show that the query point
Coverage constraints can be provided together with 𝑄 or they corresponding to the optimal rewriting must be contained in
can already be available in the system, as any other integrity the upper right region of the reference space with respect to the
constraint. This could be useful when they represent generally point represented by the input query.
valid non-discrimination rules that must be satisfied by any query
execution.
(a) Data distribution (b) Data discretization (4 bins)

Figure 1: Data representation and processing

Example 2.1. Consider the Diabetes US dataset1 and let 𝑔𝑒𝑛𝑑𝑒𝑟 Pre-processing. According to property (ii) of the coverage-based
be the sensitive attribute. Suppose we are interested in finding rewriting, given an input query 𝑄 ⟨𝑣⟩, any coverage-based rewrit-
people whose number of medications is less than 10 and the num- ing is located in the upper right portion of the space defined
ber of performed lab tests is less than 30. Additionally, suppose we by 𝑣. Thus, the (unit) search space contains infinite possible
would like to guarantee that at least 15 females are present in the coverage-based rewritings among which the optimal one should
query result (coverage constraint). The corresponding SPJ query be identified. During the pre-processing step, such search space is
is 𝑄 ⟨30, 10⟩, defined in SQL as SELECT * FROM Diabetes WHERE discretized, so that an approximation of the optimal solution can
num_lab_procedures < 30 AND num_medications < 10. Fig- be detected in the next processing step by looking at the result-
ure 1(a) shows the data distribution corresponding to a small ing finite set of points. To this aim, we first organize the search
sample of size 100 taken from the Diabetes relation, projected space as a multi-dimensional grid. The grid has 𝑑 axes, one for
over the attributes referenced in the query selection conditions each selection attribute in the canonical form of 𝑄 ⟨𝑣⟩, and each
(points are colored and shaped according to the sensitive attribute axis, starting from query values, is discretized into a fixed set
values: blue crosses for females and black dots for males). The of bins, by using a binning approach (e.g, equi-width, dividing
query corresponds to point (30, 10) in such a space and the region each axis in a fixed number of bins of equal size, or equi-depth,
boxed at the bottom left of point (30, 10) contains the result of 𝑄. in which each bin contains an equal number of instances), typ-
The query is already in canonical form, so no preliminary ical of histogram generation. Each point 𝑣 at the intersection
rewriting is required. The search space for detecting coverage- of hyperplanes corresponding to bin values corresponds to a
based rewritings of the input query corresponds to the grey sensitive SPJ query containing 𝑄 ⟨𝑣⟩, thus satisfying condition
region in Figure 1(b). ^ (ii) of the reference problem. The set of grid points identified
in this way is called discretized search space. The approach is
approximate because a smaller query, in terms of minimality and
1 https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+
proximity, than that identified by the algorithm, corresponding
1999-2008
to a coverage-based rewriting of the input query, might exist samples with size 9604 out of 100 will lead to an estimation error
but, if it lies inside one grid cell, it cannot be discovered by the equal to 1%.
algorithm. Notice that the grid is computed starting from 𝐼 and
𝑄 (𝐶𝐶 is not used). Performance evaluation. The analysis of the efficiency of the
proposed coverage-based query rewriting approach has been in-
Example 2.2. During the pre-processing step, by applying, as
vestigated in a previous work [1]. The experiments demonstrated
an example, the equi-depth binning approach to each axis of our
that the time complexity of the proposed algorithms depends
reference example, we obtain the grid represented in Figure 1(b),
on: the number of bins, the used binning approach, the num-
considering 4 bins for each axis. The points of the resulting dis-
ber of selection conditions in the query, the coverage constraint
cretized search space correspond to the grid intersection points.
threshold, and the sample size. Specifically, by varying the num-
Each point corresponds to a sensitive SPJ query, obtained from
ber of selection conditions or the number of bins, and therefore
the input one by replacing selection constants with the grid point
the dimensionality of the multi-dimensional grid and the size
coordinates (see Figure 1(c)). ^
of the discretized search space, the execution time rapidly in-
Processing. During the processing step, we visit the discretized creases; the impact of the curve of dimensionality can however
search space returned by the pre-processing step starting from be reduced by applying the designed optimizations. Equi-depth
the grid point corresponding to the input query. The discretized binning approaches often lead to a more efficient processing than
search space is visited one point after the other, at increasing equi-width approaches since the distribution of points in the
distance from 𝑄. For each point 𝑢, we check whether the as- discretized search space follow data distribution, thus reducing
sociated query 𝑄 ⟨𝑢⟩ is a coverage-based rewriting of 𝑄 ⟨𝑣⟩ by the number of grid cells with no dataset points and, as a conse-
estimating the query result cardinality, for each protected group quence, the number of cardinality estimations. The execution
referenced by coverage constraints 𝐶𝐶, and the query cardinality time also depends on the chosen coverage constraint thresholds
𝑐𝑎𝑟𝑑 (𝑄 (𝐼 )). and the number of coverage constraints. Finally, the sample size
The properties of the discretized search space and of the used influences the cardinality estimation time and therefore the total
canonical form are taken into account for pruning cells that can- execution time.
not contain the solution and for further improving the efficiency
of the process, by iteratively refining the size and the number of 3 IMPACT EVALUATION
cells during the visit and, as a consequence, by working with a The coverage-based rewriting of the input query can be obtained
discretized search space at varying granularity [1]. by visiting the grid, produced as the result of the pre-processing
Example 2.3. Figure 1(d) illustrates the processing approach step, and by verifying constraint satisfaction relying on a sample-
for the considered example. Starting from the grid point corre- based approach. The optimal coverage-based rewriting is there-
sponding to the input query 𝑄, we estimate the cardinality of fore approximate since: (i) the grid, that corresponds to the dis-
𝑄 ↓gender
cretized search space, might have an impact on the accuracy of
𝑓 𝑒𝑚𝑎𝑙𝑒
on 𝐼 , needed for checking constraint satisfaction, and
the selected coverage-based rewriting; (ii) the estimation error
of 𝑄 (𝐼 ), by relying on a sample-based approach, obtaining (2, 8). related to the sample usage has an impact on query cardinality
Since the constraint is not satisfied, we further visit the other estimation and on constraint satisfaction.
points of the discretized search space at increasing distance from It is therefore important to introduce some measures quantify-
𝑄, checking constraint satisfaction and looking for the minimum ing the error that can be generated. To this aim, in the following
rewriting (property (iv)). The visit proceeds as pointed out in Fig- we discuss three groups of measures: the first group deals with
ure 1(d) (the order of the visit is represented by the blue numbers the approximation error related to the usage of the grid for the
associated with the top right vertex of each cell). Shaped cells discretization of the query search space; the second group deals
are not visited thanks to the pruning effect: if 𝑄 ⟨𝑢⟩(𝐼 ) satisfies with the approximation error related to the usage of a sample dur-
the coverage constraints, all the points in the upper right portion ing the pre-processing and processing phases; the third concerns
of space defined by 𝑢 will not satisfy conditions (iv) and (v) of the error related to the detected optimal rewriting.
the reference problem, thus they can be discarded. The optimal
coverage-based rewriting corresponds to query 𝑄 ⟨43, 20⟩ (big
blue dot in Figure 1 (d)) since (43, 20) is the point at the minimum
3.1 Grid-based accuracy
distance from the origin such that the corresponding query sat- Due to the usage of a discretized search space, the optimal covera-
isfies the considered coverage constraint (𝑄 ↓gender𝑓 𝑒𝑚𝑎𝑙𝑒
≥ 15). The ge-based rewriting identified by the proposed approach is the
best approximation of an optimal rewriting, given the considered
input query is thus rewritten into SELECT * FROM Adult WHERE
grid. The accuracy related to the usage of a given grid for the
num_lab_procedures < 43 AND num_medications < 20. ^
detection of the optimal rewriting thus corresponds to the error
Sample-based estimation. The processing step, as well as cov- introduced by the discretization process.
erage constraint checking, requires fast and accurate cardinality As pointed out before, given a query 𝑄 ⟨𝑣⟩, the visit proceeds
estimates. To make the processing more efficient, similarly to at increasing distance from the query point 𝑣. When an optimal
[16], we rely on sampling based estimators based on samples rewriting 𝑄 ⟨𝑢⟩ is reached, this means that the neighbours of
(uniform, independent, and without replacement) of the input 𝑢 in the discretized search space cannot be optimal rewritings
dataset [9], dynamically constructed during the rewriting phase, (otherwise the search would have stopped before). On the other
and on the approach in [18] for generating the sample of joined hand, another point 𝑧 might exist that is not included in the
tables. As well known, a sample of a given size can be generated search space but 𝑄 ⟨𝑧⟩(𝐼 ) satisfies 𝐶𝐶 and either it is closer to
so that query selectivity can be estimated with an error 𝑒 and the query point 𝑣 than 𝑢 or 𝑄 ⟨𝑧⟩(𝐼 ) ⊆ 𝑄 ⟨𝑢⟩(𝐼 ). Such point is an
confidence 𝛿 [4]. As an example, if the error is 1% and the confi- optimal rewriting but, due to the approximation, it cannot be
dence is 95%, the sample size should be 9604: this means that 95 identified by the proposed approach. The approximation error
has an impact on the minimality property, in addition, it might
lead to a wrong assessment of constraint satisfaction.

Impact on (pre-)processing In order to evaluate the impact
of the sample usage on grid generation, we compare the data
distribution of the dataset 𝐼 with the data distribution of the
sample 𝑆, both projected over the attributes appearing in the
considered query 𝑄. The more similar the two distributions, the
lower is the impact of the sample selection in the detection of
the optimal coverage-based rewriting.
Several metrics have been proposed for quantifying the dis-
tance between two multivariate datasets. Many of them, e.g., the
Wasserstein metric [17], the Kullback-Liebler [13, 14] and the
Jensen-Shannon divergence [10], quantify the distance between
the corresponding probability distributions and sometimes, as
for the Wasserstein metric, the result might tend to infinity [17].
Figure 2: Grid-based accuracy (dashed blue line for the Probability distributions are not directly available under the con-
maximum diagonal, dotted blue line for the minimum di- sidered scenarios and, even if computed, they introduce a further
agonal) level of approximation in the computation. The Kolmogorov-
Smirnov (KS) distance, defined for arbitrary univariate distribu-
tions [7, 22], measures the maximum distance, between 0 and 1,
for the identified approximate optimal rewriting 𝑄 ⟨𝑢⟩, also called
between the cumulative distributions of two datasets. Due to its
grid-based accuracy, can therefore be defined as the maximum
generality, it is quite used and many, very complex, extensions
distance between 𝑢 and all its neighbours on the grid, preceding
to the multivariate case exist.
it in the search; the accuracy is therefore lower than or equal
In our work, we consider a very simple extension of the uni-
to the diagonal of the grid cells having 𝑢 as a vertex and closer
variate KS metric, obtained by averaging the KS distance com-
to 𝑣 than 𝑢. By considering the entire grid, we can quantify the
puted for each query attribute. The distance computed between
maximum and minimum grid-based accuracy in terms of the
two datasets 𝐼 and 𝑆 is denoted by 𝑑 𝐾𝑆 (𝐼, 𝑆). This approach is
maximum and the minimum diagonal length of grid cells (see
suitable since, similarly to what we have proposed for the search
Figure 2 for a graphical explanation).
space construction, where we rely on an aggregation of unidi-
Definition 3.1 (Grid-based accuracy). Let 𝐺 𝑏 be the grid gen- mensional histograms instead of a multidimensional histogram,
erated from a dataset 𝐼 and a query 𝑄 using a certain binning it aggregates distances defined on each single attribute.
approach 𝑏. The mimimum/maximum grid-based accuracy of 𝐺 𝑏 , In order to investigate the impact of the KS distance for a
denoted by 𝑑𝑖𝑎𝑔𝑚𝑖𝑛 𝑚𝑎𝑥
𝑏 and 𝑑𝑖𝑎𝑔 𝑏 , is defined as the minimum/
sample 𝑆 on the identification of the optimal rewriting, it is useful
𝐺 𝐺
maximum diagonal length of grid cells in 𝐺 𝑏 , normalized between to introduce some metrics, quantifying the difference in using
0 and 1. □ 𝑆 instead of the initial dataset 𝐼 for the detection of the optimal
coverage-based rewriting. Such metrics compute the average
Example 3.2. Figure 2 shows the normalized cell diagonal difference, in terms of minimality, proximity, and solution distance,
lengths for the grid created as discussed in Example 2.2. The 𝑜𝑝𝑡 𝑜𝑝𝑡
between the optimal rewritings 𝑄 𝑆,𝐶𝐶 and 𝑄 𝐼,𝐶𝐶 , obtained by
grid-based accuracy for this grid varies between 0.05 (dotted blue processing the sample 𝑆 and the original dataset 𝐼 , respectively,
line), corresponding to queries 𝑄 ⟨30, 15⟩ and 𝑄 ⟨30, 20⟩ and 0.73 on a random set of queries 𝑄𝑆, uniformly distributed in the query
(dashed blue line), corresponding to query 𝑄 ⟨90, 75⟩. ^ search space:
Different binning approaches might lead to a different grid-
• Average minimality difference (property (iv) of the optimal
based accuracy. In particular, when fixing the reference data 𝑜𝑝𝑡
interval over each axis, binning approaches based on data distri- rewriting). It quantifies how much different 𝑄 𝑆,𝐶𝐶 and
𝑜𝑝𝑡
bution, like equi-depth, lead to a higher variability of grid-based 𝑄 𝐼,𝐶𝐶 are, in average, with respect to their result set car-
accuracy, since they generate smaller buckets for dense regions dinalities when they are executed over the input dataset 𝐼
and larger buckets for sparse ones, as stated by the following (thus, it quantifies, in average, the difference in the relax-
proposition. ation of the original query over the initial dataset):
𝑜𝑝𝑡 𝑜𝑝𝑡
|𝑐𝑎𝑟𝑑 (𝑄 (𝐼 ))−𝑐𝑎𝑟𝑑 (𝑄 (𝐼 )) |
Proposition 1. Let 𝐺 𝑤 be the grid generated from a dataset 𝐼 𝑚(𝐼, 𝑆) ≡ 𝑎𝑣𝑔𝑄 ∈𝑄𝑆 𝑆,𝐶𝐶
𝑐𝑎𝑟𝑑 (𝑄 (𝐼 ))
𝐼 ,𝐶𝐶
.
and a query 𝑄 using the equi-width approach and 𝐺 𝑑 that gener- • Average proximity difference (property (v) of the optimal
ated using the equi-depth approach, with 𝑛 bins for each axis in both 𝑜𝑝𝑡
rewriting). It quantifies how much different 𝑄 𝑆,𝐶𝐶 and
cases. Then: (i) 𝑑𝑖𝑎𝑔𝑚𝑖𝑛 𝑚𝑖𝑛 𝑚𝑎𝑥 ≥ 𝑑𝑖𝑎𝑔𝑚𝑎𝑥 . □
𝑑 ≤ 𝑑𝑖𝑎𝑔𝐺 𝑤 ; (ii) 𝑑𝑖𝑎𝑔 𝑑 𝐺𝑤 𝑜𝑝𝑡
𝐺 𝐺 𝑄 𝐼,𝐶𝐶 are, in average, with respect to their Euclidean dis-
3.2 Sample-based accuracy tance 𝑑 () from the input query 𝑄, in the unit space, further
normalized between 0 and 1 (thus, it quantifies, in average,
The accuracy of the query rewriting approach depends on the
how far the two optimal rewritings are with respect to the
sample data distribution for two main reasons: (i) different sam-
original query):
ple distributions might lead to the generation of different buckets 𝑜𝑝𝑡 𝑜𝑝𝑡
𝑝 (𝐼, 𝑆) ≡ 𝑎𝑣𝑔𝑄 ∈𝑄𝑆 |𝑑 (𝑄 𝑆,𝐶𝐶 , 𝑄) − 𝑑 (𝑄 𝐼,𝐶𝐶 , 𝑄)|.
and therefore of different grids, with an impact on both pre-
processing and processing steps; (ii) the sample is used for query • Average solution distance: It quantifies how much differ-
𝑜𝑝𝑡 𝑜𝑝𝑡
cardinality estimation and, as a consequence, the estimation error ent 𝑄 𝑆,𝐶𝐶 and 𝑄 𝐼,𝐶𝐶 are, in average, in terms of their
Euclidean distance in the unit space, further normalized guarantee constraint satisfaction, the cardinality of the rewrit-
between 0 and 1 (thus, it quantifies, in average, how far ten query is about 4 times that of the original query). Finally,
the two optimal rewritings are, without taking the input the proximity, i.e., the Euclidean distance between (30, 10) and
dataset and the original query into account): (43, 20) in the unit space, normalized between 0 and 1, is 0.19,
𝑜𝑝𝑡 𝑜𝑝𝑡
𝑠𝑑 (𝐼, 𝑆) ≡ 𝑎𝑣𝑔𝑄 ∈𝑄𝑆 𝑑 (𝑄 𝑆,𝐶𝐶 , 𝑄 𝐼,𝐶𝐶 ). thus corresponding to a relaxation of 19% with respect to the
maximum query, returning the whole dataset. ^
Impact on constraint satisfaction. The error and the confi-
dence related to the considered sample (see Section 2) have an 4 EXPERIMENTAL RESULTS
impact also on coverage constraint satisfaction. A constraint
In this section, we present some preliminary experimental results
𝐶𝐶𝑖 ≡ 𝑄 ↓𝑆𝑠𝑖𝑖 ≥ 𝑘𝑖 is satisfied on 𝐼 if 𝑐𝑎𝑟𝑑 (𝜎𝑆𝑖 =𝑠𝑖 (𝑄 (𝐼 ))) ≥ 𝑘𝑖 . with the aim of analyzing the accuracy of the proposed query
Since the sample-based estimation of 𝑐𝑎𝑟𝑑 (𝜎𝑆𝑖 =𝑠𝑖 (𝑄 (𝐼 ))) might rewriting approach.
lead to an error of 𝑒 × 𝑐𝑎𝑟𝑑 (𝐼 ), in order to guarantee that the
constraint is also satisfied by the input dataset 𝐼 , we can mod- 4.1 Experimental Setup
ify the constraint, when evaluated over the sample, as: 𝑄 ↓𝑆𝑠𝑖𝑖 ≥
All experiments were conducted on a PC with an Intel Core i5-
𝑘𝑖 + (𝑒 ×𝑐𝑎𝑟𝑑 (𝐼 )). This consideration makes the proposed sample-
8300H 2.30 GHz CPU and 16GB main memory, running Microsoft
based approach reasonable for coverage constraints in which
𝑘𝑖 Windows 10. All programs were coded in Python 3.8.
𝑐𝑎𝑟𝑑 (𝐼 ) has the same order of magnitude as 𝑒. Notice that, by The experiments refer to a real dataset, stored in PostgreSQL:
changing the constraint as proposed above, we increase the prob- Diabetes US2 representing 10 years (1999-2008) of clinical care
ability of constraint satisfaction on the input dataset at the price at 130 US hospitals and integrated delivery networks (100,000
of reducing proximity of the optimal rewriting with respect to instances). It includes over 50 features representing patient and
the input query (since the optimal query will be further away hospital outcomes. For our experiments we use gender as sensi-
from the initial one). tive attribute and we add a coverage constraint on 𝑓 𝑒𝑚𝑎𝑙𝑒.
For this dataset, we generated many random samples with an
3.3 Solution-based accuracy increasing size, guaranteeing a variable percentage of error (1% or
𝑜𝑝𝑡 3%) and a variable level of confidence (95% or 99%). More precisely,
Let 𝑆 be a sample of dataset 𝐼 and 𝑄 𝑆,𝐶𝐶 = 𝑄 ⟨𝑢⟩ be the opti-
we considered the following sample sizes: 1067 (3% error and
mal solution obtained from 𝑆, given a query 𝑄 ⟨𝑣⟩ and a set of
95% confidence), 1843 (3% error and 99% confidence), 9604 (1%
coverage constraints 𝐶𝐶. It could be useful to introduce some
error and 95% confidence), 16588 (1% error and 99% confidence).
additional measures to evaluate the quality of the obtained opti-
𝑜𝑝𝑡 For each sample size, we generated 5 random samples, for a total
mal rewriting 𝑄 𝑆,𝐶𝐶 with respect to the discretized search space of 20 samples. The samples are all small enough to be stored in
identified by the chosen dataset 𝑆, taking into account both the main memory.
applied relaxation with respect to the original query and the In order to evaluate the accuracy of the proposed approach, we
approximation error, in line with what we have proposed for randomly generated a set of 1000 queries, with different selectiv-
grid-based and sample-based accuracy. To this aim, we propose ities, and we compared the impact of the grid and of the consid-
the following three measures: ered sample in detecting the optimal coverage-based rewriting,
𝑜𝑝𝑡 by considering the measures presented in Section 3. All the se-
• Grid-based accuracy of 𝑄 𝑆,𝐶𝐶 . According to what discussed
lected queries contain three selection conditions (thus leading
in Subsection 3.1, it can be computed as the maximum
to a three-dimensional grid), with respect to the three numerical
distance between 𝑢 and all its neighbours on the grid,
attributes with the highest number of distinct values, namely
preceding it in the search (thus, the maximum diagonal
(number_emergency,num_lab_procedures,num_medications);
of the cells of the grid having 𝑢 as a vertex and closer to 𝑣
selection values are picked at random from the attribute value
than 𝑢).
range in the dataset. For the sake of simplicity, join queries are
• Relaxation degree. Similarly to [16], the relaxation degree,
not considered for these preliminary experiments.
first proposed in [1], quantifies, through estimations over
For each query, we then defined the coverage constraint taking
the initial dataset 𝐼 , how much the optimal coverage-based
𝑜𝑝𝑡 into account the considered query and in such a way that it is
rewriting 𝑄 𝑆,𝐶𝐶 relaxes the original query 𝑄, as the per-
neither satisfied on the dataset nor on the considered samples.
centage of new added tuples with respect to those con-
The experiments were performed by considering two dis-
tained in the original query result:
𝑜𝑝𝑡 tinct binning approaches during the pre-processing step: (i) equi-
|𝑄 𝑆,𝐶𝐶 (𝐼 ) |− |𝑄 (𝐼 ) |
. width∗ , corresponding to an equi-width approach, dividing each
|𝑄 (𝐼 ) |
• Proximity. It can be computed as the Euclidean distance axis in a fixed number of bins of equal size, set as the minimum
𝑜𝑝𝑡 between a selected number (denoted by #𝑏𝑖𝑛𝑠 in the experimental
between the optimal coverage-based rewriting 𝑄 𝑆,𝐶𝐶 and
𝑄 in the unit space, further normalized between 0 and results) and the number of distinct values for the considered at-
1, thus indicating how far the optimal rewriting is with tribute in the dataset; (ii) equi-depth∗ , in which each bin, defined
respect to the original query. as for the equi-width∗ approach, contains a constant number of
instances. A variable number of bins, namely 4, 8, 16, 32, 64, has
Example 3.3. The optimal coverage-based rewriting of the been considered for specific experiments.
running example corresponds to query 𝑄 ⟨43, 20⟩ (see Example We then performed three groups of experiments aimed at
2.3). The grid-based accuracy of such solution is 0.16 (see Figure analyzing the grid-based, the sample-based, and the solution-
2), since this is the maximum length between the solution and based accuracy of the optimal coverage-based rewriting.
its neighbours on the grid. Since 𝑐𝑎𝑟𝑑 (𝑄 ⟨43, 20⟩(𝐼 )) = 38 and 2 https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+
𝑐𝑎𝑟𝑑 (𝑄 (𝐼 )) = 8 (see Figure 2), the relaxation degree is 3.75 (to 1999-2008
Sample KS Distance Sample KS Distance
1067_1 0.0185 1843_1 0.0156
1067_2 0.0185 1843_2 0.0183
1067_3 0.0185 1843_3 0.0130
1067_4 0.0199 1843_4 0.0130
1067_5 0.0198 1843_5 0.0119
Mean 0.0190 Mean 0.0144
Variance 4.38e-7 Variance 5.36e-6

Sample KS Distance Sample KS Distance
(a) Maximal grid-based accuracy 9604_1 0.0058 16588_1 0.0034
9604_2 0.0064 16588_2 0.0044
9604_3 0.0026 16588_3 0.0032
9604_4 0.0071 16588_4 0.0055
9604_5 0.0134 16588_5 0.0035
Mean 0.0071 Mean 0.0040
Variance 1.24e-5 Variance 7.32e-7

Figure 4: KS distance from the samples to the reference
dataset

(b) Minimal grid-based accuracy Sample KS avg min. avg prox. avg sol.
distance difference difference distance
9604_3 0.0026 0.0275 0.1072 0.1403
Figure 3: Grid-based accuracy, when varying the number 9604_5 0.0134 0.1350 0.0995 0.1291
of bins
Table 1: Sample-based accuracy measures

4.2 Experimental evaluation
4.2.1 Grid-based accuracy. The first group of experiments (namely, 9604_3 and 9604_5). We then computed the sample-
aims at analyzing the maximum and minimum grid-based ac- based accuracy measures on the set of 1000 random queries.
curacy, as presented in Subsection 3.1, by varying the number Table 1 shows the results obtained by considering the equi-
of bins, with respect to the selected binning approach, namely depth∗ approach and 32 bins (similar values has been obtained
equi-depth∗ and equi-width∗ . To this aim, we selected a sample for the equi-width∗ and other numbers of bins). As you can see,
with size 16588 (sample 16588_3 in Figure 4) and we considered there is no clear ordering between the two samples when consid-
all the generated 1000 random queries. The obtained results are ering such measures. In particular, sample 9604_3 is better than
independent from the query selectivity, therefore in the following sample 9604_5 with respect to minimality; however, for prox-
we discuss those obtained with selectivity equal to 2.5%. imity and solution distance the opposite result holds. Thus, it
Figure 3 shows that both the maximum and the minimum seems that the KS measure is not good enough for discriminating
grid-based accuracy decrease by increasing the number of bins, between samples with a different behaviour with respect to the
independently from the chosen binning approach. This is because, detection of the optimal coverage-based rewriting. Additional
by increasing the number of bins, the space is discretized into a work is therefore needed for investigating or defining alterna-
higher number of cells, thus obtaining cells of smaller size. tive functions able to discriminate between samples under the
From Figure 3 we also observe the behaviour described by considered scenario.
Proposition 1: the maximal grid-based accuracy obtained with
the equi-depth∗ approach is always higher than the grid-based 4.2.3 Solution-based accuracy. The last group of experiments
accuracy obtained with the equi-width∗ approach; minimal accu- aims at analyzing the solution-based accuracy, according to the
racy behaves in the opposite way. measures introduced in Subsection 3.3, by varying the number of
bins, with respect to the selected binning approach, namely equi-
4.2.2 Sample-based accuracy. The second group of experi- depth∗ and equi-width∗ . To this aim, we selected sample 16588_3
ments deals with the analysis of the sample-based accuracy, ac- (i.e., the biggest sample with the smallest KS distance with respect
cording to the measures introduced in Subsection 3.2. To this to the initial dataset) and we considered all the generated 1000
aim, we considered all the samples described in Subsection 4.1 random queries. The obtained results show a different behaviour
and, for each of them, we computed the KS distance with respect with respect to the region in which the optimal solution is lo-
to the input dataset. cated, either dense or sparse. In particular, Figure 5 shows that,
Results, presented in Figure 4, show that the obtained values as expected, by increasing the number of bins, the grid-based
are very similar and quite small: for most sample sizes, the great- accuracy of the solution will decrease as well. However, depend-
est differences refer to the third decimal digit. As expected, by ing on where the optimal solution is located, the equi-depth∗
increasing the sample size, the KS distance tends to decrease. and the equi-width∗ approaches behave in a different way. More
In order to evaluate the impact of the KS distance on sample- precisely, when the solution region is dense (Figure 5(a)), the ac-
based accuracy measures, we considered the sample size leading curacy is lower with the equi-depth∗ approach since in this case
to the highest variance (9604) and we selected the samples with higher density will lead to smaller bins. On the other hand, when
the highest difference between the corresponding KS distances the solution region is sparse (Figure 5(b)), the accuracy is usually
(a) Solution in a dense region (a) Solution in a dense region

(b) Solution in a sparse region (b) Solution in a sparse region

Figure 5: Grid-based accuracy of the solution Figure 7: Proximity

Finally, notice that dense regions have a greater impact on
query cardinalities and, as a consequence, on the relaxation de-
gree, whose values are usually higher in this case (starting from
8 bins), independently on the selected binning approach. On the
other hand, dense regions tend to generate optimal solutions with
lower proximity (more evident with the equi-depth∗ approach,
that is more sensible to data distribution), since constraint satis-
faction can be obtained on query points closer to the initial one.
In general, for uniformly distributed datasets, in which no sparse
regions can be detected, equi-depth∗ can be considered the best
(a) Solution in a dense region option. By combining the obtained results with those presented
in [1, 3], related to performance, a number of bins equal to 16
can be considered a good compromise between effectiveness and
efficiency.

5 RELATED WORK
The interest for coverage constraints has been introduced in [5, 6],
drawing inspiration from the literature on diversity [11]. The
problem of evaluating the coverage of a given dataset has been
considered in the context of the MithraLabel system [12, 24], in
(b) Solution in a sparse region
which the lack of coverage is modeled as a widget in the nutri-
tional label [26] of a dataset. Once the lack of coverage has been
identified, the smallest number of data points needed to hit all
Figure 6: Relaxation degree the “large uncovered spaces” is identified with the aim of helping
data owners in achieving coverage through a data repairing ap-
proach. When protected categories are defined in terms of many
attributes, the identification of attribute patterns associated with
lower with the equi-width∗ approach since in this case lower coverage problems might lead to performance issues, due to the
density will lead to longer bins under the equi-depth∗ approach. combinatorial explosion of such patterns. Efficient techniques,
A similar behavior can be observed for the relaxation degree inspired from set enumeration and association rule mining, ad-
(Figure 6) and the proximity (Figure 7): by increasing the number dressing this problem have been proposed in [6]. To fix coverage
of bins, they decrease for both the binning approaches. When unsatisfaction, additional data can be acquired. Since data ac-
the solution is contained in a dense region, equi-depth∗ behaves quisition has a cost in term of data processing, techniques have
better than equi-width∗ , especially for low numbers of bins. We been presented in [6] for determining the patterns that can be
further notice that, for very low numbers of bins, proximity covered given a certain maximum cost. An efficient approach for
and grid-based accuracy coincide (the farthest neighbour of the coverage analysis, given a set of attributes across multiple tables,
solution is the query point itself). is presented in [27]. As pointed out by the previous discussion,
most existing approaches chase coverage through data repair user to reconsider some of the choices made in the selection of
and focus on efficiency issues. By contrast, we consider accuracy the operations to be rewritten, thus producing a new annotated
for coverage-based query rewriting during data transformation, script.
thus complementing existing approaches. Future work is needed in order to understand how to define
The technique considered in this paper relies on rewriting. new distance functions between distributions, focusing on their
Other fairness-aware rewriting approaches have been proposed behaviour during coverage-based rewriting. An additional issue
for OLAP queries [19, 20]. Bias is defined in terms of causal concerns the definition and the analysis of further solution-based
fairness (checking for causal relationships from the sensitive at- measures, evaluating the quality of the detected solution with
tributes to the outcome) and detected, explained, and resolved respect to the solution that would have been obtained without
through rewriting. On the other hand, we focus on data transfor- considering a sample, as well as their integration in the covRew
mations in presence of coverage constraints. prototype system.
Impact evaluation is quite relevant in the design and the ex-
ecution of non-discriminating pipelines, usually very complex REFERENCES
in real-world scenarios. Various systems have been designed [1] C. Accinelli, S. Minisi, B. Catania. Coverage-based rewriting for data prepara-
tion. In Proc. of the EDBT/ICDT Workshops, CEUR-WS.org, 2020.
for supporting the user during this activity. Among them, we [2] C. Accinelli, B. Catania, G. Guerrini, S. Minisi. covRew: a Python toolkit for
recall: Fair-DAGs [25], an open-source library aiming at repre- pre-processing pipeline rewriting ensuring coverage constraint satisfaction.
senting data processing pipelines in terms of a directed acyclic In EDBT, Proc. of the 24th Int. Conf. on Extending Database Technology, 2021.
[3] C. Accinelli, B. Catania, G. Guerrini, S. Minisi. A coverage-based approach to
graph (DAG) and identifying distortions with respect to protected data transformations. In preparation.
groups as the data flows through the pipeline; FairPrep [21], an [4] S. Agarwal et al. Knowing when you’re wrong: building fast and reliable
environment for investigating the impact of fairness-enhancing approximate query processing systems. In Proc. of the ACM SIGMOD, Int. Conf.
on Management of Data, pages 481–492, 2014.
interventions inside data processing pipelines; AI Fairness 360 [8], [5] A. Asudeh, H. V. Jagadish, J. Stoyanovich. Towards responsible data-driven
an open-source Python toolkit for algorithmic fairness, aimed at decision making in score-based systems. IEEE Data Eng. Bull., 42(3):76–87,
2019.
facilitating the transition of fairness-aware research algorithms [6] A. Asudeh, Z. Jin, H. V. Jagadish. Assessing and remedying coverage for a
to usage in an industrial setting and at providing a common given dataset. In ICDE, Proc. of the 35th IEEE Int. Conf. on Data Engineering,
framework to share and evaluate algorithms. pages 554–565, 2019.
[7] M. Basseville. Divergence measures for statistical data processing. 2010.
[8] R. K. Bellamy et al. Ai fairness 360: An extensible toolkit for detecting and mit-
igating algorithmic bias. IBM Journal of Research and Development, 63(4/5):4–1,
6 CONCLUDING REMARKS 2019.
Rewriting approaches have been recognized as an interesting [9] G. Cormode, M. N. Garofalakis, P. J. Haas, C. Jermaine. Synopses for massive
data: Samples, histograms, wavelets, sketches. Found. Trends Databases, 4(1-
mean for enforcing specific non-discriminating properties guar- 3):1–294, 2012.
anteeing transparency. In this paper, we started from the ap- [10] I. Dagan, L. Lee, F. Pereira. Similarity-based methods for word sense disam-
biguation. arXiv preprint cmp-lg/9708010, 1997.
proach proposed in [1] with the aim of investigating the impact [11] M. Drosou, H. V. Jagadish, E. Pitoura, J. Stoyanovich. Diversity in big data: A
of rewriting on coverage-constraint satisfaction. The approach is review. Big Data, 5(2):73–84, 2017.
approximate and relies on a sample for both the construction of [12] Z. Jin et al. Mithracoverage: A system for investigating population bias for
intersectional fairness. In Proc. of the ACM SIGMOD Int. Conf. on Management
the solution search space and the detection of the optimal rewrit- of Data, pages 2721–2724, 2020.
ing. Three different groups of measures have been proposed for [13] S. Kullback. Information theory and statistics. Courier Corporation, 1997.
quantifying the accuracy induced by the approximation and the [14] S. Kullback, R. A. Leibler. On information and sufficiency. The annals of
mathematical statistics, 22(1):79–86, 1951.
impact of the sample in the detection of the optimal solution. [15] W. McKinney. Python for data analysis: Data wrangling with Pandas, NumPy,
Preliminary experimental results show that: (i) different bin- and IPython. O’Reilly Media, Inc., 2012.
[16] C. Mishra, N. Koudas. Interactive query refinement. In EDBT, Proc. of the 12th
ning approaches lead to different grid-based accuracy degrees; Int. Conf. on Extending Database Technology, Proceedings, pages 862–873, 2009.
(ii) common measures for computing the distance between dis- [17] I. Olkin, F. Pukelsheim. The distance between two random vectors with given
tributions are not effective for analyzing their behaviour in the dispersion matrices. Linear Algebra and its Applications, 48:257–263, 1982.
[18] M. Riondato et al. The VC-dimension of SQL queries and selectivity estimation
detection of the optimal coverage-based solution; (iii) the number through sampling. In Machine Learning and Knowledge Discovery in Databases
of bins used in the generation of the search space has an impact - European Conference, ECML PKDD, Proceedings, Part II, pages 661–676, 2011.
in the accuracy of the detected solution, and not only on the [19] B. Salimi et al. Hypdb: A demonstration of detecting, explaining and resolving
bias in OLAP queries. Proc. VLDB Endow., 11(12):2062–2065, 2018.
performance [1], while the optimal binning approach depends [20] B. Salimi, J. Gehrke, D. Suciu. Bias in OLAP queries: Detection, explanation,
on the position of the optimal rewriting in the search space; (iv) and removal. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data,
pages 1021–1035, 2018.
a number of bins greater than 16 represents a good compromise [21] S. Schelter, Y. He, J. Khilnani, J. Stoyanovich. FairPrep: Promoting data to a
between accuracy and efficiency. first-class citizen in studies on fairness-enhancing interventions. In EDBT,
The proposed approach is at the basis of covRew [2], a Python Proc. of the 23nd Int. Conf. on Extending Database Technology, pages 395–398,
2020.
toolkit for rewriting slicing operations in pre-processing pipelines, [22] M. A. Stephens. Edf statistics for goodness of fit and some comparisons.
ensuring coverage constraint satisfaction. covRew takes in input Journal of the American statistical Association, 69(347):730–737, 1974.
a two-dimensional tabular dataset 𝐼 , with the related sensitive [23] J. Stoyanovich, B. Howe, H. V. Jagadish. Responsible Data Management. Proc.
VLDB Endow., 13(12): 3474–3488, 2020.
attribute specification, for the identification of protected groups, [24] C. Sun et al. Mithralabel: Flexible dataset nutritional labels for responsible
a processing pipeline represented as a Pandas script [15], and a data science. In CIKM, Proc. of the 28th Int. Conf. on Information and Knowledge
Management, pages 2893–2896, 2019.
set of coverage constraints. It includes three main components: [25] K. Yang, B. Huang, J. Stoyanovich, S. Schelter. Fairness-aware instrumentation
(i) a pipeline analyzer, which identifies candidate operations for of preprocessing pipelines for machine learning. In Workshop on Human-In-
rewriting, (ii) a pipeline rewriter, which transforms operations the-Loop Data Analytics (HILDA 2020), 2020.
[26] K. Yang et al. A nutritional label for rankings. In Proc. of the 2018 ACM
that are selected by the user according to the input coverage con- SIGMOD Int. Conf. on Management of Data, pages 1773–1776, 2018.
straints, and (iii) an impact evaluator, assessing the impact of the [27] Y. Lin, Y. Guan, A. Asudeh, H. V. Jagadish. Identifying insufficient data coverage
rewriting through the usage of the grid-based and solution-based in databases with multiple relations. Proc. of the VLDB Endow, pages 2229–2242,
2020.
measures presented in Section 3. Such measures could lead the