=Paper= {{Paper |id=Vol-2841/PIE+Q_2 |storemode=property |title=The impact of rewriting on coverage constraint satisfaction |pdfUrl=https://ceur-ws.org/Vol-2841/PIE+Q_2.pdf |volume=Vol-2841 |authors=Chiara Accinelli,Barbara Catania,Giovanna Guerrini,Simone Minisi |dblpUrl=https://dblp.org/rec/conf/edbt/AccinelliCGM21a }} ==The impact of rewriting on coverage constraint satisfaction== https://ceur-ws.org/Vol-2841/PIE+Q_2.pdf
    The impact of rewriting on coverage constraint satisfaction
                           Chiara Accinelli, Barbara Catania, Giovanna Guerrini, Simone Minisi
                                                        DIBRIS - University of Genoa, Genoa - Italy
                                                             name.surname@dibris.unige.it

ABSTRACT                                                                                   In this paper, we are interested in investigating solutions for
Due to the impact of analytical processes on our life, an increas-                     detecting bias in data-preprocessing steps, defined in terms of cov-
ing effort is being devoted to the design of technological solu-                       erage constraints, with a special reference to filtering and merge
tions that help humans in measuring the bias introduced by such                        data transformations, mitigating it, and checking whether the
processes and understanding its causes. Existing solutions can                         mitigation was effective. Specifically, we focus on classical data
refer to either back-end or front-end stages of the data process-                      transformation operations, often defined in terms of Selection-
ing pipeline and usually represent bias in terms of some given                         Projection-Join (SPJ) operations over tabular data, that can reduce
diversity or fairness constraint. In our previous work [1], we                         the number of records related to some protected or disadvantaged
proposed an approach for rewriting filtering and merge opera-                          groups, defined in terms of some sensitive attributes, even if such
tions in pre-processing pipelines into the “closest” operations so                     attributes are not directly used in the specification of the data
that protected groups are adequately represented (i.e., covered)                       transformation operation.
in the result. This is relevant because any under-represented                              In this frame, the approach we developed [1] aims at support-
category in an initial or intermediate dataset might lead to an                        ing the user by minimally rewriting the transformation operation
under-representation of that category in any subsequent analyti-                       so that input coverage constraints are guaranteed to be satisfied in
cal process. Since many potential rewritings exist, the proposed                       the transformation result. Through rewriting, the revised process
approach is approximate and relies on a sample-based cardinality                       is traced for further processing, thus guaranteeing transparency.
estimation, thus introducing a trade-off between the accuracy                          Since many potential rewritings exist, we proposed a sample-
and the efficiency of the process. In this paper, we investigate this                  based two-steps approach for detecting, under an approximate
trade-off by first presenting various measures quantifying the er-                     approach, the minimal (i.e., the optimal) rewriting of the orig-
ror introduced by the rewriting, due to the applied approximation                      inal query. After transforming the SPJ query into a canonical
and the selected sample. Then, we (preliminarly) experimentally                        form, first (in a pre-processing step), the search space of potential
evaluate such measures on a real-world dataset.                                        rewritings is discretized, so that an approximation of the optimal
                                                                                       solution can be detected in the next processing step, by looking at
                                                                                       the resulting finite set of points. The coverage-based rewriting of
1    INTRODUCTION                                                                      the input query can be obtained by visiting the grid, produced as
The impact of data on our society is getting higher and higher,                        the result of the pre-processing step, according to an order that
with data about people being more and more often exploited                             guarantees the fast detection of the minimal rewriting, and by ver-
as the basis to make decisions that might impact people’s lives.                       ifying constraint satisfaction through a sample-based approach.
Thus, it becomes crucial to ensure that, in the systems enabling                       The coverage-based rewriting is approximate both because of the
such data-based decisions, data are dealt with in a responsible and                    discretization of the search space and of the error in estimating
non-discriminating way [23], in all the steps, from acquisition to                     cardinalities and constraint satisfaction on the sample.
analysis.                                                                                  In this paper, we start from this approach and we investigate
   Non-discrimination can be addressed by considering specific                         its effectiveness in the detection of the optimal coverage-based
diversity and fairness constraints. Diversity allows us to capture                     solution. To this aim, we introduce three main groups of mea-
the quality of a collection of items with respect to the variety                       sures quantifying the error that can be generated by the used
of its constituent elements. On the other hand, fairness can be                        approximated and sample-based approach. The first group of
broadly defined as the impartial treatment of individuals and of                       measures deals with the accuracy of the discretization applied
demographic groups inside data processing tasks.                                       in the pre-processing step. The second group deals with the er-
   Among all data processing steps, data pre-processing plays a                        ror due to the usage of the sample for the grid generation and
relevant role when considering non-discrimination issues since it                      cardinality estimations, while the third group helps the user in
can introduce technical bias by exacerbating pre-existing bias that                    quantitatively evaluating specific solutions obtained through the
may exist in society, with an impact on the whole data lifecycle.                      rewriting.
   When considering pre-processing tasks, an often considered                              The generated rewritings are then (preliminarly) experimen-
non-discriminating constraint is coverage. Coverage constraints                        tally analyzed in terms of the proposed measures, on a real-world
guarantee that the input, or training, dataset includes enough                         dataset. The obtained results provide hints on how to tune ap-
examples for each (protected) category of interest, thus increasing                    proximation and sample related parameters to achieve a good
diversity with the aim of limiting the introduction of bias during                     trade-off between accuracy and performance in the context of
the next analytical steps. Coverage is quite relevant in the first                     coverage-based rewriting, by combining new results related to
data processing tasks, like data transformation, since the used                        accuracy presented in this paper with results related to perfor-
transformations might change the, possibly initially satisfied,                        mance, previously presented in [1].
coverage of protected categories.                                                          Even if the proposed measures have been defined in the context
                                                                                       of a specific coverage-based approach, we believe that they can be
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)       of more general value in understanding the role of approximation
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0           and sampling during data pre-processing.
International (CC BY 4.0)
   The remainder of the paper is structured as follows. In Sec-             The approach. Given a dataset 𝐼 and a set of coverage con-
tion 2, we present the overall approach to coverage-based rewrit-           straints 𝐶𝐶, each selected sensitive SPJ query 𝑄 can be rewritten
                                                                                                  𝑜𝑝𝑡
ing. In Section 3, we introduce new measures to quantify, in                into another query 𝑄 𝐼,𝐶𝐶 , according to what presented in [1], so
terms of accuracy, the impact of rewriting. In Section 4, we ex-                   𝑜𝑝𝑡
                                                                            that 𝑄 𝐼,𝐶𝐶 is the minimal query relaxing 𝑄 guaranteeing cover-
perimentally analyze the impact of the rewriting according to
                                                                            age constraint satisfaction when evaluated over the input dataset.
the introduced measures. Section 5 discusses related work while
                                                                            Relaxation is reasonable when the user is happy with the speci-
Section 6 concludes the paper and outlines future work directions.
                                                                            fied transformation and she wants to keep the result set of the
                                                                                                                  𝑜𝑝𝑡
                                                                            original query after the rewriting. 𝑄 𝐼,𝐶𝐶 must therefore satisfy
2    COVERAGE-BASED REWRITING                                                                                𝑜𝑝𝑡                    𝑜𝑝𝑡
                                                                            the following properties: (i) 𝑄 𝐼,𝐶𝐶 ≡ 𝑄 ⟨𝑢⟩, thus 𝑄 𝐼,𝐶𝐶 is obtained
We focus on data to be transformed for further data analytical                                                                                  𝑜𝑝𝑡
tasks. Data can refer specific protected (minorities or historically        from 𝑄 by only changing the selection constants; (ii) 𝑄 ⊆ 𝑄 𝐼,𝐶𝐶 ,
                                                                                   𝑜𝑝𝑡
disadvantaged) groups and we aim at guaranteeing that each                  thus 𝑄 𝐼,𝐶𝐶 always contains the result of the input query; (iii) all
transformation step during the pre-processing pipeline, based                                                                         𝑜𝑝𝑡
                                                                            coverage constraints associated with 𝑄 are satisfied by 𝑄 𝐼,𝐶𝐶 (𝐼 ).
on filtering or merge operations, produces a new dataset con-
                                                                            The rewriting should be optimal, i.e., the new query has to satisfy
taining enough entries for each protected group of interest. In
                                                                            specific minimality properties with respect to the input query
the following, we briefly describe input data and the proposed
                                                                            𝑄. In order to make the definition of minimality properties ho-
technique. Additional details can be found in [1, 3].
                                                                            mogeneous with respect to all the selection attributes 𝐴𝑖 in 𝑄,
                                                                            we define them in a transformed unit space, in which the values
Datasets. Our rewriting approach can be applied over a collec-
                                                                            for each attribute 𝐴𝑖 in 𝐼 are normalized between 0 and 1. We
tion of tabular datasets (e.g., relations in a relational database,
                                                                            denote with 𝑄, 𝐼 , 𝐴𝑖 , 𝑣 a query, a dataset, an attribute, and a vector
Data Frames in the Pandas analytical environment) 𝐼 ≡ 𝐼 1, ..., 𝐼𝑟 .
                                                                            of values, respectively, in the unit space. Notice that properties
Among the attributes 𝐴1, ..., 𝐴𝑚 of each input dataset, we assume
                                                                            (i), (ii), and (iii) are satisfied in the original space if and only if
that some discrete valued attributes 𝑆 1, ..., 𝑆𝑛 are of particular
                                                                            they are satisfied in the normalized one. Minimality can now
concern, since they allow the identification of protected groups,
                                                                            be stated according to the following two properties: (iv) there
and we call them sensitive attributes. Examples of sensitive at-
                                                                            is no other query 𝑄 ′ satisfying conditions (i), (ii), and (iii) such
tributes are the gender (with values in {𝑓 𝑒𝑚𝑎𝑙𝑒, 𝑚𝑎𝑙𝑒}) and the
                                                                            that 𝑄 ′ (𝐼 ) ⊂ 𝑄 𝐼,𝐶𝐶 (𝐼 ) (thus, 𝑄 𝐼,𝐶𝐶 is the minimal query on 𝐼
                                                                                                 𝑜𝑝𝑡              𝑜𝑝𝑡
race (with values in {𝑎𝑠𝑖𝑎𝑛, 𝑏𝑙𝑎𝑐𝑘, ℎ𝑖𝑠𝑝𝑎𝑛𝑖𝑐, 𝑤ℎ𝑖𝑡𝑒}).
                                                                                                                   𝑜𝑝𝑡
                                                                            satisfying (i), (ii), and (iii)); (v) 𝑄 𝐼,𝐶𝐶 ≡ 𝑄 ⟨𝑢⟩ is the closest query
Data pre-processing operations. The pre-processing opera-                   to 𝑄 ⟨𝑣⟩ according to the Euclidean distance between 𝑣 and 𝑢 in
tions we are interested in correspond to monotonic Select-Project-                                                                        𝑜𝑝𝑡
                                                                            the unit space, satisfying (i), (ii), (iii), and (iv) (thus, 𝑄 𝐼,𝐶𝐶 is the
Join (SPJ) queries over input tabular data that might alter the
representation (i.e., the coverage) of specific groups of interests,        coverage-based rewriting syntactically closest to the input query,
defined in terms of sensitive attribute values. To this aim, we             thus maximizing proximity and potentially user satisfaction).
focus on SPJ queries that return, among the others, at least one               In order to compute the optimal coverage-based rewriting of
sensitive attribute (called sensitive SPJ operations or queries). For       an SPJ query 𝑄 ⟨𝑣⟩, given a set of coverage constraints 𝐶𝐶 and an
the sake of simplicity, we assume that selection conditions are             instance 𝐼 , we follow the approach presented in [1], consisting
defined over numeric attributes, even if the proposed approach              of three steps shortly discussed in what follows. For the sake of
can be easily extended to any other ordered domain. Thus, under             simplicity in the notations, in presenting the approach and the
the considered assumptions, sensitive attributes are not included           related examples, we do not underline symbols referring to the
in selection conditions (typical assumption in data processing).            unit space, even if proximity is always considered in that space.
   In the following, when needed, we denote 𝑄 by 𝑄 ⟨𝑣 1, ..., 𝑣𝑑 ⟩
or 𝑄 ⟨𝑣⟩, 𝑣 ≡ (𝑣 1, ..., 𝑣𝑑 ), where 𝑣 1, ..., 𝑣𝑑 are the constant values      Canonical form generation. We first translate the selected SPJ
appearing in the selection conditions 𝑠𝑒𝑙𝑖 ≡ 𝐴𝑖 𝜃𝑖 𝑣𝑖 in 𝑄.                 queries into a canonical form, in which each selection condition
                                                                            containing operators (>, ≥, =) is translated into one or more
Coverage constraints. Conditions over the number of entries                 equivalent conditions defined in terms of operator <. For example,
belonging to a given protected group of interest returned by the            any predicate of the form 𝐴𝑖 > 𝑣𝑖 can be transformed into the
execution of SPJ queries can be specified in terms of coverage              predicate −𝐴𝑖 < −𝑣𝑖 . When considering canonical forms, an
constraints [6, 27]. Given a sensitive SPJ query 𝑄, with reference          optimal coverage-based rewriting query is obtained from the
to a sensitive attribute 𝑆𝑖 , and a value 𝑠𝑖 belonging to the domain        input query by replacing one or more selection predicates 𝑠𝑒𝑙𝑖 ≡
of 𝑆𝑖 , 𝑖 ∈ {1, ..., 𝑛}, a coverage constraint with respect to 𝑆𝑖 and       𝐴𝑖 < 𝑣𝑖 with a predicate 𝑠𝑒𝑙𝑖′ ≡ 𝐴𝑖 < 𝑢𝑖 with 𝑢𝑖 ≥ 𝑣𝑖 . 𝑠𝑒𝑙𝑖′ is
                                                                            called a relaxation of 𝑠𝑒𝑙𝑖 . Relaxed queries generated through
𝑠𝑖 is denoted by 𝑄 ↓𝑆𝑠𝑖𝑖 ≥ 𝑘𝑖 and it is satisfied by 𝑄 over the
                                                                            coverage-based rewriting starting from 𝑄 ⟨𝑣⟩, 𝐼 , and 𝐶𝐶 have the
input dataset 𝐼 when 𝑐𝑎𝑟𝑑 (𝜎𝑆𝑖 =𝑠𝑖 (𝑄 (𝐼 ))) ≥ 𝑘𝑖 holds. For example,
                                                                            form 𝑄 ⟨𝑢⟩, with 𝑢 ≥ 𝑣, and can be represented as points 𝑢 in the
choosing gender as a sensitive attribute, a coverage constraint
                                                                            𝑑-dimensional space defined over the selection attributes. Taking
could be 𝑄 ↓gender        ≥ 10, specifying that the result of 𝑄 must
                  𝑓 𝑒𝑚𝑎𝑙𝑒                                                   into account the features of the canonical form and property (ii)
contain data related to at least 10 female individuals.                     of the optimal rewriting, it is simple to show that the query point
   Coverage constraints can be provided together with 𝑄 or they             corresponding to the optimal rewriting must be contained in
can already be available in the system, as any other integrity              the upper right region of the reference space with respect to the
constraint. This could be useful when they represent generally              point represented by the input query.
valid non-discrimination rules that must be satisfied by any query
execution.
                                 (a) Data distribution                                            (b) Data discretization (4 bins)




                              (c) Multi-dimensional grid                                          (d) Multi-dimensional grid visit


                                                    Figure 1: Data representation and processing


   Example 2.1. Consider the Diabetes US dataset1 and let 𝑔𝑒𝑛𝑑𝑒𝑟                     Pre-processing. According to property (ii) of the coverage-based
be the sensitive attribute. Suppose we are interested in finding                 rewriting, given an input query 𝑄 ⟨𝑣⟩, any coverage-based rewrit-
people whose number of medications is less than 10 and the num-                  ing is located in the upper right portion of the space defined
ber of performed lab tests is less than 30. Additionally, suppose we             by 𝑣. Thus, the (unit) search space contains infinite possible
would like to guarantee that at least 15 females are present in the              coverage-based rewritings among which the optimal one should
query result (coverage constraint). The corresponding SPJ query                  be identified. During the pre-processing step, such search space is
is 𝑄 ⟨30, 10⟩, defined in SQL as SELECT * FROM Diabetes WHERE                    discretized, so that an approximation of the optimal solution can
num_lab_procedures < 30 AND num_medications < 10. Fig-                           be detected in the next processing step by looking at the result-
ure 1(a) shows the data distribution corresponding to a small                    ing finite set of points. To this aim, we first organize the search
sample of size 100 taken from the Diabetes relation, projected                   space as a multi-dimensional grid. The grid has 𝑑 axes, one for
over the attributes referenced in the query selection conditions                 each selection attribute in the canonical form of 𝑄 ⟨𝑣⟩, and each
(points are colored and shaped according to the sensitive attribute              axis, starting from query values, is discretized into a fixed set
values: blue crosses for females and black dots for males). The                  of bins, by using a binning approach (e.g, equi-width, dividing
query corresponds to point (30, 10) in such a space and the region               each axis in a fixed number of bins of equal size, or equi-depth,
boxed at the bottom left of point (30, 10) contains the result of 𝑄.             in which each bin contains an equal number of instances), typ-
   The query is already in canonical form, so no preliminary                     ical of histogram generation. Each point 𝑣 at the intersection
rewriting is required. The search space for detecting coverage-                  of hyperplanes corresponding to bin values corresponds to a
based rewritings of the input query corresponds to the grey                      sensitive SPJ query containing 𝑄 ⟨𝑣⟩, thus satisfying condition
region in Figure 1(b).                                             ^             (ii) of the reference problem. The set of grid points identified
                                                                                 in this way is called discretized search space. The approach is
                                                                                 approximate because a smaller query, in terms of minimality and
1 https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+
                                                                                 proximity, than that identified by the algorithm, corresponding
1999-2008
to a coverage-based rewriting of the input query, might exist            samples with size 9604 out of 100 will lead to an estimation error
but, if it lies inside one grid cell, it cannot be discovered by the     equal to 1%.
algorithm. Notice that the grid is computed starting from 𝐼 and
𝑄 (𝐶𝐶 is not used).                                                      Performance evaluation. The analysis of the efficiency of the
                                                                         proposed coverage-based query rewriting approach has been in-
   Example 2.2. During the pre-processing step, by applying, as
                                                                         vestigated in a previous work [1]. The experiments demonstrated
an example, the equi-depth binning approach to each axis of our
                                                                         that the time complexity of the proposed algorithms depends
reference example, we obtain the grid represented in Figure 1(b),
                                                                         on: the number of bins, the used binning approach, the num-
considering 4 bins for each axis. The points of the resulting dis-
                                                                         ber of selection conditions in the query, the coverage constraint
cretized search space correspond to the grid intersection points.
                                                                         threshold, and the sample size. Specifically, by varying the num-
Each point corresponds to a sensitive SPJ query, obtained from
                                                                         ber of selection conditions or the number of bins, and therefore
the input one by replacing selection constants with the grid point
                                                                         the dimensionality of the multi-dimensional grid and the size
coordinates (see Figure 1(c)).                                  ^
                                                                         of the discretized search space, the execution time rapidly in-
   Processing. During the processing step, we visit the discretized      creases; the impact of the curve of dimensionality can however
search space returned by the pre-processing step starting from           be reduced by applying the designed optimizations. Equi-depth
the grid point corresponding to the input query. The discretized         binning approaches often lead to a more efficient processing than
search space is visited one point after the other, at increasing         equi-width approaches since the distribution of points in the
distance from 𝑄. For each point 𝑢, we check whether the as-              discretized search space follow data distribution, thus reducing
sociated query 𝑄 ⟨𝑢⟩ is a coverage-based rewriting of 𝑄 ⟨𝑣⟩ by           the number of grid cells with no dataset points and, as a conse-
estimating the query result cardinality, for each protected group        quence, the number of cardinality estimations. The execution
referenced by coverage constraints 𝐶𝐶, and the query cardinality         time also depends on the chosen coverage constraint thresholds
𝑐𝑎𝑟𝑑 (𝑄 (𝐼 )).                                                           and the number of coverage constraints. Finally, the sample size
   The properties of the discretized search space and of the used        influences the cardinality estimation time and therefore the total
canonical form are taken into account for pruning cells that can-        execution time.
not contain the solution and for further improving the efficiency
of the process, by iteratively refining the size and the number of       3     IMPACT EVALUATION
cells during the visit and, as a consequence, by working with a          The coverage-based rewriting of the input query can be obtained
discretized search space at varying granularity [1].                     by visiting the grid, produced as the result of the pre-processing
    Example 2.3. Figure 1(d) illustrates the processing approach         step, and by verifying constraint satisfaction relying on a sample-
for the considered example. Starting from the grid point corre-          based approach. The optimal coverage-based rewriting is there-
sponding to the input query 𝑄, we estimate the cardinality of            fore approximate since: (i) the grid, that corresponds to the dis-
𝑄 ↓gender
                                                                         cretized search space, might have an impact on the accuracy of
     𝑓 𝑒𝑚𝑎𝑙𝑒
             on 𝐼 , needed for checking constraint satisfaction, and
                                                                         the selected coverage-based rewriting; (ii) the estimation error
of 𝑄 (𝐼 ), by relying on a sample-based approach, obtaining (2, 8).      related to the sample usage has an impact on query cardinality
Since the constraint is not satisfied, we further visit the other        estimation and on constraint satisfaction.
points of the discretized search space at increasing distance from          It is therefore important to introduce some measures quantify-
𝑄, checking constraint satisfaction and looking for the minimum          ing the error that can be generated. To this aim, in the following
rewriting (property (iv)). The visit proceeds as pointed out in Fig-     we discuss three groups of measures: the first group deals with
ure 1(d) (the order of the visit is represented by the blue numbers      the approximation error related to the usage of the grid for the
associated with the top right vertex of each cell). Shaped cells         discretization of the query search space; the second group deals
are not visited thanks to the pruning effect: if 𝑄 ⟨𝑢⟩(𝐼 ) satisfies     with the approximation error related to the usage of a sample dur-
the coverage constraints, all the points in the upper right portion      ing the pre-processing and processing phases; the third concerns
of space defined by 𝑢 will not satisfy conditions (iv) and (v) of        the error related to the detected optimal rewriting.
the reference problem, thus they can be discarded. The optimal
coverage-based rewriting corresponds to query 𝑄 ⟨43, 20⟩ (big
blue dot in Figure 1 (d)) since (43, 20) is the point at the minimum
                                                                         3.1    Grid-based accuracy
distance from the origin such that the corresponding query sat-          Due to the usage of a discretized search space, the optimal covera-
isfies the considered coverage constraint (𝑄 ↓gender𝑓 𝑒𝑚𝑎𝑙𝑒
                                                            ≥ 15). The   ge-based rewriting identified by the proposed approach is the
                                                                         best approximation of an optimal rewriting, given the considered
input query is thus rewritten into SELECT * FROM Adult WHERE
                                                                         grid. The accuracy related to the usage of a given grid for the
num_lab_procedures < 43 AND num_medications < 20. ^
                                                                         detection of the optimal rewriting thus corresponds to the error
Sample-based estimation. The processing step, as well as cov-            introduced by the discretization process.
erage constraint checking, requires fast and accurate cardinality           As pointed out before, given a query 𝑄 ⟨𝑣⟩, the visit proceeds
estimates. To make the processing more efficient, similarly to           at increasing distance from the query point 𝑣. When an optimal
[16], we rely on sampling based estimators based on samples              rewriting 𝑄 ⟨𝑢⟩ is reached, this means that the neighbours of
(uniform, independent, and without replacement) of the input             𝑢 in the discretized search space cannot be optimal rewritings
dataset [9], dynamically constructed during the rewriting phase,         (otherwise the search would have stopped before). On the other
and on the approach in [18] for generating the sample of joined          hand, another point 𝑧 might exist that is not included in the
tables. As well known, a sample of a given size can be generated         search space but 𝑄 ⟨𝑧⟩(𝐼 ) satisfies 𝐶𝐶 and either it is closer to
so that query selectivity can be estimated with an error 𝑒 and           the query point 𝑣 than 𝑢 or 𝑄 ⟨𝑧⟩(𝐼 ) ⊆ 𝑄 ⟨𝑢⟩(𝐼 ). Such point is an
confidence 𝛿 [4]. As an example, if the error is 1% and the confi-       optimal rewriting but, due to the approximation, it cannot be
dence is 95%, the sample size should be 9604: this means that 95         identified by the proposed approach. The approximation error
                                                                        has an impact on the minimality property, in addition, it might
                                                                        lead to a wrong assessment of constraint satisfaction.

                                                                        Impact on (pre-)processing In order to evaluate the impact
                                                                        of the sample usage on grid generation, we compare the data
                                                                        distribution of the dataset 𝐼 with the data distribution of the
                                                                        sample 𝑆, both projected over the attributes appearing in the
                                                                        considered query 𝑄. The more similar the two distributions, the
                                                                        lower is the impact of the sample selection in the detection of
                                                                        the optimal coverage-based rewriting.
                                                                            Several metrics have been proposed for quantifying the dis-
                                                                        tance between two multivariate datasets. Many of them, e.g., the
                                                                        Wasserstein metric [17], the Kullback-Liebler [13, 14] and the
                                                                        Jensen-Shannon divergence [10], quantify the distance between
                                                                        the corresponding probability distributions and sometimes, as
                                                                        for the Wasserstein metric, the result might tend to infinity [17].
Figure 2: Grid-based accuracy (dashed blue line for the                 Probability distributions are not directly available under the con-
maximum diagonal, dotted blue line for the minimum di-                  sidered scenarios and, even if computed, they introduce a further
agonal)                                                                 level of approximation in the computation. The Kolmogorov-
                                                                        Smirnov (KS) distance, defined for arbitrary univariate distribu-
                                                                        tions [7, 22], measures the maximum distance, between 0 and 1,
for the identified approximate optimal rewriting 𝑄 ⟨𝑢⟩, also called
                                                                        between the cumulative distributions of two datasets. Due to its
grid-based accuracy, can therefore be defined as the maximum
                                                                        generality, it is quite used and many, very complex, extensions
distance between 𝑢 and all its neighbours on the grid, preceding
                                                                        to the multivariate case exist.
it in the search; the accuracy is therefore lower than or equal
                                                                            In our work, we consider a very simple extension of the uni-
to the diagonal of the grid cells having 𝑢 as a vertex and closer
                                                                        variate KS metric, obtained by averaging the KS distance com-
to 𝑣 than 𝑢. By considering the entire grid, we can quantify the
                                                                        puted for each query attribute. The distance computed between
maximum and minimum grid-based accuracy in terms of the
                                                                        two datasets 𝐼 and 𝑆 is denoted by 𝑑 𝐾𝑆 (𝐼, 𝑆). This approach is
maximum and the minimum diagonal length of grid cells (see
                                                                        suitable since, similarly to what we have proposed for the search
Figure 2 for a graphical explanation).
                                                                        space construction, where we rely on an aggregation of unidi-
   Definition 3.1 (Grid-based accuracy). Let 𝐺 𝑏 be the grid gen-       mensional histograms instead of a multidimensional histogram,
erated from a dataset 𝐼 and a query 𝑄 using a certain binning           it aggregates distances defined on each single attribute.
approach 𝑏. The mimimum/maximum grid-based accuracy of 𝐺 𝑏 ,                In order to investigate the impact of the KS distance for a
denoted by 𝑑𝑖𝑎𝑔𝑚𝑖𝑛             𝑚𝑎𝑥
                   𝑏 and 𝑑𝑖𝑎𝑔 𝑏 , is defined as the minimum/
                                                                        sample 𝑆 on the identification of the optimal rewriting, it is useful
                 𝐺              𝐺
maximum diagonal length of grid cells in 𝐺 𝑏 , normalized between       to introduce some metrics, quantifying the difference in using
0 and 1.                                                        □       𝑆 instead of the initial dataset 𝐼 for the detection of the optimal
                                                                        coverage-based rewriting. Such metrics compute the average
   Example 3.2. Figure 2 shows the normalized cell diagonal             difference, in terms of minimality, proximity, and solution distance,
lengths for the grid created as discussed in Example 2.2. The                                                 𝑜𝑝𝑡         𝑜𝑝𝑡
                                                                        between the optimal rewritings 𝑄 𝑆,𝐶𝐶 and 𝑄 𝐼,𝐶𝐶 , obtained by
grid-based accuracy for this grid varies between 0.05 (dotted blue      processing the sample 𝑆 and the original dataset 𝐼 , respectively,
line), corresponding to queries 𝑄 ⟨30, 15⟩ and 𝑄 ⟨30, 20⟩ and 0.73      on a random set of queries 𝑄𝑆, uniformly distributed in the query
(dashed blue line), corresponding to query 𝑄 ⟨90, 75⟩.          ^       search space:
   Different binning approaches might lead to a different grid-
                                                                            • Average minimality difference (property (iv) of the optimal
based accuracy. In particular, when fixing the reference data                                                                   𝑜𝑝𝑡
interval over each axis, binning approaches based on data distri-             rewriting). It quantifies how much different 𝑄 𝑆,𝐶𝐶 and
                                                                                 𝑜𝑝𝑡
bution, like equi-depth, lead to a higher variability of grid-based            𝑄 𝐼,𝐶𝐶 are, in average, with respect to their result set car-
accuracy, since they generate smaller buckets for dense regions                dinalities when they are executed over the input dataset 𝐼
and larger buckets for sparse ones, as stated by the following                 (thus, it quantifies, in average, the difference in the relax-
proposition.                                                                   ation of the original query over the initial dataset):
                                                                                                               𝑜𝑝𝑡                   𝑜𝑝𝑡
                                                                                                    |𝑐𝑎𝑟𝑑 (𝑄         (𝐼 ))−𝑐𝑎𝑟𝑑 (𝑄         (𝐼 )) |
   Proposition 1. Let 𝐺 𝑤 be the grid generated from a dataset 𝐼              𝑚(𝐼, 𝑆) ≡ 𝑎𝑣𝑔𝑄 ∈𝑄𝑆         𝑆,𝐶𝐶
                                                                                                            𝑐𝑎𝑟𝑑 (𝑄 (𝐼 ))
                                                                                                                          𝐼 ,𝐶𝐶
                                                                                                                                 .
and a query 𝑄 using the equi-width approach and 𝐺 𝑑 that gener-             • Average proximity difference (property (v) of the optimal
ated using the equi-depth approach, with 𝑛 bins for each axis in both                                                           𝑜𝑝𝑡
                                                                              rewriting). It quantifies how much different 𝑄 𝑆,𝐶𝐶 and
cases. Then: (i) 𝑑𝑖𝑎𝑔𝑚𝑖𝑛       𝑚𝑖𝑛           𝑚𝑎𝑥 ≥ 𝑑𝑖𝑎𝑔𝑚𝑎𝑥 . □
                      𝑑 ≤ 𝑑𝑖𝑎𝑔𝐺 𝑤 ; (ii) 𝑑𝑖𝑎𝑔 𝑑           𝐺𝑤                     𝑜𝑝𝑡
                     𝐺                        𝐺                               𝑄 𝐼,𝐶𝐶 are, in average, with respect to their Euclidean dis-
3.2    Sample-based accuracy                                                  tance 𝑑 () from the input query 𝑄, in the unit space, further
                                                                              normalized between 0 and 1 (thus, it quantifies, in average,
The accuracy of the query rewriting approach depends on the
                                                                              how far the two optimal rewritings are with respect to the
sample data distribution for two main reasons: (i) different sam-
                                                                              original query):
ple distributions might lead to the generation of different buckets                                     𝑜𝑝𝑡              𝑜𝑝𝑡
                                                                              𝑝 (𝐼, 𝑆) ≡ 𝑎𝑣𝑔𝑄 ∈𝑄𝑆 |𝑑 (𝑄 𝑆,𝐶𝐶 , 𝑄) − 𝑑 (𝑄 𝐼,𝐶𝐶 , 𝑄)|.
and therefore of different grids, with an impact on both pre-
processing and processing steps; (ii) the sample is used for query          • Average solution distance: It quantifies how much differ-
                                                                                     𝑜𝑝𝑡         𝑜𝑝𝑡
cardinality estimation and, as a consequence, the estimation error            ent 𝑄 𝑆,𝐶𝐶 and 𝑄 𝐼,𝐶𝐶 are, in average, in terms of their
        Euclidean distance in the unit space, further normalized         guarantee constraint satisfaction, the cardinality of the rewrit-
        between 0 and 1 (thus, it quantifies, in average, how far        ten query is about 4 times that of the original query). Finally,
        the two optimal rewritings are, without taking the input         the proximity, i.e., the Euclidean distance between (30, 10) and
        dataset and the original query into account):                    (43, 20) in the unit space, normalized between 0 and 1, is 0.19,
                                  𝑜𝑝𝑡      𝑜𝑝𝑡
        𝑠𝑑 (𝐼, 𝑆) ≡ 𝑎𝑣𝑔𝑄 ∈𝑄𝑆 𝑑 (𝑄 𝑆,𝐶𝐶 , 𝑄 𝐼,𝐶𝐶 ).                       thus corresponding to a relaxation of 19% with respect to the
                                                                         maximum query, returning the whole dataset.                    ^
Impact on constraint satisfaction. The error and the confi-
dence related to the considered sample (see Section 2) have an           4     EXPERIMENTAL RESULTS
impact also on coverage constraint satisfaction. A constraint
                                                                         In this section, we present some preliminary experimental results
𝐶𝐶𝑖 ≡ 𝑄 ↓𝑆𝑠𝑖𝑖 ≥ 𝑘𝑖 is satisfied on 𝐼 if 𝑐𝑎𝑟𝑑 (𝜎𝑆𝑖 =𝑠𝑖 (𝑄 (𝐼 ))) ≥ 𝑘𝑖 .   with the aim of analyzing the accuracy of the proposed query
Since the sample-based estimation of 𝑐𝑎𝑟𝑑 (𝜎𝑆𝑖 =𝑠𝑖 (𝑄 (𝐼 ))) might       rewriting approach.
lead to an error of 𝑒 × 𝑐𝑎𝑟𝑑 (𝐼 ), in order to guarantee that the
constraint is also satisfied by the input dataset 𝐼 , we can mod-        4.1     Experimental Setup
ify the constraint, when evaluated over the sample, as: 𝑄 ↓𝑆𝑠𝑖𝑖 ≥
                                                                         All experiments were conducted on a PC with an Intel Core i5-
𝑘𝑖 + (𝑒 ×𝑐𝑎𝑟𝑑 (𝐼 )). This consideration makes the proposed sample-
                                                                         8300H 2.30 GHz CPU and 16GB main memory, running Microsoft
based approach reasonable for coverage constraints in which
   𝑘𝑖                                                                    Windows 10. All programs were coded in Python 3.8.
𝑐𝑎𝑟𝑑 (𝐼 ) has the same order of magnitude as 𝑒. Notice that, by              The experiments refer to a real dataset, stored in PostgreSQL:
changing the constraint as proposed above, we increase the prob-         Diabetes US2 representing 10 years (1999-2008) of clinical care
ability of constraint satisfaction on the input dataset at the price     at 130 US hospitals and integrated delivery networks (100,000
of reducing proximity of the optimal rewriting with respect to           instances). It includes over 50 features representing patient and
the input query (since the optimal query will be further away            hospital outcomes. For our experiments we use gender as sensi-
from the initial one).                                                   tive attribute and we add a coverage constraint on 𝑓 𝑒𝑚𝑎𝑙𝑒.
                                                                             For this dataset, we generated many random samples with an
3.3     Solution-based accuracy                                          increasing size, guaranteeing a variable percentage of error (1% or
                                             𝑜𝑝𝑡                         3%) and a variable level of confidence (95% or 99%). More precisely,
Let 𝑆 be a sample of dataset 𝐼 and 𝑄 𝑆,𝐶𝐶 = 𝑄 ⟨𝑢⟩ be the opti-
                                                                         we considered the following sample sizes: 1067 (3% error and
mal solution obtained from 𝑆, given a query 𝑄 ⟨𝑣⟩ and a set of
                                                                         95% confidence), 1843 (3% error and 99% confidence), 9604 (1%
coverage constraints 𝐶𝐶. It could be useful to introduce some
                                                                         error and 95% confidence), 16588 (1% error and 99% confidence).
additional measures to evaluate the quality of the obtained opti-
                𝑜𝑝𝑡                                                      For each sample size, we generated 5 random samples, for a total
mal rewriting 𝑄 𝑆,𝐶𝐶 with respect to the discretized search space        of 20 samples. The samples are all small enough to be stored in
identified by the chosen dataset 𝑆, taking into account both the         main memory.
applied relaxation with respect to the original query and the                In order to evaluate the accuracy of the proposed approach, we
approximation error, in line with what we have proposed for              randomly generated a set of 1000 queries, with different selectiv-
grid-based and sample-based accuracy. To this aim, we propose            ities, and we compared the impact of the grid and of the consid-
the following three measures:                                            ered sample in detecting the optimal coverage-based rewriting,
                                       𝑜𝑝𝑡                               by considering the measures presented in Section 3. All the se-
      • Grid-based accuracy of 𝑄 𝑆,𝐶𝐶 . According to what discussed
                                                                         lected queries contain three selection conditions (thus leading
        in Subsection 3.1, it can be computed as the maximum
                                                                         to a three-dimensional grid), with respect to the three numerical
        distance between 𝑢 and all its neighbours on the grid,
                                                                         attributes with the highest number of distinct values, namely
        preceding it in the search (thus, the maximum diagonal
                                                                         (number_emergency,num_lab_procedures,num_medications);
        of the cells of the grid having 𝑢 as a vertex and closer to 𝑣
                                                                         selection values are picked at random from the attribute value
        than 𝑢).
                                                                         range in the dataset. For the sake of simplicity, join queries are
      • Relaxation degree. Similarly to [16], the relaxation degree,
                                                                         not considered for these preliminary experiments.
        first proposed in [1], quantifies, through estimations over
                                                                             For each query, we then defined the coverage constraint taking
        the initial dataset 𝐼 , how much the optimal coverage-based
                      𝑜𝑝𝑡                                                into account the considered query and in such a way that it is
        rewriting 𝑄 𝑆,𝐶𝐶 relaxes the original query 𝑄, as the per-
                                                                         neither satisfied on the dataset nor on the considered samples.
        centage of new added tuples with respect to those con-
                                                                             The experiments were performed by considering two dis-
        tained in the original query result:
            𝑜𝑝𝑡                                                          tinct binning approaches during the pre-processing step: (i) equi-
         |𝑄 𝑆,𝐶𝐶 (𝐼 ) |− |𝑄 (𝐼 ) |
                                   .                                     width∗ , corresponding to an equi-width approach, dividing each
                |𝑄 (𝐼 ) |
      • Proximity. It can be computed as the Euclidean distance          axis in a fixed number of bins of equal size, set as the minimum
                                                       𝑜𝑝𝑡               between a selected number (denoted by #𝑏𝑖𝑛𝑠 in the experimental
        between the optimal coverage-based rewriting 𝑄 𝑆,𝐶𝐶 and
        𝑄 in the unit space, further normalized between 0 and            results) and the number of distinct values for the considered at-
        1, thus indicating how far the optimal rewriting is with         tribute in the dataset; (ii) equi-depth∗ , in which each bin, defined
        respect to the original query.                                   as for the equi-width∗ approach, contains a constant number of
                                                                         instances. A variable number of bins, namely 4, 8, 16, 32, 64, has
   Example 3.3. The optimal coverage-based rewriting of the              been considered for specific experiments.
running example corresponds to query 𝑄 ⟨43, 20⟩ (see Example                 We then performed three groups of experiments aimed at
2.3). The grid-based accuracy of such solution is 0.16 (see Figure       analyzing the grid-based, the sample-based, and the solution-
2), since this is the maximum length between the solution and            based accuracy of the optimal coverage-based rewriting.
its neighbours on the grid. Since 𝑐𝑎𝑟𝑑 (𝑄 ⟨43, 20⟩(𝐼 )) = 38 and         2 https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+
𝑐𝑎𝑟𝑑 (𝑄 (𝐼 )) = 8 (see Figure 2), the relaxation degree is 3.75 (to      1999-2008
                                                                             Sample      KS Distance       Sample     KS Distance
                                                                             1067_1         0.0185         1843_1        0.0156
                                                                             1067_2         0.0185         1843_2        0.0183
                                                                             1067_3         0.0185         1843_3        0.0130
                                                                             1067_4         0.0199         1843_4        0.0130
                                                                             1067_5         0.0198         1843_5        0.0119
                                                                             Mean           0.0190         Mean          0.0144
                                                                             Variance      4.38e-7         Variance     5.36e-6

                                                                            Sample       KS Distance       Sample     KS Distance
                   (a) Maximal grid-based accuracy                          9604_1          0.0058         16588_1       0.0034
                                                                            9604_2          0.0064         16588_2       0.0044
                                                                            9604_3          0.0026         16588_3       0.0032
                                                                            9604_4          0.0071         16588_4       0.0055
                                                                            9604_5          0.0134         16588_5       0.0035
                                                                            Mean            0.0071         Mean          0.0040
                                                                            Variance       1.24e-5         Variance     7.32e-7

                                                                      Figure 4: KS distance from the samples to the reference
                                                                      dataset

                   (b) Minimal grid-based accuracy                        Sample          KS       avg min.    avg prox.    avg sol.
                                                                                       distance   difference   difference   distance
                                                                          9604_3        0.0026      0.0275       0.1072      0.1403
Figure 3: Grid-based accuracy, when varying the number                    9604_5        0.0134      0.1350       0.0995      0.1291
of bins
                                                                               Table 1: Sample-based accuracy measures

4.2    Experimental evaluation
    4.2.1 Grid-based accuracy. The first group of experiments         (namely, 9604_3 and 9604_5). We then computed the sample-
aims at analyzing the maximum and minimum grid-based ac-              based accuracy measures on the set of 1000 random queries.
curacy, as presented in Subsection 3.1, by varying the number            Table 1 shows the results obtained by considering the equi-
of bins, with respect to the selected binning approach, namely        depth∗ approach and 32 bins (similar values has been obtained
equi-depth∗ and equi-width∗ . To this aim, we selected a sample       for the equi-width∗ and other numbers of bins). As you can see,
with size 16588 (sample 16588_3 in Figure 4) and we considered        there is no clear ordering between the two samples when consid-
all the generated 1000 random queries. The obtained results are       ering such measures. In particular, sample 9604_3 is better than
independent from the query selectivity, therefore in the following    sample 9604_5 with respect to minimality; however, for prox-
we discuss those obtained with selectivity equal to 2.5%.             imity and solution distance the opposite result holds. Thus, it
    Figure 3 shows that both the maximum and the minimum              seems that the KS measure is not good enough for discriminating
grid-based accuracy decrease by increasing the number of bins,        between samples with a different behaviour with respect to the
independently from the chosen binning approach. This is because,      detection of the optimal coverage-based rewriting. Additional
by increasing the number of bins, the space is discretized into a     work is therefore needed for investigating or defining alterna-
higher number of cells, thus obtaining cells of smaller size.         tive functions able to discriminate between samples under the
    From Figure 3 we also observe the behaviour described by          considered scenario.
Proposition 1: the maximal grid-based accuracy obtained with
the equi-depth∗ approach is always higher than the grid-based             4.2.3 Solution-based accuracy. The last group of experiments
accuracy obtained with the equi-width∗ approach; minimal accu-        aims at analyzing the solution-based accuracy, according to the
racy behaves in the opposite way.                                     measures introduced in Subsection 3.3, by varying the number of
                                                                      bins, with respect to the selected binning approach, namely equi-
   4.2.2 Sample-based accuracy. The second group of experi-           depth∗ and equi-width∗ . To this aim, we selected sample 16588_3
ments deals with the analysis of the sample-based accuracy, ac-       (i.e., the biggest sample with the smallest KS distance with respect
cording to the measures introduced in Subsection 3.2. To this         to the initial dataset) and we considered all the generated 1000
aim, we considered all the samples described in Subsection 4.1        random queries. The obtained results show a different behaviour
and, for each of them, we computed the KS distance with respect       with respect to the region in which the optimal solution is lo-
to the input dataset.                                                 cated, either dense or sparse. In particular, Figure 5 shows that,
   Results, presented in Figure 4, show that the obtained values      as expected, by increasing the number of bins, the grid-based
are very similar and quite small: for most sample sizes, the great-   accuracy of the solution will decrease as well. However, depend-
est differences refer to the third decimal digit. As expected, by     ing on where the optimal solution is located, the equi-depth∗
increasing the sample size, the KS distance tends to decrease.        and the equi-width∗ approaches behave in a different way. More
   In order to evaluate the impact of the KS distance on sample-      precisely, when the solution region is dense (Figure 5(a)), the ac-
based accuracy measures, we considered the sample size leading        curacy is lower with the equi-depth∗ approach since in this case
to the highest variance (9604) and we selected the samples with       higher density will lead to smaller bins. On the other hand, when
the highest difference between the corresponding KS distances         the solution region is sparse (Figure 5(b)), the accuracy is usually
                    (a) Solution in a dense region                                       (a) Solution in a dense region




                    (b) Solution in a sparse region                                      (b) Solution in a sparse region


       Figure 5: Grid-based accuracy of the solution                                      Figure 7: Proximity


                                                                       Finally, notice that dense regions have a greater impact on
                                                                    query cardinalities and, as a consequence, on the relaxation de-
                                                                    gree, whose values are usually higher in this case (starting from
                                                                    8 bins), independently on the selected binning approach. On the
                                                                    other hand, dense regions tend to generate optimal solutions with
                                                                    lower proximity (more evident with the equi-depth∗ approach,
                                                                    that is more sensible to data distribution), since constraint satis-
                                                                    faction can be obtained on query points closer to the initial one.
                                                                    In general, for uniformly distributed datasets, in which no sparse
                                                                    regions can be detected, equi-depth∗ can be considered the best
                    (a) Solution in a dense region                  option. By combining the obtained results with those presented
                                                                    in [1, 3], related to performance, a number of bins equal to 16
                                                                    can be considered a good compromise between effectiveness and
                                                                    efficiency.

                                                                    5   RELATED WORK
                                                                    The interest for coverage constraints has been introduced in [5, 6],
                                                                    drawing inspiration from the literature on diversity [11]. The
                                                                    problem of evaluating the coverage of a given dataset has been
                                                                    considered in the context of the MithraLabel system [12, 24], in
                    (b) Solution in a sparse region
                                                                    which the lack of coverage is modeled as a widget in the nutri-
                                                                    tional label [26] of a dataset. Once the lack of coverage has been
                                                                    identified, the smallest number of data points needed to hit all
                Figure 6: Relaxation degree                         the “large uncovered spaces” is identified with the aim of helping
                                                                    data owners in achieving coverage through a data repairing ap-
                                                                    proach. When protected categories are defined in terms of many
                                                                    attributes, the identification of attribute patterns associated with
lower with the equi-width∗ approach since in this case lower        coverage problems might lead to performance issues, due to the
density will lead to longer bins under the equi-depth∗ approach.    combinatorial explosion of such patterns. Efficient techniques,
   A similar behavior can be observed for the relaxation degree     inspired from set enumeration and association rule mining, ad-
(Figure 6) and the proximity (Figure 7): by increasing the number   dressing this problem have been proposed in [6]. To fix coverage
of bins, they decrease for both the binning approaches. When        unsatisfaction, additional data can be acquired. Since data ac-
the solution is contained in a dense region, equi-depth∗ behaves    quisition has a cost in term of data processing, techniques have
better than equi-width∗ , especially for low numbers of bins. We    been presented in [6] for determining the patterns that can be
further notice that, for very low numbers of bins, proximity        covered given a certain maximum cost. An efficient approach for
and grid-based accuracy coincide (the farthest neighbour of the     coverage analysis, given a set of attributes across multiple tables,
solution is the query point itself).                                is presented in [27]. As pointed out by the previous discussion,
most existing approaches chase coverage through data repair             user to reconsider some of the choices made in the selection of
and focus on efficiency issues. By contrast, we consider accuracy       the operations to be rewritten, thus producing a new annotated
for coverage-based query rewriting during data transformation,          script.
thus complementing existing approaches.                                    Future work is needed in order to understand how to define
   The technique considered in this paper relies on rewriting.          new distance functions between distributions, focusing on their
Other fairness-aware rewriting approaches have been proposed            behaviour during coverage-based rewriting. An additional issue
for OLAP queries [19, 20]. Bias is defined in terms of causal           concerns the definition and the analysis of further solution-based
fairness (checking for causal relationships from the sensitive at-      measures, evaluating the quality of the detected solution with
tributes to the outcome) and detected, explained, and resolved          respect to the solution that would have been obtained without
through rewriting. On the other hand, we focus on data transfor-        considering a sample, as well as their integration in the covRew
mations in presence of coverage constraints.                            prototype system.
   Impact evaluation is quite relevant in the design and the ex-
ecution of non-discriminating pipelines, usually very complex           REFERENCES
in real-world scenarios. Various systems have been designed              [1] C. Accinelli, S. Minisi, B. Catania. Coverage-based rewriting for data prepara-
                                                                             tion. In Proc. of the EDBT/ICDT Workshops, CEUR-WS.org, 2020.
for supporting the user during this activity. Among them, we             [2] C. Accinelli, B. Catania, G. Guerrini, S. Minisi. covRew: a Python toolkit for
recall: Fair-DAGs [25], an open-source library aiming at repre-              pre-processing pipeline rewriting ensuring coverage constraint satisfaction.
senting data processing pipelines in terms of a directed acyclic             In EDBT, Proc. of the 24th Int. Conf. on Extending Database Technology, 2021.
                                                                         [3] C. Accinelli, B. Catania, G. Guerrini, S. Minisi. A coverage-based approach to
graph (DAG) and identifying distortions with respect to protected            data transformations. In preparation.
groups as the data flows through the pipeline; FairPrep [21], an         [4] S. Agarwal et al. Knowing when you’re wrong: building fast and reliable
environment for investigating the impact of fairness-enhancing               approximate query processing systems. In Proc. of the ACM SIGMOD, Int. Conf.
                                                                             on Management of Data, pages 481–492, 2014.
interventions inside data processing pipelines; AI Fairness 360 [8],     [5] A. Asudeh, H. V. Jagadish, J. Stoyanovich. Towards responsible data-driven
an open-source Python toolkit for algorithmic fairness, aimed at             decision making in score-based systems. IEEE Data Eng. Bull., 42(3):76–87,
                                                                             2019.
facilitating the transition of fairness-aware research algorithms        [6] A. Asudeh, Z. Jin, H. V. Jagadish. Assessing and remedying coverage for a
to usage in an industrial setting and at providing a common                  given dataset. In ICDE, Proc. of the 35th IEEE Int. Conf. on Data Engineering,
framework to share and evaluate algorithms.                                  pages 554–565, 2019.
                                                                         [7] M. Basseville. Divergence measures for statistical data processing. 2010.
                                                                         [8] R. K. Bellamy et al. Ai fairness 360: An extensible toolkit for detecting and mit-
                                                                             igating algorithmic bias. IBM Journal of Research and Development, 63(4/5):4–1,
6   CONCLUDING REMARKS                                                       2019.
Rewriting approaches have been recognized as an interesting              [9] G. Cormode, M. N. Garofalakis, P. J. Haas, C. Jermaine. Synopses for massive
                                                                             data: Samples, histograms, wavelets, sketches. Found. Trends Databases, 4(1-
mean for enforcing specific non-discriminating properties guar-              3):1–294, 2012.
anteeing transparency. In this paper, we started from the ap-           [10] I. Dagan, L. Lee, F. Pereira. Similarity-based methods for word sense disam-
                                                                             biguation. arXiv preprint cmp-lg/9708010, 1997.
proach proposed in [1] with the aim of investigating the impact         [11] M. Drosou, H. V. Jagadish, E. Pitoura, J. Stoyanovich. Diversity in big data: A
of rewriting on coverage-constraint satisfaction. The approach is            review. Big Data, 5(2):73–84, 2017.
approximate and relies on a sample for both the construction of         [12] Z. Jin et al. Mithracoverage: A system for investigating population bias for
                                                                             intersectional fairness. In Proc. of the ACM SIGMOD Int. Conf. on Management
the solution search space and the detection of the optimal rewrit-           of Data, pages 2721–2724, 2020.
ing. Three different groups of measures have been proposed for          [13] S. Kullback. Information theory and statistics. Courier Corporation, 1997.
quantifying the accuracy induced by the approximation and the           [14] S. Kullback, R. A. Leibler. On information and sufficiency. The annals of
                                                                             mathematical statistics, 22(1):79–86, 1951.
impact of the sample in the detection of the optimal solution.          [15] W. McKinney. Python for data analysis: Data wrangling with Pandas, NumPy,
    Preliminary experimental results show that: (i) different bin-           and IPython. O’Reilly Media, Inc., 2012.
                                                                        [16] C. Mishra, N. Koudas. Interactive query refinement. In EDBT, Proc. of the 12th
ning approaches lead to different grid-based accuracy degrees;               Int. Conf. on Extending Database Technology, Proceedings, pages 862–873, 2009.
(ii) common measures for computing the distance between dis-            [17] I. Olkin, F. Pukelsheim. The distance between two random vectors with given
tributions are not effective for analyzing their behaviour in the            dispersion matrices. Linear Algebra and its Applications, 48:257–263, 1982.
                                                                        [18] M. Riondato et al. The VC-dimension of SQL queries and selectivity estimation
detection of the optimal coverage-based solution; (iii) the number           through sampling. In Machine Learning and Knowledge Discovery in Databases
of bins used in the generation of the search space has an impact             - European Conference, ECML PKDD, Proceedings, Part II, pages 661–676, 2011.
in the accuracy of the detected solution, and not only on the           [19] B. Salimi et al. Hypdb: A demonstration of detecting, explaining and resolving
                                                                             bias in OLAP queries. Proc. VLDB Endow., 11(12):2062–2065, 2018.
performance [1], while the optimal binning approach depends             [20] B. Salimi, J. Gehrke, D. Suciu. Bias in OLAP queries: Detection, explanation,
on the position of the optimal rewriting in the search space; (iv)           and removal. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data,
                                                                             pages 1021–1035, 2018.
a number of bins greater than 16 represents a good compromise           [21] S. Schelter, Y. He, J. Khilnani, J. Stoyanovich. FairPrep: Promoting data to a
between accuracy and efficiency.                                             first-class citizen in studies on fairness-enhancing interventions. In EDBT,
    The proposed approach is at the basis of covRew [2], a Python            Proc. of the 23nd Int. Conf. on Extending Database Technology, pages 395–398,
                                                                             2020.
toolkit for rewriting slicing operations in pre-processing pipelines,   [22] M. A. Stephens. Edf statistics for goodness of fit and some comparisons.
ensuring coverage constraint satisfaction. covRew takes in input             Journal of the American statistical Association, 69(347):730–737, 1974.
a two-dimensional tabular dataset 𝐼 , with the related sensitive        [23] J. Stoyanovich, B. Howe, H. V. Jagadish. Responsible Data Management. Proc.
                                                                             VLDB Endow., 13(12): 3474–3488, 2020.
attribute specification, for the identification of protected groups,    [24] C. Sun et al. Mithralabel: Flexible dataset nutritional labels for responsible
a processing pipeline represented as a Pandas script [15], and a             data science. In CIKM, Proc. of the 28th Int. Conf. on Information and Knowledge
                                                                             Management, pages 2893–2896, 2019.
set of coverage constraints. It includes three main components:         [25] K. Yang, B. Huang, J. Stoyanovich, S. Schelter. Fairness-aware instrumentation
(i) a pipeline analyzer, which identifies candidate operations for           of preprocessing pipelines for machine learning. In Workshop on Human-In-
rewriting, (ii) a pipeline rewriter, which transforms operations             the-Loop Data Analytics (HILDA 2020), 2020.
                                                                        [26] K. Yang et al. A nutritional label for rankings. In Proc. of the 2018 ACM
that are selected by the user according to the input coverage con-           SIGMOD Int. Conf. on Management of Data, pages 1773–1776, 2018.
straints, and (iii) an impact evaluator, assessing the impact of the    [27] Y. Lin, Y. Guan, A. Asudeh, H. V. Jagadish. Identifying insufficient data coverage
rewriting through the usage of the grid-based and solution-based             in databases with multiple relations. Proc. of the VLDB Endow, pages 2229–2242,
                                                                             2020.
measures presented in Section 3. Such measures could lead the