<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The impact of rewriting on coverage constraint satisfaction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chiara Accinelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Barbara Catania</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanna Guerrini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Minisi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIBRIS - University of Genoa</institution>
          ,
          <addr-line>Genoa -</addr-line>
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Due to the impact of analytical processes on our life, an increasing efort is being devoted to the design of technological solutions that help humans in measuring the bias introduced by such processes and understanding its causes. Existing solutions can refer to either back-end or front-end stages of the data processing pipeline and usually represent bias in terms of some given diversity or fairness constraint. In our previous work [1], we proposed an approach for rewriting filtering and merge operations in pre-processing pipelines into the “closest” operations so that protected groups are adequately represented (i.e., covered) in the result. This is relevant because any under-represented category in an initial or intermediate dataset might lead to an under-representation of that category in any subsequent analytical process. Since many potential rewritings exist, the proposed approach is approximate and relies on a sample-based cardinality estimation, thus introducing a trade-of between the accuracy and the eficiency of the process. In this paper, we investigate this trade-of by first presenting various measures quantifying the error introduced by the rewriting, due to the applied approximation and the selected sample. Then, we (preliminarly) experimentally evaluate such measures on a real-world dataset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The impact of data on our society is getting higher and higher,
with data about people being more and more often exploited
as the basis to make decisions that might impact people’s lives.
Thus, it becomes crucial to ensure that, in the systems enabling
such data-based decisions, data are dealt with in a responsible and
non-discriminating way [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], in all the steps, from acquisition to
analysis.
      </p>
      <p>Non-discrimination can be addressed by considering specific
diversity and fairness constraints. Diversity allows us to capture
the quality of a collection of items with respect to the variety
of its constituent elements. On the other hand, fairness can be
broadly defined as the impartial treatment of individuals and of
demographic groups inside data processing tasks.</p>
      <p>Among all data processing steps, data pre-processing plays a
relevant role when considering non-discrimination issues since it
can introduce technical bias by exacerbating pre-existing bias that
may exist in society, with an impact on the whole data lifecycle.</p>
      <p>When considering pre-processing tasks, an often considered
non-discriminating constraint is coverage. Coverage constraints
guarantee that the input, or training, dataset includes enough
examples for each (protected) category of interest, thus increasing
diversity with the aim of limiting the introduction of bias during
the next analytical steps. Coverage is quite relevant in the first
data processing tasks, like data transformation, since the used
transformations might change the, possibly initially satisfied,
coverage of protected categories.</p>
      <p>In this paper, we are interested in investigating solutions for
detecting bias in data-preprocessing steps, defined in terms of
coverage constraints, with a special reference to filtering and merge
data transformations, mitigating it, and checking whether the
mitigation was efective. Specifically, we focus on classical data
transformation operations, often defined in terms of
SelectionProjection-Join (SPJ) operations over tabular data, that can reduce
the number of records related to some protected or disadvantaged
groups, defined in terms of some sensitive attributes, even if such
attributes are not directly used in the specification of the data
transformation operation.</p>
      <p>
        In this frame, the approach we developed [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] aims at
supporting the user by minimally rewriting the transformation operation
so that input coverage constraints are guaranteed to be satisfied in
the transformation result. Through rewriting, the revised process
is traced for further processing, thus guaranteeing transparency.
Since many potential rewritings exist, we proposed a
samplebased two-steps approach for detecting, under an approximate
approach, the minimal (i.e., the optimal) rewriting of the
original query. After transforming the SPJ query into a canonical
form, first (in a pre-processing step), the search space of potential
rewritings is discretized, so that an approximation of the optimal
solution can be detected in the next processing step, by looking at
the resulting finite set of points. The coverage-based rewriting of
the input query can be obtained by visiting the grid, produced as
the result of the pre-processing step, according to an order that
guarantees the fast detection of the minimal rewriting, and by
verifying constraint satisfaction through a sample-based approach.
The coverage-based rewriting is approximate both because of the
discretization of the search space and of the error in estimating
cardinalities and constraint satisfaction on the sample.
      </p>
      <p>In this paper, we start from this approach and we investigate
its efectiveness in the detection of the optimal coverage-based
solution. To this aim, we introduce three main groups of
measures quantifying the error that can be generated by the used
approximated and sample-based approach. The first group of
measures deals with the accuracy of the discretization applied
in the pre-processing step. The second group deals with the
error due to the usage of the sample for the grid generation and
cardinality estimations, while the third group helps the user in
quantitatively evaluating specific solutions obtained through the
rewriting.</p>
      <p>
        The generated rewritings are then (preliminarly)
experimentally analyzed in terms of the proposed measures, on a real-world
dataset. The obtained results provide hints on how to tune
approximation and sample related parameters to achieve a good
trade-of between accuracy and performance in the context of
coverage-based rewriting, by combining new results related to
accuracy presented in this paper with results related to
performance, previously presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Even if the proposed measures have been defined in the context
of a specific coverage-based approach, we believe that they can be
of more general value in understanding the role of approximation
and sampling during data pre-processing.</p>
      <p>The remainder of the paper is structured as follows. In
Section 2, we present the overall approach to coverage-based
rewriting. In Section 3, we introduce new measures to quantify, in
terms of accuracy, the impact of rewriting. In Section 4, we
experimentally analyze the impact of the rewriting according to
the introduced measures. Section 5 discusses related work while
Section 6 concludes the paper and outlines future work directions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>COVERAGE-BASED REWRITING</title>
      <p>
        We focus on data to be transformed for further data analytical
tasks. Data can refer specific protected (minorities or historically
disadvantaged) groups and we aim at guaranteeing that each
transformation step during the pre-processing pipeline, based
on filtering or merge operations, produces a new dataset
containing enough entries for each protected group of interest. In
the following, we briefly describe input data and the proposed
technique. Additional details can be found in [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ].
Datasets. Our rewriting approach can be applied over a
collection of tabular datasets (e.g., relations in a relational database,
Data Frames in the Pandas analytical environment)  ≡ 1, ...,  .
Among the attributes 1, ...,  of each input dataset, we assume
that some discrete valued attributes 1, ...,  are of particular
concern, since they allow the identification of protected groups,
and we call them sensitive attributes. Examples of sensitive
attributes are the gender (with values in { ,  }) and the
race (with values in {, , ℎ, ℎ }).
      </p>
      <sec id="sec-2-1">
        <title>Data pre-processing operations. The pre-processing opera</title>
        <p>tions we are interested in correspond to monotonic
Select-ProjectJoin (SPJ) queries over input tabular data that might alter the
representation (i.e., the coverage) of specific groups of interests,
defined in terms of sensitive attribute values. To this aim, we
focus on SPJ queries that return, among the others, at least one
sensitive attribute (called sensitive SPJ operations or queries). For
the sake of simplicity, we assume that selection conditions are
defined over numeric attributes, even if the proposed approach
can be easily extended to any other ordered domain. Thus, under
the considered assumptions, sensitive attributes are not included
in selection conditions (typical assumption in data processing).</p>
        <p>
          In the following, when needed, we denote  by  ⟨1, ...,  ⟩
or  ⟨ ⟩,  ≡ (1, ...,  ), where 1, ...,  are the constant values
appearing in the selection conditions  ≡   in .
Coverage constraints. Conditions over the number of entries
belonging to a given protected group of interest returned by the
execution of SPJ queries can be specified in terms of coverage
constraints [
          <xref ref-type="bibr" rid="ref27 ref6">6, 27</xref>
          ]. Given a sensitive SPJ query , with reference
to a sensitive attribute  , and a value  belonging to the domain
of  ,  ∈ {1, ..., }, a coverage constraint with respect to  and

 is denoted by  ↓ ≥  and it is satisfied by  over the
input dataset  when  ( = ( ( ))) ≥  holds. For example,
choosing gender as a sensitive attribute, a coverage constraint
gender
could be  ↓  ≥ 10, specifying that the result of  must
contain data related to at least 10 female individuals.
        </p>
        <p>Coverage constraints can be provided together with  or they
can already be available in the system, as any other integrity
constraint. This could be useful when they represent generally
valid non-discrimination rules that must be satisfied by any query
execution.</p>
        <p>
          The approach. Given a dataset  and a set of coverage
constraints , each selected sensitive SPJ query  can be rewritten
into another query , , according to what presented in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], so
that , is the minimal query relaxing  guaranteeing
coverage constraint satisfaction when evaluated over the input dataset.
Relaxation is reasonable when the user is happy with the
speciifed transformation and she wants to keep the result set of the
original query after the rewriting. , must therefore satisfy
the following properties: (i) , ≡  ⟨⟩, thus , is obtained
from  by only changing the selection constants; (ii)  ⊆ , ,
thus , always contains the result of the input query; (iii) all
coverage constraints associated with  are satisfied by , ( ).
The rewriting should be optimal, i.e., the new query has to satisfy
specific minimality properties with respect to the input query
. In order to make the definition of minimality properties
homogeneous with respect to all the selection attributes  in ,
we define them in a transformed unit space, in which the values
for each attribute  in  are normalized between 0 and 1. We
denote with ,  ,  ,  a query, a dataset, an attribute, and a vector
of values, respectively, in the unit space. Notice that properties
(i), (ii), and (iii) are satisfied in the original space if and only if
they are satisfied in the normalized one. Minimality can now
be stated according to the following two properties: (iv) there
is no other query  ′ satisfying conditions (i), (ii), and (iii) such
that  ′ ( ) ⊂ , ( ) (thus, , is the minimal query on 
satisfying (i), (ii), and (iii)); (v)
        </p>
        <p>, ≡  ⟨⟩ is the closest query
to  ⟨ ⟩ according to the Euclidean distance between  and  in
the unit space, satisfying (i), (ii), (iii), and (iv) (thus, , is the
coverage-based rewriting syntactically closest to the input query,
thus maximizing proximity and potentially user satisfaction).</p>
        <p>
          In order to compute the optimal coverage-based rewriting of
an SPJ query  ⟨ ⟩, given a set of coverage constraints  and an
instance  , we follow the approach presented in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], consisting
of three steps shortly discussed in what follows. For the sake of
simplicity in the notations, in presenting the approach and the
related examples, we do not underline symbols referring to the
unit space, even if proximity is always considered in that space.
        </p>
        <p>Canonical form generation. We first translate the selected SPJ
queries into a canonical form, in which each selection condition
containing operators (&gt;, ≥, =) is translated into one or more
equivalent conditions defined in terms of operator &lt;. For example,
any predicate of the form  &gt;  can be transformed into the
predicate − &lt; − . When considering canonical forms, an
optimal coverage-based rewriting query is obtained from the
input query by replacing one or more selection predicates  ≡
 &lt;  with a predicate ′ ≡  &lt;  with  ≥  . ′ is
called a relaxation of  . Relaxed queries generated through
coverage-based rewriting starting from  ⟨ ⟩,  , and  have the
form  ⟨⟩, with  ≥  , and can be represented as points  in the
-dimensional space defined over the selection attributes. Taking
into account the features of the canonical form and property (ii)
of the optimal rewriting, it is simple to show that the query point
corresponding to the optimal rewriting must be contained in
the upper right region of the reference space with respect to the
point represented by the input query.
(a) Data distribution
(b) Data discretization (4 bins)
(c) Multi-dimensional grid
(d) Multi-dimensional grid visit</p>
        <p>Example 2.1. Consider the Diabetes US dataset1 and let 
be the sensitive attribute. Suppose we are interested in finding
people whose number of medications is less than 10 and the
number of performed lab tests is less than 30. Additionally, suppose we
would like to guarantee that at least 15 females are present in the
query result (coverage constraint). The corresponding SPJ query
is  ⟨30, 10⟩, defined in SQL as SELECT * FROM Diabetes WHERE
num_lab_procedures &lt; 30 AND num_medications &lt; 10.
Figure 1(a) shows the data distribution corresponding to a small
sample of size 100 taken from the Diabetes relation, projected
over the attributes referenced in the query selection conditions
(points are colored and shaped according to the sensitive attribute
values: blue crosses for females and black dots for males). The
query corresponds to point (30, 10) in such a space and the region
boxed at the bottom left of point (30, 10) contains the result of .</p>
        <p>The query is already in canonical form, so no preliminary
rewriting is required. The search space for detecting
coveragebased rewritings of the input query corresponds to the grey
region in Figure 1(b). ^
1https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+
1999-2008</p>
        <p>Pre-processing. According to property (ii) of the coverage-based
rewriting, given an input query  ⟨ ⟩, any coverage-based
rewriting is located in the upper right portion of the space defined
by  . Thus, the (unit) search space contains infinite possible
coverage-based rewritings among which the optimal one should
be identified. During the pre-processing step, such search space is
discretized, so that an approximation of the optimal solution can
be detected in the next processing step by looking at the
resulting finite set of points. To this aim, we first organize the search
space as a multi-dimensional grid. The grid has  axes, one for
each selection attribute in the canonical form of  ⟨ ⟩, and each
axis, starting from query values, is discretized into a fixed set
of bins, by using a binning approach (e.g, equi-width, dividing
each axis in a fixed number of bins of equal size, or equi-depth,
in which each bin contains an equal number of instances),
typical of histogram generation. Each point  at the intersection
of hyperplanes corresponding to bin values corresponds to a
sensitive SPJ query containing  ⟨ ⟩, thus satisfying condition
(ii) of the reference problem. The set of grid points identified
in this way is called discretized search space. The approach is
approximate because a smaller query, in terms of minimality and
proximity, than that identified by the algorithm, corresponding
to a coverage-based rewriting of the input query, might exist
but, if it lies inside one grid cell, it cannot be discovered by the
algorithm. Notice that the grid is computed starting from  and
 ( is not used).</p>
        <p>Example 2.2. During the pre-processing step, by applying, as
an example, the equi-depth binning approach to each axis of our
reference example, we obtain the grid represented in Figure 1(b),
considering 4 bins for each axis. The points of the resulting
discretized search space correspond to the grid intersection points.
Each point corresponds to a sensitive SPJ query, obtained from
the input one by replacing selection constants with the grid point
coordinates (see Figure 1(c)). ^</p>
        <p>Processing. During the processing step, we visit the discretized
search space returned by the pre-processing step starting from
the grid point corresponding to the input query. The discretized
search space is visited one point after the other, at increasing
distance from . For each point , we check whether the
associated query  ⟨⟩ is a coverage-based rewriting of  ⟨ ⟩ by
estimating the query result cardinality, for each protected group
referenced by coverage constraints , and the query cardinality
 ( ( )).</p>
        <p>
          The properties of the discretized search space and of the used
canonical form are taken into account for pruning cells that
cannot contain the solution and for further improving the eficiency
of the process, by iteratively refining the size and the number of
cells during the visit and, as a consequence, by working with a
discretized search space at varying granularity [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          Example 2.3. Figure 1(d) illustrates the processing approach
for the considered example. Starting from the grid point
corresponding to the input query , we estimate the cardinality of
gender on  , needed for checking constraint satisfaction, and
 ↓ 
of  ( ), by relying on a sample-based approach, obtaining (2, 8).
Since the constraint is not satisfied, we further visit the other
points of the discretized search space at increasing distance from
, checking constraint satisfaction and looking for the minimum
rewriting (property (iv)). The visit proceeds as pointed out in
Figure 1(d) (the order of the visit is represented by the blue numbers
associated with the top right vertex of each cell). Shaped cells
are not visited thanks to the pruning efect: if  ⟨⟩ ( ) satisfies
the coverage constraints, all the points in the upper right portion
of space defined by  will not satisfy conditions (iv) and (v) of
the reference problem, thus they can be discarded. The optimal
coverage-based rewriting corresponds to query  ⟨43, 20⟩ (big
blue dot in Figure 1 (d)) since (43, 20) is the point at the minimum
distance from the origin such that the corresponding query
satgender
isfies the considered coverage constraint (  ↓  ≥ 15). The
input query is thus rewritten into SELECT * FROM Adult WHERE
num_lab_procedures &lt; 43 AND num_medications &lt; 20. ^
Sample-based estimation. The processing step, as well as
coverage constraint checking, requires fast and accurate cardinality
estimates. To make the processing more eficient, similarly to
[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], we rely on sampling based estimators based on samples
(uniform, independent, and without replacement) of the input
dataset [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], dynamically constructed during the rewriting phase,
and on the approach in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] for generating the sample of joined
tables. As well known, a sample of a given size can be generated
so that query selectivity can be estimated with an error  and
confidence  [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. As an example, if the error is 1% and the
confidence is 95%, the sample size should be 9604: this means that 95
samples with size 9604 out of 100 will lead to an estimation error
equal to 1%.
        </p>
        <p>
          Performance evaluation. The analysis of the eficiency of the
proposed coverage-based query rewriting approach has been
investigated in a previous work [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The experiments demonstrated
that the time complexity of the proposed algorithms depends
on: the number of bins, the used binning approach, the
number of selection conditions in the query, the coverage constraint
threshold, and the sample size. Specifically, by varying the
number of selection conditions or the number of bins, and therefore
the dimensionality of the multi-dimensional grid and the size
of the discretized search space, the execution time rapidly
increases; the impact of the curve of dimensionality can however
be reduced by applying the designed optimizations. Equi-depth
binning approaches often lead to a more eficient processing than
equi-width approaches since the distribution of points in the
discretized search space follow data distribution, thus reducing
the number of grid cells with no dataset points and, as a
consequence, the number of cardinality estimations. The execution
time also depends on the chosen coverage constraint thresholds
and the number of coverage constraints. Finally, the sample size
influences the cardinality estimation time and therefore the total
execution time.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>IMPACT EVALUATION</title>
      <p>The coverage-based rewriting of the input query can be obtained
by visiting the grid, produced as the result of the pre-processing
step, and by verifying constraint satisfaction relying on a
samplebased approach. The optimal coverage-based rewriting is
therefore approximate since: (i) the grid, that corresponds to the
discretized search space, might have an impact on the accuracy of
the selected coverage-based rewriting; (ii) the estimation error
related to the sample usage has an impact on query cardinality
estimation and on constraint satisfaction.</p>
      <p>It is therefore important to introduce some measures
quantifying the error that can be generated. To this aim, in the following
we discuss three groups of measures: the first group deals with
the approximation error related to the usage of the grid for the
discretization of the query search space; the second group deals
with the approximation error related to the usage of a sample
during the pre-processing and processing phases; the third concerns
the error related to the detected optimal rewriting.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Grid-based accuracy</title>
      <p>Due to the usage of a discretized search space, the optimal
coverage-based rewriting identified by the proposed approach is the
best approximation of an optimal rewriting, given the considered
grid. The accuracy related to the usage of a given grid for the
detection of the optimal rewriting thus corresponds to the error
introduced by the discretization process.</p>
      <p>As pointed out before, given a query  ⟨ ⟩, the visit proceeds
at increasing distance from the query point  . When an optimal
rewriting  ⟨⟩ is reached, this means that the neighbours of
 in the discretized search space cannot be optimal rewritings
(otherwise the search would have stopped before). On the other
hand, another point  might exist that is not included in the
search space but  ⟨⟩ ( ) satisfies  and either it is closer to
the query point  than  or  ⟨⟩ ( ) ⊆  ⟨⟩ ( ). Such point is an
optimal rewriting but, due to the approximation, it cannot be
identified by the proposed approach. The approximation error
for the identified approximate optimal rewriting  ⟨⟩, also called
grid-based accuracy, can therefore be defined as the maximum
distance between  and all its neighbours on the grid, preceding
it in the search; the accuracy is therefore lower than or equal
to the diagonal of the grid cells having  as a vertex and closer
to  than . By considering the entire grid, we can quantify the
maximum and minimum grid-based accuracy in terms of the
maximum and the minimum diagonal length of grid cells (see
Figure 2 for a graphical explanation).</p>
      <p>Definition 3.1 (Grid-based accuracy). Let  be the grid
generated from a dataset  and a query  using a certain binning
approach . The mimimum/maximum grid-based accuracy of  ,
denoted by  and  , is defined as the minimum/
 
maximum diagonal length of grid cells in  , normalized between
0 and 1. □</p>
      <p>Example 3.2. Figure 2 shows the normalized cell diagonal
lengths for the grid created as discussed in Example 2.2. The
grid-based accuracy for this grid varies between 0.05 (dotted blue
line), corresponding to queries  ⟨30, 15⟩ and  ⟨30, 20⟩ and 0.73
(dashed blue line), corresponding to query  ⟨90, 75⟩. ^</p>
      <p>Diferent binning approaches might lead to a diferent
gridbased accuracy. In particular, when fixing the reference data
interval over each axis, binning approaches based on data
distribution, like equi-depth, lead to a higher variability of grid-based
accuracy, since they generate smaller buckets for dense regions
and larger buckets for sparse ones, as stated by the following
proposition.</p>
      <p>Proposition 1. Let   be the grid generated from a dataset 
and a query  using the equi-width approach and  that
generated using the equi-depth approach, with  bins for each axis in both
cases. Then: (i)  ≤ ; (ii)  ≥  . □
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Sample-based accuracy</title>
      <p>The accuracy of the query rewriting approach depends on the
sample data distribution for two main reasons: (i) diferent
sample distributions might lead to the generation of diferent buckets
and therefore of diferent grids, with an impact on both
preprocessing and processing steps; (ii) the sample is used for query
cardinality estimation and, as a consequence, the estimation error
has an impact on the minimality property, in addition, it might
lead to a wrong assessment of constraint satisfaction.
Impact on (pre-)processing In order to evaluate the impact
of the sample usage on grid generation, we compare the data
distribution of the dataset  with the data distribution of the
sample , both projected over the attributes appearing in the
considered query . The more similar the two distributions, the
lower is the impact of the sample selection in the detection of
the optimal coverage-based rewriting.</p>
      <p>
        Several metrics have been proposed for quantifying the
distance between two multivariate datasets. Many of them, e.g., the
Wasserstein metric [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], the Kullback-Liebler [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ] and the
Jensen-Shannon divergence [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], quantify the distance between
the corresponding probability distributions and sometimes, as
for the Wasserstein metric, the result might tend to infinity [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
Probability distributions are not directly available under the
considered scenarios and, even if computed, they introduce a further
level of approximation in the computation. The
KolmogorovSmirnov (KS) distance, defined for arbitrary univariate
distributions [
        <xref ref-type="bibr" rid="ref22 ref7">7, 22</xref>
        ], measures the maximum distance, between 0 and 1,
between the cumulative distributions of two datasets. Due to its
generality, it is quite used and many, very complex, extensions
to the multivariate case exist.
      </p>
      <p>In our work, we consider a very simple extension of the
univariate KS metric, obtained by averaging the KS distance
computed for each query attribute. The distance computed between
two datasets  and  is denoted by  (, ). This approach is
suitable since, similarly to what we have proposed for the search
space construction, where we rely on an aggregation of
unidimensional histograms instead of a multidimensional histogram,
it aggregates distances defined on each single attribute.</p>
      <p>In order to investigate the impact of the KS distance for a
sample  on the identification of the optimal rewriting, it is useful
to introduce some metrics, quantifying the diference in using
 instead of the initial dataset  for the detection of the optimal
coverage-based rewriting. Such metrics compute the average
diference, in terms of minimality, proximity, and solution distance,
between the optimal rewritings , and , , obtained by
processing the sample  and the original dataset  , respectively,
on a random set of queries , uniformly distributed in the query
search space:
• Average minimality diference (property (iv) of the optimal
rewriting). It quantifies how much diferent , and
, are, in average, with respect to their result set
cardinalities when they are executed over the input dataset 
(thus, it quantifies, in average, the diference in the
relaxation of the original query over the initial dataset):</p>
      <p>| (, ( ))− (, ( )) | .</p>
      <p>(, ) ≡  ∈  ( ( ))
• Average proximity diference (property (v) of the optimal
rewriting). It quantifies how much diferent , and
, are, in average, with respect to their Euclidean
distance  () from the input query , in the unit space, further
normalized between 0 and 1 (thus, it quantifies, in average,
how far the two optimal rewritings are with respect to the
original query):
 (, ) ≡  ∈ | (, , ) −  (, , ) |.
• Average solution distance: It quantifies how much
diferent , and , are, in average, in terms of their
Euclidean distance in the unit space, further normalized
between 0 and 1 (thus, it quantifies, in average, how far
the two optimal rewritings are, without taking the input
dataset and the original query into account):
 (, ) ≡  ∈ (, , , ).</p>
      <sec id="sec-5-1">
        <title>Impact on constraint satisfaction. The error and the confi</title>
        <p>dence related to the considered sample (see Section 2) have an
impact also on coverage constraint satisfaction. A constraint

 ≡  ↓ ≥  is satisfied on  if  ( = ( ( ))) ≥  .
Since the sample-based estimation of  ( = ( ( ))) might
lead to an error of  ×  ( ), in order to guarantee that the
constraint is also satisfied by the input dataset  , we can
mod
ify the constraint, when evaluated over the sample, as:  ↓ ≥
 + ( × ( )). This consideration makes the proposed
samplebased approach reasonable for coverage constraints in which
 ( ) has the same order of magnitude as . Notice that, by
changing the constraint as proposed above, we increase the
probability of constraint satisfaction on the input dataset at the price
of reducing proximity of the optimal rewriting with respect to
the input query (since the optimal query will be further away
from the initial one).</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>3.3 Solution-based accuracy</title>
      <p>
        Let  be a sample of dataset  and , =  ⟨⟩ be the
optimal solution obtained from , given a query  ⟨ ⟩ and a set of
coverage constraints . It could be useful to introduce some
additional measures to evaluate the quality of the obtained
opti
mal rewriting , with respect to the discretized search space
identified by the chosen dataset , taking into account both the
applied relaxation with respect to the original query and the
approximation error, in line with what we have proposed for
grid-based and sample-based accuracy. To this aim, we propose
the following three measures:
• Grid-based accuracy of , . According to what discussed
in Subsection 3.1, it can be computed as the maximum
distance between  and all its neighbours on the grid,
preceding it in the search (thus, the maximum diagonal
of the cells of the grid having  as a vertex and closer to 
than ).
• Relaxation degree. Similarly to [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], the relaxation degree,
ifrst proposed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], quantifies, through estimations over
the initial dataset  , how much the optimal coverage-based

rewriting , relaxes the original query , as the
percentage of new added tuples with respect to those
contained in the original query result:
      </p>
      <p>|, ( ) |−| ( ) | .</p>
      <p>| ( ) |
• Proximity. It can be computed as the Euclidean distance
between the optimal coverage-based rewriting , and
 in the unit space, further normalized between 0 and
1, thus indicating how far the optimal rewriting is with
respect to the original query.</p>
      <p>Example 3.3. The optimal coverage-based rewriting of the
running example corresponds to query  ⟨43, 20⟩ (see Example
2.3). The grid-based accuracy of such solution is 0.16 (see Figure
2), since this is the maximum length between the solution and
its neighbours on the grid. Since  ( ⟨43, 20⟩ ( )) = 38 and
 ( ( )) = 8 (see Figure 2), the relaxation degree is 3.75 (to
guarantee constraint satisfaction, the cardinality of the
rewritten query is about 4 times that of the original query). Finally,
the proximity, i.e., the Euclidean distance between (30, 10) and
(43, 20) in the unit space, normalized between 0 and 1, is 0.19,
thus corresponding to a relaxation of 19% with respect to the
maximum query, returning the whole dataset. ^
4</p>
    </sec>
    <sec id="sec-7">
      <title>EXPERIMENTAL RESULTS</title>
      <p>In this section, we present some preliminary experimental results
with the aim of analyzing the accuracy of the proposed query
rewriting approach.
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>Experimental Setup</title>
      <p>All experiments were conducted on a PC with an Intel Core
i58300H 2.30 GHz CPU and 16GB main memory, running Microsoft
Windows 10. All programs were coded in Python 3.8.</p>
      <p>The experiments refer to a real dataset, stored in PostgreSQL:
Diabetes US2 representing 10 years (1999-2008) of clinical care
at 130 US hospitals and integrated delivery networks (100,000
instances). It includes over 50 features representing patient and
hospital outcomes. For our experiments we use gender as
sensitive attribute and we add a coverage constraint on  .</p>
      <p>For this dataset, we generated many random samples with an
increasing size, guaranteeing a variable percentage of error (1% or
3%) and a variable level of confidence (95% or 99%). More precisely,
we considered the following sample sizes: 1067 (3% error and
95% confidence), 1843 (3% error and 99% confidence), 9604 (1%
error and 95% confidence), 16588 (1% error and 99% confidence).
For each sample size, we generated 5 random samples, for a total
of 20 samples. The samples are all small enough to be stored in
main memory.</p>
      <p>In order to evaluate the accuracy of the proposed approach, we
randomly generated a set of 1000 queries, with diferent
selectivities, and we compared the impact of the grid and of the
considered sample in detecting the optimal coverage-based rewriting,
by considering the measures presented in Section 3. All the
selected queries contain three selection conditions (thus leading
to a three-dimensional grid), with respect to the three numerical
attributes with the highest number of distinct values, namely
(number_emergency,num_lab_procedures,num_medications);
selection values are picked at random from the attribute value
range in the dataset. For the sake of simplicity, join queries are
not considered for these preliminary experiments.</p>
      <p>For each query, we then defined the coverage constraint taking
into account the considered query and in such a way that it is
neither satisfied on the dataset nor on the considered samples.</p>
      <p>The experiments were performed by considering two
distinct binning approaches during the pre-processing step: (i)
equiwidth∗, corresponding to an equi-width approach, dividing each
axis in a fixed number of bins of equal size, set as the minimum
between a selected number (denoted by # in the experimental
results) and the number of distinct values for the considered
attribute in the dataset; (ii) equi-depth∗, in which each bin, defined
as for the equi-width∗ approach, contains a constant number of
instances. A variable number of bins, namely 4, 8, 16, 32, 64, has
been considered for specific experiments.</p>
      <p>We then performed three groups of experiments aimed at
analyzing the grid-based, the sample-based, and the
solutionbased accuracy of the optimal coverage-based rewriting.
2https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+
1999-2008
(a) Maximal grid-based accuracy
(b) Minimal grid-based accuracy</p>
      <p>Experimental evaluation
4.2.1 Grid-based accuracy. The first group of experiments
aims at analyzing the maximum and minimum grid-based
accuracy, as presented in Subsection 3.1, by varying the number
of bins, with respect to the selected binning approach, namely
equi-depth∗ and equi-width∗. To this aim, we selected a sample
with size 16588 (sample 16588_3 in Figure 4) and we considered
all the generated 1000 random queries. The obtained results are
independent from the query selectivity, therefore in the following
we discuss those obtained with selectivity equal to 2.5%.</p>
      <p>Figure 3 shows that both the maximum and the minimum
grid-based accuracy decrease by increasing the number of bins,
independently from the chosen binning approach. This is because,
by increasing the number of bins, the space is discretized into a
higher number of cells, thus obtaining cells of smaller size.</p>
      <p>From Figure 3 we also observe the behaviour described by
Proposition 1: the maximal grid-based accuracy obtained with
the equi-depth∗ approach is always higher than the grid-based
accuracy obtained with the equi-width∗ approach; minimal
accuracy behaves in the opposite way.</p>
      <p>4.2.2 Sample-based accuracy. The second group of
experiments deals with the analysis of the sample-based accuracy,
according to the measures introduced in Subsection 3.2. To this
aim, we considered all the samples described in Subsection 4.1
and, for each of them, we computed the KS distance with respect
to the input dataset.</p>
      <p>Results, presented in Figure 4, show that the obtained values
are very similar and quite small: for most sample sizes, the
greatest diferences refer to the third decimal digit. As expected, by
increasing the sample size, the KS distance tends to decrease.</p>
      <p>In order to evaluate the impact of the KS distance on
samplebased accuracy measures, we considered the sample size leading
to the highest variance (9604) and we selected the samples with
the highest diference between the corresponding KS distances
Sample
9604_1
9604_2
9604_3
9604_4
9604_5
Mean
Variance
KS Distance
0.0058
0.0064
0.0026
0.0071
0.0134
0.0071
1.24e-5
Sample
16588_1
16588_2
16588_3
16588_4
16588_5
Mean
Variance</p>
      <p>KS Distance
0.0156
0.0183
0.0130
0.0130
0.0119
0.0144
5.36e-6
(namely, 9604_3 and 9604_5). We then computed the
samplebased accuracy measures on the set of 1000 random queries.</p>
      <p>Table 1 shows the results obtained by considering the
equidepth∗ approach and 32 bins (similar values has been obtained
for the equi-width∗ and other numbers of bins). As you can see,
there is no clear ordering between the two samples when
considering such measures. In particular, sample 9604_3 is better than
sample 9604_5 with respect to minimality; however, for
proximity and solution distance the opposite result holds. Thus, it
seems that the KS measure is not good enough for discriminating
between samples with a diferent behaviour with respect to the
detection of the optimal coverage-based rewriting. Additional
work is therefore needed for investigating or defining
alternative functions able to discriminate between samples under the
considered scenario.</p>
      <p>4.2.3 Solution-based accuracy. The last group of experiments
aims at analyzing the solution-based accuracy, according to the
measures introduced in Subsection 3.3, by varying the number of
bins, with respect to the selected binning approach, namely
equidepth∗ and equi-width∗. To this aim, we selected sample 16588_3
(i.e., the biggest sample with the smallest KS distance with respect
to the initial dataset) and we considered all the generated 1000
random queries. The obtained results show a diferent behaviour
with respect to the region in which the optimal solution is
located, either dense or sparse. In particular, Figure 5 shows that,
as expected, by increasing the number of bins, the grid-based
accuracy of the solution will decrease as well. However,
depending on where the optimal solution is located, the equi-depth∗
and the equi-width∗ approaches behave in a diferent way. More
precisely, when the solution region is dense (Figure 5(a)), the
accuracy is lower with the equi-depth∗ approach since in this case
higher density will lead to smaller bins. On the other hand, when
the solution region is sparse (Figure 5(b)), the accuracy is usually
(a) Solution in a dense region
(a) Solution in a dense region
(b) Solution in a sparse region
lower with the equi-width∗ approach since in this case lower
density will lead to longer bins under the equi-depth∗ approach.</p>
      <p>A similar behavior can be observed for the relaxation degree
(Figure 6) and the proximity (Figure 7): by increasing the number
of bins, they decrease for both the binning approaches. When
the solution is contained in a dense region, equi-depth∗ behaves
better than equi-width∗, especially for low numbers of bins. We
further notice that, for very low numbers of bins, proximity
and grid-based accuracy coincide (the farthest neighbour of the
solution is the query point itself).
(b) Solution in a sparse region</p>
      <p>
        Finally, notice that dense regions have a greater impact on
query cardinalities and, as a consequence, on the relaxation
degree, whose values are usually higher in this case (starting from
8 bins), independently on the selected binning approach. On the
other hand, dense regions tend to generate optimal solutions with
lower proximity (more evident with the equi-depth∗ approach,
that is more sensible to data distribution), since constraint
satisfaction can be obtained on query points closer to the initial one.
In general, for uniformly distributed datasets, in which no sparse
regions can be detected, equi-depth∗ can be considered the best
option. By combining the obtained results with those presented
in [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ], related to performance, a number of bins equal to 16
can be considered a good compromise between efectiveness and
eficiency.
5
      </p>
    </sec>
    <sec id="sec-9">
      <title>RELATED WORK</title>
      <p>
        The interest for coverage constraints has been introduced in [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ],
drawing inspiration from the literature on diversity [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The
problem of evaluating the coverage of a given dataset has been
considered in the context of the MithraLabel system [
        <xref ref-type="bibr" rid="ref12 ref24">12, 24</xref>
        ], in
which the lack of coverage is modeled as a widget in the
nutritional label [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] of a dataset. Once the lack of coverage has been
identified, the smallest number of data points needed to hit all
the “large uncovered spaces” is identified with the aim of helping
data owners in achieving coverage through a data repairing
approach. When protected categories are defined in terms of many
attributes, the identification of attribute patterns associated with
coverage problems might lead to performance issues, due to the
combinatorial explosion of such patterns. Eficient techniques,
inspired from set enumeration and association rule mining,
addressing this problem have been proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. To fix coverage
unsatisfaction, additional data can be acquired. Since data
acquisition has a cost in term of data processing, techniques have
been presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for determining the patterns that can be
covered given a certain maximum cost. An eficient approach for
coverage analysis, given a set of attributes across multiple tables,
is presented in [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. As pointed out by the previous discussion,
most existing approaches chase coverage through data repair
and focus on eficiency issues. By contrast, we consider accuracy
for coverage-based query rewriting during data transformation,
thus complementing existing approaches.
      </p>
      <p>
        The technique considered in this paper relies on rewriting.
Other fairness-aware rewriting approaches have been proposed
for OLAP queries [
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ]. Bias is defined in terms of causal
fairness (checking for causal relationships from the sensitive
attributes to the outcome) and detected, explained, and resolved
through rewriting. On the other hand, we focus on data
transformations in presence of coverage constraints.
      </p>
      <p>
        Impact evaluation is quite relevant in the design and the
execution of non-discriminating pipelines, usually very complex
in real-world scenarios. Various systems have been designed
for supporting the user during this activity. Among them, we
recall: Fair-DAGs [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], an open-source library aiming at
representing data processing pipelines in terms of a directed acyclic
graph (DAG) and identifying distortions with respect to protected
groups as the data flows through the pipeline; FairPrep [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], an
environment for investigating the impact of fairness-enhancing
interventions inside data processing pipelines; AI Fairness 360 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
an open-source Python toolkit for algorithmic fairness, aimed at
facilitating the transition of fairness-aware research algorithms
to usage in an industrial setting and at providing a common
framework to share and evaluate algorithms.
      </p>
    </sec>
    <sec id="sec-10">
      <title>6 CONCLUDING REMARKS</title>
      <p>
        Rewriting approaches have been recognized as an interesting
mean for enforcing specific non-discriminating properties
guaranteeing transparency. In this paper, we started from the
approach proposed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] with the aim of investigating the impact
of rewriting on coverage-constraint satisfaction. The approach is
approximate and relies on a sample for both the construction of
the solution search space and the detection of the optimal
rewriting. Three diferent groups of measures have been proposed for
quantifying the accuracy induced by the approximation and the
impact of the sample in the detection of the optimal solution.
      </p>
      <p>
        Preliminary experimental results show that: (i) diferent
binning approaches lead to diferent grid-based accuracy degrees;
(ii) common measures for computing the distance between
distributions are not efective for analyzing their behaviour in the
detection of the optimal coverage-based solution; (iii) the number
of bins used in the generation of the search space has an impact
in the accuracy of the detected solution, and not only on the
performance [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], while the optimal binning approach depends
on the position of the optimal rewriting in the search space; (iv)
a number of bins greater than 16 represents a good compromise
between accuracy and eficiency.
      </p>
      <p>
        The proposed approach is at the basis of covRew [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a Python
toolkit for rewriting slicing operations in pre-processing pipelines,
ensuring coverage constraint satisfaction. covRew takes in input
a two-dimensional tabular dataset  , with the related sensitive
attribute specification, for the identification of protected groups,
a processing pipeline represented as a Pandas script [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and a
set of coverage constraints. It includes three main components:
(i) a pipeline analyzer, which identifies candidate operations for
rewriting, (ii) a pipeline rewriter, which transforms operations
that are selected by the user according to the input coverage
constraints, and (iii) an impact evaluator, assessing the impact of the
rewriting through the usage of the grid-based and solution-based
measures presented in Section 3. Such measures could lead the
user to reconsider some of the choices made in the selection of
the operations to be rewritten, thus producing a new annotated
script.
      </p>
      <p>Future work is needed in order to understand how to define
new distance functions between distributions, focusing on their
behaviour during coverage-based rewriting. An additional issue
concerns the definition and the analysis of further solution-based
measures, evaluating the quality of the detected solution with
respect to the solution that would have been obtained without
considering a sample, as well as their integration in the covRew
prototype system.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Accinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Minisi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Catania</surname>
          </string-name>
          .
          <article-title>Coverage-based rewriting for data preparation</article-title>
          .
          <source>In Proc. of the EDBT/ICDT Workshops, CEUR-WS.org</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Accinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Catania</surname>
          </string-name>
          , G. Guerrini,
          <string-name>
            <surname>S. Minisi.</surname>
          </string-name>
          <article-title>covRew: a Python toolkit for pre-processing pipeline rewriting ensuring coverage constraint satisfaction</article-title>
          .
          <source>In EDBT, Proc. of the 24th Int. Conf. on Extending Database Technology</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Accinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Catania</surname>
          </string-name>
          , G. Guerrini,
          <string-name>
            <given-names>S.</given-names>
            <surname>Minisi</surname>
          </string-name>
          .
          <article-title>A coverage-based approach to data transformations</article-title>
          .
          <source>In preparation.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          et al.
          <article-title>Knowing when you're wrong: building fast and reliable approximate query processing systems</article-title>
          .
          <source>In Proc. of the ACM SIGMOD, Int. Conf. on Management of Data</source>
          , pages
          <fpage>481</fpage>
          -
          <lpage>492</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Asudeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stoyanovich</surname>
          </string-name>
          .
          <article-title>Towards responsible data-driven decision making in score-based systems</article-title>
          .
          <source>IEEE Data Eng. Bull.</source>
          ,
          <volume>42</volume>
          (
          <issue>3</issue>
          ):
          <fpage>76</fpage>
          -
          <lpage>87</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Asudeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <article-title>Assessing and remedying coverage for a given dataset</article-title>
          .
          <source>In ICDE, Proc. of the 35th IEEE Int. Conf. on Data Engineering</source>
          , pages
          <fpage>554</fpage>
          -
          <lpage>565</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Basseville</surname>
          </string-name>
          .
          <article-title>Divergence measures for statistical data processing</article-title>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Bellamy</surname>
          </string-name>
          et al.
          <source>Ai</source>
          fairness
          <volume>360</volume>
          :
          <article-title>An extensible toolkit for detecting and mitigating algorithmic bias</article-title>
          .
          <source>IBM Journal of Research and Development</source>
          ,
          <volume>63</volume>
          (
          <issue>4</issue>
          /5):
          <fpage>4</fpage>
          -
          <lpage>1</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cormode</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Garofalakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Haas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jermaine</surname>
          </string-name>
          .
          <article-title>Synopses for massive data: Samples, histograms, wavelets, sketches</article-title>
          .
          <source>Found. Trends Databases</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          - 3):
          <fpage>1</fpage>
          -
          <lpage>294</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>I.</given-names>
            <surname>Dagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          .
          <article-title>Similarity-based methods for word sense disambiguation</article-title>
          .
          <source>arXiv preprint cmp-lg/9708010</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Drosou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pitoura</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Stoyanovich.</surname>
          </string-name>
          <article-title>Diversity in big data: A review</article-title>
          .
          <source>Big Data</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ):
          <fpage>73</fpage>
          -
          <lpage>84</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          et al.
          <article-title>Mithracoverage: A system for investigating population bias for intersectional fairness</article-title>
          .
          <source>In Proc. of the ACM SIGMOD Int. Conf. on Management of Data</source>
          , pages
          <fpage>2721</fpage>
          -
          <lpage>2724</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kullback</surname>
          </string-name>
          .
          <source>Information theory and statistics. Courier Corporation</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kullback</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Leibler</surname>
          </string-name>
          .
          <article-title>On information and suficiency</article-title>
          .
          <source>The annals of mathematical statistics</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ):
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          ,
          <year>1951</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>W.</given-names>
            <surname>McKinney</surname>
          </string-name>
          .
          <article-title>Python for data analysis: Data wrangling with Pandas, NumPy</article-title>
          , and
          <string-name>
            <surname>IPython. O'Reilly Media</surname>
          </string-name>
          , Inc.,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          .
          <article-title>Interactive query refinement</article-title>
          .
          <source>In EDBT, Proc. of the 12th Int. Conf. on Extending Database Technology, Proceedings</source>
          , pages
          <fpage>862</fpage>
          -
          <lpage>873</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>I.</given-names>
            <surname>Olkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pukelsheim</surname>
          </string-name>
          .
          <article-title>The distance between two random vectors with given dispersion matrices</article-title>
          .
          <source>Linear Algebra and its Applications</source>
          ,
          <volume>48</volume>
          :
          <fpage>257</fpage>
          -
          <lpage>263</lpage>
          ,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Riondato</surname>
          </string-name>
          et al.
          <article-title>The VC-dimension of SQL queries and selectivity estimation through sampling</article-title>
          .
          <source>In Machine Learning and Knowledge Discovery in Databases - European Conference</source>
          ,
          <string-name>
            <given-names>ECML PKDD</given-names>
            , Proceedings,
            <surname>Part</surname>
          </string-name>
          <string-name>
            <surname>II</surname>
          </string-name>
          , pages
          <fpage>661</fpage>
          -
          <lpage>676</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>B.</given-names>
            <surname>Salimi</surname>
          </string-name>
          et al.
          <article-title>Hypdb: A demonstration of detecting, explaining and resolving bias in OLAP queries</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>11</volume>
          (
          <issue>12</issue>
          ):
          <fpage>2062</fpage>
          -
          <lpage>2065</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>B.</given-names>
            <surname>Salimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          .
          <article-title>Bias in OLAP queries: Detection, explanation, and removal</article-title>
          .
          <source>In Proc. of the ACM SIGMOD Int. Conf. on Management of Data</source>
          , pages
          <fpage>1021</fpage>
          -
          <lpage>1035</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schelter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Khilnani</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Stoyanovich.</surname>
          </string-name>
          <article-title>FairPrep: Promoting data to a ifrst-class citizen in studies on fairness-enhancing interventions</article-title>
          .
          <source>In EDBT, Proc. of the 23nd Int. Conf. on Extending Database Technology</source>
          , pages
          <fpage>395</fpage>
          -
          <lpage>398</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Stephens</surname>
          </string-name>
          .
          <article-title>Edf statistics for goodness of fit and some comparisons</article-title>
          .
          <source>Journal of the American statistical Association</source>
          ,
          <volume>69</volume>
          (
          <issue>347</issue>
          ):
          <fpage>730</fpage>
          -
          <lpage>737</lpage>
          ,
          <year>1974</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Stoyanovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Howe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <article-title>Responsible Data Management</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>13</volume>
          (
          <issue>12</issue>
          ):
          <fpage>3474</fpage>
          -
          <lpage>3488</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          et al.
          <article-title>Mithralabel: Flexible dataset nutritional labels for responsible data science</article-title>
          .
          <source>In CIKM, Proc. of the 28th Int. Conf. on Information and Knowledge Management</source>
          , pages
          <fpage>2893</fpage>
          -
          <lpage>2896</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stoyanovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schelter</surname>
          </string-name>
          .
          <article-title>Fairness-aware instrumentation of preprocessing pipelines for machine learning</article-title>
          .
          <source>In Workshop on Human-Inthe-Loop Data Analytics (HILDA</source>
          <year>2020</year>
          ),
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          et al.
          <article-title>A nutritional label for rankings</article-title>
          .
          <source>In Proc. of the 2018 ACM SIGMOD Int. Conf. on Management of Data</source>
          , pages
          <fpage>1773</fpage>
          -
          <lpage>1776</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Asudeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <article-title>Identifying insuficient data coverage in databases with multiple relations</article-title>
          .
          <source>Proc. of the VLDB Endow</source>
          , pages
          <fpage>2229</fpage>
          -
          <lpage>2242</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>