<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Italian Conference on Big Data and Data Science, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Queries: Nondiscrimination Awareness in Data Transformation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chiara Accinelli</string-name>
          <email>chiara.accinelli@dibris.unige.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Barbara Catania</string-name>
          <email>barbara.catania@dibris.unige.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanna Guerrini</string-name>
          <email>giovanna.guerrini@dibris.unige.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Genoa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>2</volume>
      <fpage>0</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>When people-related data are used in automated decision making processes, social inequities can be amplified. Thus, the development of technological solutions satisfying nondiscrimination requirements, in terms of, e.g., fairness, diversity, and coverage, is currently one of the main challenges for the data management and data analytics communities. In particular, coverage constraints guarantee that a dataset includes enough items for each (protected) category of interest, thus increasing diversity with the aim of limiting the introduction of bias during the next analytical steps. While coverage constraints have been mainly used for designing data repair solutions, in our study we investigate their efects on data processing pipelines, with a special reference to data transformation. To this aim, we first introduce coverage-based queries as a means for ensuring coverage constraint satisfaction on a selection-based query result, through rewriting. We then present two approximate algorithms for coverage-based query processing: the first, covRew, initially introduced in [3], relies on data discretization and sampling; the second, covKnn is a novel contribution and relies on a nearest neighbour approach for coveragebased query processing. The algorithms are experimentally compared with respect to eficiency and efectiveness, on a real dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>nondiscrimination</kwd>
        <kwd>data transformation</kwd>
        <kwd>coverage</kwd>
        <kwd>rewriting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Today, we are surrounded by information that is increasingly being used to make decisions that
could have an impact on people’s lives. It is therefore very important to understand the nature
of this social impact and take responsibility for it. To this aim, the development of responsible
and nondiscrimination-aware technological solutions is one of the main challenges in data
management and analytics and the satisfaction of ethical requirements is becoming a crucial
element for asserting the quality of a dataset [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. More precisely, nondiscrimination-aware data
decision systems should take humans in the loop in all the data processing steps, from acquisition
to wrangling and analysis: from one hand, they must ensure transparency and interpretability,
for tracing the whole process; from the other, they should guarantee nondiscrimination with
respect to protected groups of individuals. This is achieved by characterizing groups of interests
in terms of sensitive attributes, like, e.g., gender or economical status, and relying on the
specification of properties, like fairness, i.e., lack of bias [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], diversity, i.e., the degree to which
diferent kinds of objects are represented in a dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and coverage [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], guaranteeing a
suficient representation of any category of interest in a dataset.
      </p>
      <p>
        A significant ongoing research efort [
        <xref ref-type="bibr" rid="ref18 ref8">8, 18</xref>
        ] aims at a holistic treatment of nondiscrimination
along the stages of the data processing life-cycle, rather than as a constraint on the final result:
the sooner you spot the problem fewer problems you will get in the last analytical steps of the
chain. Approaches have been proposed both with reference to front-end (e.g., OLAP queries [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
set selection [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], ranking [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]) and back-end stages (e.g, dataset repair during data acquisition,
with a special focus on coverage in [
        <xref ref-type="bibr" rid="ref12 ref4 ref5">4, 5, 12</xref>
        ] and causal fairness [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]). We refer the reader to
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for a short survey.
      </p>
      <p>
        In our work, we are interested in nondiscrimination constraints defined in terms of coverage.
Indeed, it is well known that the quality of a classifier depends not only on the used algorithm
but also on the number of instances, i.e., the coverage, of each group in the dataset [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. While
coverage constraints have been mainly used for repairing raw datasets before any transformation
[
        <xref ref-type="bibr" rid="ref12 ref4 ref5">4, 5, 12</xref>
        ], we investigate their efects on the data processing pipeline, with a special reference
to data transformed through query execution.
      </p>
      <p>Example 1. Consider the StudentPerformance dataset1 and suppose we want to predict who,
among the best students, has parents with a high level of education (e.g. from graduation upwards),
through a classification task. The information about the lunch at school allows us to deduce
information about the economical status of the student family, thus lunch can be taken as sensitive
attribute. Suppose we would like to train a model which accuracy does not deeply depend on the
economical status of the student family, assuming that the lower the accuracy diference between the
considered student groups the lower the model discriminates against them. Now suppose to qualify
the best students with two selection conditions over math and reading subjects: math_score ≥ 80
AND reading_score ≥ 80. This query returns just 13 students with free/reduced lunch out of
143 individuals. When training a Random Forest classifier on the query result, we get a model with
an accuracy equal to 0.71 and an accuracy diference equal to 0.28 (accuracy for standard lunch
= 0.78, accuracy for reduced lunch = 0.5). Thus, the insuficient number of students with reduced
lunch in the query result afects the accuracy diference of the model. Now suppose to modify
the initial query with the following one: math_score ≥ 71 AND reading_score ≥ 68. This
query returns 74 free/reduced instances out of 350 individuals. The accuracy of the classification
model trained on this new result slightly changes (0.69) but the accuracy diference is improved
and corresponds to 0.006. The improvement is due to the higher number, i.e., the higher coverage, of
protected instances in the query result. The price to pay is a relaxation of the original query. ♦</p>
      <p>
        In this paper, we present the main results we achieved for ensuring coverage constraint
satisfaction during data transformation represented in terms of selection-based queries. In order
to avoid disparate treatment discrimination [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and similarly to what has been done in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for
OLAP queries and causal fairness and, more recently, in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] for range queries and diferent types
of associational fairness, the input selection-based query, when executed over a given dataset
 , is modified through rewriting, obtaining a new query that, when executed over  , satisfies
the given constraints and is still close to the initial request. As far as we know and according
1It is available at https://www.kaggle.com/spscientist/students-performance-in-exams.
to [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], no other solutions addressing coverage-based rewriting have been proposed so far. More
precisely, we first introduce coverage-based queries as a means for ensuring coverage constraint
satisfaction in selection-based query results. We then present two approximate algorithms for
coverage-based query processing: the first, covRew, initially introduced in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], relies on data
discretization and sampling; the second, covKnn, presented here for the first time, relies on a
nearest neighbour approach for coverage-based query processing. The approaches are then
experimentally compared with respect to eficiency and efectiveness, on a real dataset.
      </p>
      <p>The remainder of this paper is organized as follows. Section 2 presents preliminary definitions
and the reference problem. covRew is briefly described in Section 3 while Section 4 presents
covKnn, an alternative nearest neighbour approximate approach for coverage-based query
processing. Experimental results are presented in Section 5. Finally, Section 6 discusses related
work and Section 7 concludes and outlines future work directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Definition</title>
      <p>Dataset. We assume data are represented as a collection of tabular datasets (e.g., relations in
a relational database, data frames in the Pandas analytical environment, called tables in the
following)  ≡ { 1, ...,   }, with disjoint schemas, for the sake of simplicity. Let  1, ...,   be the
attributes of  ,   
denote the domain of   . A set of discrete-valued attributes  = {
1, ...,   }
are of particular concern since they allow the identification of protected groups and are called
sensitive attributes. Examples of sensitive attributes are gender and race.
thus denoted by ⟨
the selection-based query.</p>
      <p>Data transformation operations.</p>
      <p>
        We focus on selection-based data transformations (or
queries) over stored or computed datasets, in data processing pipelines that might alter the
representation (i.e., the coverage) of specific groups of interests, defined in terms of sensitive
attribute values. Examples are data slicing operations in Pandas2 or ColumnTransformers in
Scikit-Learn3. We consider boolean combinations of atomic selection conditions   ≡     ,
  ∈    ,  ∈ {&lt;, ≤, &gt;, ≥} ,   numeric attribute,4  = 1, … ,  ,   ∉  . A selection-based query  is
1, ...,   ⟩ or ⟨ ⟩ ,  ≡ ( 1, ...,   ). Attributes  1, ...,   are called dimensions of
Coverage constraints. Conditions over the number of entries belonging to a given protected
group of interest returned by the execution of a selection-based query can be specified in terms
of coverage constraints [
        <xref ref-type="bibr" rid="ref12 ref4">4, 12</xref>
        ]. A coverage constraint  has the form ↓
ℎ ≥  and specifies
that the minimum number of instances with sensitive attribute   equal to 



query result has to be  . As an example, ↓fgeemnadleer≥ 10 specifies that a query result should include
at least 10 female individuals. The group a coverage constraint refers to is called protected group
,  = 1, ..., ℎ , in a

and is characterized by the selection condition   ≡ 
1 =   1 ∧ ... ∧  

ℎ =   ℎ. Even if coverage
thresholds are usually application specific and should be determined through statistical analysis,
the central limit theorem suggests that the number of representative should be around 30 and,
according to [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], at least 20 to 50 instances for each group are required.
2https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/03_subset_data.html
3https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
4For the sake of simplicity, selection attributes are assumed numeric. The approach can be easily extended to deal
with any other ordered domain.
      </p>
      <p>(a) Input and induced queries
(b) Skyline points
sensitive attributes  and let ⟨ ⟩ be a selection-based query. A coverage-based query    for
 and  is a selection-based query that, given a dataset  , stretches the result ( )
as little as
possible so that constraints in  are satisfied. More precisely, for each dataset  : (i) a tuple 
exists such that  
 ( ) = ⟨ ⟩( ) ; (ii)  ⊆    ; (iii) all coverage constraints are satisfied by 

 ( ) ;
(iv)    is the closest query to  in terms of minimality, i.e., result cardinality, and proximity,
i.e., syntactic closeness. Minimality means that no other query  ′ satisfying conditions (i)–(iii)
⟨ ⟩ , according to the Euclidean distance (defined in a unit space) between  and  , satisfying
′( )) &lt;  (</p>
      <p>( )) ; proximity means that ⟨ ⟩ is the closest query to
Properties. It has been proved that a coverage-based query  
 satisfies the following properties:
(P1) It can be represented in a canonical form in which each selection condition has the form
  ≤   or   &lt;   ; ⟨ ⟩ can thus be represented as point  in the  -dimensional space
defined by selection attributes (see  in Figure 1(a)).
(P2) Let</p>
      <p>( ) ≡ ⟨ ⟩( ) . It can be proved that  coincides with the upper right vertex of the
minimum bounding box of at most  distinct points in  ,  , and the origin of the space
(see the green triangle in Figure 1(a)). Such vertices are called induced points and the set
of all induced points corresponds to the search space for coverage-based queries.
(P3) There is a relationship between  and the skyline of induced points corresponding to
queries that, when executed over  , satisfy  . The dominance relation, needed for the
skyline computation, is defined over selection attributes, assuming the lower the better
(see Figure 1(b)). It can be proved that  coincides with the skyline point corresponding
to the query with the minimal cardinality at the lowest distance from  .


The problem. The problem we address concerns the design of eficient algorithms for processing
 . The designed algorithms improve the naïf approach derived from properties P1, P2, and P3,
which is inherently ineficient due to the size of the search space and the skyline computation.
We do not rely on any index data structure, so that both stored and computed datasets can be
considered. The proposed approximate algorithms are briefly described in the next sections
while an optimized precise one is currently under evaluation.</p>
      <p>(a) Data distribution
(b) Multi-dimensional grid
(c) Cardinality estimates and
solution</p>
    </sec>
    <sec id="sec-3">
      <title>3. Approximate Coverage-based Query Processing through</title>
    </sec>
    <sec id="sec-4">
      <title>Discretization and Sampling</title>
      <p>
        In order to improve the eficiency of the naïf approach, the approach proposed in [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]
relies on a discretized search space instead of generating and visiting the whole set of induced
points. Additionally, cardinality estimation, needed for constraint and minimality checking, is
performed through a sample-based approach. The proposed technique, called covRew, can be
applied over any dataset for which a sample is available or can be easily computed on the fly,
and relies on two main steps:
• Pre-processing. The approximate search space is generated by considering the
intersection points of a grid obtained by discretizing each axis (one for each selection attribute  
in ⟨ ⟩ , from the query value   to the maximum value of   in  ), using standard binning
approaches (e.g., equi-width and equi-depth). Each point on the grid corresponds to a
selection-based query of type ⟨ ⟩ , thus satisfying conditions (i) and (ii) of the reference
problem.
• Processing. Given  , ⟨ ⟩ such that
      </p>
      <p>
        ( ) = ⟨ ⟩( ) is then computed by visiting the
discretized search space starting from  , one point after the other, at increasing distance
from  , checking minimality and proximity in the approximated space. The properties
of the discretized search space and the canonical form are considered for pruning the
space (algorithm covRewP in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), possibly increasing the number of points to be visited
at diferent iterations (algorithms covRewI and covRewIP in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], depending on whether
pruning is considered together with iterations). The trade-of between accuracy and
eficiency of the proposed algorithms has been evaluated on both synthetic and real-world
datasets (see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for details).
      </p>
      <p>Example 2. Consider the scenario introduced in Example 1.

can be
rewritten
Figure
in
canonical
form
as</p>
      <p>-math_score ≤ -80 AND -reading_score ≤ -80.
2(a)
shows
the
dataset
projected
over
attributes
-math_score
and
-reading_score.</p>
      <p>Points are colored and shaped according to the sensitive attribute
values: red crosses for free/reduced lunches and black dots for standard lunches. The result of 
in canonical form corresponds to the region at the bottom left side of the query point (-80,-80). The
space to be discretized corresponds to the green rectangle. By applying the equi-depth approach to
each axis with 4 bins, we obtain the grid in Figure 2(b). The discretized search space corresponds to
the grid intersection points: each point represents a selection-based query containing  . During the
processing step, the points of the grid are visited, under diferent strategies, starting from (-80, -80),
at increasing distance from it (blue numbers represent the visiting order). For each visited point
⟨ ⟩ , we estimate the cardinality of  ℎ= /
(⟨ ⟩)( ) , needed for checking constraint
satisfaction, and of ⟨ ⟩( ) , by relying on a sample-based approach (see the pairs of numbers
associated with each point in Figure 2(c)). The result corresponds to point  = (-71,-65) (see Figure
2(c)): it satisfies the coverage constraint (  ℎ= /
(⟨ ⟩)( ) =
79, property (iii)), has the
minimum cardinality (370), at the lowest distance from  (property (iv)). Distances are computed
in the reference unit space (normalized values are shown in red in Figure 2(c)). Query  is thus
rewritten into -math_score ≤ -71 AND -reading_score ≤ -65.
♦</p>
      <p>
        The number of cardinality estimations in covRew is in (  ), where  is the number of bins
used for the discretization of the axis and  is the number of selection conditions. As shown
in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], pruning strategies help in reducing such exponential complexity in real cases. On the
other hand, thanks to the sample-based approach, covRew performance does not depend on the
dataset cardinality.
      </p>
    </sec>
    <sec id="sec-5">
      <title>4. A Nearest Neighbour Approach to Approximate</title>
    </sec>
    <sec id="sec-6">
      <title>Coverage-based Query Processing</title>
      <p>based query  
Similarly to the naïf approach derived from properties P1, P2, and P3 in Section 2, covRew works
in a space of query points to identify an approximation of ⟨ ⟩ , needed to process the
coverage ( ) . An alternative approach could be that of computing an approximation of
⟨ ⟩ directly from the data instances. To this aim, we first look for the missing number of
protected group instances closest to the query point, with respect to an input distance function
 , and then use them to compute the induced query representing an approximation of ⟨ ⟩ .
More precisely, given one coverage constraint  ≡↓ 
ℎ ≥  , a selection-based query ⟨ ⟩ , and
a dataset  , the proposed approach, called covKnn, relies on four main steps:5

1. compute the set of protected group instances that are not included in ( ) , corresponding
to  
≡   ( − ( ))</p>
      <p>;
2. determine the number of missing protected instances ℎ =  −  (
3. identify the ℎ nearest neighbours to  , with respect to  , in   ;
4. return the query induced by the ℎ nearest neighbours as an approximation of ⟨ ⟩ .
 (( ))) ;</p>
      <p>It is easy to show that covKnn guarantees the satisfaction of properties (i)-(iii) of
coveragebased queries; however, minimality and proximity (property (iv)) are not necessarily satisfied by
the induced query even if they are implicitly taken into account during the nearest neighbour
computation.
5covKnn can be easily extended to deal with a set of constraints by considering the query induced by the union of
the ℎ missing instances for each coverage constraint   in the input set.</p>
      <p>
        One additional consideration is needed for the distance function to be used. As observed
in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], those based on the general theory of vector p-norms, like the Euclidean or Manhattan
distances, are distances between points. They are therefore not suitable when considering
distances from boxes corresponding to a query space, as in our case, since it is not obvious
which point inside the box should be treated as the reference point. In such a context, the user
would usually prefer answers that are close to the periphery of the box. The box query distance
proposed in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] achieves this goal. Given an input query ⟨ ⟩ and a database instance point 
that does not belong to ( ) , when considering selection-based queries in canonical form, it is
computed as follows:
√
√√ 
√
√
√√∑(  (  ,   ))2
where   (  ,   ) = {
  −   if   &gt;  
0 otherwise.
      </p>
      <p>(1)
√</p>
      <p>
        The box query distance can be customized to user needs by using weights. In our case, the
idea is to use weights that make the distance of a point  to each query axis closer to the area
corresponding to the contribution of  to the induced query, for that axis. To this aim, we use
the Inverse Aspect-Ratio weights presented in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Example 3. Consider the scenario introduced in Example 1. Under covKnn, the processing proceeds
in this way: (i) the search space   is first computed (see the white space in Figure 3(a)): it contains
342 points out of the 1000 points of the whole dataset; (ii) assuming that ( ) contains 13 protected
group instances, the number of missing instances is determined as ℎ = 70-13 = 57; (iii) the 57 nearest
neighbour points to the query point (-80, -80) in   are computed, using the box query distance
(green points in Figure 3(a)); (iv) the induced query of the detected points is then computed (see
Figure 3(b)): it corresponds to the query -math_score ≤ -70 AND -reading_score ≤ -69. ♦</p>
      <p>In covKnn, no cardinality estimate is required (minimality is not checked and the coverage
constraint is satisfied by the induced query by construction). The complexity is () , where
 is the cardinality of   since, for each instance in   , the distance from  is computed
according to Eq. (1). As a final remark, we notice that the obtained result is diferent from the
one obtained with covRew (see Example 2): they represent two possible and potentially distinct
approximations of the coverage-based query result.</p>
    </sec>
    <sec id="sec-7">
      <title>5. Experimental Evaluation</title>
      <sec id="sec-7-1">
        <title>5.1. Experimental setup</title>
        <p>All experiments were conducted on a machine with an Intel(R) Core(TM) i99900K 3.60GHz CPU
and 16GB main memory, running Ubuntu 20.04. All programs were coded in Python 3.8 and
the dataset is stored on PostgreSQL 9.6.19.1.</p>
        <p>Dataset. The experiments refer to the Adult dataset,6 exploited for assessing many
nondiscrimination data management techniques. It contains 48,842 individuals, described by six
6https://archive.ics.uci.edu/ml/datasets/Adult
(a) Query nearest neighbour points
(b) Induced query
numerical attributes and eight categorical attributes. For our experiments, we use attributes
sex, race, and marital_status as sensitive attributes.</p>
        <p>Queries and coverage constraints. We identified 9 selection-based queries  , with a variable
number  of selection conditions and selectivity %sel (see Table 1):  corresponds to the number
of selection conditions and  points out the selectivity (ℎ corresponds to %sel ≤ 15%,  to 15% &lt;
%sel &lt; 40%, and  to %sel ≥ 40%). We then considered many problem instances by combining
the selection-based queries with various coverage constraints, requiring a diferent number of
protected group instances. Table 2 presents the considered instances pointing out the cardinality
of the initial query result  (())
query result  ( ↓
 
())</p>
        <p>, the cardinality of each protected group in the initial
, the missing number of protected group instances, according
to the selected coverage constraint, and the cardinality of  
(#space).</p>
        <p>
          Algorithms. The experiments compare covKnn and covRewIP; the precise algorithm derived
from the properties presented in Section 2 is used as a baseline. covRewIP is the grid-based
algorithm guaranteeing the best tradeof between accuracy and eficiency according to [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ],
executed using the equi-depth discretization, with 32 bins, applied over a random sample of size
9,604 of the
        </p>
        <p>
          dataset. In the covKnn implementation, we rely on a naïf top-sort approach,
in which we keep in a bufer the best  points seen so far, for nearest neighbour computation.7
Distance function. We performed many experiments, considering many diferent weights
for the box query distance proposed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The reported experimental results refer to the box
query distance with squared Inverse Aspect-Ratio weights (see [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] for details).
Eficiency and efectiveness indicators
        </p>
        <p>
          . Eficiency is analyzed in terms of execution time for
determining the new query to be executed (averaged over 10 executions). For better pointing out
the impact of the dataset size, we include the time to load the dataset for covKnn (0.19 s) and to
generate and load the sample for covRewIP (0.071 s).8 Efectiveness is evaluated by comparing:
(i) the achieved semantic relaxation, called relaxation degree in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], as the rate between the
(i.e, the number of new tuples returned) and of ()
;

cardinality of
        </p>
        <p>() − ()
(ii) the syntactical relaxation, i.e., the proximity of ⟨ ⟩ with respect to the initial query ⟨ ⟩ .
The baseline approach is used only for the analysis of the accuracy since experiments, that are
7Notice that, since the box query distance is not a metric function and since, in general, the dataset might not be
stored, we cannot rely on usual data structures, like k-d trees.
8The time for computing the query result is not included since it is equal for both the approaches.
not reported here for space constraints, show that its average performance is worst than those
of the approximate approaches, especially for large-scale datasets.</p>
      </sec>
      <sec id="sec-7-2">
        <title>5.2. Experimental results</title>
        <p>Impact of the search space size on performance. From Figure 4(a), we observe that, as
expected, the time increases while increasing the size of the search space  , for a fixed number
of selection conditions. The search space size can increase by either considering queries with the
same or similar selectivity and varying the considered protected group (e.g., I3 &lt; I2 &lt; I4; I10 &lt;I9;
I17 &lt; I16 &lt; I15) or keeping the same protected group for queries with diferent selectivity (e.g.,
I5 &lt; I1; I12 &lt; I9; I18 &lt; I14). This behaviour does not hold for covRewIP since its performance
does not depend on the number of protected instances.</p>
        <p>Impact of the search space dimension on performance. As expected, covKnn time increases
also when increasing the number of selection conditions (i.e, the dimension of the search space)
(a) Execution time
(b) Relaxation degree
(c) Proximity
for a given dataset size (e.g., I1 &lt; I9 &lt; I14). The increase however is limited and is due to
the distance computation on points with a higher number of coordinates. On the other hand,
covRewIP performance, depending on the applied pruning, might exponentially increase while
increasing  and, as a consequence, the size of the discretized search space (see, e.g., I15, I16, I17).
covKnn thus performs better with higher dimensions; for a low number of selection conditions
this behaviour is less evident, due to the high cost for reading the whole dataset in covKnn.
Impact of coverage constraints on performance. Figure 4 shows that the coverage constraint
threshold, and thus the number of nearest neighbours to be detected, has a limited impact on
covKnn performance. This can be observed by changing the threshold while keeping the same
query and the same protected group (see, e.g., I5-I8). An opposite behaviour can be observed
for covRewIP: performance decreases while increasing the threshold since a larger subset of
the discretized search space has to be visited. Experiments that are not reported here for space
constraints show that covKnn performance is sensible to the number of coverage constraints
since, for each of them, a specific search has to be computed on a distinct search space (see, e.g.,
I3 and I4). This does not hold for covRewIP.</p>
        <p>Accuracy. Figures 4(b) and (c) compare covKnn, covRewIP, and the precise solution   ,
computed using the baseline approach, with respect to relaxation degree and proximity. Notice
that, since covRewIP relies on sampling for estimations, the rewritten query might not satisfy
the constraint (see I3 and I14). In general, the covKnn and covRewIP accuracy is quite good and
often the generated solutions coincide with the optimal ones (see I1, I2, I4, I19), which obviously
always achieve the lowest relaxation degree (and the lower proximity to equal relaxation, see
Section 2). When the solutions do not coincide, covRewIP often achieves the lowest relaxation
degree, with an average 2.2% additional relaxation with respect to the precise solution (8.1% for
covKnn). This is because, diferently from covKnn, it takes minimality into account. Situations
for which the covRewIP relaxation degree is higher than the covKnn one probably correspond
to estimation errors due to the sample usage and the search space discretization.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>6. Related Work</title>
      <p>
        Coverage constraints have been first introduced in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] in the context of data repair. When
protected categories are defined in terms of many attributes, the identification of attribute patterns
associated with coverage problems might lead to performance issues, due to their combinatorial
explosion. Eficient techniques, inspired from set enumeration and association rule mining, have
therefore been presented. Once the lack of coverage has been identified, the smallest number
of data points needed to hit all the “uncovered spaces” has to be identified and additional data
acquired. Since data acquisition has a cost, techniques have been presented for determining
the patterns that can be covered given a certain maximum cost [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. An eficient approach for
coverage analysis, given a set of attributes across multiple tables, is presented in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The
approaches in [
        <xref ref-type="bibr" rid="ref12 ref4">4, 12</xref>
        ] deal with categorical attributes with low cardinality; the coverage-based
data repair problem has been extended to ordinal and continuous-valued attributes in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The
detection of groups, defined by the intersection of multiple attributes, with a low coverage is
also at the basis of the MithraCoverage web application [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Coverage-based queries, presented in this paper, rely on rewriting to avoid disparate treatment
discrimination during selection-based query execution. Another non-discrimination aware
technique based on rewriting has been proposed in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for OLAP queries. Here, the bias
is defined in terms of causal fairness (checking for causal relationships from the sensitive
attributes to the outcome). A fairness-aware rewriting approach for range queries has been
recently proposed [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], where fairness constraints correspond to the desired distribution for the
groups of interest. The approach relies on binary sensitive attributes and a specialized index
data structure. As far as we know and according to [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], no further approaches have been
proposed so far for coverage-based rewriting of data transformations.
      </p>
    </sec>
    <sec id="sec-9">
      <title>7. Concluding remarks</title>
      <p>
        In this paper, we summarize our work on nondiscrimination-aware data transformation. We
ifrst introduce coverage-based queries as a means for guaranteeing the satisfaction of a set of
coverage constraints on a selection-based query result, through rewriting. We then present two
approximate algorithms for coverage-based query processing: the first, covRew, was already
presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]; the second, covKnn, relies on a nearest neighbour approach and is proposed
in this paper with the aim of overcoming some limitations of covRew. The techniques are
experimentally compared with respect to eficiency and efectiveness. The obtained results,
some of which are not reported in this paper for space constraints, show that, for small or medium
size datasets, covKnn can generate approximate solutions faster than covRew, at the price of
a higher relaxation. On the other hand, covRew could be the right choice for huge datasets,
thanks to the sample-based approach, and when we want to minimize the achieved relaxation.
We are currently working on the design and evaluation of algorithms for the computation
of the optimal rewriting, in terms of minimality and proximity, and on the integration of
coverage-based queries in a relational DBMS.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Accinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Catania</surname>
          </string-name>
          , G. Guerrini, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Minisi</surname>
          </string-name>
          .
          <article-title>The impact of rewriting on coverage constraint satisfaction</article-title>
          .
          <source>In Proc. EDBT/ICDT Workshops</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Accinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Catania</surname>
          </string-name>
          , G. Guerrini, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Minisi</surname>
          </string-name>
          .
          <article-title>A coverage-based approach to nondiscrimination-aware data transformation</article-title>
          .
          <source>ACM J. Data Inf. Qual.</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Accinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Minisi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Catania</surname>
          </string-name>
          .
          <article-title>Coverage-based rewriting for data preparation</article-title>
          .
          <source>In Proc. EDBT/ICDT Workshops</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Asudeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <article-title>Assessing and remedying coverage for a given dataset</article-title>
          .
          <source>In Proc. ICDE</source>
          , pages
          <fpage>554</fpage>
          -
          <lpage>565</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Asudeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shahbazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <article-title>Identifying insuficient data coverage for ordinal continuous-valued attributes</article-title>
          .
          <source>In Proc. SIGMOD</source>
          , pages
          <fpage>129</fpage>
          -
          <lpage>141</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Barocas</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Selbst</surname>
          </string-name>
          .
          <article-title>Big data's disparate impact</article-title>
          . Calif. L. Rev.,
          <volume>104</volume>
          :
          <fpage>671</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Catania</surname>
          </string-name>
          , G. Guerrini, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Accinelli</surname>
          </string-name>
          .
          <article-title>Fairness &amp; friends in the data science era</article-title>
          .
          <source>AI &amp; Society</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Drosou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          , E. Pitoura, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Stoyanovich</surname>
          </string-name>
          .
          <article-title>Diversity in big data: A review</article-title>
          .
          <source>Big Data</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ):
          <fpage>73</fpage>
          -
          <lpage>84</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tanca</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          .
          <article-title>Ethical dimensions for data quality</article-title>
          .
          <source>ACM J. Data Inf. Qual.</source>
          ,
          <volume>12</volume>
          (
          <issue>1</issue>
          ):2:
          <fpage>1</fpage>
          -
          <issue>2</issue>
          :
          <fpage>5</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Asudeh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <article-title>MithraCoverage: A system for investigating population bias for intersectional fairness</article-title>
          .
          <source>In Proc. SIGMOD</source>
          , pages
          <fpage>2721</fpage>
          -
          <lpage>2724</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadlag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Wanjari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Haritsa</surname>
          </string-name>
          .
          <article-title>Supporting exploratory queries in databases</article-title>
          .
          <source>In Proc. DASFAA</source>
          , pages
          <fpage>594</fpage>
          -
          <lpage>605</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Asudeh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <article-title>Identifying insuficient data coverage in databases with multiple relations</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>13</volume>
          (
          <issue>11</issue>
          ):
          <fpage>2229</fpage>
          -
          <lpage>2242</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Salimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          .
          <article-title>Bias in OLAP queries: Detection, explanation, and removal</article-title>
          .
          <source>In Proc. SIGMOD</source>
          , pages
          <fpage>1021</fpage>
          -
          <lpage>1035</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Salimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Howe</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          .
          <article-title>Data management for causal algorithmic fairness</article-title>
          .
          <source>IEEE Data Eng. Bull.</source>
          ,
          <volume>42</volume>
          (
          <issue>3</issue>
          ):
          <fpage>24</fpage>
          -
          <lpage>35</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Salimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Howe</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          .
          <article-title>Interventional fairness: Causal database repair for algorithmic fairness</article-title>
          .
          <source>In Proc. SIGMOD</source>
          , pages
          <fpage>793</fpage>
          -
          <lpage>810</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Shahbazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Asudeh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <article-title>A survey on techniques for identifying and resolving representation bias in data</article-title>
          .
          <source>CoRR, abs/2203.11852</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shetiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. P.</given-names>
            <surname>Swift</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Asudeh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Das</surname>
          </string-name>
          .
          <article-title>Fairness-aware range queries for selecting unbiased data</article-title>
          .
          <source>In Proc. ICDE</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Stoyanovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Howe</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <article-title>Responsible data management</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>13</volume>
          (
          <issue>12</issue>
          ):
          <fpage>3474</fpage>
          -
          <lpage>3488</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Stoyanovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <article-title>Online set selection with fairness and diversity constraints</article-title>
          .
          <source>In Proc. EDBT</source>
          , pages
          <fpage>241</fpage>
          -
          <lpage>252</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sudman</surname>
          </string-name>
          . Applied Sampling. Academic Press, inc.,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zehlike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Stoyanovich</surname>
          </string-name>
          .
          <article-title>Fairness in ranking: A survey</article-title>
          .
          <source>CoRR, abs/2103.14000</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>