Coverage-based Rewriting for Data Preparation
               Chiara Accinelli                                        Simone Minisi                               Barbara Catania
         University of Genoa, Italy                              University of Genoa, Italy                   University of Genoa, Italy
       chiara.accinelli@dibris.unige.it                        simone.minisi@dibris.unige.it                barbara.catania@dibris.unige.it

ABSTRACT                                                                           this type of guarantee is provided by design, i.e., intrinsically
The development of technological solutions satisfying non dis-                     embedded in the mechanisms of the data processing workflow.
criminating requirements is currently one of the main challenges                      Among the relevant themes from a social point of view, vari-
for data processing. Concepts like fairness, i.e., lack of bias, and               ous non-discriminating constraints, like fairness, i.e., lack of bias,
diversity, i.e., the degree to which different kinds of objects are                and diversity, i.e., the degree to which different kinds of objects
represented in a dataset, have been recently taken into account in                 are represented in a dataset, have recently received attention
designing non-discriminating set selection, ranking, and OLAP                      from the data management community. As pointed out in [6, 8],
approaches. Information extraction is however also at the basis                    an important direction is developing a holistic treatment of non-
of back-end data processing, for preparing, e.g., extracting and                   discriminating constraints through different stages of the data
transforming data, usually based on SQL queries, before loading                    management and analysis life-cycle, from data cleaning, integra-
them inside a data warehouse for further front-end processing.                     tion, and preprocessing, through selection and ranking, to result
The impact of an unfair data preparation process might have a                      interpretation. In this context, a key issue concerns enforcing
relevant impact on front-end analysis. As an example, an under-                    non-discriminating constraints incrementally through individual
represented category in the warehouse might lead to an under-                      independent choices, rather than as a constraint on the set of
representation of that category in most of the following processes.                final results.
This kind of guarantee is known as coverage. In this paper, we
start from this consideration and we propose an approach for                       Problem. Recently, various approaches to cope with non-discri-
automatically rewriting back-end queries, whose results do not                     minating constraints during front-end information extraction
guarantee some coverage constraints, into the “closest” queries                    stages, aiming at presenting the users with data organized as
satisfying those constraints. Through rewriting, coverage-based                    to satisfy their needs, have been proposed, in the context of set
modifications of data preparation steps are traced for further pro-                selection [20], ranking [2, 3, 22], and OLAP [14, 15, 17].
cessing. We also present some preliminary experimental results                         Information extraction is however also at the basis of back-end
and we identify some directions for future works.                                  processing in analytical systems, in which source data have to
                                                                                   be prepared, namely extracted, transformed, and loaded, into
                                                                                   an analytical repository - a data warehouse - for further front-
1    INTRODUCTION                                                                  end processing. This kind of processes is probably simpler from
Background. Nowadays, large-scale technologies for the manage-                     the point of view of data manipulation with respect to relevant
ment and the analysis of big data have a relevant and positive                     front-end search functionalities and can often be represented in
impact: they can improve people’s lives and enhance scientific                     terms of Select-Project-Join (SPJ) queries. However, an unfair data
exploration. At the same time, it becomes increasingly important                   preparation process might have a huge impact on many front-end
to understand the nature of these impacts at the social level and                  processes: as an example, an under-represented category in the
to take responsibility for them, especially when they deal with                    warehouse might lead to an under-representation of that category
human-related data [7].                                                            in any following process. As pointed out in [16], this motivates
   The development of technological solutions satisfying non-                      the study of fairness-aware data transformations, where the idea
discriminating requirements is currently one of the main chal-                     is to minimally rewrite the transformation query so that certain
lenges in data processing and, in particular, in data management.                  fairness constraints are guaranteed to be satisfied in the result of
Due to the above-mentioned social relevance, themes such as                        the transformation.
diversity, non-discrimination, fairness, protection of minorities,                     In this paper, we start from this considerations and we consider
and transparency are becoming increasingly crucial when deal-                      fairness constraints guaranteeing that there are enough entries
ing with any data processing step. More concretely, consider a                     in the dataset for each object category, before starting front-end
population upon which a data processing (either operational or                     processes. Such property has been called coverage in [4] and has
analytical) task has to be applied. Suppose that a subset of our                   relationships with both fairness and diversity issues [22]. More
population shares some characteristics that should not be em-                      precisely, consider a (source) dataset 𝐼 in which some items are
ployed for discrimination (e.g., race, gender, disability status). It is           characterized by a sensitive attribute (e.g, gender, race, age) and
important to guarantee that the result of the processing task is not               an SPJ query 𝑄 over 𝐼 , including the sensitive attribute in the
discriminating with respect to the considered sensitive attributes.                projection list. Now suppose that some coverage constraints are
This may include ensuring a fair probability of selection, not                     given with respect to the result of 𝑄 when executed over 𝐼 (e.g.,
giving undue relevance to individuals sharing these properties,                    we would like the number of females is greater than a given
or other related constraints. This type of requirements is made                    threshold). Suppose 𝑄, when executed over 𝐼 , does not satisfy the
desirable by the ethical need to take responsibility; however, it                  constraints. We are interested in rewriting 𝑄 into a new query 𝑄 ′
is also made mandatory by the recent General Data Protection                       such that 𝑄 ′ relaxes predicates in 𝑄, thus, 𝑄 ⊆ 𝑄 ′ , and 𝑄 ′ is the
Regulation of the European Union [5]. The latter imposes that                      smallest query containing 𝑄 satisfying the coverage constraints
© 2020 Copyright held by the owner/author(s). Published in the Workshop Pro-       when executed over 𝐼 .
ceedings of the EDBT/ICDT 2020 Joint Conference, March 30-April 2, 2020 on             Notice that we are interested in rewriting 𝑄 into 𝑄 ′ in order
CEUR-WS.org. Distribution of this paper is permitted under the terms of the Cre-
ative Commons license CC BY 4.0.
                                                                                   to achieve transparency: through rewriting, the coverage-aware
relaxation process is traced for further analysis or processing. In    Query relaxation. Query relaxation is a well-studied problem in
this respect, the approach we propose can be seen as a revision        database literature, in which the goal is to help users in revising
of existing query refinement techniques, addressing the empty          query specification or execution so that generated results better
or few answer problem (see, e.g., [12]), to take care of coverage      match users’ desired results. Previous work has exploited query
and fairness issues.                                                   logs [9] and data being queried [23]. In order to address the empty
                                                                       or few answer problem, a skyline-based refined query execution
Contributions. The main contributions of the paper can be sum-         has been proposed in [10] while an interactive framework for
marized as follows:                                                    query relaxation through rewriting has been presented in [12].
    • We formally define a query rewriting problem based on            The approach we propose can be seen as a reinterpretation of
      coverage constraints.                                            such last query relaxation framework to take care of coverage
    • We present a baseline algorithm for query rewriting, relax-      and fairness issues without any required user interaction.
      ing the query with the aim of satisfying the constraints at
      hand and we propose two optimizations based on pruning           3    PROBLEM DEFINITION
      and incremental refinement.                                      Let 𝐷 = {𝑅1, ..., 𝑅𝑛 } be a relational database schema, let 𝑆 ∈ 𝑅𝑘 ,
    • We present some preliminary experimental results show-           𝑘 ∈ {1, ..., 𝑛} be an attribute called sensitive attribute with domain
      ing that the proposed approach is effective and efficient        𝐷𝑆 = {𝑠 1, ...𝑠ℎ }. Let 𝑄 be an arbitrary SQL query. Let 𝑠𝑒𝑙 1, ..., 𝑠𝑒𝑙𝑑
      when applied over some real datasets.                            be the selection conditions in 𝑄, 𝑠𝑒𝑙𝑖 ≡ 𝑆𝑖 𝜃𝑣𝑖 , attribute 𝑆𝑖 has
Organization. The remainder of this paper is organized as follows.     domain 𝐷𝑆𝑖 and belongs to the schema of a relation 𝑅 𝑗 in 𝐷,
Section 2 discusses related work about non-discrimination in           𝑣𝑖 ∈ 𝐷𝑆𝑖 . When needed, we denote 𝑄 by 𝑄 ⟨𝑣 1, ..., 𝑣𝑑 ⟩ or 𝑄 ⟨𝑣⟩,
information extraction and query relaxation. The problem we            𝑣 ≡ (𝑣 1, ..., 𝑣𝑑 ). For the sake of simplicity, we assume that 𝑆𝑖 ,
consider is then presented in Section 3. The proposed grid-based       𝑖 = 1, ..., 𝑑, is a numeric attribute; the proposed approach can
algorithm is introduced in Section 4 while experimental results        however be easily extended to deal with any ordered domain.
are presented in Section 5. Finally, Section 6 outlines future works       We restrict our attention to SQL queries 𝑄 in which: (i) the
and presents some concluding remarks.                                  sensitive attribute 𝑆 belongs to the projection list; (ii) 𝑄 ⟨𝑣⟩ ⊆
                                                                       𝑄 ⟨𝑢⟩ when 𝑢𝑖 ≥ 𝑣𝑖 , 𝑖 = 1, ..., 𝑑 (denoted by 𝑢 ≥ 𝑣). Such queries
2   RELATED WORK                                                       are called sensitive selection monotone queries.
There are several areas of research that are related to our work,          We then denote with 𝑄 ↓𝑠 𝑗 the query 𝜎𝑆=𝑠 𝑗 (𝑄), 𝑗 ∈ {1, ..., ℎ},
which we discuss below.                                                and with |𝑄 (𝐼 )| the cardinality of the result of 𝑄 when executed
                                                                       over a database instance 𝐼 of 𝐷. For each value 𝑠𝑖 of the sensitive
Non-discrimination in information extraction. Information extrac-      attribute 𝑆, a constraint 𝐶𝐶 𝑗 ≡ |𝑄 ↓𝑠𝑖 | ≥ 𝑘𝑖 , is called coverage
tion aims at presenting the users with data organized as to satisfy    constraint with respect to 𝑠𝑖 or simply coverage constraint.
their needs. Non-discriminating information extraction takes               The problem we want to address is the following: given a
into account non-discriminating properties, like fairness and di-      database instance 𝐼 , a sensitive selection monotone query 𝑄 ⟨𝑣⟩,
versity, in generating results to be returned to the user. Fairness    and a set of coverage constraints 𝐶𝐶 = {𝐶𝐶 1, ..., 𝐶𝐶𝑡 }, 𝑡 ≤ ℎ,
and diversity constraints have been taken into account for design-     not satisfied by 𝑄 (𝐼 ), rewrite 𝑄 ⟨𝑣⟩ into a new relaxed query
ing set selection [20] and ranking [2, 3, 22] approaches. Fairness     𝑄 ′ ≡ 𝑄 ⟨𝑢⟩ so that: (i) 𝑄 ⊆ 𝑄 ′ ; (ii) ∀𝑗 ∈ {1, ..., 𝑡 }, 𝐶𝐶 𝑗 is satisfied
has also been considered to provide fairness-aware rewriting of        by 𝑄 ′ (𝐼 ); (iii) there is no other query 𝑄 ′′ satisfying conditions
OLAP, i.e., aggregation-based, queries [14, 15], and database re-      (i) and (ii) such that 𝑄 ′′ ⊂ 𝑄 ′ (thus, 𝑄 ′ is the smallest query
pair approaches [17, 18] that provide provable fairness guarantees     satisfying (i) and (ii)); (iv) 𝑄 ′ ≡ 𝑄 ⟨𝑢⟩ is the closest query to 𝑄 ⟨𝑣⟩
about classifiers trained on database training labels. The notion      according to the Euclidean distance between 𝑣 and 𝑢. Query 𝑄 ′
of causal fairness and the design of data management techniques        is called a coverage-based rewriting of 𝑄 with respect to 𝐶𝐶 and 𝐼 .
for it have been investigated in [16]. Coverage over multiple              Coverage constraints can be provided together with query
categorical attributes has been introduced in [4], where efficient     𝑄 or they can already be available in the system. This could be
techniques for determining the least amount of additional data         useful when they represent generally valid non-discrimination
that must be obtained to guarantee coverage satisfaction have          rules that must be satisfied by any query execution. Since the user
been proposed. A web application made up of a collection of            might not be aware of the available constraints, we relax only
visual widgets that implement most latest research results on          selection predicates appearing in 𝑄 with the aim of generating
fairness, stability, and transparency for rankings, and that com-      relaxed queries that are syntactically close to 𝑄 (thus, potentially
municate details of the applied methodology to the end-user, is        increasing user satisfaction). Additionally, differently from [12],
presented in [19, 21]. In [1], an interaction model for explain-and-   we assume the user is not involved in the relaxation process to
repair data transformation systems, in which users interactively       make the rewriting process “lighter” from the user point of view.
define constraints for transformation code and the resultant data,         Consider a numeric selection predicate 𝑠𝑒𝑙𝑖 ≡ 𝑆𝑖 < 𝑣𝑖 . A relax-
is proposed. The system satisfies these constraints as much as         ation of 𝑠𝑒𝑙𝑖 is any predicate 𝑠𝑒𝑙𝑖′ ≡ 𝑆𝑖 < 𝑣𝑖′ s.t. 𝑣𝑖′ ≥ 𝑣𝑖 . We can
possible and provides an explanation for any encountered prob-         convert any predicate on a numeric domain to a predicate of the
lem. In this paper, we consider coverage constraints, as proposed      form 𝑆𝑖 < 𝑣𝑖 . For instance, a predicate 𝑆𝑖 > 𝑣𝑖 can be transformed
in [4], over a single sensitive attribute. The problem we address is   into −𝑆𝑖 < −𝑣𝑖 ; range predicates of the form 𝑆𝑖𝑙 < 𝑣𝑖 < 𝑆𝑖𝑢 can be
closely related to [1]. However, differently from [1], we consider     transformed into two separate predicates −𝑆𝑖 < −𝑣𝑖𝑙 and 𝑆𝑖 < 𝑣𝑖𝑢 .
transformations corresponding to SPJ queries with the aim of           In the rest of the paper, for ease of exposition, we assume that
satisfying coverage constraints through an approach that can be        numeric predicates have been appropriately transformed into
easily integrated inside an SQL engine.                                predicates of the form 𝑆𝑖 < 𝑣𝑖 . We rewrite the equality operator
                                                                       𝑆𝑖 = 𝑣𝑖 as a conjunction of 𝑆𝑖 ≥ 𝑣𝑖 and 𝑆𝑖 ≤ 𝑣𝑖 .
4     A GRID BASED APPROACH FOR                                                       query from 𝑄, corresponding to cell (𝑛 − 1, ..., 𝑛 − 1). Notice that
      COVERAGE-BASED REWRITING                                                        in 𝑄𝑚𝑖𝑛 each selection condition contains the maximum value
Let 𝑄 ⟨𝑣⟩ be a sensitive selection monotone query and 𝐶𝐶 ≡                            for the selection attribute, thus all conditions are always satisfied
{𝐶𝐶 1, ..., 𝐶𝐶𝑡 } be a set of coverage constraints. Relaxed queries                   in 𝐼 and 𝑄𝑚𝑖𝑛 (𝐼 ) returns all the input tuples. As a consequence,
generated through coverage-based rewriting starting from 𝑄 ⟨𝑣⟩                        if 𝑄𝑚𝑖𝑛 (𝐼 ) does not satisfy 𝐶𝐶 (line 7), no coverage rewriting can
and 𝐶𝐶 have the form 𝑄 ⟨𝑢⟩, with 𝑢 ≥ 𝑣, and can be represented                        be generated and a void result is returned. Otherwise, we navi-
as points 𝑢 in a 𝑑-dimensional space.                                                 gate the space one cell after the other according to the SFC (line
    Let 𝐼 be an instance of 𝐷 and suppose that each selection                         10). For each visited cell with index 𝑎, coverage-based constraints
attribute 𝑆 𝑗 assumes in 𝐼 values in 𝐷 𝑆 𝑗 ≡ [𝑎𝑚𝑖𝑛           𝑗 , ..., 𝑎 𝑗
                                                                         𝑚𝑎𝑥 ], 𝑗 =
                                                                                      𝐶𝐶 are checked over 𝑄 𝑎 (𝐼 ) through predicate 𝐶ℎ𝑒𝑐𝑘 (𝑄 𝑎 (𝐼 ), 𝐶𝐶).
1, ..., 𝑑. In order to navigate the 𝑑-dimensional space in a discrete                 If they are satisfied (𝐶ℎ𝑒𝑐𝑘 (𝑄 𝑎 (𝐼 ), 𝐶𝐶) is true) (line 11) or they
way, we assume to discretize 𝐷 𝑆 𝑗 into a fixed number of bins 𝑛,                     are not satisfied but the cardinality of 𝑄 𝑎 (𝐼 ) is higher than the
that we assume to be expressed as 2𝑚 , for some integer number                        current minimum cardinality (line 16), we remove 𝑢𝑟 (𝑎) from
𝑚. The space can thus be represented as a multidimensional 𝑛 ×𝑑                       𝑆𝑝𝑎𝑐𝑒 (lines 12 and 17), since all cells in 𝑢𝑟 (𝑎) correspond to
matrix, called cumulative multi-dimensional grid for 𝑄 and 𝐶𝐶                         queries with higher cardinalities with respect to the currently
in 𝐼 (𝐶𝑀𝐺𝑄,𝐶𝐶,𝐼 or 𝐶𝑀𝐺 for short when no ambiguity arises).                           identified minimum. When constraints are satisfied, the index of
Each cell of the CMG with index (𝑖 1, ..., 𝑖𝑑 ), 𝑖 𝑗 ∈ {0, ..., 𝑛 − 1},               the currently minimum query is updated accordingly (line 13).
corresponds to query 𝑄 ⟨𝑣 1 + (𝑎𝑚𝑎𝑥        𝑗   − 𝑣 1 )/𝑛 ∗ 𝑖 1, ...., 𝑣𝑑 + (𝑎𝑚𝑎𝑥
                                                                             𝑗    −
𝑣𝑑 )/𝑛 ∗ 𝑖𝑑 ⟩. For the sake of simplicity, we denote such query with                  Algorithm 1 CRBase
𝑄 (𝑖 1 ,...,𝑖𝑑 ) . Notice that 𝑄 (0,...,0) ≡ 𝑄 ⟨𝑣⟩.
                                                                                       1: function CRBase(𝑄, 𝐶𝐶, 𝐼 )
    In order to detect a coverage-based rewriting for 𝑄 with re-                       2: Input 𝑄: a sensitive selection monotone query; 𝐶𝐶: a set of coverage-
spect to 𝐶𝐶 and 𝐼 , we store in each cell (𝑖 1, ..., 𝑖𝑑 ) of the 𝐶𝑀𝐺                        based constraints; 𝐼 : a database instance
the estimated cardinality |𝑄 (𝑖 1 ,...,𝑖𝑑 ) (𝐼 )| (in the following denoted            3: Output 𝑚𝑖𝑛: index of a coverage-based rewriting of 𝑄 with respect
by 𝐶𝑀𝐺 ((𝑖 1, ..., 𝑖𝑑 )).1) and a list of estimated cardinalities, one                      to 𝐶𝐶 and 𝐼
for each value of the sensitive attribute for which a coverage                         4:      𝑆𝑝𝑎𝑐𝑒 = {all the cell indexes in the CMG}
constraint has been specified, namely (|𝑄 (𝑖 1 ,...,𝑖𝑑 ) ↓𝑠1 (𝐼 )|, ...,               5:      𝑚𝑖𝑛 = (𝑛 − 1, ...., 𝑛 − 1)
|𝑄 (𝑖 1 ,...,𝑖𝑑 ) ↓𝑠𝑡 (𝐼 )|) (denoted by 𝐶𝑀𝐺 ((𝑖 1, ..., 𝑖𝑑 )).2).                     6:      𝑎 = (0, ..., 0)
    Given an index 𝑎, we denote with 𝑢𝑟 (𝑎) (upper right elements                      7:    if not Check(𝑄 𝑚𝑖𝑛 (𝐼 ), 𝐶𝐶) then return ⊥
of 𝑎), the set {𝑏 |𝑏 ≥ 𝑎}, and with 𝑙𝑙 (𝑎) (lower left elements                        8:    else
of 𝑎), the set {𝑏 |𝑏 ≤ 𝑎}. Since we deal with sensitive selection                      9:        while 𝑆𝑝𝑎𝑐𝑒 ≠ ∅ do
monotone queries, it is easy to show that the following properties                    10:           𝑎 = Next(𝑎, 𝑆𝑝𝑎𝑐𝑒)
hold:                                                                                 11:           if Check(𝑄 𝑎 (𝐼 ), 𝐶𝐶) then
                                                                                      12:               𝑆𝑝𝑎𝑐𝑒 = 𝑆𝑝𝑎𝑐𝑒 \ ur(𝑎)
    (a) 𝐶𝑀𝐺 (𝑎).1 ≥ 𝐶𝑀𝐺 (𝑏).1 and 𝐶𝑀𝐺 (𝑎).2 ≥ 𝐶𝑀𝐺 (𝑏).2 when                          13:               if |𝑄 𝑎 (𝐼 ) | < |𝑄 𝑚𝑖𝑛 (𝐼 ) | then 𝑚𝑖𝑛 = 𝑎
        𝑎 ≥ 𝑏.                                                                        14:               end if
    (b) 𝑄 𝑎 ⊆ 𝑄 𝑏 for all 𝑏 ∈ 𝑢𝑟 (𝑎);                                                 15:           else
    (c) 𝑄 𝑎 ⊇ 𝑄 𝑏 for all 𝑏 ∈ 𝑙𝑙 (𝑎).                                                 16:               if |𝑄 𝑚𝑖𝑛 (𝐼 ) | < |𝑄 𝑎 (𝐼 ) | then
                                                                                      17:                   𝑆𝑝𝑎𝑐𝑒 = 𝑆𝑝𝑎𝑐𝑒 \ ur(𝑎)
    As an example, Figure 1(b) shows matrix 𝐶𝑀𝐺 for the dataset                       18:               end if
in Figure 1(a), when only one coverage constraint is provided. For                    19:           end if
the sake of simplicity, each cell 𝑎 reports only 𝐶𝑀𝐺 (𝑎).2. Grey                      20:        end while
cells correspond to 𝑢𝑟 ((1, 1)) while light grey cells correspond to                  21:        return 𝑚𝑖𝑛
𝑙𝑙 ((1, 1)). Cell (1, 1) is included in both 𝑢𝑟 ((1, 1)) and 𝑙𝑙 ((1, 1)).             22:    end if
                                                                                      23: end function
    In order to identify a coverage-based rewriting for 𝑄, we pro-
pose three algorithms. The first is a baseline algorithm that iden-
tifies the solution by navigating the CMG, one cell after the other,
at increasing distance from 𝑄. The second algorithm is obtained                       4.2      Adding Pruning
from the first one by pruning the space, thus reducing the number                     Algorithm 𝐶𝑅𝐵𝑎𝑠𝑒 can be optimized, leading to algorithm 𝐶𝑅𝐵𝑎𝑠𝑒𝑃
of visited cells with a limited overhead; finally, the third algo-                    (𝐶𝑅𝐵𝑎𝑠𝑒 with Pruning), by adding some pruning rules. Such rules
rithm visits the space, after pruning, by iteratively refining the                    generalize the pruning already applied by algorithm 𝐶𝑅𝐵𝑎𝑠𝑒 at
size and the number of matrix cells to be visited, thus converging                    lines 12 and 16 with the aim of further reducing the space to be
to the solution faster.                                                               visited and coping with the well known curse of dimensionality
                                                                                      problem. Each pruning rule generates two sets of cells:
4.1     The Baseline Approach                                                               a) Locking set (𝐿𝑆). For each 𝑎 ∈ 𝐿𝑆, query 𝑄 𝑎 (𝐼 ) satisfies
Algorithm 𝐶𝑅𝐵𝑎𝑠𝑒 (Coverage-based Rewriting Baseline algo-                                      coverage constraints. Based on properties (a) and (b), all
rithm) visits the CMG according to a Space Filling Curve (SFC)                                 the queries in 𝑢𝑟 (𝑎) can therefore be removed from 𝑆𝑝𝑎𝑐𝑒
in which cells are visited at increasing distance from 𝐶𝑀𝐺 (0)                                 since they satisfy coverage constraints but they are not
(see Figure 1(c)), by relying on a function 𝑁 𝑒𝑥𝑡 (). During the                               the closest to the initial query and their cardinality might
visit, we look for the cell corresponding to the query with the                                not be the minimum one. We denote with 𝐿𝑆 ∗ the set
                                                                                               Ð
minimum cardinality that satisfies coverage-based constraints                                    𝑎 ∈𝐿𝑆 𝑢𝑟 (𝑎).
𝐶𝐶. In particular, let 𝑆𝑝𝑎𝑐𝑒 be the set containing all the CMG                              b) No Solution set (𝑁 𝑆). For each 𝑎 ∈ 𝑁 𝑆, query 𝑄 𝑎 (𝐼 ) does
cell indexes. We initialize 𝑚𝑖𝑛 with the index of the more distant                             not satisfy coverage constraints. Based on properties (a)
                          (a) Data distribution                            (b) 𝐶𝑀𝐺𝑄,𝐶𝐶,𝐼                             (c) SFC


                                                                Figure 1: Data representation


        and (c), this means that all the queries in 𝑙𝑙 (𝑎) can be
        removed from 𝑆𝑝𝑎𝑐𝑒 since they cannot satisfy coverage
        constraints. We denote with 𝑁 𝑆 ∗ the set 𝑎 ∈𝑁 𝑆 𝑙𝑙 (𝑎).
                                                 Ð

  We consider two pruning rules:
      • Diagonal pruning. Using binary search, we look for the
        cell (𝑙, ...., 𝑙), 𝑙 ∈ {0, ..., 𝑛 − 1}, closest to (0, ...., 0) such that
        𝑄 (𝑙,....,𝑙) (𝐼 ) satisfies coverage constraints. We add 𝑄 (𝑙,....,𝑙)
        into 𝐿𝑆 and 𝑄 (𝑙−1,....,𝑙−1) into 𝑁 𝑆. The number of cardinal-
        ity estimations performed by diagonal pruning is 𝑂 (log 𝑛),
        where 𝑛 is the number of bins.                                                        (a) Diagonal pruning             (b) Dimensional pruning
      • Dimensional pruning. For the sake of simplicity, we explain
        dimensional pruning in a 2-dimensional space. The prun-
                                                                                      Figure 2: Pruning in 2D with 𝐿𝑜𝑐𝑘𝑖𝑛𝑔 and 𝑁𝑜𝑆𝑜𝑙𝑢𝑡𝑖𝑜𝑛 sets
        ing can however be easily extended to any d-dimensional
        matrix. Suppose that the horizontal axis corresponds to the
        selection attribute 𝑆 1 and the vertical one to the selection
        attribute 𝑆 2 . For each position 𝑖, 𝑖 = 0, ..., 𝑛 − 1 in the hori-          at each iteration 𝑖 ≥ 1, the number of bins is set to 2𝑖 and Algo-
        zontal axis, we look for the first cell (𝑖, 𝑗), 𝑗 ∈ {0, ..., 𝑛 − 1},         rithm 𝐶𝑅𝐵𝑎𝑠𝑒𝑃 is used to prune the space, navigate it, and update
        that satisfies coverage constraints. It is easy to show that                 index 𝑚𝑖𝑛. Only when we reach the maximum precision i.e., when
        (𝑖, 𝑗) ∈ 𝐿𝑆 and (𝑖, 𝑧) ∈ 𝑁 𝑆, for all 𝑧 ≤ 𝑗 − 1. A similar                   𝑖 = 𝑚 and the number of bins is equal to 𝑛, the computed 𝑚𝑖𝑛 is
        computation is then applied to the vertical axis.                            returned as result.
        Assuming to use binary search to locate the first cell sat-
        isfying the constraints for each direction, the number of                    4.4    Cardinality Estimation
        cardinality estimations performed by dimensional pruning                     Our query refinement framework requires fast and accurate car-
        is 𝑂 (𝑑𝑛 log 𝑛), where 𝑛 is the number of bins and 𝑑 the                     dinality estimates for queries corresponding to visited cells of
        number of axis, i.e., the number of selection conditions in                  the CMG. These estimates could be obtained from the cardinality
        the input query 𝑄.                                                           estimation component of the database system. However, such
Figure 2 shows an example of the considered pruning rules, with                      estimates are often incorrect, especially for queries with multiple
a single coverage constraint set to 𝑄 ↓𝑠𝑖 ≥ 7.                                       joins and selection predicates. Since the accuracy of the proposed
   Pruning rules can be easily integrated in 𝐶𝑅𝐵𝑎𝑠𝑒 algorithm as                     coverage-based rewriting depends on the used cardinality esti-
follows:                                                                             mation approach, similarly to [12], we rely on sampling based
      • sets 𝐿𝑆 and 𝑁 𝑆 are computed before line 7 according to                      estimators for cardinality estimation. In order to avoid sampling
        diagonal and dimensional pruning;                                            repeatedly for each visited cell, we compute an 𝜖-approximation
      • at line 4, 𝑆𝑝𝑎𝑐𝑒 is initialized with all the cell indexes in the             of the base tables and we rely on the approach in [13] for gener-
        CMG minus those that do not satisfy coverage constraints,                    ating the sample of joined tables. We compute just one random
        identified by the pruning phase. Thus, 𝑆𝑝𝑎𝑐𝑒 = { all the                     sample for each query at hand (uniformly, independently, and
        cell indexes in the CMG } \ (𝐿𝑆 ∗ ∪ 𝑁 𝑆 ∗ ).                                 without replacement). According to [13] such sample can be
                                                                                     reused for all structurally equivalent queries (i.e., queries con-
                                                                                     taining the same number of selection conditions and the same
4.3      Adding Iteration                                                            number of joins).
𝐶𝑅𝐵𝑎𝑠𝑒𝑃 can be further optimized by iteratively increasing the
number of bins during the search up to a given maximum 𝑛 = 2𝑚 .                      5 EXPERIMENTAL EVALUATION
The new algorithm is denoted by 𝐶𝑅𝐵𝑎𝑠𝑒𝑃𝐼 (𝐶𝑅𝐵𝑎𝑠𝑒𝑃 with Itera-
tion). Each iteration, by increasing the number of bins, increases                   5.1 Experimental Setup
the number of matrix cells to be visited for pruning and search.                     All experiments were conducted on a PC with an Intel Core i5-
As a consequence, each iteration increases the precision by which                    8300H 2.30 GHz CPU and 16GB main memory, running Microsoft
we refine the query and we compute cardinalities. More precisely,                    Windows 10. All programs were coded in Python 3.8.
                             (a) Execution time                  (b) Visited cells                (c) Relaxation degree


                                          Figure 3: Comparison of coverage rewriting algorithms


   We considered two real datasets with different sizes, stored              education number equal or greater than 13, work more than 20
in PostgreSQL: Adult dataset [11] and US Forbes Richest People               hours per week and have a capital gain higher than 5500:
dataset.1 The Adult dataset was originally extracted from the 1994                   Q4: SELECT * FROM adult_data
Census database and has been used in assessing fairness-aware                        WHERE age > 20 AND education_num ≥ 13
data management tecniques (see, e.g., [16]). It contains 48842                       AND hours_per_week > 20 AND capital_gain > 5500
individuals that are described by six numerical attributes (age,                Consider the coverage constraint 𝑄 ↓𝐹𝑒𝑚𝑎𝑙𝑒 ≥ 250. Query 𝑄
fnlwgt, education-num, capital-gain, capital-loss, hours-per-week)           returns 1242 tuples (selectivity equal to 2.5%), out of which 200
and eight categorical attributes (workclass, education, marital-             are females (about 16%), thus the constraint is not satisfied.
status, occupation, relationship, race, sex, native-country). For our           Figure 3 shows the execution time (in seconds) of the three
experiments, we use sex as sensitive attribute; it presents an un-           algorithms by varying the number of bins 𝑛. As expected, the
balanced proportion of males (67%) and females (33%). For this               number of visited cells increases (and, as a consequence, per-
dataset, we generated a random sample with 9604 tuples, guar-                formance decreases) while increasing 𝑛. 𝐶𝑅𝐵𝑎𝑠𝑒𝑃𝐼 exhibits the
anteeing query cardinality estimation with 0.1% error and 99%                best performance, showing the benefits of relying on pruning
confidence. The sample is small enough to be stored in mem-                  and iteration. On the other hand, all techniques return the same
ory. The US Forbes Richest People dataset contains information               coverage-based rewritten query, whose relaxation degree de-
about the richest people in US from 2016 (about 400 individuals),            creases while increasing the number of bins (see Figure 3 (c)).
described by six numerical attributes (worthchange, age, realtime-           Similar results have been obtained with other queries. There-
worth, realtimerank, timestamp, realtimeposition) and nine cate-             fore, in the following, we present experimental results only for
gorical attributes (name, lastname, uri, imageuri, source, industry,         𝐶𝑅𝐵𝑎𝑠𝑒𝑃𝐼 . Additionally, since the number of visited cells and the
gender, headquarters, state). For our experiments we use 𝑔𝑒𝑛𝑑𝑒𝑟              execution time are correlated, in the following we report only
as sensitive attribute; it presents an unbalanced proportion of              execution time results.
males (87%) and females (13%).
   To measure the performance of the algorithms, we consider                 Impact of the number of selection conditions. In the second group
both the execution time and the number of query cardinality                  of experiments, we analyze in more details execution time and re-
estimations, corresponding to the number of visited cells. To                laxation degree achieved with 𝐶𝑅𝐵𝑎𝑠𝑒𝑃𝐼 by varying the number
measure the accuracy of the algorithms, similarly to [12], given             of bins 𝑛 and the threshold of the coverage constraint. The con-
an initial query 𝑄 over a database instance 𝐼 , rewritten into a             sidered queries (𝑄2 and 𝑄3 listed below and 𝑄4 presented above)
new query 𝑄 ′ , we compute the relaxation degree as the ratio                contain a different number of selection conditions but all them
                         |𝑄 ′ (𝐼 )| − |𝑄 (𝐼 )|                               have a selectivity between 2-4% and the percentage of returned
                                               .
                                |𝑄 (𝐼 )|                                     females is about 16% of the result. In this way, we can analyze
   We then performed four groups of experiments aiming at: (i)               𝐶𝑅𝐵𝑎𝑠𝑒𝑃𝐼 performance independently from differences in query
comparing the three proposed algorithms; (ii) analyzing their                selectivity (similar results have been obtained with queries with
performance and accuracy with respect to the number of bins and              higher selectivities).
a single coverage constraint; (iii) analyzing their performance                      Q2: SELECT * FROM adult_data
while changing the reference dataset; (iv) analyzing their per-                      WHERE hours_per_week > 20 AND
formance and accuracy while changing the number of coverage                          capital_gain > 5500
constraints. Those preliminary experimental results refer to selec-                  Q3: SELECT * FROM adult_data
tion queries, we leave to future work the experimental analysis                      WHERE age > 20 AND hours_per_week > 20
of join queries.                                                                     AND capital_gain > 5500

5.2    Experimental Results                                                     For each query, we consider a coverage constraint 𝑄 ↓𝐹𝑒𝑚𝑎𝑙𝑒 ≥
Impact of pruning and iteration. In the first group of experiments,          𝑞𝑓 + 25%, where 𝑞𝑓 denotes the number of females returned by
we compare the three proposed algorithms to understand the                   the initial query. From Figure 4 (a), we observe that, by varying
impact of pruning and iteration on performance and relaxation                the number of selection conditions, and therefore the dimen-
degree. To this aim, we consider a query 𝑄4 with four selection              sionality of the CMG, execution time rapidly increases starting
conditions, selecting individuals who are over 20 years-old, have            from 32 bins. This is a well known problem, due to the curve of
                                                                             dimensionality. From Figure 4 (b), we notice that, starting from
1 https://www.forbes.com/forbes-400/#ec679fa7e2ff                            16 bins, we get a relaxation degree lower than 0.4. The relaxation
        (a) Execution time               (b) Relaxation degree              (a) Execution time, 𝑑 = 2         (b) Execution time, 𝑑 = 3


Figure 4: 𝐶𝑅𝐵𝑎𝑠𝑒𝑃𝐼 performance and accuracy, with re-                 Figure 7: Performance comparison between Adult and
spect to the number of bins                                           Forbes datasets, with respect to the number of bins


                                                                            (a) Execution time, 𝑑 = 2         (b) Execution time, 𝑑 = 3
         (a) Pruning time                   (b) Search time

                                                                      Figure 8: Performance comparison between Adult and
Figure 5: 𝐶𝑅𝐵𝑎𝑠𝑒𝑃𝐼 pruning and search time, with respect
                                                                      Forbes datasets, with respect to coverage threshold
to the number of bins

                                                                      Impact of sampling. In our next experiment, we examine perfor-
                                                                      mance when varying the dataset size. To this aim, we consider
                                                                      the Forbes dataset. Since it is quite small (about 400 individuals),
                                                                      we do not generate any sample and we execute queries directly
                                                                      on the input dataset. Due to space constraints, we consider only
                                                                      queries with two and three selection predicates, keeping query
                                                                      selectivity in the range considered for the Adult dataset (about
                                                                      3%):
         (a) Execution time              (b) Relaxation degree               Q2_f: SELECT * FROM forbes
                                                                             WHERE realtimeworth > 10000 AND age < 65
                                                                             Q3_f: SELECT * FROM forbes
Figure 6: 𝐶𝑅𝐵𝑎𝑠𝑒𝑃𝐼 performance and accuracy, with re-
                                                                             WHERE realtimeworth > 10000 AND age < 65
spect to coverage threshold                                                  AND age > 45

                                                                         Figure 7 compares execution time of corresponding queries
degree is almost stable starting from 32 bins. By combining to-       executed over the Adult and Forbes datasets. As expected, the
gether results shown in Figures 4 (a) and (b), we can consider a      dataset size influences the cardinality query estimation time and,
number of bins equal to 32 a reasonable choice for getting a good     as a consequence, the total execution time. A similar result holds
compromise between performance and accuracy. From Figure              while increasing the female threshold (Figure 8). In both cases,
5(a), we notice that the pruning time is quite low (lower than half   differences in times are more evident for higher number of selec-
a second) even with 𝑑 = 4, thus it does not deeply suffer from the    tion conditions (𝑑 = 3).
curse of dimensionality while the search time does (Figure 5(b)).
As a final experiment, we analyze 𝐶𝑅𝐵𝑎𝑠𝑒𝑃𝐼 performance by fix-        Impact of the number of coverage constraints. In order to analyze
ing the number of bins to 32 and varying the coverage constraint,     the impact of the number of coverage constraints, we consider
setting the threshold to 𝑞𝑓 plus a given percentage (25%, 33%,        query 𝑄2 and the following sets of coverage constraints:
50%, 66%). From Figure 6(a), we observe that, by increasing the              𝐶𝐶 1 = 𝑄2 ↓𝐹 𝑒𝑚𝑎𝑙𝑒 ≥ 456
female coverage threshold, the execution time increases (quite               𝐶𝐶 2 = 𝑄2 ↓𝐹 𝑒𝑚𝑎𝑙𝑒 ≥ 456 AND 𝑄2 ↓𝑀𝑎𝑙𝑒 ≥ 1800
rapidly for 𝑑 = 4 due to the curve of dimensionality problem)                𝐶𝐶 3 = 𝑄2 ↓𝐹 𝑒𝑚𝑎𝑙𝑒 ≥ 456 AND 𝑄2 ↓𝑀𝑎𝑙𝑒 ≥ 2400
since more cells are visited by the algorithm (the distance be-       𝐶𝐶 1 is the constraint we considered in the previous experiments.
tween the original query and the final one is higher). On the         𝐶𝐶 2 specifies an additional coverage constraint on males, which
other hand, the relaxation degree almost linearly increases (see      is however satisfied by the coverage-based rewriting generated
Figure 6(b)) with respect to the additional percentage of females     when considering 𝐶𝐶 1 . Thus, we expect that the solution returned
required by the coverage constraint since, by increasing females,     with 𝐶𝐶 1 is the same than that returned with 𝐶𝐶 2 . On the other
males increase as well.                                               hand, with 𝐶𝐶 3 , we expect a different rewriting for increasing
                                                                      the number of males, as required by the constraint.
                                                                                      [6] Marina Drosou, H. V. Jagadish, Evaggelia Pitoura, and Julia Stoyanovich. 2017.
                                                                                          Diversity in Big Data: A Review. Big Data 5, 2 (2017), 73–84.
                                                                                      [7] Serge Abiteboul et Al. 2018. Research Directions for Principles of Data Man-
                                                                                          agement (Dagstuhl Perspectives Workshop 16151). Dagstuhl Manifestos 7, 1
                                                                                          (2018), 1–29.
                                                                                      [8] Donatella Firmani, Letizia Tanca, and Riccardo Torlone. 2019. Data Processing:
                                                                                          Reflections on Ethics. In Proceedings of the 1st International Workshop on
                                                                                          Processing Information Ethically co-located with 31st International Conference
                                                                                          on Advanced Information Systems Engineering, PIE@CAiSE 2019, Rome, Italy,
                                                                                          June 4, 2019.
                                                                                      [9] Nodira Khoussainova, YongChul Kwon, Magdalena Balazinska, and Dan Suciu.
          (a) Execution time                      (b) Relaxation degree                   2010. SnipSuggest: Context-Aware Autocompletion for SQL. PVLDB 4, 1
                                                                                          (2010), 22–33.
                                                                                     [10] Nick Koudas, Chen Li, Anthony K. H. Tung, and Rares Vernica. 2006. Relaxing
                                                                                          Join and Selection Queries. In Proceedings of the 32nd International Conference
Figure 9: 𝐶𝑅𝐵𝑎𝑠𝑒𝑃𝐼 Performance and accuracy, with re-                                     on Very Large Data Bases, Seoul, Korea, September 12-15, 2006. 199–210.
spect to different coverage constraints                                              [11] Moshe Lichman. 2013. UCI Machine Learning Repository. Technical Report.
                                                                                     [12] Chaitanya Mishra and Nick Koudas. 2009. Interactive query refinement. In
                                                                                          EDBT 2009, 12th International Conference on Extending Database Technology,
                                                                                          Saint Petersburg, Russia, March 24-26, 2009, Proceedings. 862–873.
                                                                                     [13] Matteo Riondato, Mert Akdere, Ugur Çetintemel, Stanley B. Zdonik, and Eli
   Figure 9(a) shows execution time of query 𝑄2 with respect                              Upfal. 2011. The VC-Dimension of SQL Queries and Selectivity Estimation
to the considered coverage thresholds. When considering 𝐶𝐶 2 ,                            through Sampling. In Machine Learning and Knowledge Discovery in Databases
the execution time slightly increases since, for each visited cell,                       - European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011,
                                                                                          Proceedings, Part II. 661–676.
one more query cardinality estimation is needed. However, since,                     [14] Babak Salimi, Corey Cole, Peter Li, Johannes Gehrke, and Dan Suciu. 2018.
by construction, when 𝐶𝐶 1 is satisfied 𝐶𝐶 2 is satisfied as well,                        HypDB: A Demonstration of Detecting, Explaining and Resolving Bias in
                                                                                          OLAP queries. PVLDB 11, 12 (2018), 2062–2065.
the number of visited cells in the two cases coincides. On the                       [15] Babak Salimi, Johannes Gehrke, and Dan Suciu. 2018. Bias in OLAP Queries:
other hand, when considering 𝐶𝐶 3 more cells are visited and, as                          Detection, Explanation, and Removal. In Proceedings of the 2018 International
a consequence, the execution time increases as well. From Figure                          Conference on Management of Data, SIGMOD Conference 2018, Houston, TX,
                                                                                          USA, June 10-15, 2018. 1021–1035.
9(b) we see that, as expected, the relaxation degrees for 𝐶𝐶 1 and                   [16] Babak Salimi, Bill Howe, and Dan Suciu. 2019. Data Management for Causal
𝐶𝐶 2 coincide since in both cases the returned relaxed queries                            Algorithmic Fairness. IEEE Data Eng. Bull. 42, 3 (2019), 24–35.
coincide. Relaxation is higher for 𝐶𝐶 3 since a higher coverage                      [17] Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Capuchin:
                                                                                          Causal Database Repair for Algorithmic Fairness. CoRR abs/1902.08283 (2019).
threshold is considered.                                                             [18] Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional
                                                                                          Fairness: Causal Database Repair for Algorithmic Fairness. In Proceedings of
                                                                                          the 2019 International Conference on Management of Data, SIGMOD Conference
6    CONCLUDING REMARKS                                                                   2019, Amsterdam, The Netherlands, June 30 - July 5, 2019. 793–810.
In this paper, we have presented a preliminary approach for                          [19] Julia Stoyanovich and Bill Howe. 2019. Nutritional Labels for Data and Models.
                                                                                          IEEE Data Eng. Bull. 42, 3 (2019), 13–23.
coverage-based rewriting of SQL queries, suitable for address-                       [20] Julia Stoyanovich, Ke Yang, and H. V. Jagadish. 2018. Online Set Selection with
ing non-discrimination in early data processing stages. The pro-                          Fairness and Diversity Constraints. In Proceedings of the 21th International
                                                                                          Conference on Extending Database Technology, EDBT 2018, Vienna, Austria,
posed techniques revise and extend existing query refinement                              March 26-29, 2018. 241–252.
approaches, defined for addressing the empty or few answer                           [21] Chenkai Sun, Abolfazl Asudeh, H. V. Jagadish, Bill Howe, and Julia Stoyanovich.
problem, to take care of coverage issues, without the need of any                         2019. MithraLabel: Flexible Dataset Nutritional Labels for Responsible Data
                                                                                          Science. In Proceedings of the 28th ACM International Conference on Information
user interaction. Experimental results show that the proposed                             and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019.
approach is effective and efficient. Future works include: (i) fur-                       2893–2896.
ther optimizations of the 𝐶𝑅𝐵𝐴𝑠𝑒𝑃𝐼 algorithm to cope with the                        [22] Ke Yang, Vasilis Gkatzelis, and Julia Stoyanovich. 2019. Balanced Ranking
                                                                                          with Diversity Constraints. In Proceedings of the Twenty-Eighth International
curve of dimensionality issue, by exploiting CMG sparsity and                             Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August
additional heuristics; (ii) considering more than one sensitive                           10-16, 2019. 6035–6042.
                                                                                     [23] Junjie Yao, Bin Cui, Liansheng Hua, and Yuxin Huang. 2012. Keyword Query
attribute and automating the identification of the sensitive at-                          Reformulation on Structured Data. In IEEE 28th International Conference on
tribute to analyze first; (iii) dealing with categorical attributes in                    Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5
the selection conditions and more complex fairness constraints;                           April, 2012. 953–964.
(iv) taking into account data summaries of the input dataset,
thus making the rewriting reusable to datasets characterized by
the same synopses; (v) data preparation is an iterative process,
coverage-based rewriting can be considered as a new relational
operation to take into account during the whole query processing
steps, including query optimization.

REFERENCES
 [1] Dolan Antenucci and Michael J. Cafarella. 2018. Constraint-based Explanation
     and Repair of Filter-Based Transformations. PVLDB 11, 9 (2018), 947–960.
 [2] Abolfazl Asudeh, H. V. Jagadish, Gerome Miklau, and Julia Stoyanovich. 2018.
     On Obtaining Stable Rankings. PVLDB 12, 3 (2018), 237–250.
 [3] Abolfazl Asudeh, H. V. Jagadish, Julia Stoyanovich, and Gautam Das. 2019.
     Designing Fair Ranking Schemes. In Proceedings of the 2019 International
     Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The
     Netherlands, June 30 - July 5, 2019. 1259–1276.
 [4] Abolfazl Asudeh, Zhongjun Jin, and H. V. Jagadish. 2019. Assessing and
     Remedying Coverage for a Given Dataset. In 35th IEEE International Conference
     on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. 554–565.
 [5] Piero A. Bonatti and Sabrina Kirrane. 2019. Big Data and Analytics in the
     Age of the GDPR. In 2019 IEEE International Congress on Big Data, BigData
     Congress 2019, Milan, Italy, July 8-13, 2019. 7–16.