=Paper= {{Paper |id=Vol-2841/PIE+Q_1 |storemode=property |title=Data Wrangling for Fair Classification |pdfUrl=https://ceur-ws.org/Vol-2841/PIE+Q_1.pdf |volume=Vol-2841 |authors=Lacramioara Mazilu,Nikolaos Konstantinou,Norman Paton,Alvaro A. A. Fernandes |dblpUrl=https://dblp.org/rec/conf/edbt/Mazilu0PF21 }} ==Data Wrangling for Fair Classification== https://ceur-ws.org/Vol-2841/PIE+Q_1.pdf
                                  Data Wrangling for Fair Classification
                           Lacramioara Mazilu                                                             Norman W. Paton
                         University of Manchester                                                     University of Manchester
                       Manchester, United Kingdom                                                    Manchester, United Kingdom
                      lacramioaramazilu@gmail.com                                                     npaton@manchester.ac.uk

                        Nikolaos Konstantinou                                                         Alvaro A.A. Fernandes
                        University of Manchester                                                      University of Manchester
                       Manchester, United Kingdom                                                    Manchester, United Kingdom
                            nkons@live.com                                                            fernandesaaa@gmail.com

ABSTRACT                                                                                  In this paper, we investigate interventions in the data wran-
Whenever decisions that affect people are informed by classifiers,                     gling process that directly target classifier bias, when labels are
there is a risk that the decisions can discriminate against certain                    available. Specifically, we assume a situation in which data prepa-
groups as a result of bias in the training data. There has been                        ration selects and combines datasets for use in training, and we
significant work to address this, based on pre-processing the                          intervene in data preparation to ensure that these selection and
inputs to the classifier, changing the classifier itself, or post-                     combination decisions are informed by the fairness of the re-
processing the results of the classifier. However, upstream from                       sulting classifier. Similarly to other works, our focus is on the
these steps, there may be a variety of data wrangling processes                        pre-training step, however, we do not rely on the assumption
that select and integrate the data that is used to train the classifier,               that the data is already gathered and that fairness is achieved
and these steps could themselves lead to bias. In this paper, we                       through operations on the integrated data source. Our focus is
propose an approach that generates schema mappings in ways                             on creating the integrated data source in a fashion that takes into
that take into account bias observed in classifiers trained on the                     consideration fairness measurements.
results of different mappings. The approach searches a space                              The contributions are as follows:
of candidate interventions in the mapping generation process,                              (1) The proposal of a strategy for fairness aware data prepa-
which change how these mappings are generated, informed by a                                   ration for classification.
bias-aware fitness function. The resulting approach is evaluated                           (2) The realisation of the strategy as a search for fair data
using Adult Census and German Credit data sets.                                                preparation plans.
                                                                                           (3) An evaluation of (2) for benchmark data sets that shows
KEYWORDS                                                                                       how the interventions can improve a specific fairness met-
fairness, bias, classification, data preparation, data wrangling                               ric, namely demographic parity. Also, we show how the
                                                                                               accuracy of the trained classifier is impacted by the inter-
                                                                                               ventions.
1    INTRODUCTION                                                                         The remainder of this paper is structured as follows. Section 2
Fairness in machine learning is important because machine learn-                       provides the context for this work by reviewing related results.
ing supports decision making; problems resulting from bias have                        Section 3 describes some technical background on data prepara-
been widely recognised, and there are many results on fairness-                        tion required for later sections. Section 4 details the approach,
enhancing interventions in machine learning [2]. Proposals have                        in which a space of alternative wrangling strategies is explored.
been made for taking fairness into account before, during and                          Section 5 presents an empirical evaluation, and Section 6 reviews
after learning.                                                                        progress against the objectives.
   Although proposals that focus on interventions before learn-
ing have considered a variety of techniques for selecting or mod-                      2    RELATED WORK
ifying training data [7], most of this work assumes that the train-                    The problem considered in this paper is as follows: Assume we
ing data is already available. As such, the interventions take place                   have access to data sets that can be used to populate a target
after the data preparation steps that select and integrate data                        schema that is suitable for training a classifier. Combine the data
sets for analysis. As data scientists spend a considerable portion                     sets in such a way as to reduce the bias in the generated classifier.
of their time on such steps, an opportunity seems to exist for                            Previous related work has tended to assume that there is an
making interventions earlier in the data processing pipeline.                          existing data set that is suitable for training the classifier. In such
   In our previous work [12], we investigated how data wrangling                       a case, the data is considered to have a sensitive attribute (such
processes could be adjusted to address dataset properties that                         as gender or race), and the objective is to train a classifier that
can give rise to bias, specifically sample size disparity and proxy                    reduces bias with respect to some outcome, represented by labels
attributes. Such interventions are promising, in that they can be                      in the data (such as a decision to hire or promote an individual).
applied to unlabelled data, but there is no direct guarantee that                         This section reviews existing work on fairness pertaining to
addressing the underlying property will in fact reduce classifier                      machine learning, looking at interventions that are designed to
bias.                                                                                  reduce bias that take place before, during and after training. The
Β© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-   emphasis is on before training, as wrangling falls there.
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)       Before Approaches to reducing bias that influence the data used
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)                                                              for training may remove attributes that correlate with the sensi-
                                                                                       tive attribute, change the labels to reduce bias, choose samples of
                                                                           Profiling: identify candidate keys for the source relations,
                                                                              and (partial) inclusion dependencies that identify overlaps
                                                                              between attribute values in different sources.
                                                                           Mapping: informed by the results of matching and profiling,
                                                                              generate candidate mappings in the form of views that can
                                                                              be used to populate the target table from the sources [11].
                                                                           Selection: given the results of the mappings, select the top k
                                                                              results from the mappings that best satisfy some criterion.
                                                                              In this paper, the criterion used is completeness – prefer
                                                                              the mappings with the fewest missing values.

                                                                       4 APPROACH
              Figure 1: Data wrangling pipeline                        4.1 Overview of Approach
                                                                       The approach to fair data preparation for classification involves
                                                                       exploring a space of candidate data preparation plans, assessing
the available data, or assign weights to different rows that impact    the fairness of the classifiers built using the results of these plans.
their influence on the classifier [7]. Alternatively, pre-processing   The steps in the approach are illustrated in Figure 2, and are
techniques may directly tackle properties of the underlying data       discussed below.
set that may be problematic, such as coverage [10].                        In the approach, a Candidate Solution captures a collection of
   There has been less work on data wrangling, the focus of this       interventions into the wrangling process that are deemed to be
paper; although some papers explicitly discuss data preparation,       helpful for reducing bias. Specifically, the interventions involve
such work has generally been later in the pipeline than this paper.    removing matches or inclusion dependencies from consideration
Valentim et al. [15] empirically compare several interventions,        during data wrangling. For example, assume that π‘š 1 is a match
specifically the removal of the sensitive attribute, feature dis-      and 𝑖 1 is an inclusion dependency. It is possible that the data
cretisation and sampling techniques. Accinelli et al. [1] consider     resulting from the wrangling pipeline in Figure 1 produces a
coverage within data preparation, though primarily by selecting        fairer output data set when: (i) π‘š 1 is available but 𝑖 1 is not; (ii)
data rather than determining how the data should be wrangled.          π‘š 1 is not available but 𝑖 1 is available; (iii) π‘š 1 and 𝑖 1 are both
   In our previous work [12], we steer the generation of data          available; or (iv) neither π‘š 1 nor 𝑖 1 are available. These different
preparation plans in ways that reflect risk factors for bias. The      environments in which to wrangle constitute a search space of
current paper follows a similar approach, in the sense that plan       interventions that lead to the production of different data sets
generation is steered, but directly calls the classifier on candi-     that may lead to more or less biased classifiers.
date plans, thereby steering the plan generation based on direct           Assume we have a match π‘š : 𝑆.π‘Ž β†’ 𝑇 .π‘Ž β€² relating attribute π‘Ž
evidence of bias.                                                      from source 𝑆 to attribute π‘Ž β€² in target 𝑇 . The removal of π‘š from
                                                                       the wrangling process can have one of the following effects: (i)
During Approaches to reducing bias within the learning algo-
                                                                       π‘š is replaced by another match involving the same table, so that
rithm are often associated with metrics that quantify some no-
                                                                       now 𝑆.π‘Ž β€²β€² β†’ 𝑇 .π‘Ž β€² and as a result 𝑇 .π‘Ž β€² is populated differently;
tion of bias, and the learning algorithm then explicitly trades
                                                                       (ii) π‘š is replaced by another match involving a different table, so
off bias with accuracy [13]. For example, Hajian et al. [5] pro-
                                                                       that now 𝑃 .π‘Ž β†’ 𝑇 .π‘Ž β€² and as a result 𝑇 .π‘Ž β€² is populated differently;
vides techniques for constraining the levels of bias exhibited by
                                                                       (iii) no alternative match is found, so 𝑇 β€² .π‘Ž is populated with null
classification rules to a specified level.
                                                                       values.
After Approaches to reducing bias after learning operate by                Assume we have an inclusion dependency 𝐼 = 𝑃 .π‘Ž βŠ†πœƒ 𝑆.𝑏
selectively changing labels at prediction time (e.g., [6, 8]).         between attributes in the sources 𝑃 and 𝑆, where πœƒ is the degree
   Work on bias-aware data preparation can be seen as com-             of overlap between the attributes π‘Ž and 𝑏. The removal of 𝐼 can
plementary to most other work on fair classification; benefits         have one of the following effects: (i) there is another inclusion
that arise with respect to fairness in upstream processes simply       dependency between 𝑃 and 𝑆, which leads to the same tables
reduce the scale of the problem for later interventions.               being considered for joining, but on different attributes; (ii) there
                                                                       is another indirect way in which 𝑃 and 𝑆 can be joined through
3   TECHNICAL BACKGROUND                                               an intermediate table that can be used instead by the mapping
                                                                       generator; or (iii) the tables are no longer joined with each other.
The approach to fair data preparation depends on identifying               To consider different ways of preparing the data, the approach
interventions during data wrangling that affect the behaviour          is to explore the space of possible interventions, with a view
of a wrangling pipeline. We build on the VADA data wrangling           to identifying combinations that lead to less biased results. The
system [9], the relevant steps of which are illustrated in Figure 1.   space of alternative interventions is explored using Tabu search [4],
In VADA, given a target schema definition and the data sources,        a local search algorithm that employs a diversity heuristic to
the system can generate schema mappings that populate the              avoid becoming stuck in local optima. When describing the ap-
target. We particularly exploit the automatic generation of data       proach, we will periodically refer to a running example.
preparation plans, to generate new ways of integrating the data.
The steps that are relevant to this paper are:                            Example 1. Assume an example where a model is trained to
    Matching: identify relationships between source and target         predict which individuals will be hired, with the sensitive binary at-
      attributes, where the former may be suitable for providing       tribute gender, which can have the values male or female. Consider
      data for the latter.                                             a target schema 𝑇 (π‘›π‘Žπ‘šπ‘’, π‘Žπ‘”π‘’, π‘žπ‘’π‘Žπ‘™π‘– 𝑓 π‘–π‘π‘Žπ‘‘π‘–π‘œπ‘›, π‘”π‘’π‘›π‘‘π‘’π‘Ÿ, β„Žπ‘–π‘Ÿπ‘’), which
                                                                            automatically-derived matches and inclusion dependencies, but
                                                                            excluding those from the interventions of the Candidate Solution,
                                                                            thereby associating each Candidate Solution with a data set that
                                                                            populates the target schema. The resulting mappings are evalu-
                                                                            ated, thereby associating each Candidate Solution with a data set
                                                                            that populates the target.
                                                                            Train classifier. Having generated a data set for each candidate
                                                                            solution, the next step is to train a classifier on each of the data
                                                                            sets, and compute the bias of each of the resulting classifiers. The
                                                                            overall approach is independent of the type of classifier used, but
                                                                            in the experiments we use the J48 implementation of the C4.5
                                                                            decision tree learning algorithm [14].
                                                                               For each data set, we carry out k-fold cross validation. As such,
                   Figure 2: Steps in approach.
                                                                            the classifier is trained π‘˜ times using π‘˜ βˆ’ 1 folds and evaluated
                                                                            on the remaining fold, each time with a different evaluation
indicates the hiring decision of the person with the given name, age,       fold. Then, its fairness is assessed by averaging the bias for the
qualification and gender.                                                   evaluation folds. The overall approach is independent of the
                                                                            notion of bias used, but in the experiments we use demographic
                                                                            parity [3]. Demographic parity measures group fairness. Where
4.2    The Steps                                                            the Positive Rate (PR) for a group is the fraction of the items in
In this section, each of the steps in Figure 2 are described, in turn.      the group that have the given outcome (e.g., the fraction of the
Stopping condition. Each iteration of the search uses the data              people in the group that are hired), the demographic parity for a
wrangling pipeline in Figure 1 to create new data preparation               fold 𝑖 is computed as:
plans, the data from which is used to train a classifier. As such,
each iteration can be considered to be quite expensive, and cer-                               𝑑𝑝𝑖 = π‘Žπ‘π‘  (𝑃𝑅𝑖 (𝐺 1 ) βˆ’ 𝑃𝑅𝑖 (𝐺 2 ))                 (1)
tainly more expensive than evaluating a typical fitness function.           where 𝐺 1 is one group (e.g., females) and 𝐺 2 is the other (e.g.,
As a result, the search cannot be allowed to carry out arbitrary            males). Where 𝑑𝑝 = 0, there is no bias.
numbers of iterations. The search terminates with its best so-                The overall demographic parity for a Candidate Solution is
lution when it has completed a specified number of wrangles                 computed as:
(the number of executions of the data wrangling pipeline in Fig-
                                                                                                                Í𝑖=π‘˜
ure 1 during the search from Figure 2). The number is an input                                                     𝑖=1 𝑑𝑝𝑖
parameter that is usually chosen by considering the trade-off                                         π‘‘π‘π‘ π‘œπ‘™ =                                   (2)
                                                                                                                   π‘˜
between runtime and bias reduction (an analysis is reported in                 Note that the role of training during the execution of the steps
our previous work in [12]).                                                 in Figure 2 is to obtain evidence to inform the search for fairer
Create neighborhood. A Candidate Solution is a set of inter-                plans, and not to produce a final classifier.
ventions. An intervention is a match or inclusion dependency                Choose best solution. The best solution should have low bias.
that is to be excluded from consideration when generating an                However, as the approach explores a space of interventions that
integration using the wrangling pipeline in Figure 1. The Cur-              are liable to lead to increasingly sparse results (due to the deletion
rent Candidate Solution is the starting point for the creation of           of matches and inclusion dependencies), it is important not to
a neighbourhood. The neighbourhood is a set of Candidate So-                end up with a solution that has low bias but poor accuracy as
lutions, each of which is obtained by associating the Current               a result of the provision of sparse training data. So, to prefer
Candidate Solution with a single intervention from a set of Candi-          solutions that retain more data that can be used as evidence by
date Interventions. The Candidate Interventions are the matches             the classifier, the objective function for the search is:
and inclusion dependencies used in the mappings created by run-
ning the wrangling pipeline in the context of the interventions                              π‘œπ‘ 𝑗 = (𝑀 1 βˆ— π‘‘π‘π‘ π‘œπ‘™ ) + (𝑀 2 βˆ— π‘Ÿπ‘›π‘’π‘™π‘™π‘  )               (3)
from Current Candidate Solution.
                                                                            where π‘Ÿπ‘›π‘’π‘™π‘™π‘  is the ratio of nulls in the data set, and 𝑀 1 and
   Example 2. Following on from Example 1, assume that there are            𝑀 2 are weights, which are both set to 0.5 in the experiments,
sources 𝑆 1 (π‘›π‘Žπ‘šπ‘’, π‘Žπ‘”π‘’, β„Žπ‘’π‘–π‘”β„Žπ‘‘) and 𝑆 2 (π‘›π‘Žπ‘šπ‘’, β„Žπ‘–π‘Ÿπ‘’) and that there         i.e., completeness and fairness are equally important in the end
are matches π‘š 1 from 𝑆 1 .π‘›π‘Žπ‘šπ‘’ to 𝑇 .π‘›π‘Žπ‘šπ‘’, π‘š 2 from 𝑆 1 .π‘Žπ‘”π‘’ to 𝑇 .π‘Žπ‘”π‘’,     result. The search thus prefers solutions with low bias and high
π‘š 3 from 𝑆 2 .π‘›π‘Žπ‘šπ‘’ to 𝑇 .π‘›π‘Žπ‘šπ‘’, π‘š 4 from 𝑆 2 .β„Žπ‘–π‘Ÿπ‘’ to 𝑇 .β„Žπ‘–π‘Ÿπ‘’. Assume        completeness.
that a view has been produced for populating the target that con-           Update Tabu list. The Tabu search maintains a collection of
tains the join 𝑆 1 βŠ²βŠ³π‘† 1 .π‘›π‘Žπ‘šπ‘’=𝑆 2 .π‘›π‘Žπ‘šπ‘’ 𝑆 2 , informed by an inclusion     points in the search space that have already been visited. This list
dependency 𝑖 1 between 𝑆 1 .π‘›π‘Žπ‘šπ‘’ and 𝑆 2 .π‘›π‘Žπ‘šπ‘’. In addition, assume         is updated here to include all solutions that have been evaluated
that the Candidate Interventions are {π‘š 1, π‘š 2, π‘š 3, π‘š 4, 𝑖 1 }. Further,   in the current iteration.
assume that the Current Candidate is {π‘š 1 }. Then the neighbour-                The Tabu List is also used during the Create Neighborhood
hood will consist of the Candidate Solutions {π‘š 1, π‘š 2 }, {π‘š 1, π‘š 3 },      step, to avoid considering parts of the search space that have
{π‘š 1, π‘š 4 } and {π‘š 1, 𝑖 1 }.                                                been explored before.
Evaluate candidate solutions. Each of the Candidate Solutions                  Example 3. Following on from Example 2, assume that the cur-
in the neighbourhood in turn are passed to the data wrangling               rent tabu list is [{π‘š 1 }]. Then, after the exploration of the Candidate
pipeline in Figure 1. The pipeline generates mappings, using                Solutions in the neighbourhood, i.e., {π‘š 1, π‘š 2 }, {π‘š 1, π‘š 3 }, {π‘š 1, π‘š 4 }
and {π‘š 1, 𝑖 1 }, these will be added to the Tabu list so that the search                 sources over the previous scenario, e.g., for Adult Credit, we
is optimized by keeping track of the plans that have been explored,                      created five scenarios with 84, 105, 127, 147, and 168 sources,
and thus that need not be explored again.                                                each with a 0, 25, 50, 75, 100% increase over the initial number of
                                                                                         sources (84). The target schema is that of the original benchmark
Tabu diversity mechanism. Tabu search is essentially a greedy                            data sets.
local search algorithm, with a single Current Candidate Solution,                        Configuring the search. There are a number of parameters
which is normally replaced at each iteration by the best solution                        that control the behaviour of the search. The maximum num-
in its neighbourhood in accordance with the objective function.                          ber of wrangles for both data sets is set to 80; this is to obtain
To avoid being trapped in local optima, a diversity mechanism                            manageable experiment run times, while also allowing enough
can be used to jump to a new location in the search space. In our                        searching to generate plausible results. The value was obtained
implementation of Tabu, when the best solution has not been                              after a sensitivity analysis. In addition, up to the top 130 map-
improved for several iterations, the new Best Candidate Solution                         pings are obtained from mapping generation; in practice, this is
is set to the highest scoring of a subset of plans from the Tabu                         more than sufficient for the scenarios used, and often the map-
List.                                                                                    ping generator will produce significantly fewer mappings than
                                                                                         this.
5 EVALUATION                                                                                 In the German Credit data set, the maximum number of tuples
5.1 Experimental Setup                                                                   in the result of the wrangling process is set to 620 (the original
                                                                                         data set contained 1,000 rows). This is to ensure that the results
Data Sets. To investigate the effectiveness of the approach de-
                                                                                         of the fair wrangling can be selective, leaving out results that are
scribed in Section 4, we have evaluated it in the context of data
                                                                                         associated with more bias. For the same reasons, the maximum
preparation scenarios derived from the following benchmark
                                                                                         number of tuples in the Adult Census dataset is 6,616 out of 10,000.
data sets:
                                                                                             The results of the classifiers are obtained using 5-fold cross
     β€’ German Credit1 : The role of the classifier is to determine                       validation.
       if it is risky or not to give credit to a person. The dataset
                                                                                         Classifier. The reported results are for the J48 decision tree clas-
       contains 1000 tuples, each with 21 attributes. The sensi-
                                                                                         sifier from the Weka4 . The experiments were run with other
       tive attribute is Gender, which contains the values male
                                                                                         classifiers, such as Logistic Regression, obtaining negligible varia-
       (69%) and female (31%). The class attribute is Risky, which
                                                                                         tion between the experimental results, thus, we report only one
       contains the values Yes or No, stating if it is risky to give
                                                                                         set of results. Although the experiments involve binary classifi-
       someone a loan or not, respectively.
                                                                                         cation tasks, the overall approach does not depend upon this.
     β€’ Adult Census2 : The role of the classifier is to predict if a
       person makes over $50K a year. The dataset contains 10,000                        Baseline. The baseline in the experiments is obtained by running
       tuples3 , each with 14 attributes. The sensitive attribute is                     the wrangling process without the approach from Section 4. Thus
       Gender, which contains the values male (66%) and female                           we are able to obtain a direct indication of the effectiveness of
       (33%). The class attribute is Salary, which contains the                          the interventions made. Note that there is no direct competitor
       values > 50𝐾 and <= 50𝐾, stating if someone is predicted                          in the literature with which to compare.
       to earn more than 50𝐾 per year or less than that.                                 Experimental setup. The experiments were run over an Intel
These are benchmark data sets, that have been used in other                              Core i5 with 2Γ—2.7 GHz, and 8 GB RAM.
studies on fairness, e.g., [15–18]. However, each is supplied as a
single table, whereas to experiment on data wrangling, several                           5.2     Results
tables are required. So, for the experiment, each of the data sets                       German Credit Data. Figures 3 and 4 show the results of the
is partitioned horizontally into groups of rows, which are then                          experiment with the German Credit data set. For both figures, the
vertically partitioned. This subdivision of the tables creates dif-                      horizontal axis reports the number of sources in each scenario. In
ferent ways in which the target can be populated, and allows                             Figure 3, the vertical axis represents the demographic parity bias
interventions to take place that apply to subsets of the data.                           (computed using Equation 2), while, in Figure 4, the vertical axis
   For Adult Census, we horizontally divided the dataset using                           represents the accuracy of the classifier trained and evaluated
the contained 42 country values, which we then divided into                              on the output datasets. Both bias and accuracy represent the av-
2 vertical partitions, thus, amounting to 84 initial sources. For                        erages computed through cross validation.The following can be
German Credit, we divided the initial dataset into 4 horizontal                          observed: (i) The fair wrangling approach performs as well as or
partitions, which were then divided each into 3 vertical partitions,                     better than the baseline for demographic parity in all scenarios.
resulting 12 initial sources.                                                            (ii) Both methods sometimes have the same demographic parity
   In the experiments, the number of sources is varied; adding                           for different numbers of sources; this is because the same plan
additional sources provides more ways of populating the target,                          may be selected even though the number of available sources
and in turn more potential interventions to improve fairness. In                         grows. (iii) The fairness interventions do not always provide
order to create opportunities for different intervention plans, we                       improving results as more sources become available; this is be-
created alternative sources with synthetic constant values for                           cause the search is not exhaustive, and thus, although the larger
non-sensitive attributes. Thus, for each benchmark dataset, we                           numbers of sources provide more opportunities for effective in-
created five scenarios, each with an increase of 25% alternative                         terventions to be discovered, there is no guarantee that the best
1 https://www.kaggle.com/uciml/german-credit
                                                                                         intervention plans will be obtained. As mentioned in Section 5.1,
2 https://archive.ics.uci.edu/ml/datasets/adult                                          the parameter for the amount of explored search space was set
3 Due to limited resources, we used stratified sampling to choose 10,000 tuples out of   based on a sensitivity analysis similar to the one reported for
the 48,842 tuples in the original dataset. The used sample maintains the properties
of the initial dataset.                                                                  4 https://www.cs.waikato.ac.nz/ml/weka/
Figure 3: Demographic parity for the German Credit data                   Figure 5: Demographic parity for the Adult Census data
set for different numbers of sources.                                     set for different numbers of sources.




                                                                          Figure 6: Classifier accuracy for the Adult Census data set
Figure 4: Classifier accuracy for the German Credit data                  for different numbers of sources.
set for different numbers of sources.

                                                                          Runtime. Table 1 shows the runtime for each dataset consider-
the experiments presented in our previous work in [12]. Fur-              ing their different scenarios, i.e., with varying percents of syn-
thermore, the additional sources may not always provide good              thetic data sources (as explained in Section 5.1). The runtime is
opportunities to increase fairness. (iv) Accuracy did not suffer as       expressed in seconds. Note that for the same percent, the num-
a result of the interventions.                                            ber of sources for each dataset differs, e.g., for the 50% case, the
                                                                          German Credit dataset scenario contains 18 sources, while the
Adult Census Data. Figures 5 and 6 show the results of the                Adult Census scenario contains 127 sources.
experiment with the Adult Census data set. The following can                 For the German Credit dataset, the runtime increases as the
be observed: (i) The results are similar to those for the German          number of sources increases. This is due to the fact that i) the
Credit data set; we do not repeat the recurring explanations here;        Tabu search space increases and there are more intervention
(ii) The overall bias levels are smaller for the Adult Census data        plans to be explored, and ii) the wrangling process is longer in
set than for the German Credit one, and in several cases bias             the cases with more sources because each of the wrangling com-
is almost completely removed. (iii) The accuracy for the Adult            ponents that are described in Section 3 depends on the number
Census data set is higher than that in the German Credit scenarios.       of input data sources, e.g., the mapping component considers
This may be due to the difference in the amount of data that is           combinations between all subsets of sources, thus, the number
used to train the classifier, i.e., Adult Census contains significantly   of subsets increases with the number of initial sources [11].
more rows than that German Credit training set, which may lead               For the Adult Census dataset, it is interesting to notice that
to an underfit model for the German Credit scenarios.                     there is no clear pattern in the runtime values. This is due to
                                                                          the fact that for some of the scenarios, e.g., 25% and 100%, the
                    Runtime (in seconds) for scenarios                    search ends if the bias is below a set threshold. This threshold
          Dataset    with percent of synthetic sources                    is set as, in the case of demographic parity, one cannot expect
                     0%    25% 50%        75% 100%                        to have equal ratios for the positive labels for male and female
     German Credit 427      562    780 1,002 1,309                        values (in the sensitive attributes). Thus, we need to consider an
                                                                          acceptable difference. In these scenarios, we set the threshold
     Adult Census 1,416 48 1,691 1,605             41
                                                                          at 0.05, i.e., π‘‘π‘π‘ π‘œπ‘™ < 0.05 (see Equation 2). This threshold is set
Table 1: Runtime for varying number of sources on each
                                                                          according to the used fairness measure, e.g., demographic parity
dataset
                                                                          in our case. For the Adult Census scenarios with 25% and 100% of
                                                                          synthetic sources, in both cases, the wangling process stops after
the initial wrangle as it detects that the output dataset presents an                  [6] Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of Opportunity
acceptable bias value, thus, it does not prepare more interventions                        in Supervised Learning. In NIPS (Barcelona, Spain). Red Hook, NY, USA,
                                                                                           3323–3331.
from the initial wrangle.                                                              [7] Faisal Kamiran and Toon Calders. 2011. Data preprocessing techniques for
                                                                                           classification without discrimination. KAIS 33 (2011), 1–33.
                                                                                       [8] Faisal Kamiran, Asim Karim, and Xiangliang Zhang. 2012. Decision Theory
6    CONCLUSIONS                                                                           for Discrimination-Aware Classification. In ICDM. USA, 924–929. https:
Our previous work tackled underlying causes of bias for unla-                              //doi.org/10.1109/ICDM.2012.45
                                                                                       [9] Nikolaos Konstantinou, Edward Abel, Luigi Bellomarini, Alex Bogatu, Cristina
belled data [12]. This paper complements that work for labelled                            Civili, Endri Irfanie, Martin Koehler, Lacramioara Mazilu, Emanuel Sallinger,
data, by wrangling in a way that directly responds to the bias                             Alvaro Fernandes, Georg Gottlob, John Keane, and Norman Paton. 2019. VADA:
                                                                                           An Architecture for End User Informed Data Preparation. Journal of Big Data
observed when wrangling with different mappings. This work                                 6:74 (2019). https://doi.org/10.1186/s40537-019-0237-9
focuses on creating fair datasets by using demographic parity as                      [10] Yin Lin, Yifan Guan, Abolfazl Asudeh, and H. V. Jagadish. 2020. Identifying
a way to measure the bias. However, other metrics may be inves-                            Insufficient Data Coverage in Databases with Multiple Relations. Proc. VLDB
                                                                                           Endow. 13, 11 (2020), 2229–2242. http://www.vldb.org/pvldb/vol13/p2229-lin.
tigated, e.g., equalized odds, equal opportunity, etc., depending on                       pdf
the properties that are desired upon the output dataset.                              [11] Lacramioara Mazilu, Norman W. Paton, Alvaro A.A. Fernandes, and Mar-
   We now revisit the claimed contributions from the introduc-                             tin Koehler. 2019. Dynamap: Schema Mapping Generation in the Wild. In
                                                                                           Proceedings of the 31st International Conference on Scientific and Statistical
tion:                                                                                      Database Management (Santa Cruz, CA, USA) (SSDBM ’19). Association for
                                                                                           Computing Machinery, New York, NY, USA, 37–48. https://doi.org/10.1145/
    (1) The proposal of a strategy for fairness-aware data prepa-                          3335783.3335785
        ration for classification. The proposed strategy searches                     [12] Lacramioara Mazilu, Norman W. Paton, Nikolaos Konstantinou, and Alvaro
        a space of interventions that impact on how data is inte-                          A. A. Fernandes. 2020. Fairness in Data Wrangling. In 21st International
                                                                                           Conference on Information Reuse and Integration for Data Science, IRI 2020, Las
        grated for analysis. The approach builds on the ability to                         Vegas, NV, USA, August 11-13, 2020. IEEE, 341–348. https://doi.org/10.1109/
        automate aspects of the data preparation process. Inter-                           IRI49571.2020.00056
                                                                                      [13] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and
        ventions are carried out, integrations generated, and the                          Aram Galstyan. 2019. A Survey on Bias and Fairness in Machine Learning.
        resulting data is used to train classifiers. The classifiers are                   CoRR abs/1908.09635 (2019). arXiv:1908.09635 http://arxiv.org/abs/1908.09635
        then tested for bias, and the most promising interventions                    [14] J. Ross Quinlan. 1996. Bagging, Boosting, and C4.5. In Proceedings of the
                                                                                           Thirteenth National Conference on Artificial Intelligence and Eighth Innovative
        investigated further.                                                              Applications of Artificial Intelligence Conference, AAAI, William J. Clancey
    (2) The realisation of the strategy as a search for fair data prepa-                   and Daniel S. Weld (Eds.). AAAI Press / The MIT Press, 725–730. http:
        ration plans. The strategy has been implemented as a Tabu                          //www.aaai.org/Library/AAAI/1996/aaai96-108.php
                                                                                      [15] InΓͺs Valentim, Nuno LourenΓ§o, and Nuno Antunes. 2019. The Impact of Data
        search, over a space of interventions that include the ex-                         Preparation on the Fairness of Software Systems. CoRR abs/1910.02321 (2019).
        clusion of matches and inclusion dependencies. Removing                            arXiv:1910.02321 http://arxiv.org/abs/1910.02321
                                                                                      [16] Vladimiro Zelaya, Paolo Missier, and Dennis Prangle. 2019. Parametrised Data
        selected matches and inclusion dependencies may reduce                             Sampling for Fairness Optimisation. In KDD XAI.
        bias by removing problematic mappings, or by prioritising                     [17] Richard Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork.
        alternative ways of preparing the data.                                            2013. Learning Fair Representations. In ICML (Atlanta, GA, USA) (ICML’13).
                                                                                           III–325–III–333.
    (3) An evaluation of (2) for benchmark data sets that shows                       [18] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating
        how the interventions can improve a specific fairness metric,                      Unwanted Biases with Adversarial Learning. In AAAI (New Orleans, LA, USA).
        namely demographic parity. Also, we show how the accu-                             NY, USA, 335–340. https://doi.org/10.1145/3278721.3278779
        racy of the trained classifier is impacted by the interventions.
        Results using two evaluation scenarios show that the ap-
        proach reduces bias in comparison with a non fairness-
        aware base case. Also, the evaluation shows that, com-
        pared to the baseline case, the accuracy of the classifier
        trained as a result of the interventions did not suffer.


ACKNOWLEDGMENTS
We are pleased to acknowledge the support of the UK Engineer-
ing and Physical Sciences Research Council, through the VADA
Programme Grant (EP/M025268/1).

REFERENCES
[1] Chiara Accinelli, Simone Minisi, and Barbara Catania. 2020. Coverage-
    based Rewriting for Data Preparation. In Proceedings of the Workshops of
    the EDBT/ICDT 2020 Joint Conference, Copenhagen, Denmark, March 30, 2020
    (CEUR Workshop Proceedings, Vol. 2578), Alexandra Poulovassilis et al. (Eds.).
    CEUR-WS.org. http://ceur-ws.org/Vol-2578/PIE4.pdf
[2] Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam
    Choudhary, Evan P. Hamilton, and Derek Roth. 2019. A comparative study of
    fairness-enhancing interventions in machine learning. In FAT*. ACM, 329–338.
    https://doi.org/10.1145/3287560.3287589
[3] Pratik Gajane. 2017. On formalizing fairness in prediction with machine
    learning. CoRR abs/1710.03184 (2017). arXiv:1710.03184 http://arxiv.org/abs/
    1710.03184
[4] Fred Glover. 1986. Future paths for integer programming and links to artificial
    intelligence. Computers Operations Research 13, 5 (1986), 533 – 549. https:
    //doi.org/10.1016/0305-0548(86)90048-1 Applications of Integer Programming.
[5] Sara Hajian and Josep Domingo-Ferrer. 2013. A Methodology for Direct and
    Indirect Discrimination Prevention in Data Mining. IEEE Trans. Knowl. Data
    Eng. 25, 7 (2013), 1445–1459. https://doi.org/10.1109/TKDE.2012.72