=Paper= {{Paper |id=Vol-2841/PIE+Q_1 |storemode=property |title=Data Wrangling for Fair Classification |pdfUrl=https://ceur-ws.org/Vol-2841/PIE+Q_1.pdf |volume=Vol-2841 |authors=Lacramioara Mazilu,Nikolaos Konstantinou,Norman Paton,Alvaro A. A. Fernandes |dblpUrl=https://dblp.org/rec/conf/edbt/Mazilu0PF21 }} ==Data Wrangling for Fair Classification== https://ceur-ws.org/Vol-2841/PIE+Q_1.pdf

Data Wrangling for Fair Classification
Lacramioara Mazilu Norman W. Paton
University of Manchester University of Manchester
Manchester, United Kingdom Manchester, United Kingdom
lacramioaramazilu@gmail.com npaton@manchester.ac.uk

Nikolaos Konstantinou Alvaro A.A. Fernandes
University of Manchester University of Manchester
Manchester, United Kingdom Manchester, United Kingdom
nkons@live.com fernandesaaa@gmail.com

ABSTRACT In this paper, we investigate interventions in the data wran-
Whenever decisions that affect people are informed by classifiers, gling process that directly target classifier bias, when labels are
there is a risk that the decisions can discriminate against certain available. Specifically, we assume a situation in which data prepa-
groups as a result of bias in the training data. There has been ration selects and combines datasets for use in training, and we
significant work to address this, based on pre-processing the intervene in data preparation to ensure that these selection and
inputs to the classifier, changing the classifier itself, or post- combination decisions are informed by the fairness of the re-
processing the results of the classifier. However, upstream from sulting classifier. Similarly to other works, our focus is on the
these steps, there may be a variety of data wrangling processes pre-training step, however, we do not rely on the assumption
that select and integrate the data that is used to train the classifier, that the data is already gathered and that fairness is achieved
and these steps could themselves lead to bias. In this paper, we through operations on the integrated data source. Our focus is
propose an approach that generates schema mappings in ways on creating the integrated data source in a fashion that takes into
that take into account bias observed in classifiers trained on the consideration fairness measurements.
results of different mappings. The approach searches a space The contributions are as follows:
of candidate interventions in the mapping generation process, (1) The proposal of a strategy for fairness aware data prepa-
which change how these mappings are generated, informed by a ration for classification.
bias-aware fitness function. The resulting approach is evaluated (2) The realisation of the strategy as a search for fair data
using Adult Census and German Credit data sets. preparation plans.
(3) An evaluation of (2) for benchmark data sets that shows
KEYWORDS how the interventions can improve a specific fairness met-
fairness, bias, classification, data preparation, data wrangling ric, namely demographic parity. Also, we show how the
accuracy of the trained classifier is impacted by the inter-
ventions.
1 INTRODUCTION The remainder of this paper is structured as follows. Section 2
Fairness in machine learning is important because machine learn- provides the context for this work by reviewing related results.
ing supports decision making; problems resulting from bias have Section 3 describes some technical background on data prepara-
been widely recognised, and there are many results on fairness- tion required for later sections. Section 4 details the approach,
enhancing interventions in machine learning [2]. Proposals have in which a space of alternative wrangling strategies is explored.
been made for taking fairness into account before, during and Section 5 presents an empirical evaluation, and Section 6 reviews
after learning. progress against the objectives.
Although proposals that focus on interventions before learn-
ing have considered a variety of techniques for selecting or mod- 2 RELATED WORK
ifying training data [7], most of this work assumes that the train- The problem considered in this paper is as follows: Assume we
ing data is already available. As such, the interventions take place have access to data sets that can be used to populate a target
after the data preparation steps that select and integrate data schema that is suitable for training a classifier. Combine the data
sets for analysis. As data scientists spend a considerable portion sets in such a way as to reduce the bias in the generated classifier.
of their time on such steps, an opportunity seems to exist for Previous related work has tended to assume that there is an
making interventions earlier in the data processing pipeline. existing data set that is suitable for training the classifier. In such
In our previous work [12], we investigated how data wrangling a case, the data is considered to have a sensitive attribute (such
processes could be adjusted to address dataset properties that as gender or race), and the objective is to train a classifier that
can give rise to bias, specifically sample size disparity and proxy reduces bias with respect to some outcome, represented by labels
attributes. Such interventions are promising, in that they can be in the data (such as a decision to hire or promote an individual).
applied to unlabelled data, but there is no direct guarantee that This section reviews existing work on fairness pertaining to
addressing the underlying property will in fact reduce classifier machine learning, looking at interventions that are designed to
bias. reduce bias that take place before, during and after training. The
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- emphasis is on before training, as wrangling falls there.
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) Before Approaches to reducing bias that influence the data used
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0) for training may remove attributes that correlate with the sensi-
tive attribute, change the labels to reduce bias, choose samples of
Profiling: identify candidate keys for the source relations,
and (partial) inclusion dependencies that identify overlaps
between attribute values in different sources.
Mapping: informed by the results of matching and profiling,
generate candidate mappings in the form of views that can
be used to populate the target table from the sources [11].
Selection: given the results of the mappings, select the top k
results from the mappings that best satisfy some criterion.
In this paper, the criterion used is completeness – prefer
the mappings with the fewest missing values.

4 APPROACH
Figure 1: Data wrangling pipeline 4.1 Overview of Approach
The approach to fair data preparation for classification involves
exploring a space of candidate data preparation plans, assessing
the available data, or assign weights to different rows that impact the fairness of the classifiers built using the results of these plans.
their influence on the classifier [7]. Alternatively, pre-processing The steps in the approach are illustrated in Figure 2, and are
techniques may directly tackle properties of the underlying data discussed below.
set that may be problematic, such as coverage [10]. In the approach, a Candidate Solution captures a collection of
There has been less work on data wrangling, the focus of this interventions into the wrangling process that are deemed to be
paper; although some papers explicitly discuss data preparation, helpful for reducing bias. Specifically, the interventions involve
such work has generally been later in the pipeline than this paper. removing matches or inclusion dependencies from consideration
Valentim et al. [15] empirically compare several interventions, during data wrangling. For example, assume that 𝑚 1 is a match
specifically the removal of the sensitive attribute, feature dis- and 𝑖 1 is an inclusion dependency. It is possible that the data
cretisation and sampling techniques. Accinelli et al. [1] consider resulting from the wrangling pipeline in Figure 1 produces a
coverage within data preparation, though primarily by selecting fairer output data set when: (i) 𝑚 1 is available but 𝑖 1 is not; (ii)
data rather than determining how the data should be wrangled. 𝑚 1 is not available but 𝑖 1 is available; (iii) 𝑚 1 and 𝑖 1 are both
In our previous work [12], we steer the generation of data available; or (iv) neither 𝑚 1 nor 𝑖 1 are available. These different
preparation plans in ways that reflect risk factors for bias. The environments in which to wrangle constitute a search space of
current paper follows a similar approach, in the sense that plan interventions that lead to the production of different data sets
generation is steered, but directly calls the classifier on candi- that may lead to more or less biased classifiers.
date plans, thereby steering the plan generation based on direct Assume we have a match 𝑚 : 𝑆.𝑎 → 𝑇 .𝑎 ′ relating attribute 𝑎
evidence of bias. from source 𝑆 to attribute 𝑎 ′ in target 𝑇 . The removal of 𝑚 from
the wrangling process can have one of the following effects: (i)
During Approaches to reducing bias within the learning algo-
𝑚 is replaced by another match involving the same table, so that
rithm are often associated with metrics that quantify some no-
now 𝑆.𝑎 ′′ → 𝑇 .𝑎 ′ and as a result 𝑇 .𝑎 ′ is populated differently;
tion of bias, and the learning algorithm then explicitly trades
(ii) 𝑚 is replaced by another match involving a different table, so
off bias with accuracy [13]. For example, Hajian et al. [5] pro-
that now 𝑃 .𝑎 → 𝑇 .𝑎 ′ and as a result 𝑇 .𝑎 ′ is populated differently;
vides techniques for constraining the levels of bias exhibited by
(iii) no alternative match is found, so 𝑇 ′ .𝑎 is populated with null
classification rules to a specified level.
values.
After Approaches to reducing bias after learning operate by Assume we have an inclusion dependency 𝐼 = 𝑃 .𝑎 ⊆𝜃 𝑆.𝑏
selectively changing labels at prediction time (e.g., [6, 8]). between attributes in the sources 𝑃 and 𝑆, where 𝜃 is the degree
Work on bias-aware data preparation can be seen as com- of overlap between the attributes 𝑎 and 𝑏. The removal of 𝐼 can
plementary to most other work on fair classification; benefits have one of the following effects: (i) there is another inclusion
that arise with respect to fairness in upstream processes simply dependency between 𝑃 and 𝑆, which leads to the same tables
reduce the scale of the problem for later interventions. being considered for joining, but on different attributes; (ii) there
is another indirect way in which 𝑃 and 𝑆 can be joined through
3 TECHNICAL BACKGROUND an intermediate table that can be used instead by the mapping
generator; or (iii) the tables are no longer joined with each other.
The approach to fair data preparation depends on identifying To consider different ways of preparing the data, the approach
interventions during data wrangling that affect the behaviour is to explore the space of possible interventions, with a view
of a wrangling pipeline. We build on the VADA data wrangling to identifying combinations that lead to less biased results. The
system [9], the relevant steps of which are illustrated in Figure 1. space of alternative interventions is explored using Tabu search [4],
In VADA, given a target schema definition and the data sources, a local search algorithm that employs a diversity heuristic to
the system can generate schema mappings that populate the avoid becoming stuck in local optima. When describing the ap-
target. We particularly exploit the automatic generation of data proach, we will periodically refer to a running example.
preparation plans, to generate new ways of integrating the data.
The steps that are relevant to this paper are: Example 1. Assume an example where a model is trained to
Matching: identify relationships between source and target predict which individuals will be hired, with the sensitive binary at-
attributes, where the former may be suitable for providing tribute gender, which can have the values male or female. Consider
data for the latter. a target schema 𝑇 (𝑛𝑎𝑚𝑒, 𝑎𝑔𝑒, 𝑞𝑢𝑎𝑙𝑖 𝑓 𝑖𝑐𝑎𝑡𝑖𝑜𝑛, 𝑔𝑒𝑛𝑑𝑒𝑟, ℎ𝑖𝑟𝑒), which
automatically-derived matches and inclusion dependencies, but
excluding those from the interventions of the Candidate Solution,
thereby associating each Candidate Solution with a data set that
populates the target schema. The resulting mappings are evalu-
ated, thereby associating each Candidate Solution with a data set
that populates the target.
Train classifier. Having generated a data set for each candidate
solution, the next step is to train a classifier on each of the data
sets, and compute the bias of each of the resulting classifiers. The
overall approach is independent of the type of classifier used, but
in the experiments we use the J48 implementation of the C4.5
decision tree learning algorithm [14].
For each data set, we carry out k-fold cross validation. As such,
Figure 2: Steps in approach.
the classifier is trained 𝑘 times using 𝑘 − 1 folds and evaluated
on the remaining fold, each time with a different evaluation
indicates the hiring decision of the person with the given name, age, fold. Then, its fairness is assessed by averaging the bias for the
qualification and gender. evaluation folds. The overall approach is independent of the
notion of bias used, but in the experiments we use demographic
parity [3]. Demographic parity measures group fairness. Where
4.2 The Steps the Positive Rate (PR) for a group is the fraction of the items in
In this section, each of the steps in Figure 2 are described, in turn. the group that have the given outcome (e.g., the fraction of the
Stopping condition. Each iteration of the search uses the data people in the group that are hired), the demographic parity for a
wrangling pipeline in Figure 1 to create new data preparation fold 𝑖 is computed as:
plans, the data from which is used to train a classifier. As such,
each iteration can be considered to be quite expensive, and cer- 𝑑𝑝𝑖 = 𝑎𝑏𝑠 (𝑃𝑅𝑖 (𝐺 1 ) − 𝑃𝑅𝑖 (𝐺 2 )) (1)
tainly more expensive than evaluating a typical fitness function. where 𝐺 1 is one group (e.g., females) and 𝐺 2 is the other (e.g.,
As a result, the search cannot be allowed to carry out arbitrary males). Where 𝑑𝑝 = 0, there is no bias.
numbers of iterations. The search terminates with its best so- The overall demographic parity for a Candidate Solution is
lution when it has completed a specified number of wrangles computed as:
(the number of executions of the data wrangling pipeline in Fig-
Í𝑖=𝑘
ure 1 during the search from Figure 2). The number is an input 𝑖=1 𝑑𝑝𝑖
parameter that is usually chosen by considering the trade-off 𝑑𝑝𝑠𝑜𝑙 = (2)
𝑘
between runtime and bias reduction (an analysis is reported in Note that the role of training during the execution of the steps
our previous work in [12]). in Figure 2 is to obtain evidence to inform the search for fairer
Create neighborhood. A Candidate Solution is a set of inter- plans, and not to produce a final classifier.
ventions. An intervention is a match or inclusion dependency Choose best solution. The best solution should have low bias.
that is to be excluded from consideration when generating an However, as the approach explores a space of interventions that
integration using the wrangling pipeline in Figure 1. The Cur- are liable to lead to increasingly sparse results (due to the deletion
rent Candidate Solution is the starting point for the creation of of matches and inclusion dependencies), it is important not to
a neighbourhood. The neighbourhood is a set of Candidate So- end up with a solution that has low bias but poor accuracy as
lutions, each of which is obtained by associating the Current a result of the provision of sparse training data. So, to prefer
Candidate Solution with a single intervention from a set of Candi- solutions that retain more data that can be used as evidence by
date Interventions. The Candidate Interventions are the matches the classifier, the objective function for the search is:
and inclusion dependencies used in the mappings created by run-
ning the wrangling pipeline in the context of the interventions 𝑜𝑏 𝑗 = (𝑤 1 ∗ 𝑑𝑝𝑠𝑜𝑙 ) + (𝑤 2 ∗ 𝑟𝑛𝑢𝑙𝑙𝑠 ) (3)
from Current Candidate Solution.
where 𝑟𝑛𝑢𝑙𝑙𝑠 is the ratio of nulls in the data set, and 𝑤 1 and
Example 2. Following on from Example 1, assume that there are 𝑤 2 are weights, which are both set to 0.5 in the experiments,
sources 𝑆 1 (𝑛𝑎𝑚𝑒, 𝑎𝑔𝑒, ℎ𝑒𝑖𝑔ℎ𝑡) and 𝑆 2 (𝑛𝑎𝑚𝑒, ℎ𝑖𝑟𝑒) and that there i.e., completeness and fairness are equally important in the end
are matches 𝑚 1 from 𝑆 1 .𝑛𝑎𝑚𝑒 to 𝑇 .𝑛𝑎𝑚𝑒, 𝑚 2 from 𝑆 1 .𝑎𝑔𝑒 to 𝑇 .𝑎𝑔𝑒, result. The search thus prefers solutions with low bias and high
𝑚 3 from 𝑆 2 .𝑛𝑎𝑚𝑒 to 𝑇 .𝑛𝑎𝑚𝑒, 𝑚 4 from 𝑆 2 .ℎ𝑖𝑟𝑒 to 𝑇 .ℎ𝑖𝑟𝑒. Assume completeness.
that a view has been produced for populating the target that con- Update Tabu list. The Tabu search maintains a collection of
tains the join 𝑆 1 ⊲⊳𝑆 1 .𝑛𝑎𝑚𝑒=𝑆 2 .𝑛𝑎𝑚𝑒 𝑆 2 , informed by an inclusion points in the search space that have already been visited. This list
dependency 𝑖 1 between 𝑆 1 .𝑛𝑎𝑚𝑒 and 𝑆 2 .𝑛𝑎𝑚𝑒. In addition, assume is updated here to include all solutions that have been evaluated
that the Candidate Interventions are {𝑚 1, 𝑚 2, 𝑚 3, 𝑚 4, 𝑖 1 }. Further, in the current iteration.
assume that the Current Candidate is {𝑚 1 }. Then the neighbour- The Tabu List is also used during the Create Neighborhood
hood will consist of the Candidate Solutions {𝑚 1, 𝑚 2 }, {𝑚 1, 𝑚 3 }, step, to avoid considering parts of the search space that have
{𝑚 1, 𝑚 4 } and {𝑚 1, 𝑖 1 }. been explored before.
Evaluate candidate solutions. Each of the Candidate Solutions Example 3. Following on from Example 2, assume that the cur-
in the neighbourhood in turn are passed to the data wrangling rent tabu list is [{𝑚 1 }]. Then, after the exploration of the Candidate
pipeline in Figure 1. The pipeline generates mappings, using Solutions in the neighbourhood, i.e., {𝑚 1, 𝑚 2 }, {𝑚 1, 𝑚 3 }, {𝑚 1, 𝑚 4 }
and {𝑚 1, 𝑖 1 }, these will be added to the Tabu list so that the search sources over the previous scenario, e.g., for Adult Credit, we
is optimized by keeping track of the plans that have been explored, created five scenarios with 84, 105, 127, 147, and 168 sources,
and thus that need not be explored again. each with a 0, 25, 50, 75, 100% increase over the initial number of
sources (84). The target schema is that of the original benchmark
Tabu diversity mechanism. Tabu search is essentially a greedy data sets.
local search algorithm, with a single Current Candidate Solution, Configuring the search. There are a number of parameters
which is normally replaced at each iteration by the best solution that control the behaviour of the search. The maximum num-
in its neighbourhood in accordance with the objective function. ber of wrangles for both data sets is set to 80; this is to obtain
To avoid being trapped in local optima, a diversity mechanism manageable experiment run times, while also allowing enough
can be used to jump to a new location in the search space. In our searching to generate plausible results. The value was obtained
implementation of Tabu, when the best solution has not been after a sensitivity analysis. In addition, up to the top 130 map-
improved for several iterations, the new Best Candidate Solution pings are obtained from mapping generation; in practice, this is
is set to the highest scoring of a subset of plans from the Tabu more than sufficient for the scenarios used, and often the map-
List. ping generator will produce significantly fewer mappings than
this.
5 EVALUATION In the German Credit data set, the maximum number of tuples
5.1 Experimental Setup in the result of the wrangling process is set to 620 (the original
data set contained 1,000 rows). This is to ensure that the results
Data Sets. To investigate the effectiveness of the approach de-
of the fair wrangling can be selective, leaving out results that are
scribed in Section 4, we have evaluated it in the context of data
associated with more bias. For the same reasons, the maximum
preparation scenarios derived from the following benchmark
number of tuples in the Adult Census dataset is 6,616 out of 10,000.
data sets:
The results of the classifiers are obtained using 5-fold cross
• German Credit1 : The role of the classifier is to determine validation.
if it is risky or not to give credit to a person. The dataset
Classifier. The reported results are for the J48 decision tree clas-
contains 1000 tuples, each with 21 attributes. The sensi-
sifier from the Weka4 . The experiments were run with other
tive attribute is Gender, which contains the values male
classifiers, such as Logistic Regression, obtaining negligible varia-
(69%) and female (31%). The class attribute is Risky, which
tion between the experimental results, thus, we report only one
contains the values Yes or No, stating if it is risky to give
set of results. Although the experiments involve binary classifi-
someone a loan or not, respectively.
cation tasks, the overall approach does not depend upon this.
• Adult Census2 : The role of the classifier is to predict if a
person makes over $50K a year. The dataset contains 10,000 Baseline. The baseline in the experiments is obtained by running
tuples3 , each with 14 attributes. The sensitive attribute is the wrangling process without the approach from Section 4. Thus
Gender, which contains the values male (66%) and female we are able to obtain a direct indication of the effectiveness of
(33%). The class attribute is Salary, which contains the the interventions made. Note that there is no direct competitor
values > 50𝐾 and <= 50𝐾, stating if someone is predicted in the literature with which to compare.
to earn more than 50𝐾 per year or less than that. Experimental setup. The experiments were run over an Intel
These are benchmark data sets, that have been used in other Core i5 with 2×2.7 GHz, and 8 GB RAM.
studies on fairness, e.g., [15–18]. However, each is supplied as a
single table, whereas to experiment on data wrangling, several 5.2 Results
tables are required. So, for the experiment, each of the data sets German Credit Data. Figures 3 and 4 show the results of the
is partitioned horizontally into groups of rows, which are then experiment with the German Credit data set. For both figures, the
vertically partitioned. This subdivision of the tables creates dif- horizontal axis reports the number of sources in each scenario. In
ferent ways in which the target can be populated, and allows Figure 3, the vertical axis represents the demographic parity bias
interventions to take place that apply to subsets of the data. (computed using Equation 2), while, in Figure 4, the vertical axis
For Adult Census, we horizontally divided the dataset using represents the accuracy of the classifier trained and evaluated
the contained 42 country values, which we then divided into on the output datasets. Both bias and accuracy represent the av-
2 vertical partitions, thus, amounting to 84 initial sources. For erages computed through cross validation.The following can be
German Credit, we divided the initial dataset into 4 horizontal observed: (i) The fair wrangling approach performs as well as or
partitions, which were then divided each into 3 vertical partitions, better than the baseline for demographic parity in all scenarios.
resulting 12 initial sources. (ii) Both methods sometimes have the same demographic parity
In the experiments, the number of sources is varied; adding for different numbers of sources; this is because the same plan
additional sources provides more ways of populating the target, may be selected even though the number of available sources
and in turn more potential interventions to improve fairness. In grows. (iii) The fairness interventions do not always provide
order to create opportunities for different intervention plans, we improving results as more sources become available; this is be-
created alternative sources with synthetic constant values for cause the search is not exhaustive, and thus, although the larger
non-sensitive attributes. Thus, for each benchmark dataset, we numbers of sources provide more opportunities for effective in-
created five scenarios, each with an increase of 25% alternative terventions to be discovered, there is no guarantee that the best
1 https://www.kaggle.com/uciml/german-credit
intervention plans will be obtained. As mentioned in Section 5.1,
2 https://archive.ics.uci.edu/ml/datasets/adult the parameter for the amount of explored search space was set
3 Due to limited resources, we used stratified sampling to choose 10,000 tuples out of based on a sensitivity analysis similar to the one reported for
the 48,842 tuples in the original dataset. The used sample maintains the properties
of the initial dataset. 4 https://www.cs.waikato.ac.nz/ml/weka/
Figure 3: Demographic parity for the German Credit data Figure 5: Demographic parity for the Adult Census data
set for different numbers of sources. set for different numbers of sources.

Figure 6: Classifier accuracy for the Adult Census data set
Figure 4: Classifier accuracy for the German Credit data for different numbers of sources.
set for different numbers of sources.

Runtime. Table 1 shows the runtime for each dataset consider-
the experiments presented in our previous work in [12]. Fur- ing their different scenarios, i.e., with varying percents of syn-
thermore, the additional sources may not always provide good thetic data sources (as explained in Section 5.1). The runtime is
opportunities to increase fairness. (iv) Accuracy did not suffer as expressed in seconds. Note that for the same percent, the num-
a result of the interventions. ber of sources for each dataset differs, e.g., for the 50% case, the
German Credit dataset scenario contains 18 sources, while the
Adult Census Data. Figures 5 and 6 show the results of the Adult Census scenario contains 127 sources.
experiment with the Adult Census data set. The following can For the German Credit dataset, the runtime increases as the
be observed: (i) The results are similar to those for the German number of sources increases. This is due to the fact that i) the
Credit data set; we do not repeat the recurring explanations here; Tabu search space increases and there are more intervention
(ii) The overall bias levels are smaller for the Adult Census data plans to be explored, and ii) the wrangling process is longer in
set than for the German Credit one, and in several cases bias the cases with more sources because each of the wrangling com-
is almost completely removed. (iii) The accuracy for the Adult ponents that are described in Section 3 depends on the number
Census data set is higher than that in the German Credit scenarios. of input data sources, e.g., the mapping component considers
This may be due to the difference in the amount of data that is combinations between all subsets of sources, thus, the number
used to train the classifier, i.e., Adult Census contains significantly of subsets increases with the number of initial sources [11].
more rows than that German Credit training set, which may lead For the Adult Census dataset, it is interesting to notice that
to an underfit model for the German Credit scenarios. there is no clear pattern in the runtime values. This is due to
the fact that for some of the scenarios, e.g., 25% and 100%, the
Runtime (in seconds) for scenarios search ends if the bias is below a set threshold. This threshold
Dataset with percent of synthetic sources is set as, in the case of demographic parity, one cannot expect
0% 25% 50% 75% 100% to have equal ratios for the positive labels for male and female
German Credit 427 562 780 1,002 1,309 values (in the sensitive attributes). Thus, we need to consider an
acceptable difference. In these scenarios, we set the threshold
Adult Census 1,416 48 1,691 1,605 41
at 0.05, i.e., 𝑑𝑝𝑠𝑜𝑙 < 0.05 (see Equation 2). This threshold is set
Table 1: Runtime for varying number of sources on each
according to the used fairness measure, e.g., demographic parity
dataset
in our case. For the Adult Census scenarios with 25% and 100% of
synthetic sources, in both cases, the wangling process stops after
the initial wrangle as it detects that the output dataset presents an [6] Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of Opportunity
acceptable bias value, thus, it does not prepare more interventions in Supervised Learning. In NIPS (Barcelona, Spain). Red Hook, NY, USA,
3323–3331.
from the initial wrangle. [7] Faisal Kamiran and Toon Calders. 2011. Data preprocessing techniques for
classification without discrimination. KAIS 33 (2011), 1–33.
[8] Faisal Kamiran, Asim Karim, and Xiangliang Zhang. 2012. Decision Theory
6 CONCLUSIONS for Discrimination-Aware Classification. In ICDM. USA, 924–929. https:
Our previous work tackled underlying causes of bias for unla- //doi.org/10.1109/ICDM.2012.45
[9] Nikolaos Konstantinou, Edward Abel, Luigi Bellomarini, Alex Bogatu, Cristina
belled data [12]. This paper complements that work for labelled Civili, Endri Irfanie, Martin Koehler, Lacramioara Mazilu, Emanuel Sallinger,
data, by wrangling in a way that directly responds to the bias Alvaro Fernandes, Georg Gottlob, John Keane, and Norman Paton. 2019. VADA:
An Architecture for End User Informed Data Preparation. Journal of Big Data
observed when wrangling with different mappings. This work 6:74 (2019). https://doi.org/10.1186/s40537-019-0237-9
focuses on creating fair datasets by using demographic parity as [10] Yin Lin, Yifan Guan, Abolfazl Asudeh, and H. V. Jagadish. 2020. Identifying
a way to measure the bias. However, other metrics may be inves- Insufficient Data Coverage in Databases with Multiple Relations. Proc. VLDB
Endow. 13, 11 (2020), 2229–2242. http://www.vldb.org/pvldb/vol13/p2229-lin.
tigated, e.g., equalized odds, equal opportunity, etc., depending on pdf
the properties that are desired upon the output dataset. [11] Lacramioara Mazilu, Norman W. Paton, Alvaro A.A. Fernandes, and Mar-
We now revisit the claimed contributions from the introduc- tin Koehler. 2019. Dynamap: Schema Mapping Generation in the Wild. In
Proceedings of the 31st International Conference on Scientific and Statistical
tion: Database Management (Santa Cruz, CA, USA) (SSDBM ’19). Association for
Computing Machinery, New York, NY, USA, 37–48. https://doi.org/10.1145/
(1) The proposal of a strategy for fairness-aware data prepa- 3335783.3335785
ration for classification. The proposed strategy searches [12] Lacramioara Mazilu, Norman W. Paton, Nikolaos Konstantinou, and Alvaro
a space of interventions that impact on how data is inte- A. A. Fernandes. 2020. Fairness in Data Wrangling. In 21st International
Conference on Information Reuse and Integration for Data Science, IRI 2020, Las
grated for analysis. The approach builds on the ability to Vegas, NV, USA, August 11-13, 2020. IEEE, 341–348. https://doi.org/10.1109/
automate aspects of the data preparation process. Inter- IRI49571.2020.00056
[13] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and
ventions are carried out, integrations generated, and the Aram Galstyan. 2019. A Survey on Bias and Fairness in Machine Learning.
resulting data is used to train classifiers. The classifiers are CoRR abs/1908.09635 (2019). arXiv:1908.09635 http://arxiv.org/abs/1908.09635
then tested for bias, and the most promising interventions [14] J. Ross Quinlan. 1996. Bagging, Boosting, and C4.5. In Proceedings of the
Thirteenth National Conference on Artificial Intelligence and Eighth Innovative
investigated further. Applications of Artificial Intelligence Conference, AAAI, William J. Clancey
(2) The realisation of the strategy as a search for fair data prepa- and Daniel S. Weld (Eds.). AAAI Press / The MIT Press, 725–730. http:
ration plans. The strategy has been implemented as a Tabu //www.aaai.org/Library/AAAI/1996/aaai96-108.php
[15] Inês Valentim, Nuno Lourenço, and Nuno Antunes. 2019. The Impact of Data
search, over a space of interventions that include the ex- Preparation on the Fairness of Software Systems. CoRR abs/1910.02321 (2019).
clusion of matches and inclusion dependencies. Removing arXiv:1910.02321 http://arxiv.org/abs/1910.02321
[16] Vladimiro Zelaya, Paolo Missier, and Dennis Prangle. 2019. Parametrised Data
selected matches and inclusion dependencies may reduce Sampling for Fairness Optimisation. In KDD XAI.
bias by removing problematic mappings, or by prioritising [17] Richard Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork.
alternative ways of preparing the data. 2013. Learning Fair Representations. In ICML (Atlanta, GA, USA) (ICML’13).
III–325–III–333.
(3) An evaluation of (2) for benchmark data sets that shows [18] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating
how the interventions can improve a specific fairness metric, Unwanted Biases with Adversarial Learning. In AAAI (New Orleans, LA, USA).
namely demographic parity. Also, we show how the accu- NY, USA, 335–340. https://doi.org/10.1145/3278721.3278779
racy of the trained classifier is impacted by the interventions.
Results using two evaluation scenarios show that the ap-
proach reduces bias in comparison with a non fairness-
aware base case. Also, the evaluation shows that, com-
pared to the baseline case, the accuracy of the classifier
trained as a result of the interventions did not suffer.

ACKNOWLEDGMENTS
We are pleased to acknowledge the support of the UK Engineer-
ing and Physical Sciences Research Council, through the VADA
Programme Grant (EP/M025268/1).

REFERENCES
[1] Chiara Accinelli, Simone Minisi, and Barbara Catania. 2020. Coverage-
based Rewriting for Data Preparation. In Proceedings of the Workshops of
the EDBT/ICDT 2020 Joint Conference, Copenhagen, Denmark, March 30, 2020
(CEUR Workshop Proceedings, Vol. 2578), Alexandra Poulovassilis et al. (Eds.).
CEUR-WS.org. http://ceur-ws.org/Vol-2578/PIE4.pdf
[2] Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam
Choudhary, Evan P. Hamilton, and Derek Roth. 2019. A comparative study of
fairness-enhancing interventions in machine learning. In FAT*. ACM, 329–338.
https://doi.org/10.1145/3287560.3287589
[3] Pratik Gajane. 2017. On formalizing fairness in prediction with machine
learning. CoRR abs/1710.03184 (2017). arXiv:1710.03184 http://arxiv.org/abs/
1710.03184
[4] Fred Glover. 1986. Future paths for integer programming and links to artificial
intelligence. Computers Operations Research 13, 5 (1986), 533 – 549. https:
//doi.org/10.1016/0305-0548(86)90048-1 Applications of Integer Programming.
[5] Sara Hajian and Josep Domingo-Ferrer. 2013. A Methodology for Direct and
Indirect Discrimination Prevention in Data Mining. IEEE Trans. Knowl. Data
Eng. 25, 7 (2013), 1445–1459. https://doi.org/10.1109/TKDE.2012.72