=Paper=
{{Paper
|id=Vol-2324/Paper06-NKonstantinou
|storemode=property
|title=Feedback Driven Improvement of Data Preparation Pipelines
|pdfUrl=https://ceur-ws.org/Vol-2324/Paper06-NKonstantinou.pdf
|volume=Vol-2324
|authors=Nikolaos Konstantinou,Norman Paton
|dblpUrl=https://dblp.org/rec/conf/dolap/KonstantinouP19
}}
==Feedback Driven Improvement of Data Preparation Pipelines==
Feedback Driven Improvement of Data Preparation Pipelines
Nikolaos Konstantinou Norman W. Paton
School of Computer Science School of Computer Science
University of Manchester University of Manchester
Manchester, UK Manchester, UK
nikolaos.konstantinou@manchester.ac.uk norman.paton@manchester.ac.uk
ABSTRACT data preparation is typically quoted as taking 80% of the time of
Data preparation, whether for populating enterprise data ware- data scientists, who would prefer to be spending their time on
houses or as a precursor to more exploratory analyses, is recog- analysing and interpreting results1 .
nised as being laborious, and as a result is a barrier to cost- The high cost of data preparation has been recognised for a
effective data analysis. Several steps that recur within data prepa- considerable period. For example, research into dataspaces [12]
ration pipelines are amenable to automation, but it seems impor- proposed a pay-as-you-go approach to data integration, in which
tant that automated decisions can be refined in the light of user there was an initial and automated bootstrapping phase, which
feedback on data products. There has been significant work on was followed by an incremental improvement phase in which
how individual data preparation steps can be refined in the light the user provided feedback on the data product. This gave rise to
of feedback. This paper goes further, by proposing an approach a collection of proposals for pay-as-you-go data integration and
in which feedback on the correctness of values in a data prod- cleaning platforms [17], which in turn led to proposals for the use
uct can be used to revise the results of diverse data preparation of crowds as a possible source of feedback [9]. This research pro-
components. Furthermore, the approach takes into account all vided experience with pay-as-you-go data management, without
the available feedback in determining which actions should be leading to many end-to-end systems; for the most part, feedback
applied to refine the data preparation process. The approach has was obtained for a particular task (e.g. mapping selection, en-
been implemented to refine the results of of matching, mapping tity resolution) and used for that task alone. This was alright,
and data repair components in the VADA data preparation sys- but collecting feedback on lots of individual components is it-
tem, and is evaluated using deep web and open government data self expensive, and thus not especially helpful for the complete,
sets from the real estate domain. The experiments have shown many-step data preparation process.
how the approach enables feedback to be assimilated effectively This, therefore, leaves open the question as to how to make
for use with individual data preparation components, and fur- a multi-step data preparation process much more cost effective,
thermore that synergies result from applying the feedback to for example through automation and widespread use of feedback
several data preparation components. on data products. There are now some results on automating
comprehensive data preparation pipelines. For example, in Data
Tamer [29], machine learning is used to support activities in-
1 INTRODUCTION cluding the alignment of data sets and instance level integra-
Data preparation is the process of transforming data from its tion through entity resolution and fusion. In some respects Data
original form into a representation that is more appropriate for Tamer follows a pay-as-you-go approach, as the training data
analysis. In data warehouses, data preparation tends to be referred used by the learning components is revised in the light of experi-
to as involving an Extract Transform Load (ETL) process [34], and ence. Furthermore, in VADA [21], a collection of components (for
for more ad hoc analyses carried out by data scientists may be matching, mapping generation, source selection, format trans-
referred to as data wrangling [27]. In both cases, similar steps formation and data repair) are orchestrated automatically over
tend to be involved in data preparation, such as: discovery of rele- data sources, informed by supplementary instance data drawn
vant sources; profiling of these sources to better understand their from the domain of the target schema [20]. However, to date,
individual properties and the potential relationships between feedback has only been applied in VADA to inform the selection
them; matching to identify the relationships between source at- of mappings.
tributes; mapping to combine the data from multiple sources; In this paper we investigate how feedback on the data product
format transformation to revise the representations of attribute that results from the multi-component data preparation process
values; and entity resolution to identify and remove duplicate in VADA can be used to revise the results of multiple of these
records representing the same real world object. wrangling components in a well-informed way. In particular,
This is a long list of steps, each of which can potentially involve given feedback on the correctness of tuples in the data product,
data engineers: (i) deciding which data integration and cleaning a feedback assimilation strategy explores a set of hypotheses
operations to apply to which sources; (ii) deciding the order about the reason for problems with the result. The statistical
of application of the operations; and (iii) either configuring the significance of these hypotheses is then tested, giving rise to the
individual operation applications or writing the rules that express generation of a revised data integration process. The proposed
the behaviour to be exhibited. Although there are many data approach thus uses the same feedback to inform changes to
preparation products, and the market for data preparation tools many different data preparation components, thereby seeking to
is estimated to be $2.9 billion [24], most of these products are maximise the return on the investment made in the provision of
essentially visual programming platforms, in which users make feedback.
many, fine-grained decisions. The consequences of this is that The contributions of the paper are as follows:
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings
1 https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on
CEUR-WS.org. consuming-least-enjoyable-data-science-task-survey-says/33d344256f63
Figure 1: Data Sources and Reference Data in a simple data preparation scenario. Specific values are coloured as follows:
red font – incorrect value w.r.t. conditional functional dependencies (CFDs); green font: correct value w.r.t. CFDs; blue font
– data used in mining CFDs.
(1) A technique for applying feedback on a data product across
a multi-step data preparation process that both identifies
statistically significant issues and provides a mechanism
for exploring the actions that may resolve these issues.
(2) A realisation of the technique from (1) in a specific data
preparation platform, where feedback is used to change
the matches used in an integration, change which map-
pings are used, and change which data quality rules are
applied.
(3) An empirical evaluation of the implementation of the ap-
proach from (2) that investigates the effectiveness of the
proposed approach both for individual data preparation
Figure 2: Using feedback to improve the end product. The
constructs (matches, mappings and CFDs) and for apply-
shaded red background denotes false positive feedback ob-
ing feedback across all these constructs together.
tained on the initial end product, in the light of which
The remainder of the paper is structured as follows. Section the system is able to refine the data preparation process
2 outlines the data preparation pipeline on which we build, and to yield the revised data product without the problematic
provides a running example that will be used in the later sections. values.
Section 3 provides a problem statement and an overview of the
approach. Section 4 details the individual components in the re-
alisation of the approach, and presents the feedback assimilation Target Schema: A schema definition for the end data prod-
algorithm. Section 5 evaluates the technique in a real estate ap- uct. In the running example, the target schema consists
plication. Section 6 reviews other approaches to increasing the of one table, property, with six attributes, namely price,
cost-effectiveness of data preparation, and Section 7 concludes. postcode, income, bedroom_no, street_name, and location.
User Context: The desired characteristics of the end prod-
2 A DATA PREPARATION PIPELINE uct; as the end data product is obtained by an automated
This section provides an overview of the aspects of the VADA process, many candidate solutions can be generated. The
data preparation architecture that are relevant to the feedback user context allows the system to select solutions that meet
assimilation approach that is the focus of the paper. The VADA the specific requirements of the user [1]. The user’s re-
architecture is described in more detail in earlier publications [20, quirements are modelled as a weighted set of criteria, with
21]. the sum of their weights equal to one. Although in general
different quality criteria can be used (such as complete-
ness, consistency or relevance), in our running example,
2.1 Defining a Data Preparation Task
we consider 6 criteria, each one on the correctness of a
In VADA, instead of handcrafting a data preparation workflow, target schema attribute, and a weight of 16 .
the user focuses on expressing their requirements, and then the Data Context: Supplementary instance data associated with
system automatically populates the end data product. In particu- the target schema, which can be used as additional ev-
lar, the user provides: idence to inform the automated data preparation pro-
Input Data Sources: A collection of data sources that can cess [20]. For example, in Figure 1, reference data is pro-
be used to populate the result. Figure 1, illustrates the vided that provides definitive address data.
data sources in our running scenario. These sources in- Given the above information, and a user-defined targeted size
clude real estate property data (sources s 1 to s 3 ) and open for the end product of 6 tuples, from the data sources in Figure 1,
government data (source s 4 ). the system can automatically produce the first end data product
example illustrated in Figure 2, with a support size of 2, it
will produce a number of CFDs, including the following:
property([postcode] → [streetname],(M1 5BY | | Cambridge
Street))
property([postcode] → [locality],(M9 8QB | | Manchester))
property([postcode] → [streetname],(OX2 9DU | | Crabtree
Road))
property([postcode] → [locality],(OX28 4GE | | Witney))
property([postcode] → [streetname],(OX4 2 DU | | Oxford Road))
Given a set of CFDs and a dataset, the Data Repair compo-
Figure 3: A Data Preparation Pipeline nent will identify violations of these CFDs, which can then
be repaired, in the sense that violations are removed [11].
In Figure 2, rules are learnt from the repeated presence
in Figure 2. In the absence of feedback or any other user-defined of the italicised tuples highlighted in blue in the refer-
characteristics, the system will select tuples that are as complete ence data in Figure 2. Thus, the incorrect values Market
as possible to populate the end data product. However, as illus- Street, Crabtree Rd, etc. have been corrected to Lakeside
trated in Figure 2, feedback on the correctness of the result can Rise, Crabtree Road, etc. resulting in an end product that
be used to revise how the data product is produced, and thus is consistent with respect to the reference data.
improve its overall quality.
3 PROBLEM STATEMENT
2.2 Data Preparation Process and This section provides more details on the problem to be solved,
Components along with an overview of the approach to be followed. The
Figure 3 illustrates the basic flow of events for the data process- problem can be described as follows.
ing pipeline in this paper, where the plus precedes and follows Assume we have a data preparation pipeline P, that
parallel tasks. First, the system is initialised using the sources and orchestrates a collection of data preparation steps
data context that the user has provided. Then, CFD Miner, Data {s 1 , ..., sn }, to produce an end data product E that
Profiler and Matching components are run on the sources and consists of a set of tuples. The problem is, given
data context. Given the matches and profiling data, the Mapping a set of feedback instances F on tuples from E, to
component generates a set of candidate mappings, over which re-orchestrate some or all of the data preparation
Mapping Selection evaluates the user criteria to select the most steps si , revised in the light of the feedback, in a
suitable mappings for contributing to the end product. Subse- way that produces an improved end data product.
quently, the Data Repair component repairs constraint violations In this paper, we assume that the feedback takes the form of
that are detected on the end product. The components are now true positive (TP) or false positive (FP) annotations on tuples or
outlined in more detail: attribute values from E. Where a tuple is labelled as a TP this
Initialise: This component sets up the system Knowledge means that it is considered to belong in the result; and where a
Base with metadata about the data available in the sys- tuple is labelled as an FP, this means that it should not have been
tem, the target schema, user preferences and component included in the result. When an attribute value is labelled as a
configurations. TP this means that the given value is a legitimate member of the
Matching: Given a set of sources and the target schema domain of the column, and when it is labelled as an FP this value
T , the matching component produces a set of matches should not have appeared in that column2 .
between their attributes and the target schema attributes. Given such annotations, the approach consists of the following
Each match has a similarity score, with 0 ≤ score ≤ 1. steps:
Profiling: This component analyses the available sources (1) Given feedback F, identify a collection of hypotheses H
and produces a set of candidate keys and a set of inclusion that could explain the feedback. For example, if an attribute
dependencies among the sources. value is incorrect, the following are possible hypotheses:
Mapping Generation: Using the results of the two previ- (i) a match that was used to associate that value in a source
ous components as inputs, mapping generation searches with this attribute in the target is incorrect; (ii) a mapping
the space of possible ways of combining the sources using that was used to populate that value in the target is in-
unions or joins, producing a set of candidate mappings. correct, for example joining two tables that should not
Mapping Selection: Given a set of user criteria, a set of have been joined; and (iii) a format transformation has
candidate mappings, the target schema and a targeted size, introduced an error into the value.
mapping selection establishes how many tuples to obtain (2) Given a hypothesis h ∈ H , review all the evidence pertain-
from each of the candidate mappings to populate a target ing to h from the feedback to establish if the confidence
schema instance, thus creating an end data product [1]. in the hypothesis is sufficient to suggest that it should be
Data Repair: Repair takes place in the presence of reference investigated further. For example, if the hypothesis is that
data; reference data is considered to provide complete cov- a match is incorrect, all the feedback on data derived from
erage of the domains for certain target columns. Data that match should be considered together, with a view
repair involves the cooperation of 2 components. First, 2 These annotations lend themselves to propagation as follows. If a tuple is marked
CFD Miner is trained on the data context [11]. The com- as TP, all of its attribute values are marked as TP. If an attribute value is marked as
ponent is configured by a support size parameter. In the FP, all tuples containing any of these attribute values are marked as FP.
to determining whether the match should be considered 4.2 Hypothesis Testing
problematic. This section describes how hypotheses are tested with respect to
(3) Given a hypothesis h ∈ H in which there is some confi- the evidence from feedback.
dence from feedback, identify actions that could be taken As stated in Section 2, the user can define the desired charac-
in the pipeline P. For example, if the hypothesis is that a teristics of the end product in the form of a set of criteria. Some
match is incorrect, possible actions would be to drop that of these criteria can be determined without feedback (such as
match or to drop all the mappings that use the match. the completeness with respect to the data context of the values
(4) Given that evidence from feedback F may support several in a column), but some can be informed by feedback (such as
different hypotheses, there is a space of possible actions relevance and correctness). In the examples and experiments in
that could be carried out, each leading to a different collec- this paper, feedback is on correctness. For criteria that are based
tion of data preparation steps. As a result, we must explore on feedback, Equation 1 is used, in which cˆs is the criterion eval-
the space of candidate integrations that implement the dif- uation on a source s, tp (resp. f p) the numbers of tuples marked
ferent actions. by the user as true (resp. false) positives, and |s | is the source
size.
4 SOLUTION
1 tp − f p
This section provides additional details on how the steps from cˆs =
(1 + ) (1)
Section 3 are carried out in practice, and includes details on how 2 |s |
the feedback can be used to inform actions on matches, map- Thus, in the absence of feedback, the default value for a cri-
pings and CFDs. In particular, Section 4.1 identifies hypotheses terion is 0.5. However, if we have feedback on the correctness
that may be suggested by the available feedback; Section 4.2 de- of an attribute that there are 6 tp values and 3 fp values, then if
scribes how the statistical significance of such hypotheses can there are 100 tuples, the estimated correctness for the attribute
be ascertained; Section 4.3 identifies some actions that may be in s is now 12 (1 + |100|
6−3 ) = 0.515.
taken in response to a hypothesis; and Section 4.4 describes how We now proceed to establish when criteria estimates are sig-
the different aspects of the solution are brought together in an nificantly different using standard statistical techniques [7]3 . The
algorithm. estimated values of a criterion ĉ on two sources s 2 and s 1 are
considered to be significantly different when Equation 2 holds:
4.1 Hypotheses q
Given an end data product on which some feedback has been ĉ s2 − ĉ s1 > z ses22 − ses21 (2)
collected, an obvious question is what can the feedback tell us
about the way in which the data product has been produced. More where ĉ s2 (resp. ĉ s1 ) is the evaluation of criterion ĉ on s 2 (resp.
specifically, given an end data product and some feedback that s 1 ), ses2 and ses1 are the standard errors for sources s 2 and s 1
identifies problems with the data product, an obvious question respectively, calculated by Equation 3, and z is a statistical term
is what went wrong in the production of this data product. For measuring the relationship between a value and the mean of a
example, in Figure 1, the street_name Market Street is not consis- group of values. For instance, a z-score of zero indicates that the
tent with the postcode M9 8QB, and thus could have been labelled value and the mean are identical. In our experiments we use the
as an FP by the user. It is straightforward to identify possible z-score that corresponds to a confidence level of 80%, i.e. 1.282.
reasons for this problem, which here we term hypotheses. In this Standard error is calculated by Equation 3 below.
case, possible reasons include: s
cˆs (1 − cˆs )
Matching: Incorrect match used to relate the source to the ses = (3)
target. Ls
Mapping Generation: Incorrect mapping combined sources where s is a source, cˆs is the evaluated feedback-based criterion
in an inappropriate way. on source s, and Ls is the amount of feedback collected on source
Data Repair: Incorrect data repair replaced the correct value s.
with the wrong one. Given the above, then our estimation of a data criterion cˆs on
In this paper, all these hypotheses are considered as possible a source s is cˆs ± es where es is the margin of error for the data
explanations for the identified FP. The question then is when do criterion evaluation on s, and 0 ≤ cˆs ± es ≤ 1. A source s can
we provide credence to a hypothesis. Given the evidence available be either a set of values, or a set of tuples. The formula for the
in the form of feedback, hypotheses are tested as follows: margin of error is given in Equation 4.
Matching: If the data that results from a match is signifi- s
|s | − Ls
cantly worse than that produced from other matches for es = z · ses · (4)
the same target schema attribute, then the match is suspi- |s | − 1
cious. where |s | is the number of elements in s (source size), and Ls
Mapping Generation: If the result of a mapping is signif- the amount of feedback collected for source s. This is feedback
icantly worse than the results from the other mappings, either provided directly on the data product (if it is the end data
then the mapping is suspicious. product) or propagated to it. We only consider attribute-level
Data Repair: If the repaired values from a repair rule are feedback instances when evaluating Ls on a set of values, and
significantly worse than the other values for the repaired
attribute, then the rule is suspicious. 3 Such statistical techniques have been used before to compare mappings on the
basis of feedback, with a view to targeting feedback collection [28]. In that work,
The following section uses statistical techniques to compare where there is uncertainty as to which mappings should be preferred in mapping
matches, mappings and feedback using evidence from feedback. selection, additional feedback is obtained to reduce the uncertainty.
Figure 4: Detecting significantly different matches
Figure 5: Detecting significantly different mappings
we only consider tuple-level feedback instances when evaluating
Ls on a set of tuples.
from 5 candidate mappings. Candidate mappings m 1 to m 4 con-
In the absence of feedback on a set of tuples or values, given
tribute to the solution, while m 5 does not. For each of the can-
that cˆs is unknown, hence cˆs = 21 ± 12 , we can solve Equation 4
didate mappings that contributes to the solution, we evaluate
by setting es = 12 and Ls = 0, and evaluate the standard error
the participating tuples against the rest of the tuples of the end
ses for the source s (needed for evaluating significant difference
product, using Equation 2. For instance, for to evaluate whether
using Equation 2) in the absence of feedback as:
mapping m 1 is significantly different, we use s 1 and s 2 as illus-
s trated in Figure 5 .
1 |s | − 1 It is important to mention that for mappings, before evaluation,
ses = (5) we propagate to the mapping the feedback that the user has
2z |s |
provided on the end product, and on the mappings themselves.
Next, we discuss the comparisons we used when testing the Furthermore, the Mapping Selection component ensures that the
different components in the system. The focus is on what is tuples that are marked as true (resp. false) positives are selected
considered as s 1 , s 2 , when evaluating Equation 2. (resp. not selected).
Testing matches for significant difference. Figure 4 shows Testing CFDs for significant difference. In Figure 6 we see
the target schema T and a source s, with a match between at- the result of a CFD on the end product. We mark as green the 2
tribute s.d and the target schema attribute T .d: m 1 : s.d ∼ T .d. attribute values for column c that are found to be correct, and
When testing match m 1 for significant difference, Equation 2 with red the 3 values in column d found to be violating the CFD,
is evaluated on the projections on the matching attributes. Specif- and that were thus repaired. As before, we consider tuple sets s 1
ically, to test the significance of match m 1 , for the evaluation and s 2 in Figure 6 as inputs to Equation 2. Note that the correct
of Equation 2, we use value sets s 1 and s 2 (i.e., s.d and T .d), as values are not considered in s 1 , as they have no effect on the end
illustrated in Figure 4. product. We only consider the tuples that were modified.
Note that the set s 1 is the complete set of values for attribute
s.d, regardless of whether the values are selected to populate the 4.3 Actions
end product or not, while s 2 is the greyed part of attribute T .d. Each of the hypotheses is associated with a respective action,
Consider the example in Figure 1, for which some feedback has typically set in this paper to ruling out the suspicious item be-
been obtained on the initial end product in Figure 2, to the effect fore rerunning the data preparation pipeline. An item here is a
that the first two tuples have fp annotations for their bedroom_no. match, a candidate mapping, or a CFD rule. The hypotheses and
In Figure 2, the first 3 tuples have been obtained from Source 1, respective actions per component are thus summarised below:
but involve an incorrect match of Source1.bathrooms with Tar-
get.bedroom_no. The correctness estimated for this match using
Equation 1 is 12 (1 + 0−2
|3 | ) = 0.167. The correctness for the values
obtained from other matches on the same column, on which no
feedback has been collected, is estimated using Equation 1 to
be 0.5. Although with this small illustrative example statistical
significance is difficult to establish, we can see that the correct-
ness estimate associated with the match from Source1.bathrooms
with Target.bedroom_no is lower than that for the other values in
the column (obtained using matches involving source attributes
Source 2.bedroom_no and Source 3.bed_num), and thus the match
involving Source1.bathrooms with Target.bedroom_no may be
identified as being suspicious.
Testing mappings for significant difference. Figure 5 shows
an example of a target schema T populated with tuples selected Figure 6: Detecting significantly different CFDs
Algorithm 1 Apply collected feedback set of CFDs. The resulting end data product can then be provided
1: function ApplyFeedback to the user for analysis or further feedback collection. Newly
2: for all m ∈ Matches do collected feedback instances are added to the existing ones, and
3: if significantlyWorse(m.a, T .a \ m.a) then propagated to the end product, and the candidate mappings. In
4: Matches.remove(m) addition, the feedback can periodically be applied to the whole
5: end if process using Algorithm 1 to generate a revised end product. The
6: end for process continues until the user finds the result fit for purpose
7: Mappinдs ← MappingGeneration(Matches, pro f ileData) and terminates data preparation pipeline.
8: for all map ∈ Mappinдs do Function significantlyWorse (lines 3, 9 and 16) is essentially
9: if significantlyWorse(map, T \ map) then testing Equation 2 with s 1 and s 2 as arguments, and returns true
10: Mappinдs.remove(map) if it holds, or false otherwise. The arguments for this function are
11: end if the ones illustrated as s 1 and s 2 in Figure 4 when detecting sig-
12: end for nificantly worse matches, Figure 5 when detecting significantly
13: endProduct ← MappingSelection(Mappinдs) worse mappings, and Figure 6 when detecting significantly worse
14: for all c f d ∈ CF Ds do rules. Next, using the same approach we detect and remove suspi-
15: s ← ViolatingTuples(T , c f d) cious mappings (lines 12–16) and suspicious CFDs (lines 18–23).
16: if significantlyWorse(s, T \ s) then As s in line 15 we define the set of tuples in T violating a given
17: CF Ds.remove(c f d) c f d.
18: end if
19: end for 5 EVALUATION
20: endProduct ← DataRepair(CF Ds)
21: end function
5.1 Experimental Setup
For the evaluation, we used as sources the following datasets:
(a) forty datasets with real-estate properties extracted from the
• Matching. If a match is suspicious, discard the match. web using OXpath [13], with contents similar to sources s 1 to s 3
This means that mapping generation (or any of the other in Figure 1, (b) English indices of deprivation data, downloaded
components down the line) will not take this match into from www.gov.uk, as shown in s 4 in Figure 1. As reference data,
account. we used open address data from openaddressesuk.org.
• Mapping Generation. If a mapping is suspicious, discard Then, to enable the automatic evaluation of correctness on
the mapping. This means that Mapping Selection will not the end product and throughout the pipeline, we constructed
take this mapping into account. a ground truth based on the available data, which we used for
• Data Repair. If a repair rule is suspicious, discard the our experiments, as follows: we manually matched, mapped,
rule. deduplicated, and then repaired the end product. This gave us a
dataset consisting of approximately 4.5k tuples.
We need not take actions for items that are found to be sig-
For the experiments, the match threshold was set to 0.6, the
nificantly better than others, as they are retained by default. As
top 100 mappings are retained from mapping generation, and the
such, we only focus on removing suspicious items.
support size for the data repair component was set to 5.
Each of the criteria in the user context was set to be the cor-
4.4 Algorithm rectness of an attribute from the target schema. Thus, the user
Algorithm 1 provides the basic sequence of events while assimi- criteria are: correctness(price), correctness(postcode), correct-
lating feedback. It is assumed that the feedback assimilation takes ness(income), correctness(bedroom_no), correctness(street_name),
place in the context of an integration, in which Matches is the set and correctness(location). They all have the same weight, 16 . The
of existing matches, profileData contains the existing inclusion correctness of each of these properties is estimated based on
dependency and primary key data, and CFDs is a set of existing feedback. Mapping selection is configured to select what are pre-
conditional functional dependencies, as produced in the upper dicted to be the best 1000 tuples from a subset of the generated
part of Figure 3. mappings.
Then, the algorithm considers the hypotheses on matches, The workflow for the experiments is illustrated in Figure 7.
mappings and CFDs in turn; this order is chosen because changes We essentially automatically reran the workflow of Figure 3, col-
to matches give rise to changes to mappings, and CFDs are applied lecting 25 feedback instances in each iteration, until 500 feedback
to the results of mappings. instances on the end product had been collected and propagated.
First, we iterate over the available matches (lines 2–6) and Feedback is generated randomly in the experiments, based on
test whether any of these produces a result that is significantly the correctness of the respective tuple with respect to the ground
worse than the results of other matches. T .a (line 3) is a target truth. In each iteration, half of the feedback is given at tuple level,
schema attribute that is the target of a match m with a source and half at attribute level.
attribute m.a. Any match that yields a significantly worse result We then configured the system to test:
is removed.
The remaining matches are then used for mapping generation • Matching. This means running Algorithm 1 without lines
(line 7), and any generated mappings that perform significantly 8–12, and without lines 14–19.
worse than the others are removed (lines 8–12). • Mapping generation. This means running Algorithm 1
A similar process is followed with CFDs, removing CDFs that without lines 2–6, and without lines 14–19.
are found to be problematic in the light of the feedback (lines 14– • Data repair. This means running Algorithm 1 without lines
19). The end product is then repaired (line 20) using the reduced 2–6, and without lines 8–12.
Figure 7: Experiments Figure 8: Precision for different configurations
• All of the components. This means running Algorithm 1 to correctness, this in turn leads to more true positives in the
as defined. results and a higher precision.
• None of the components. This means running Algorithm As such, there might be considered to be 2 baselines in this
1 without lines 2–6, without lines 8–12, and without lines experiment: the case in which there is no feedback, and the
14–19. In this case, although no actions are taken to re- precision is around 0.2, and none in which the feedback informs
move matches, mappings or CFDs, the data product is still mapping selection, but the approach from Section 4 is not applied.
improved in the light of feedback, as MappingSelection The actions that have been taken by the matching and mapping
in line 13 is able to benefit from improved estimates of components, i.e., the removal of suspicious matches or mappings,
mapping correctness. have had a positive impact on the precision of the end product,
The none of the components case forms the baseline, as we know as shown in Figure 8. It is unclear from this experiment which
of no direct competitor that is seeking to apply feedback on one has more impact.
the result of an integration to revise the behaviour of several Taking action on the CFDs has had little impact on the end
integration components. product precision. This can be understood as follows:
• The rules being learnt are numerous, e.g., 3526 rules were
5.2 Results learnt with a support size of 5. As a result, each rule applies
The precision values obtained for different configurations are to quite few rows, and it can be difficult to obtain enough
illustrated in Figure 84 . Each of the lines in Figure 8 is the average feedback to draw statistically significant conclusions as to
of 5 runs from 0 to 500 randomly collected feedback instances, the quality of an individual rule.
using the configurations from Section 5.1. Note that, as the col- • Each of the rules typically has a small effect to the end
lection of different feedback instances leads to different possible product. As such, even when a problematic rule is removed,
actions, which leads to different results being made available for this tends not to have much of an effect on overall quality.
feedback collection, there is no realistic way to provide identical
However, it is still clear that identifying CFDs that introduced
feedback instances to different runs of the experiment. This is
errors to the data, and discarding them, has the potential to have
reflected in the uneven curves in Figure 8.
a positive effect on the resulting quality.
As usual, the precision of the end product is evaluated as:
tp The most important result for the overall proposal, however,
precision = tp+f p , where a tuple is considered a true positive is that when the actions across all the components are combined,
(tp) if all of its values appear in the ground truth, and as a false this provides much the greatest improvement compared with any
positive (f p) otherwise. of the individual means of applying the feedback. Furthermore,
Notice in Figure 8 that the none case (i.e., when there is no much of this improvement is obtained with modest amounts of
discarding of matches, mappings or rules along the way) still feedback.
leads to an increase of the precision of the end product. This We noticed in the experiments that discarding suspicious
is because, throughout the experiments, mapping selection (as matches, mappings or CFDs, even in the case of significant is-
outlined in Section 2.2) is informed by correctness estimates, and sues with the quality of their associated results, did not always
feedback is used to improve the correctness estimates. So, before guarantee that the precision of the end product would increase.
any feedback is collected, all mappings have the same correct- This happens because the contents of the tuples that replace the
ness (estimated at 0.5) and thus mapping selection considers all discarded ones are yet unknown, as feedback on them is not al-
mappings to be of equal quality. However, as feedback is gath- ways present. As such, the new tuples are not guaranteed to be of
ered, different mappings will come to have different levels of better quality. The trend is, though, that the overall correctness
correctness, and mapping selection will make increasingly well can be expected to increase, even if not monotonically.
informed selection decisions. As these selection decisions relate Next, we report on the number of suspicious items discovered
4 In a setting where mapping selection retains the top 1000 tuples from a ground
in each of the experiments. Each line in Figure 9, corresponds to
truth of around 4500 tuples, the recall tends to increase broadly in line with the
the average of the same 5 runs as reported in Figure 8. Figures
precision. 9a to c, respectively show the numbers of suspicious matches,
a. Matching b. Mapping Generation
c. Data Repair d. All Components
Figure 9: Detecting Suspicious Products
mappings and CFDs considered, when only the respective com- In relation to applying feedback to multiple activities, it has
ponent is being tested. Figure 9d shows the overall number of been recognised that the same feedback may be able to inform
suspicious items discovered and discarded when all components different aspects of the same step. For example, in [5], feedback
were being tested. on the correctness of results is used to inform both mapping
We can observe the following: (i) Rather few suspicious matches selection and mapping refinement. In contrast with the work here,
have been detected, so the benefit obtained from the removal these are two essentially independent proposals that happen to
of each such match has been quite substantial. We note that use the same feedback, where the feedback has been obtained
as matches relate to individual columns, obtaining sufficient FP directly on the mapping results. In Corleone [14], several dif-
feedback on the data deriving from a match can require quite a ferent aspects of entity resolution are informed by feedback on
lot of feedback. (ii) More suspicious mappings are identified, and matching pairs; in particular the feedback is used to inform the
they are identified from early in the process. (iii) Although quite learning of comparison rules and to identify and resolve problem
a few suspicious CFDs are identified, this is a small fraction of cases. This may be the closest work to ours, though the focus is
the overall number. on a single data integration step, for which custom feedback has
From Figures 8 and 9 it is clear that taking feedback into been obtained. Also for entity resolution, feedback on matching
account and acting on all components increases the return on the pairs is used in [25] as part of a single optimisation step that
investment from feedback. When considering all possible actions configures together blocking, detailed comparison and clustering.
together, the overall benefit is greater, and the benefit accrues This contrasts with the current paper in focusing different as-
with smaller amounts of feedback. pects of the same integration step. To the best of our knowledge,
there is no prior work in which several distinct steps in the data
6 RELATED WORK preparation process are considered together when responding to
feedback.
In this section, we consider related work under three headings,
In terms of reducing manual effort in data preparation, there
pay-as-you-go data preparation, applying feedback to multiple
are a variety of different approaches. For example, again in the
activities and reducing manual effort in data preparation.
context of individual steps, tools can be used to support data en-
In pay-as-you-go data preparation, following from the vision
gineers in developing transformations. For example, in relation
of dataspaces [12], a best-effort integration is produced through
to format transformation, Wrangler [19] can suggest potential
automated bootstrapping, which is refined in the light of user
transformation programs, and FlashFill [16] can synthesize trans-
feedback. There have been many proposals for pay-as-you-go
formation programs from examples. There are also proposals
data preparation components, for example relating to data ex-
in which mappings are discovered based on data instances (e.g.,
traction [10], matching [18, 36], mapping [5, 6, 30] and entity
[2, 15, 26]). In ETL, there are also a variety of approaches that seek
resolution [14, 25]. Such proposals have addressed issues such
to reduce the amount of manual labour required. For example,
as targeting the most appropriate feedback [10, 28, 35], and ac-
this includes the provision of language features [4, 32] or pat-
commodating unreliable feedback, in particular in the context of
terns [31, 33] that support recurring data preparation behaviours,
crowdsourcing [9, 23]. However, such work has primarily used a
techniques for managing evolution of ETL programs [8], and
single type of feedback to inform a single data preparation step.
development of ETL processes that abstract over more concrete identifying the subset of the candidate actions that together are
implementation details [3, 22]. However, although such work the most effective.
focuses on raising the abstraction levels at which data engineers Acknowledgement: This work is funded by the UK Engineering
engage in data preparation tasks, we are not aware of prior re- and Physical Sciences Research Council (EPSRC) through the
sults that use feedback on data products to make changes across VADA Progrmame Grant.
complete data preparation processes.
7 CONCLUSIONS REFERENCES
[1] Edward Abel, John Keane, Norman W. Paton, Alvaro A.A. Fernandes, Martin
The development of data preparation processes is laborious, re- Koehler, Nikolaos Konstantinou, Julio Cesar Cortes Rios, Nurzety A. Azuan,
quiring sustained attention to detail from data engineers across and Suzanne M. Embury. 2018. User driven multi-criteria source selection.
Information Sciences 430-431 (2018), 179–199. https://doi.org/10.1016/j.ins.
a variety of tasks. Many of these tasks involve activities that are 2017.11.019
amenable to automation. However, automated approaches have [2] Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang Chiew Tan. 2011.
Designing and refining schema mappings via data examples. In Proceedings of
partial knowledge, and thus cannot necessarily be relied upon the ACM SIGMOD International Conference on Management of Data, SIGMOD.
to make the best integration decisions. When automation falls 133–144. https://doi.org/10.1145/1989323.1989338
short, one way to improve the situation is through feedback on [3] Syed Muhammad Fawad Ali and Robert Wrembel. 2017. From concep-
tual design to performance optimization of ETL workflows: current state
the candidate end data product. The successful combination of of research and open problems. VLDB J. 26, 6 (2017), 777–801. https:
automation and feedback provides a route to data preparation //doi.org/10.1007/s00778-017-0477-2
without programming, which was considered to be important by [4] Ove Andersen, Christian Thomsen, and Kristian Torp. 2018. SimpleETL: ETL
Processing by Simple Specifications. In Proceedings of the 20th International
90% of participants in a survey on end user data preparation5 . Workshop on Design, Optimization, Languages and Analytical Processing of
Towards this goal of reducing the burden of data prepara- Big Data co-located with 10th EDBT/ICDT Joint Conference (EDBT/ICDT 2018),
Vienna, Austria, March 26-29, 2018. http://ceur-ws.org/Vol-2062/paper10.pdf
tion, we now revisit and elaborate on the contributions from the [5] Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A.
introduction. Fernandes, and Cornelia Hedeler. 2010. Feedback-based annotation, selection
and refinement of schema mappings for dataspaces. In EDBT. 573–584. https:
(1) A technique for applying feedback on a data product across //doi.org/10.1145/1739041.1739110
a multi-step data preparation process that both identifies [6] Angela Bonifati, Radu Ciucanu, and Slawek Staworko. 2014. Interactive Infer-
statistically significant issues and provides a mechanism ence of Join Queries. In 17th International Conference on Extending Database
Technology, EDBT. 451–462. https://doi.org/10.5441/002/edbt.2014.41
for exploring the actions that may resolve these issues. We [7] Michael G. Bulmer. 1979. Principles of Statistics. Dover Publications.
have described an approach in which hypotheses about [8] Darius Butkevicius, Philipp D. Freiberger, and Frederik M. Halberg. 2017.
MAIME: A Maintenance Manager for ETL Processes. In Proceedings of the
the problems with an integration are tested for statistical Workshops of the EDBT/ICDT 2017 Joint Conference (EDBT/ICDT 2017), Venice,
significance with respect to user feedback on the candi- Italy, March 21-24, 2017. http://ceur-ws.org/Vol-1810/DOLAP_paper_08.pdf
date end data product, giving rise to actions that seek to [9] Valter Crescenzi, Alvaro A. A. Fernandes, Paolo Merialdo, and Norman W.
Paton. 2017. Crowdsourcing for data management. Knowl. Inf. Syst. 53, 1
resolve issues with the feedback. The same approach is po- (2017), 1–41. https://doi.org/10.1007/s10115-017-1057-x
tentially applicable to different types of feedback, diverse [10] Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2015. Crowdsourcing
data preparation components, and a variety of actions. large scale wrapper inference. Distributed and Parallel Databases 33, 1 (2015),
95–122. https://doi.org/10.1007/s10619-014-7163-9
(2) A realisation of the technique from (1) in a specific data [11] Wenfei Fan, Floris Geerts, Laks V S Lakshmanan, and Ming Xiong. 2011.
preparation platform, where feedback is used to change the Discovering conditional functional dependencies. Proceedings - International
Conference on Data Engineering 23, 5 (2011). https://doi.org/10.1109/ICDE.
matches used in an integration, change which mappings 2009.208
are selected, and change which data quality rules are ap- [12] Michael J. Franklin, Alon Y. Halevy, and David Maier. 2005. From databases to
plied. We have indicated how the technique can be applied dataspaces: a new abstraction for information management. SIGMOD Record
34, 4 (2005), 27–33. https://doi.org/10.1145/1107499.1107502
to matching, mapping and repair steps, on the basis of [13] Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, and Andrew
true/false positive annotations on data product tuples, in Sellers. 2013. OXPath: A language for scalable data extraction, automation,
the VADA data preparation system. and crawling on the deep web. The VLDB Journal 22, 1 (01 Feb 2013), 47–72.
https://doi.org/10.1007/s00778-012-0286-6
(3) An empirical evaluation of the implementation of the ap- [14] Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan
proach from (2) that investigates the effectiveness of the Rampalli, Jude W. Shavlik, and Xiaojin Zhu. 2014. Corleone: hands-off crowd-
sourcing for entity matching. In SIGMOD. 601–612. https://doi.org/10.1145/
proposed approach both for individual data preparation con- 2588555.2588576
structs (matches, mappings and CFDs) and for applying feed- [15] Georg Gottlob and Pierre Senellart. 2010. Schema Mapping Discovery from
back across all these constructs together. An experimental Data Instances. JACM 57, 2 (2010), 6:1–6:37. https://doi.org/10.1145/1667053.
1667055
evaluation with real estate data has shown how the ap- [16] Sumit Gulwani, William R. Harris, and Rishabh Singh. 2012. Spreadsheet
proach can identify actions that can improve data product data manipulation using examples. Commun. ACM 55, 8 (2012), 97–105.
quality on the basis of changes to individual data prepa- https://doi.org/10.1145/2240236.2240260
[17] Cornelia Hedeler, Khalid Belhajjame, Norman W. Paton, Alessandro Campi,
ration steps, and can coordinate changes across multiple AlvaroA.A. Fernandes, and SuzanneM. Embury. 2010. Dataspaces. In Search
such steps, with particularly significant benefits from the Computing. LNCS, Vol. 5950. Springer Berlin Heidelberg, 114–134. http://dx.
doi.org/10.1007/978-3-642-12310-8_7
combined approach. [18] Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Vinh Tuan Chau, Tri Kurniawan
There are several promising directions for further investiga- Wijaya, Zoltán Miklós, Karl Aberer, Avigdor Gal, and Matthias Weidlich. 2015.
SMART: A tool for analyzing and reconciling schema matching networks. In
tion, which include: (a) extending the data preparation compo- 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South
nents to which the approach is applied, for example to include Korea, April 13-17, 2015. 1488–1491. https://doi.org/10.1109/ICDE.2015.7113408
source selection or data format transformation; (b) considering [19] Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2011.
Wrangler: Interactive Visual Specification of Data Transformation Scripts. In
alternative actions, such as changing match scores, mapping al- CHI. 3363–3372.
gorithm thresholds, or rule learning parameters; and (c) more [20] Martin Koehler, Alex Bogatu, Cristina Civili, Nikolaos Konstantinou, Ed-
ward Abel, Alvaro A. A. Fernandes, John A. Keane, Leonid Libkin, and
selective or incremental application of actions, with a view to Norman W. Paton. 2017. Data context informed data wrangling. In 2017
IEEE International Conference on Big Data, BigData 2017. 956–963. https:
5 https://www.datawatch.com/2017-end-user-data-preparation-market-study-2/ //doi.org/10.1109/BigData.2017.8258015
[21] Nikolaos Konstantinou, Martin Koehler, Edward Abel, Cristina Civili, Bernd
Neumayr, Emanuel Sallinger, Alvaro A. A. Fernandes, Georg Gottlob, John A.
Keane, Leonid Libkin, and Norman W. Paton. 2017. The VADA Architecture
for Cost-Effective Data Wrangling. In ACM SIGMOD. 1599–1602. https://doi.
org/10.1145/3035918.3058730
[22] Georgia Kougka, Anastasios Gounaris, and Alkis Simitsis. 2018. The many
faces of data-centric workflow optimization: a survey. I. J. Data Science and
Analytics 6, 2 (2018), 81–107. https://doi.org/10.1007/s41060-018-0107-0
[23] Guoliang Li, Jiannan Wang, Yudian Zheng, and Michael J. Franklin. 2016.
Crowdsourced Data Management: A Survey. IEEE Trans. Knowl. Data Eng. 28,
9 (2016), 2296–2319. https://doi.org/10.1109/TKDE.2016.2535242
[24] Ehtisham Zaidi Mark A. Beyer, Eric Thoo. 2018. Magic Quadrant for Data
Integration Tools. Technical Report. Gartner. G00340493.
[25] Ruhaila Maskat, Norman W. Paton, and Suzanne M. Embury. 2016. Pay-as-you-
go Configuration of Entity Resolution. T. Large-Scale Data- and Knowledge-
Centered Systems 29 (2016), 40–65. https://doi.org/10.1007/978-3-662-54037-4_
2
[26] Fotis Psallidas, Bolin Ding, Kaushik Chakrabarti, and Surajit Chaudhuri.
2015. S4: Top-k Spreadsheet-Style Search for Query Discovery. In Pro-
ceedings of the 2015 ACM SIGMOD International Conference on Management
of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 2001–2016.
https://doi.org/10.1145/2723372.2749452
[27] Tye Rattenbury, Joe Hellerstein, Jeffery Heer, Sean Kandel, and Conner Car-
reras. 2017. Principles of Data Wrangling. O’Reilly.
[28] Julio César Cortés Ríos, Norman W. Paton, Alvaro A. A. Fernandes, and Khalid
Belhajjame. 2016. Efficient Feedback Collection for Pay-as-you-go Source
Selection. In Proceedings of the 28th International Conference on Scientific and
Statistical Database Management, SSDBM 2016, Budapest, Hungary, July 18-20,
2016. 1:1–1:12. https://doi.org/10.1145/2949689.2949690
[29] Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales, Mitch
Cherniack, Stanley B. Zdonik, Alexander Pagan, and Shan Xu. 2013. Data Cura-
tion at Scale: The Data Tamer System. In CIDR 2013, Sixth Biennial Conference
on Innovative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013, On-
line Proceedings. http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper28.pdf
[30] Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby
Crammer, Zachary G. Ives, Fernando C. N. Pereira, and Sudipto Guha. 2008.
Learning to create data-integrating queries. PVLDB 1, 1 (2008), 785–796.
https://doi.org/10.14778/1453856.1453941
[31] Vasileios Theodorou, Alberto Abelló, Maik Thiele, and Wolfgang Lehner. 2017.
Frequent patterns in ETL workflows: An empirical approach. Data Knowl.
Eng. 112 (2017), 1–16. https://doi.org/10.1016/j.datak.2017.08.004
[32] Christian Thomsen and Torben Bach Pedersen. 2009. pygrametl: a powerful
programming framework for extract-transform-load programmers. In DOLAP
2009, ACM 12th International Workshop on Data Warehousing and OLAP, Hong
Kong, China, November 6, 2009, Proceedings. 49–56. https://doi.org/10.1145/
1651291.1651301
[33] Kalle Tomingas, Margus Kliimask, and Tanel Tammet. 2014. Data Integration
Patterns for Data Warehouse Automation. In 18th East European Conference
on Advances in Databases and Information Systems, ADBIS. 41–55. https:
//doi.org/10.1007/978-3-319-10518-5_4
[34] P. Vassiliadis. 2011. A Survey of Extract-Transform-Load Technology. IJDWM
5, 3 (2011), 1–27.
[35] Zhepeng Yan, Nan Zheng, Zachary G. Ives, Partha Pratim Talukdar, and Cong
Yu. 2015. Active learning in keyword search-based data integration. VLDB J.
24, 5 (2015), 611–631. https://doi.org/10.1007/s00778-014-0374-x
[36] Chen Jason Zhang, Lei Chen, H. V. Jagadish, and Caleb Chen Cao. 2013.
Reducing Uncertainty of Schema Matching via Crowdsourcing. PVLDB 6, 9
(2013), 757–768. https://doi.org/10.14778/2536360.2536374