<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Joint Conference (March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Feedback Driven Improvement of Data Preparation Pipelines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nikolaos Konstantinou</string-name>
          <email>nikolaos.konstantinou@manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Norman W. Paton</string-name>
          <email>norman.paton@manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science, University of Manchester</institution>
          ,
          <addr-line>Manchester</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>26</volume>
      <issue>2019</issue>
      <abstract>
        <p>Data preparation, whether for populating enterprise data warehouses or as a precursor to more exploratory analyses, is recognised as being laborious, and as a result is a barrier to costefective data analysis. Several steps that recur within data preparation pipelines are amenable to automation, but it seems important that automated decisions can be refined in the light of user feedback on data products. There has been significant work on how individual data preparation steps can be refined in the light of feedback. This paper goes further, by proposing an approach in which feedback on the correctness of values in a data product can be used to revise the results of diverse data preparation components. Furthermore, the approach takes into account all the available feedback in determining which actions should be applied to refine the data preparation process. The approach has been implemented to refine the results of of matching, mapping and data repair components in the VADA data preparation system, and is evaluated using deep web and open government data sets from the real estate domain. The experiments have shown how the approach enables feedback to be assimilated efectively for use with individual data preparation components, and furthermore that synergies result from applying the feedback to several data preparation components.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Data preparation is the process of transforming data from its
original form into a representation that is more appropriate for
analysis. In data warehouses, data preparation tends to be referred
to as involving an Extract Transform Load (ETL) process [34], and
for more ad hoc analyses carried out by data scientists may be
referred to as data wrangling [27]. In both cases, similar steps
tend to be involved in data preparation, such as: discovery of
relevant sources; profiling of these sources to better understand their
individual properties and the potential relationships between
them; matching to identify the relationships between source
attributes; mapping to combine the data from multiple sources;
format transformation to revise the representations of attribute
values; and entity resolution to identify and remove duplicate
records representing the same real world object.</p>
      <p>This is a long list of steps, each of which can potentially involve
data engineers: (i) deciding which data integration and cleaning
operations to apply to which sources; (ii) deciding the order
of application of the operations; and (iii) either configuring the
individual operation applications or writing the rules that express
the behaviour to be exhibited. Although there are many data
preparation products, and the market for data preparation tools
is estimated to be $2.9 billion [24], most of these products are
essentially visual programming platforms, in which users make
many, fine-grained decisions. The consequences of this is that
data preparation is typically quoted as taking 80% of the time of
data scientists, who would prefer to be spending their time on
analysing and interpreting results1.</p>
      <p>
        The high cost of data preparation has been recognised for a
considerable period. For example, research into dataspaces [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
proposed a pay-as-you-go approach to data integration, in which
there was an initial and automated bootstrapping phase, which
was followed by an incremental improvement phase in which
the user provided feedback on the data product. This gave rise to
a collection of proposals for pay-as-you-go data integration and
cleaning platforms [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], which in turn led to proposals for the use
of crowds as a possible source of feedback [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This research
provided experience with pay-as-you-go data management, without
leading to many end-to-end systems; for the most part, feedback
was obtained for a particular task (e.g. mapping selection,
entity resolution) and used for that task alone. This was alright,
but collecting feedback on lots of individual components is
itself expensive, and thus not especially helpful for the complete,
many-step data preparation process.
      </p>
      <p>
        This, therefore, leaves open the question as to how to make
a multi-step data preparation process much more cost efective,
for example through automation and widespread use of feedback
on data products. There are now some results on automating
comprehensive data preparation pipelines. For example, in Data
Tamer [29], machine learning is used to support activities
including the alignment of data sets and instance level
integration through entity resolution and fusion. In some respects Data
Tamer follows a pay-as-you-go approach, as the training data
used by the learning components is revised in the light of
experience. Furthermore, in VADA [21], a collection of components (for
matching, mapping generation, source selection, format
transformation and data repair) are orchestrated automatically over
data sources, informed by supplementary instance data drawn
from the domain of the target schema [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. However, to date,
feedback has only been applied in VADA to inform the selection
of mappings.
      </p>
      <p>In this paper we investigate how feedback on the data product
that results from the multi-component data preparation process
in VADA can be used to revise the results of multiple of these
wrangling components in a well-informed way. In particular,
given feedback on the correctness of tuples in the data product,
a feedback assimilation strategy explores a set of hypotheses
about the reason for problems with the result. The statistical
significance of these hypotheses is then tested, giving rise to the
generation of a revised data integration process. The proposed
approach thus uses the same feedback to inform changes to
many diferent data preparation components, thereby seeking to
maximise the return on the investment made in the provision of
feedback.</p>
      <p>The contributions of the paper are as follows:
1https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-timeconsuming-least-enjoyable-data-science-task-survey-says/33d344256f63</p>
      <p>(1) A technique for applying feedback on a data product across
a multi-step data preparation process that both identifies
statistically significant issues and provides a mechanism
for exploring the actions that may resolve these issues.
(2) A realisation of the technique from (1) in a specific data
preparation platform, where feedback is used to change
the matches used in an integration, change which
mappings are used, and change which data quality rules are
applied.
(3) An empirical evaluation of the implementation of the
approach from (2) that investigates the efectiveness of the
proposed approach both for individual data preparation
constructs (matches, mappings and CFDs) and for
applying feedback across all these constructs together.</p>
      <p>The remainder of the paper is structured as follows. Section
2 outlines the data preparation pipeline on which we build, and
provides a running example that will be used in the later sections.
Section 3 provides a problem statement and an overview of the
approach. Section 4 details the individual components in the
realisation of the approach, and presents the feedback assimilation
algorithm. Section 5 evaluates the technique in a real estate
application. Section 6 reviews other approaches to increasing the
cost-efectiveness of data preparation, and Section 7 concludes.
2</p>
    </sec>
    <sec id="sec-2">
      <title>A DATA PREPARATION PIPELINE</title>
      <p>
        This section provides an overview of the aspects of the VADA
data preparation architecture that are relevant to the feedback
assimilation approach that is the focus of the paper. The VADA
architecture is described in more detail in earlier publications [
        <xref ref-type="bibr" rid="ref20">20,
21</xref>
        ].
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Defining a Data Preparation Task</title>
      <p>In VADA, instead of handcrafting a data preparation workflow,
the user focuses on expressing their requirements, and then the
system automatically populates the end data product. In
particular, the user provides:</p>
      <p>Input Data Sources: A collection of data sources that can
be used to populate the result. Figure 1, illustrates the
data sources in our running scenario. These sources
include real estate property data (sources s1 to s3) and open
government data (source s4).</p>
      <p>
        Target Schema: A schema definition for the end data
product. In the running example, the target schema consists
of one table, property, with six attributes, namely price,
postcode, income, bedroom_no, street_name, and location.
User Context: The desired characteristics of the end
product; as the end data product is obtained by an automated
process, many candidate solutions can be generated. The
user context allows the system to select solutions that meet
the specific requirements of the user [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The user’s
requirements are modelled as a weighted set of criteria, with
the sum of their weights equal to one. Although in general
diferent quality criteria can be used (such as
completeness, consistency or relevance), in our running example,
we consider 6 criteria, each one on the correctness of a
target schema attribute, and a weight of 16 .
      </p>
      <p>
        Data Context: Supplementary instance data associated with
the target schema, which can be used as additional
evidence to inform the automated data preparation
process [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. For example, in Figure 1, reference data is
provided that provides definitive address data.
      </p>
      <p>Given the above information, and a user-defined targeted size
for the end product of 6 tuples, from the data sources in Figure 1,
the system can automatically produce the first end data product
in Figure 2. In the absence of feedback or any other user-defined
characteristics, the system will select tuples that are as complete
as possible to populate the end data product. However, as
illustrated in Figure 2, feedback on the correctness of the result can
be used to revise how the data product is produced, and thus
improve its overall quality.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Data Preparation Process and Components</title>
      <p>Initialise: This component sets up the system Knowledge
Base with metadata about the data available in the
system, the target schema, user preferences and component
configurations.</p>
      <p>Matching: Given a set of sources and the target schema
T , the matching component produces a set of matches
between their attributes and the target schema attributes.</p>
      <p>Each match has a similarity score, with 0 ≤ score ≤ 1.
Profiling: This component analyses the available sources
and produces a set of candidate keys and a set of inclusion
dependencies among the sources.</p>
      <p>
        Mapping Generation: Using the results of the two
previous components as inputs, mapping generation searches
the space of possible ways of combining the sources using
unions or joins, producing a set of candidate mappings.
Mapping Selection: Given a set of user criteria, a set of
candidate mappings, the target schema and a targeted size,
mapping selection establishes how many tuples to obtain
from each of the candidate mappings to populate a target
schema instance, thus creating an end data product [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Data Repair: Repair takes place in the presence of reference
data; reference data is considered to provide complete
coverage of the domains for certain target columns. Data
repair involves the cooperation of 2 components. First,
CFD Miner is trained on the data context [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The
component is configured by a support size parameter. In the
example illustrated in Figure 2, with a support size of 2, it
will produce a number of CFDs, including the following:
property([postcode] → [streetname],(M1 5BY | | Cambridge
Street))
property([postcode] → [locality],(M9 8QB | | Manchester))
property([postcode] → [streetname],(OX2 9DU | | Crabtree
Road))
property([postcode] → [locality],(OX28 4GE | | Witney))
property([postcode] → [streetname],(OX4 2 DU | | Oxford Road))
Given a set of CFDs and a dataset, the Data Repair
component will identify violations of these CFDs, which can then
be repaired, in the sense that violations are removed [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
In Figure 2, rules are learnt from the repeated presence
of the italicised tuples highlighted in blue in the
reference data in Figure 2. Thus, the incorrect values Market
Street, Crabtree Rd, etc. have been corrected to Lakeside
Rise, Crabtree Road, etc. resulting in an end product that
is consistent with respect to the reference data.
3
      </p>
    </sec>
    <sec id="sec-5">
      <title>PROBLEM STATEMENT</title>
      <p>This section provides more details on the problem to be solved,
along with an overview of the approach to be followed. The
problem can be described as follows.</p>
      <p>Assume we have a data preparation pipeline P, that
orchestrates a collection of data preparation steps
{s1, ..., sn }, to produce an end data product E that
consists of a set of tuples. The problem is, given
a set of feedback instances F on tuples from E, to
re-orchestrate some or all of the data preparation
steps si , revised in the light of the feedback, in a
way that produces an improved end data product.</p>
      <p>In this paper, we assume that the feedback takes the form of
true positive (TP) or false positive (FP) annotations on tuples or
attribute values from E. Where a tuple is labelled as a TP this
means that it is considered to belong in the result; and where a
tuple is labelled as an FP, this means that it should not have been
included in the result. When an attribute value is labelled as a
TP this means that the given value is a legitimate member of the
domain of the column, and when it is labelled as an FP this value
should not have appeared in that column2.</p>
      <p>Given such annotations, the approach consists of the following
steps:
(1) Given feedback F, identify a collection of hypotheses H
that could explain the feedback. For example, if an attribute
value is incorrect, the following are possible hypotheses:
(i) a match that was used to associate that value in a source
with this attribute in the target is incorrect; (ii) a mapping
that was used to populate that value in the target is
incorrect, for example joining two tables that should not
have been joined; and (iii) a format transformation has
introduced an error into the value.
(2) Given a hypothesis h ∈ H , review all the evidence
pertaining to h from the feedback to establish if the confidence
in the hypothesis is suficient to suggest that it should be
investigated further. For example, if the hypothesis is that
a match is incorrect, all the feedback on data derived from
that match should be considered together, with a view
2These annotations lend themselves to propagation as follows. If a tuple is marked
as TP, all of its attribute values are marked as TP. If an attribute value is marked as
FP, all tuples containing any of these attribute values are marked as FP.
4.2
This section describes how hypotheses are tested with respect to
the evidence from feedback.</p>
      <p>As stated in Section 2, the user can define the desired
characteristics of the end product in the form of a set of criteria. Some
of these criteria can be determined without feedback (such as
the completeness with respect to the data context of the values
in a column), but some can be informed by feedback (such as
relevance and correctness). In the examples and experiments in
this paper, feedback is on correctness. For criteria that are based
on feedback, Equation 1 is used, in which cˆs is the criterion
evaluation on a source s, tp (resp. f p) the numbers of tuples marked
by the user as true (resp. false) positives, and |s | is the source
size.</p>
      <p>cˆs = 21 (1 + tp −|s |f p ) (1)</p>
      <p>Thus, in the absence of feedback, the default value for a
criterion is 0.5. However, if we have feedback on the correctness
of an attribute that there are 6 tp values and 3 fp values, then if
there are 100 tuples, the estimated correctness for the attribute
in s is now 12 (1 + |61003| ) = 0.515.</p>
      <p>−</p>
      <p>
        We now proceed to establish when criteria estimates are
significantly diferent using standard statistical techniques [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]3. The
estimated values of a criterion cˆ on two sources s2 and s1 are
considered to be significantly diferent when Equation 2 holds:
q
cˆs2 − cˆs1 &gt; z ses22 − ses21
where cˆs2 (resp. cˆs1 ) is the evaluation of criterion cˆ on s2 (resp.
s1), ses2 and ses1 are the standard errors for sources s2 and s1
respectively, calculated by Equation 3, and z is a statistical term
measuring the relationship between a value and the mean of a
group of values. For instance, a z-score of zero indicates that the
value and the mean are identical. In our experiments we use the
z-score that corresponds to a confidence level of 80%, i.e. 1.282.
Standard error is calculated by Equation 3 below.
to determining whether the match should be considered
problematic.
(3) Given a hypothesis h ∈ H in which there is some
confidence from feedback, identify actions that could be taken
in the pipeline P . For example, if the hypothesis is that a
match is incorrect, possible actions would be to drop that
match or to drop all the mappings that use the match.
(4) Given that evidence from feedback F may support several
diferent hypotheses, there is a space of possible actions
that could be carried out, each leading to a diferent
collection of data preparation steps. As a result, we must explore
the space of candidate integrations that implement the
different actions.
4
      </p>
    </sec>
    <sec id="sec-6">
      <title>SOLUTION</title>
      <p>This section provides additional details on how the steps from
Section 3 are carried out in practice, and includes details on how
the feedback can be used to inform actions on matches,
mappings and CFDs. In particular, Section 4.1 identifies hypotheses
that may be suggested by the available feedback; Section 4.2
describes how the statistical significance of such hypotheses can
be ascertained; Section 4.3 identifies some actions that may be
taken in response to a hypothesis; and Section 4.4 describes how
the diferent aspects of the solution are brought together in an
algorithm.
4.1</p>
    </sec>
    <sec id="sec-7">
      <title>Hypotheses</title>
      <p>Given an end data product on which some feedback has been
collected, an obvious question is what can the feedback tell us
about the way in which the data product has been produced. More
specifically, given an end data product and some feedback that
identifies problems with the data product, an obvious question
is what went wrong in the production of this data product. For
example, in Figure 1, the street_name Market Street is not
consistent with the postcode M9 8QB, and thus could have been labelled
as an FP by the user. It is straightforward to identify possible
reasons for this problem, which here we term hypotheses. In this
case, possible reasons include:</p>
      <p>Matching: Incorrect match used to relate the source to the
target.</p>
      <p>Mapping Generation: Incorrect mapping combined sources
in an inappropriate way.</p>
      <p>Data Repair: Incorrect data repair replaced the correct value
with the wrong one.</p>
      <p>In this paper, all these hypotheses are considered as possible
explanations for the identified FP. The question then is when do
we provide credence to a hypothesis. Given the evidence available
in the form of feedback, hypotheses are tested as follows:
Matching: If the data that results from a match is
significantly worse than that produced from other matches for
the same target schema attribute, then the match is
suspicious.</p>
      <p>Mapping Generation: If the result of a mapping is
significantly worse than the results from the other mappings,
then the mapping is suspicious.</p>
      <p>Data Repair: If the repaired values from a repair rule are
significantly worse than the other values for the repaired
attribute, then the rule is suspicious.</p>
      <p>The following section uses statistical techniques to compare
matches, mappings and feedback using evidence from feedback.
(2)
(3)
(4)
ses =
s
cˆs (1 − cˆs )</p>
      <p>Ls
where s is a source, cˆs is the evaluated feedback-based criterion
on source s, and Ls is the amount of feedback collected on source
s.</p>
      <p>Given the above, then our estimation of a data criterion cˆs on
a source s is cˆs ± es where es is the margin of error for the data
criterion evaluation on s, and 0 ≤ cˆs ± es ≤ 1. A source s can
be either a set of values, or a set of tuples. The formula for the
margin of error is given in Equation 4.</p>
      <p>es = z · ses ·
s
|s | − Ls
|s | − 1
where |s | is the number of elements in s (source size), and Ls
the amount of feedback collected for source s. This is feedback
either provided directly on the data product (if it is the end data
product) or propagated to it. We only consider attribute-level
feedback instances when evaluating Ls on a set of values, and
3Such statistical techniques have been used before to compare mappings on the
basis of feedback, with a view to targeting feedback collection [28]. In that work,
where there is uncertainty as to which mappings should be preferred in mapping
selection, additional feedback is obtained to reduce the uncertainty.
we only consider tuple-level feedback instances when evaluating
Ls on a set of tuples.</p>
      <p>In the absence of feedback on a set of tuples or values, given
that cˆs is unknown, hence cˆs = 21 ± 21 , we can solve Equation 4
by setting es = 21 and Ls = 0, and evaluate the standard error
ses for the source s (needed for evaluating significant diference
using Equation 2) in the absence of feedback as:
(5)
ses =</p>
      <p>Next, we discuss the comparisons we used when testing the
diferent components in the system. The focus is on what is
considered as s1, s2, when evaluating Equation 2.</p>
      <p>Testing matches for significant diference . Figure 4 shows
the target schema T and a source s, with a match between
attribute s .d and the target schema attribute T .d: m1 : s .d ∼ T .d.</p>
      <p>When testing match m1 for significant diference, Equation 2
is evaluated on the projections on the matching attributes.
Specifically, to test the significance of match m1, for the evaluation
of Equation 2, we use value sets s1 and s2 (i.e., s .d and T .d), as
illustrated in Figure 4.</p>
      <p>Note that the set s1 is the complete set of values for attribute
s .d, regardless of whether the values are selected to populate the
end product or not, while s2 is the greyed part of attribute T .d.</p>
      <p>Consider the example in Figure 1, for which some feedback has
been obtained on the initial end product in Figure 2, to the efect
that the first two tuples have fp annotations for their bedroom_no.
In Figure 2, the first 3 tuples have been obtained from Source 1,
but involve an incorrect match of Source1.bathrooms with
Target.bedroom_no. The correctness estimated for this match using
Equation 1 is 12 (1 + 0−32 ) = 0.167. The correctness for the values
obtained from other |m|atches on the same column, on which no
feedback has been collected, is estimated using Equation 1 to
be 0.5. Although with this small illustrative example statistical
significance is dificult to establish, we can see that the
correctness estimate associated with the match from Source1.bathrooms
with Target.bedroom_no is lower than that for the other values in
the column (obtained using matches involving source attributes
Source 2.bedroom_no and Source 3.bed_num), and thus the match
involving Source1.bathrooms with Target.bedroom_no may be
identified as being suspicious.</p>
      <p>Testing mappings for significant diference . Figure 5 shows
an example of a target schema T populated with tuples selected
from 5 candidate mappings. Candidate mappings m1 to m4
contribute to the solution, while m5 does not. For each of the
candidate mappings that contributes to the solution, we evaluate
the participating tuples against the rest of the tuples of the end
product, using Equation 2. For instance, for to evaluate whether
mapping m1 is significantly diferent, we use s1 and s2 as
illustrated in Figure 5 .</p>
      <p>It is important to mention that for mappings, before evaluation,
we propagate to the mapping the feedback that the user has
provided on the end product, and on the mappings themselves.
Furthermore, the Mapping Selection component ensures that the
tuples that are marked as true (resp. false) positives are selected
(resp. not selected).</p>
      <p>Testing CFDs for significant diference . In Figure 6 we see
the result of a CFD on the end product. We mark as green the 2
attribute values for column c that are found to be correct, and
with red the 3 values in column d found to be violating the CFD,
and that were thus repaired. As before, we consider tuple sets s1
and s2 in Figure 6 as inputs to Equation 2. Note that the correct
values are not considered in s1, as they have no efect on the end
product. We only consider the tuples that were modified.
4.3</p>
    </sec>
    <sec id="sec-8">
      <title>Actions</title>
      <p>Each of the hypotheses is associated with a respective action,
typically set in this paper to ruling out the suspicious item
before rerunning the data preparation pipeline. An item here is a
match, a candidate mapping, or a CFD rule. The hypotheses and
respective actions per component are thus summarised below:
• Matching. If a match is suspicious, discard the match.</p>
      <p>This means that mapping generation (or any of the other
components down the line) will not take this match into
account.
• Mapping Generation. If a mapping is suspicious, discard
the mapping. This means that Mapping Selection will not
take this mapping into account.
• Data Repair. If a repair rule is suspicious, discard the
rule.</p>
      <p>We need not take actions for items that are found to be
significantly better than others, as they are retained by default. As
such, we only focus on removing suspicious items.
4.4</p>
    </sec>
    <sec id="sec-9">
      <title>Algorithm</title>
      <p>Algorithm 1 provides the basic sequence of events while
assimilating feedback. It is assumed that the feedback assimilation takes
place in the context of an integration, in which Matches is the set
of existing matches, profileData contains the existing inclusion
dependency and primary key data, and CFDs is a set of existing
conditional functional dependencies, as produced in the upper
part of Figure 3.</p>
      <p>Then, the algorithm considers the hypotheses on matches,
mappings and CFDs in turn; this order is chosen because changes
to matches give rise to changes to mappings, and CFDs are applied
to the results of mappings.</p>
      <p>First, we iterate over the available matches (lines 2–6) and
test whether any of these produces a result that is significantly
worse than the results of other matches. T .a (line 3) is a target
schema attribute that is the target of a match m with a source
attribute m.a. Any match that yields a significantly worse result
is removed.</p>
      <p>The remaining matches are then used for mapping generation
(line 7), and any generated mappings that perform significantly
worse than the others are removed (lines 8–12).</p>
      <p>A similar process is followed with CFDs, removing CDFs that
are found to be problematic in the light of the feedback (lines 14–
19). The end product is then repaired (line 20) using the reduced
set of CFDs. The resulting end data product can then be provided
to the user for analysis or further feedback collection. Newly
collected feedback instances are added to the existing ones, and
propagated to the end product, and the candidate mappings. In
addition, the feedback can periodically be applied to the whole
process using Algorithm 1 to generate a revised end product. The
process continues until the user finds the result fit for purpose
and terminates data preparation pipeline.</p>
      <p>
        Function significantlyWorse (lines 3, 9 and 16) is essentially
testing Equation 2 with s1 and s2 as arguments, and returns true
if it holds, or false otherwise. The arguments for this function are
the ones illustrated as s1 and s2 in Figure 4 when detecting
significantly worse matches, Figure 5 when detecting significantly
worse mappings, and Figure 6 when detecting significantly worse
rules. Next, using the same approach we detect and remove
suspicious mappings (lines 12–16) and suspicious CFDs (lines 18–23).
As s in line 15 we define the set of tuples in T violating a given
c f d.
For the evaluation, we used as sources the following datasets:
(a) forty datasets with real-estate properties extracted from the
web using OXpath [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], with contents similar to sources s1 to s3
in Figure 1, (b) English indices of deprivation data, downloaded
from www.gov.uk, as shown in s4 in Figure 1. As reference data,
we used open address data from openaddressesuk.org.
      </p>
      <p>Then, to enable the automatic evaluation of correctness on
the end product and throughout the pipeline, we constructed
a ground truth based on the available data, which we used for
our experiments, as follows: we manually matched, mapped,
deduplicated, and then repaired the end product. This gave us a
dataset consisting of approximately 4.5k tuples.</p>
      <p>For the experiments, the match threshold was set to 0.6, the
top 100 mappings are retained from mapping generation, and the
support size for the data repair component was set to 5.</p>
      <p>Each of the criteria in the user context was set to be the
correctness of an attribute from the target schema. Thus, the user
criteria are: correctness(price), correctness(postcode),
correctness(income), correctness(bedroom_no), correctness(street_name),
and correctness(location). They all have the same weight, 16 . The
correctness of each of these properties is estimated based on
feedback. Mapping selection is configured to select what are
predicted to be the best 1000 tuples from a subset of the generated
mappings.</p>
      <p>The workflow for the experiments is illustrated in Figure 7.
We essentially automatically reran the workflow of Figure 3,
collecting 25 feedback instances in each iteration, until 500 feedback
instances on the end product had been collected and propagated.
Feedback is generated randomly in the experiments, based on
the correctness of the respective tuple with respect to the ground
truth. In each iteration, half of the feedback is given at tuple level,
and half at attribute level.</p>
      <p>We then configured the system to test:
• Matching. This means running Algorithm 1 without lines
8–12, and without lines 14–19.
• Mapping generation. This means running Algorithm 1
without lines 2–6, and without lines 14–19.
• Data repair. This means running Algorithm 1 without lines
2–6, and without lines 8–12.
• All of the components. This means running Algorithm 1
as defined.
• None of the components. This means running Algorithm
1 without lines 2–6, without lines 8–12, and without lines
14–19. In this case, although no actions are taken to
remove matches, mappings or CFDs, the data product is still
improved in the light of feedback, as MappingSelection
in line 13 is able to benefit from improved estimates of
mapping correctness.</p>
      <p>The none of the components case forms the baseline, as we know
of no direct competitor that is seeking to apply feedback on
the result of an integration to revise the behaviour of several
integration components.
5.2</p>
    </sec>
    <sec id="sec-10">
      <title>Results</title>
      <p>The precision values obtained for diferent configurations are
illustrated in Figure 84. Each of the lines in Figure 8 is the average
of 5 runs from 0 to 500 randomly collected feedback instances,
using the configurations from Section 5.1. Note that, as the
collection of diefrent feedback instances leads to diferent possible
actions, which leads to diferent results being made available for
feedback collection, there is no realistic way to provide identical
feedback instances to diferent runs of the experiment. This is
reflected in the uneven curves in Figure 8.</p>
      <p>As usual, the precision of the end product is evaluated as:
precision = tpt+pf p , where a tuple is considered a true positive
(tp) if all of its values appear in the ground truth, and as a false
positive (f p) otherwise.</p>
      <p>Notice in Figure 8 that the none case (i.e., when there is no
discarding of matches, mappings or rules along the way) still
leads to an increase of the precision of the end product. This
is because, throughout the experiments, mapping selection (as
outlined in Section 2.2) is informed by correctness estimates, and
feedback is used to improve the correctness estimates. So, before
any feedback is collected, all mappings have the same
correctness (estimated at 0.5) and thus mapping selection considers all
mappings to be of equal quality. However, as feedback is
gathered, diferent mappings will come to have diferent levels of
correctness, and mapping selection will make increasingly well
informed selection decisions. As these selection decisions relate
4In a setting where mapping selection retains the top 1000 tuples from a ground
truth of around 4500 tuples, the recall tends to increase broadly in line with the
precision.
to correctness, this in turn leads to more true positives in the
results and a higher precision.</p>
      <p>As such, there might be considered to be 2 baselines in this
experiment: the case in which there is no feedback, and the
precision is around 0.2, and none in which the feedback informs
mapping selection, but the approach from Section 4 is not applied.</p>
      <p>The actions that have been taken by the matching and mapping
components, i.e., the removal of suspicious matches or mappings,
have had a positive impact on the precision of the end product,
as shown in Figure 8. It is unclear from this experiment which
one has more impact.</p>
      <p>Taking action on the CFDs has had little impact on the end
product precision. This can be understood as follows:
• The rules being learnt are numerous, e.g., 3526 rules were
learnt with a support size of 5. As a result, each rule applies
to quite few rows, and it can be dificult to obtain enough
feedback to draw statistically significant conclusions as to
the quality of an individual rule.
• Each of the rules typically has a small efect to the end
product. As such, even when a problematic rule is removed,
this tends not to have much of an efect on overall quality.
However, it is still clear that identifying CFDs that introduced
errors to the data, and discarding them, has the potential to have
a positive efect on the resulting quality.</p>
      <p>The most important result for the overall proposal, however,
is that when the actions across all the components are combined,
this provides much the greatest improvement compared with any
of the individual means of applying the feedback. Furthermore,
much of this improvement is obtained with modest amounts of
feedback.</p>
      <p>We noticed in the experiments that discarding suspicious
matches, mappings or CFDs, even in the case of significant
issues with the quality of their associated results, did not always
guarantee that the precision of the end product would increase.
This happens because the contents of the tuples that replace the
discarded ones are yet unknown, as feedback on them is not
always present. As such, the new tuples are not guaranteed to be of
better quality. The trend is, though, that the overall correctness
can be expected to increase, even if not monotonically.</p>
      <p>Next, we report on the number of suspicious items discovered
in each of the experiments. Each line in Figure 9, corresponds to
the average of the same 5 runs as reported in Figure 8. Figures
9a to c, respectively show the numbers of suspicious matches,
a. Matching
b. Mapping Generation
c. Data Repair
d. All Components
mappings and CFDs considered, when only the respective
component is being tested. Figure 9d shows the overall number of
suspicious items discovered and discarded when all components
were being tested.</p>
      <p>We can observe the following: (i) Rather few suspicious matches
have been detected, so the benefit obtained from the removal
of each such match has been quite substantial. We note that
as matches relate to individual columns, obtaining suficient FP
feedback on the data deriving from a match can require quite a
lot of feedback. (ii) More suspicious mappings are identified, and
they are identified from early in the process. (iii) Although quite
a few suspicious CFDs are identified, this is a small fraction of
the overall number.</p>
      <p>From Figures 8 and 9 it is clear that taking feedback into
account and acting on all components increases the return on the
investment from feedback. When considering all possible actions
together, the overall benefit is greater, and the benefit accrues
with smaller amounts of feedback.
6</p>
    </sec>
    <sec id="sec-11">
      <title>RELATED WORK</title>
      <p>In this section, we consider related work under three headings,
pay-as-you-go data preparation, applying feedback to multiple
activities and reducing manual efort in data preparation .</p>
      <p>
        In pay-as-you-go data preparation, following from the vision
of dataspaces [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], a best-efort integration is produced through
automated bootstrapping, which is refined in the light of user
feedback. There have been many proposals for pay-as-you-go
data preparation components, for example relating to data
extraction [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], matching [
        <xref ref-type="bibr" rid="ref18">18, 36</xref>
        ], mapping [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6, 30</xref>
        ] and entity
resolution [
        <xref ref-type="bibr" rid="ref14">14, 25</xref>
        ]. Such proposals have addressed issues such
as targeting the most appropriate feedback [
        <xref ref-type="bibr" rid="ref10">10, 28, 35</xref>
        ], and
accommodating unreliable feedback, in particular in the context of
crowdsourcing [
        <xref ref-type="bibr" rid="ref9">9, 23</xref>
        ]. However, such work has primarily used a
single type of feedback to inform a single data preparation step.
      </p>
      <p>
        In relation to applying feedback to multiple activities, it has
been recognised that the same feedback may be able to inform
diferent aspects of the same step. For example, in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], feedback
on the correctness of results is used to inform both mapping
selection and mapping refinement . In contrast with the work here,
these are two essentially independent proposals that happen to
use the same feedback, where the feedback has been obtained
directly on the mapping results. In Corleone [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], several
different aspects of entity resolution are informed by feedback on
matching pairs; in particular the feedback is used to inform the
learning of comparison rules and to identify and resolve problem
cases. This may be the closest work to ours, though the focus is
on a single data integration step, for which custom feedback has
been obtained. Also for entity resolution, feedback on matching
pairs is used in [25] as part of a single optimisation step that
configures together blocking, detailed comparison and clustering.
This contrasts with the current paper in focusing diferent
aspects of the same integration step. To the best of our knowledge,
there is no prior work in which several distinct steps in the data
preparation process are considered together when responding to
feedback.
      </p>
      <p>
        In terms of reducing manual efort in data preparation , there
are a variety of diferent approaches. For example, again in the
context of individual steps, tools can be used to support data
engineers in developing transformations. For example, in relation
to format transformation, Wrangler [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] can suggest potential
transformation programs, and FlashFill [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] can synthesize
transformation programs from examples. There are also proposals
in which mappings are discovered based on data instances (e.g.,
[
        <xref ref-type="bibr" rid="ref15 ref2">2, 15, 26</xref>
        ]). In ETL, there are also a variety of approaches that seek
to reduce the amount of manual labour required. For example,
this includes the provision of language features [
        <xref ref-type="bibr" rid="ref4">4, 32</xref>
        ] or
patterns [31, 33] that support recurring data preparation behaviours,
techniques for managing evolution of ETL programs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and
development of ETL processes that abstract over more concrete
implementation details [
        <xref ref-type="bibr" rid="ref3">3, 22</xref>
        ]. However, although such work
focuses on raising the abstraction levels at which data engineers
engage in data preparation tasks, we are not aware of prior
results that use feedback on data products to make changes across
complete data preparation processes.
7
      </p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSIONS</title>
      <p>The development of data preparation processes is laborious,
requiring sustained attention to detail from data engineers across
a variety of tasks. Many of these tasks involve activities that are
amenable to automation. However, automated approaches have
partial knowledge, and thus cannot necessarily be relied upon
to make the best integration decisions. When automation falls
short, one way to improve the situation is through feedback on
the candidate end data product. The successful combination of
automation and feedback provides a route to data preparation
without programming, which was considered to be important by
90% of participants in a survey on end user data preparation5.</p>
      <p>Towards this goal of reducing the burden of data
preparation, we now revisit and elaborate on the contributions from the
introduction.</p>
      <p>(1) A technique for applying feedback on a data product across
a multi-step data preparation process that both identifies
statistically significant issues and provides a mechanism
for exploring the actions that may resolve these issues. We
have described an approach in which hypotheses about
the problems with an integration are tested for statistical
significance with respect to user feedback on the
candidate end data product, giving rise to actions that seek to
resolve issues with the feedback. The same approach is
potentially applicable to diferent types of feedback, diverse
data preparation components, and a variety of actions.
(2) A realisation of the technique from (1) in a specific data
preparation platform, where feedback is used to change the
matches used in an integration, change which mappings
are selected, and change which data quality rules are
applied. We have indicated how the technique can be applied
to matching, mapping and repair steps, on the basis of
true/false positive annotations on data product tuples, in
the VADA data preparation system.
(3) An empirical evaluation of the implementation of the
approach from (2) that investigates the efectiveness of the
proposed approach both for individual data preparation
constructs (matches, mappings and CFDs) and for applying
feedback across all these constructs together. An experimental
evaluation with real estate data has shown how the
approach can identify actions that can improve data product
quality on the basis of changes to individual data
preparation steps, and can coordinate changes across multiple
such steps, with particularly significant benefits from the
combined approach.</p>
      <p>There are several promising directions for further
investigation, which include: (a) extending the data preparation
components to which the approach is applied, for example to include
source selection or data format transformation; (b) considering
alternative actions, such as changing match scores, mapping
algorithm thresholds, or rule learning parameters; and (c) more
selective or incremental application of actions, with a view to
5https://www.datawatch.com/2017-end-user-data-preparation-market-study-2/
identifying the subset of the candidate actions that together are
the most efective.</p>
      <p>Acknowledgement: This work is funded by the UK Engineering
and Physical Sciences Research Council (EPSRC) through the
VADA Progrmame Grant.
[21] Nikolaos Konstantinou, Martin Koehler, Edward Abel, Cristina Civili, Bernd
Neumayr, Emanuel Sallinger, Alvaro A. A. Fernandes, Georg Gottlob, John A.
Keane, Leonid Libkin, and Norman W. Paton. 2017. The VADA Architecture
for Cost-Efective Data Wrangling. In ACM SIGMOD. 1599–1602. https://doi.
org/10.1145/3035918.3058730
[22] Georgia Kougka, Anastasios Gounaris, and Alkis Simitsis. 2018. The many
faces of data-centric workflow optimization: a survey. I. J. Data Science and
Analytics 6, 2 (2018), 81–107. https://doi.org/10.1007/s41060-018-0107-0
[23] Guoliang Li, Jiannan Wang, Yudian Zheng, and Michael J. Franklin. 2016.</p>
      <p>Crowdsourced Data Management: A Survey. IEEE Trans. Knowl. Data Eng. 28,
9 (2016), 2296–2319. https://doi.org/10.1109/TKDE.2016.2535242
[24] Ehtisham Zaidi Mark A. Beyer, Eric Thoo. 2018. Magic Quadrant for Data</p>
      <p>Integration Tools. Technical Report. Gartner. G00340493.
[25] Ruhaila Maskat, Norman W. Paton, and Suzanne M. Embury. 2016.
Pay-as-yougo Configuration of Entity Resolution. T. Large-Scale Data- and
KnowledgeCentered Systems 29 (2016), 40–65. https://doi.org/10.1007/978-3-662-54037-4_
2
[26] Fotis Psallidas, Bolin Ding, Kaushik Chakrabarti, and Surajit Chaudhuri.
2015. S4: Top-k Spreadsheet-Style Search for Query Discovery. In
Proceedings of the 2015 ACM SIGMOD International Conference on Management
of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 2001–2016.
https://doi.org/10.1145/2723372.2749452
[27] Tye Rattenbury, Joe Hellerstein, Jefery Heer, Sean Kandel, and Conner
Carreras. 2017. Principles of Data Wrangling. O’Reilly.
[28] Julio César Cortés Ríos, Norman W. Paton, Alvaro A. A. Fernandes, and Khalid
Belhajjame. 2016. Eficient Feedback Collection for Pay-as-you-go Source
Selection. In Proceedings of the 28th International Conference on Scientific and
Statistical Database Management, SSDBM 2016, Budapest, Hungary, July 18-20,
2016. 1:1–1:12. https://doi.org/10.1145/2949689.2949690
[29] Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales, Mitch
Cherniack, Stanley B. Zdonik, Alexander Pagan, and Shan Xu. 2013. Data
Curation at Scale: The Data Tamer System. In CIDR 2013, Sixth Biennial Conference
on Innovative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013,
Online Proceedings. http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper28.pdf
[30] Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby
Crammer, Zachary G. Ives, Fernando C. N. Pereira, and Sudipto Guha. 2008.
Learning to create data-integrating queries. PVLDB 1, 1 (2008), 785–796.
https://doi.org/10.14778/1453856.1453941
[31] Vasileios Theodorou, Alberto Abelló, Maik Thiele, and Wolfgang Lehner. 2017.</p>
      <p>Frequent patterns in ETL workflows: An empirical approach. Data Knowl.</p>
      <p>Eng. 112 (2017), 1–16. https://doi.org/10.1016/j.datak.2017.08.004
[32] Christian Thomsen and Torben Bach Pedersen. 2009. pygrametl: a powerful
programming framework for extract-transform-load programmers. In DOLAP
2009, ACM 12th International Workshop on Data Warehousing and OLAP, Hong
Kong, China, November 6, 2009, Proceedings. 49–56. https://doi.org/10.1145/
1651291.1651301
[33] Kalle Tomingas, Margus Kliimask, and Tanel Tammet. 2014. Data Integration
Patterns for Data Warehouse Automation. In 18th East European Conference
on Advances in Databases and Information Systems, ADBIS. 41–55. https:
//doi.org/10.1007/978-3-319-10518-5_4
[34] P. Vassiliadis. 2011. A Survey of Extract-Transform-Load Technology. IJDWM
5, 3 (2011), 1–27.
[35] Zhepeng Yan, Nan Zheng, Zachary G. Ives, Partha Pratim Talukdar, and Cong
Yu. 2015. Active learning in keyword search-based data integration. VLDB J.
24, 5 (2015), 611–631. https://doi.org/10.1007/s00778-014-0374-x
[36] Chen Jason Zhang, Lei Chen, H. V. Jagadish, and Caleb Chen Cao. 2013.</p>
      <p>Reducing Uncertainty of Schema Matching via Crowdsourcing. PVLDB 6, 9
(2013), 757–768. https://doi.org/10.14778/2536360.2536374</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Edward</given-names>
            <surname>Abel</surname>
          </string-name>
          , John Keane, Norman W. Paton,
          <string-name>
            <given-names>Alvaro A.A.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          , Martin Koehler, Nikolaos Konstantinou, Julio Cesar Cortes Rios, Nurzety A.
          <string-name>
            <surname>Azuan</surname>
          </string-name>
          , and
          <string-name>
            <surname>Suzanne</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Embury</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>User driven multi-criteria source selection</article-title>
          .
          <source>Information Sciences 430-431</source>
          (
          <year>2018</year>
          ),
          <fpage>179</fpage>
          -
          <lpage>199</lpage>
          . https://doi.org/10.1016/j.ins.
          <year>2017</year>
          .
          <volume>11</volume>
          .019
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Alexe</surname>
          </string-name>
          , Balder ten Cate, Phokion G. Kolaitis, and Wang Chiew Tan.
          <year>2011</year>
          .
          <article-title>Designing and refining schema mappings via data examples</article-title>
          .
          <source>In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD</source>
          .
          <fpage>133</fpage>
          -
          <lpage>144</lpage>
          . https://doi.org/10.1145/1989323.1989338
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Syed</given-names>
            <surname>Muhammad Fawad Ali</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Wrembel</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>From conceptual design to performance optimization of ETL workflows: current state of research and open problems</article-title>
          .
          <source>VLDB J</source>
          .
          <volume>26</volume>
          ,
          <issue>6</issue>
          (
          <year>2017</year>
          ),
          <fpage>777</fpage>
          -
          <lpage>801</lpage>
          . https: //doi.org/10.1007/s00778-017-0477-2
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Ove</given-names>
            <surname>Andersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Thomsen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Kristian</given-names>
            <surname>Torp</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>SimpleETL: ETL Processing by Simple Specifications</article-title>
          .
          <source>In Proceedings of the 20th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data co-located with 10th EDBT/ICDT Joint Conference (EDBT/ICDT</source>
          <year>2018</year>
          ), Vienna, Austria, March
          <volume>26</volume>
          -29,
          <year>2018</year>
          . http://ceur-ws.
          <source>org/</source>
          Vol-2062/paper10.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Khalid</given-names>
            <surname>Belhajjame</surname>
          </string-name>
          , Norman W. Paton,
          <string-name>
            <surname>Suzanne M. Embury</surname>
            ,
            <given-names>Alvaro A. A.</given-names>
          </string-name>
          <string-name>
            <surname>Fernandes</surname>
            , and
            <given-names>Cornelia</given-names>
          </string-name>
          <string-name>
            <surname>Hedeler</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Feedback-based annotation, selection and refinement of schema mappings for dataspaces</article-title>
          .
          <source>In EDBT</source>
          .
          <volume>573</volume>
          -
          <fpage>584</fpage>
          . https: //doi.org/10.1145/1739041.1739110
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Angela</given-names>
            <surname>Bonifati</surname>
          </string-name>
          , Radu Ciucanu, and
          <string-name>
            <given-names>Slawek</given-names>
            <surname>Staworko</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Interactive Inference of Join Queries</article-title>
          .
          <source>In 17th International Conference on Extending Database Technology, EDBT</source>
          .
          <fpage>451</fpage>
          -
          <lpage>462</lpage>
          . https://doi.org/10.5441/002/edbt.
          <year>2014</year>
          .41
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Michael</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Bulmer</surname>
          </string-name>
          .
          <year>1979</year>
          . Principles of Statistics. Dover Publications.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Darius</given-names>
            <surname>Butkevicius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Philipp D.</given-names>
            <surname>Freiberger</surname>
          </string-name>
          , and
          <string-name>
            <surname>Frederik</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Halberg</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>MAIME: A Maintenance Manager for ETL Processes</article-title>
          .
          <source>In Proceedings of the Workshops of the EDBT/ICDT 2017 Joint Conference (EDBT/ICDT</source>
          <year>2017</year>
          ), Venice, Italy, March
          <volume>21</volume>
          -24,
          <year>2017</year>
          . http://ceur-ws.
          <source>org/</source>
          Vol-1810/DOLAP_paper_08.pdf
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Valter</given-names>
            <surname>Crescenzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Alvaro A. A.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          , Paolo Merialdo, and
          <string-name>
            <surname>Norman</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Paton</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Crowdsourcing for data management</article-title>
          .
          <source>Knowl. Inf. Syst</source>
          .
          <volume>53</volume>
          ,
          <issue>1</issue>
          (
          <year>2017</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>41</lpage>
          . https://doi.org/10.1007/s10115-017-1057-x
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Valter</surname>
            <given-names>Crescenzi</given-names>
          </string-name>
          , Paolo Merialdo, and
          <string-name>
            <given-names>Disheng</given-names>
            <surname>Qiu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Crowdsourcing large scale wrapper inference</article-title>
          .
          <source>Distributed and Parallel Databases</source>
          <volume>33</volume>
          ,
          <issue>1</issue>
          (
          <year>2015</year>
          ),
          <fpage>95</fpage>
          -
          <lpage>122</lpage>
          . https://doi.org/10.1007/s10619-014-7163-9
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Wenfei</surname>
            <given-names>Fan</given-names>
          </string-name>
          , Floris Geerts,
          <string-name>
            <surname>Laks V S Lakshmanan</surname>
            , and
            <given-names>Ming</given-names>
          </string-name>
          <string-name>
            <surname>Xiong</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Discovering conditional functional dependencies</article-title>
          .
          <source>Proceedings - International Conference on Data Engineering</source>
          <volume>23</volume>
          ,
          <issue>5</issue>
          (
          <year>2011</year>
          ). https://doi.org/10.1109/ICDE.
          <year>2009</year>
          .208
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Michael</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Franklin</surname>
            , Alon Y. Halevy, and
            <given-names>David</given-names>
          </string-name>
          <string-name>
            <surname>Maier</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>From databases to dataspaces: a new abstraction for information management</article-title>
          .
          <source>SIGMOD Record 34</source>
          ,
          <issue>4</issue>
          (
          <year>2005</year>
          ),
          <fpage>27</fpage>
          -
          <lpage>33</lpage>
          . https://doi.org/10.1145/1107499.1107502
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Tim</surname>
            <given-names>Furche</given-names>
          </string-name>
          , Georg Gottlob, Giovanni Grasso,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Schallhart</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Sellers</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>OXPath: A language for scalable data extraction, automation, and crawling on the deep web</article-title>
          .
          <source>The VLDB Journal 22</source>
          ,
          <issue>1</issue>
          (
          <issue>01</issue>
          <year>Feb 2013</year>
          ),
          <fpage>47</fpage>
          -
          <lpage>72</lpage>
          . https://doi.org/10.1007/s00778-012-0286-6
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Chaitanya</surname>
            <given-names>Gokhale</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanjib Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>AnHai Doan</surname>
            , Jefrey F. Naughton, Narasimhan Rampalli,
            <given-names>Jude W.</given-names>
          </string-name>
          <string-name>
            <surname>Shavlik</surname>
            , and
            <given-names>Xiaojin</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Corleone: hands-of crowdsourcing for entity matching</article-title>
          .
          <source>In SIGMOD</source>
          .
          <volume>601</volume>
          -
          <fpage>612</fpage>
          . https://doi.org/10.1145/ 2588555.2588576
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Georg</given-names>
            <surname>Gottlob</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Senellart</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Schema Mapping Discovery from Data Instances</article-title>
          .
          <source>JACM 57</source>
          ,
          <issue>2</issue>
          (
          <year>2010</year>
          ),
          <volume>6</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          :
          <fpage>37</fpage>
          . https://doi.org/10.1145/1667053. 1667055
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Sumit</given-names>
            <surname>Gulwani</surname>
          </string-name>
          ,
          <string-name>
            <surname>William R. Harris</surname>
            , and
            <given-names>Rishabh</given-names>
          </string-name>
          <string-name>
            <surname>Singh</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Spreadsheet data manipulation using examples</article-title>
          .
          <source>Commun. ACM 55</source>
          ,
          <issue>8</issue>
          (
          <year>2012</year>
          ),
          <fpage>97</fpage>
          -
          <lpage>105</lpage>
          . https://doi.org/10.1145/2240236.2240260
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Cornelia</surname>
            <given-names>Hedeler</given-names>
          </string-name>
          , Khalid Belhajjame, Norman W. Paton, Alessandro Campi, AlvaroA.A.
          <string-name>
            <surname>Fernandes</surname>
          </string-name>
          , and SuzanneM. Embury.
          <year>2010</year>
          .
          <article-title>Dataspaces</article-title>
          .
          <source>In Search Computing. LNCS</source>
          , Vol.
          <volume>5950</volume>
          . Springer Berlin Heidelberg,
          <fpage>114</fpage>
          -
          <lpage>134</lpage>
          . http://dx. doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -12310-
          <issue>8</issue>
          _
          <fpage>7</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Nguyen</given-names>
            <surname>Quoc Viet Hung</surname>
          </string-name>
          , Nguyen Thanh Tam, Vinh Tuan Chau, Tri Kurniawan Wijaya, Zoltán Miklós, Karl Aberer, Avigdor Gal, and
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Weidlich</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>SMART: A tool for analyzing and reconciling schema matching networks</article-title>
          .
          <source>In 31st IEEE International Conference on Data Engineering, ICDE</source>
          <year>2015</year>
          , Seoul, South Korea,
          <source>April 13-17</source>
          ,
          <year>2015</year>
          .
          <fpage>1488</fpage>
          -
          <lpage>1491</lpage>
          . https://doi.org/10.1109/ICDE.
          <year>2015</year>
          .7113408
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Sean</surname>
            <given-names>Kandel</given-names>
          </string-name>
          , Andreas Paepcke,
          <string-name>
            <surname>Joseph M. Hellerstein</surname>
            , and
            <given-names>Jefrey</given-names>
          </string-name>
          <string-name>
            <surname>Heer</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Wrangler: Interactive Visual Specification of Data Transformation Scripts</article-title>
          . In CHI.
          <volume>3363</volume>
          -
          <fpage>3372</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Martin</surname>
            <given-names>Koehler</given-names>
          </string-name>
          , Alex Bogatu, Cristina Civili, Nikolaos Konstantinou, Edward Abel,
          <string-name>
            <given-names>Alvaro A. A.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          , John A. Keane, Leonid Libkin, and
          <string-name>
            <surname>Norman</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Paton</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Data context informed data wrangling</article-title>
          .
          <source>In 2017 IEEE International Conference on Big Data</source>
          ,
          <string-name>
            <surname>BigData</surname>
          </string-name>
          <year>2017</year>
          .
          <fpage>956</fpage>
          -
          <lpage>963</lpage>
          . https: //doi.org/10.1109/BigData.
          <year>2017</year>
          .8258015
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>