=Paper= {{Paper |id=Vol-2324/Paper06-NKonstantinou |storemode=property |title=Feedback Driven Improvement of Data Preparation Pipelines |pdfUrl=https://ceur-ws.org/Vol-2324/Paper06-NKonstantinou.pdf |volume=Vol-2324 |authors=Nikolaos Konstantinou,Norman Paton |dblpUrl=https://dblp.org/rec/conf/dolap/KonstantinouP19 }} ==Feedback Driven Improvement of Data Preparation Pipelines== https://ceur-ws.org/Vol-2324/Paper06-NKonstantinou.pdf
 Feedback Driven Improvement of Data Preparation Pipelines
                       Nikolaos Konstantinou                                                         Norman W. Paton
                   School of Computer Science                                                  School of Computer Science
                     University of Manchester                                                   University of Manchester
                         Manchester, UK                                                             Manchester, UK
             nikolaos.konstantinou@manchester.ac.uk                                          norman.paton@manchester.ac.uk

ABSTRACT                                                                        data preparation is typically quoted as taking 80% of the time of
Data preparation, whether for populating enterprise data ware-                  data scientists, who would prefer to be spending their time on
houses or as a precursor to more exploratory analyses, is recog-                analysing and interpreting results1 .
nised as being laborious, and as a result is a barrier to cost-                    The high cost of data preparation has been recognised for a
effective data analysis. Several steps that recur within data prepa-            considerable period. For example, research into dataspaces [12]
ration pipelines are amenable to automation, but it seems impor-                proposed a pay-as-you-go approach to data integration, in which
tant that automated decisions can be refined in the light of user               there was an initial and automated bootstrapping phase, which
feedback on data products. There has been significant work on                   was followed by an incremental improvement phase in which
how individual data preparation steps can be refined in the light               the user provided feedback on the data product. This gave rise to
of feedback. This paper goes further, by proposing an approach                  a collection of proposals for pay-as-you-go data integration and
in which feedback on the correctness of values in a data prod-                  cleaning platforms [17], which in turn led to proposals for the use
uct can be used to revise the results of diverse data preparation               of crowds as a possible source of feedback [9]. This research pro-
components. Furthermore, the approach takes into account all                    vided experience with pay-as-you-go data management, without
the available feedback in determining which actions should be                   leading to many end-to-end systems; for the most part, feedback
applied to refine the data preparation process. The approach has                was obtained for a particular task (e.g. mapping selection, en-
been implemented to refine the results of of matching, mapping                  tity resolution) and used for that task alone. This was alright,
and data repair components in the VADA data preparation sys-                    but collecting feedback on lots of individual components is it-
tem, and is evaluated using deep web and open government data                   self expensive, and thus not especially helpful for the complete,
sets from the real estate domain. The experiments have shown                    many-step data preparation process.
how the approach enables feedback to be assimilated effectively                    This, therefore, leaves open the question as to how to make
for use with individual data preparation components, and fur-                   a multi-step data preparation process much more cost effective,
thermore that synergies result from applying the feedback to                    for example through automation and widespread use of feedback
several data preparation components.                                            on data products. There are now some results on automating
                                                                                comprehensive data preparation pipelines. For example, in Data
                                                                                Tamer [29], machine learning is used to support activities in-
1    INTRODUCTION                                                               cluding the alignment of data sets and instance level integra-
Data preparation is the process of transforming data from its                   tion through entity resolution and fusion. In some respects Data
original form into a representation that is more appropriate for                Tamer follows a pay-as-you-go approach, as the training data
analysis. In data warehouses, data preparation tends to be referred             used by the learning components is revised in the light of experi-
to as involving an Extract Transform Load (ETL) process [34], and               ence. Furthermore, in VADA [21], a collection of components (for
for more ad hoc analyses carried out by data scientists may be                  matching, mapping generation, source selection, format trans-
referred to as data wrangling [27]. In both cases, similar steps                formation and data repair) are orchestrated automatically over
tend to be involved in data preparation, such as: discovery of rele-            data sources, informed by supplementary instance data drawn
vant sources; profiling of these sources to better understand their             from the domain of the target schema [20]. However, to date,
individual properties and the potential relationships between                   feedback has only been applied in VADA to inform the selection
them; matching to identify the relationships between source at-                 of mappings.
tributes; mapping to combine the data from multiple sources;                       In this paper we investigate how feedback on the data product
format transformation to revise the representations of attribute                that results from the multi-component data preparation process
values; and entity resolution to identify and remove duplicate                  in VADA can be used to revise the results of multiple of these
records representing the same real world object.                                wrangling components in a well-informed way. In particular,
   This is a long list of steps, each of which can potentially involve          given feedback on the correctness of tuples in the data product,
data engineers: (i) deciding which data integration and cleaning                a feedback assimilation strategy explores a set of hypotheses
operations to apply to which sources; (ii) deciding the order                   about the reason for problems with the result. The statistical
of application of the operations; and (iii) either configuring the              significance of these hypotheses is then tested, giving rise to the
individual operation applications or writing the rules that express             generation of a revised data integration process. The proposed
the behaviour to be exhibited. Although there are many data                     approach thus uses the same feedback to inform changes to
preparation products, and the market for data preparation tools                 many different data preparation components, thereby seeking to
is estimated to be $2.9 billion [24], most of these products are                maximise the return on the investment made in the provision of
essentially visual programming platforms, in which users make                   feedback.
many, fine-grained decisions. The consequences of this is that                     The contributions of the paper are as follows:

© 2019 Copyright held by the author(s). Published in the Workshop Proceedings
                                                                                1 https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on
CEUR-WS.org.                                                                    consuming-least-enjoyable-data-science-task-survey-says/33d344256f63
Figure 1: Data Sources and Reference Data in a simple data preparation scenario. Specific values are coloured as follows:
red font – incorrect value w.r.t. conditional functional dependencies (CFDs); green font: correct value w.r.t. CFDs; blue font
– data used in mining CFDs.


    (1) A technique for applying feedback on a data product across
        a multi-step data preparation process that both identifies
        statistically significant issues and provides a mechanism
        for exploring the actions that may resolve these issues.
    (2) A realisation of the technique from (1) in a specific data
        preparation platform, where feedback is used to change
        the matches used in an integration, change which map-
        pings are used, and change which data quality rules are
        applied.
    (3) An empirical evaluation of the implementation of the ap-
        proach from (2) that investigates the effectiveness of the
        proposed approach both for individual data preparation
                                                                         Figure 2: Using feedback to improve the end product. The
        constructs (matches, mappings and CFDs) and for apply-
                                                                         shaded red background denotes false positive feedback ob-
        ing feedback across all these constructs together.
                                                                         tained on the initial end product, in the light of which
   The remainder of the paper is structured as follows. Section          the system is able to refine the data preparation process
2 outlines the data preparation pipeline on which we build, and          to yield the revised data product without the problematic
provides a running example that will be used in the later sections.      values.
Section 3 provides a problem statement and an overview of the
approach. Section 4 details the individual components in the re-
alisation of the approach, and presents the feedback assimilation            Target Schema: A schema definition for the end data prod-
algorithm. Section 5 evaluates the technique in a real estate ap-              uct. In the running example, the target schema consists
plication. Section 6 reviews other approaches to increasing the                of one table, property, with six attributes, namely price,
cost-effectiveness of data preparation, and Section 7 concludes.               postcode, income, bedroom_no, street_name, and location.
                                                                             User Context: The desired characteristics of the end prod-
2     A DATA PREPARATION PIPELINE                                              uct; as the end data product is obtained by an automated
This section provides an overview of the aspects of the VADA                   process, many candidate solutions can be generated. The
data preparation architecture that are relevant to the feedback                user context allows the system to select solutions that meet
assimilation approach that is the focus of the paper. The VADA                 the specific requirements of the user [1]. The user’s re-
architecture is described in more detail in earlier publications [20,          quirements are modelled as a weighted set of criteria, with
21].                                                                           the sum of their weights equal to one. Although in general
                                                                               different quality criteria can be used (such as complete-
                                                                               ness, consistency or relevance), in our running example,
2.1      Defining a Data Preparation Task
                                                                               we consider 6 criteria, each one on the correctness of a
In VADA, instead of handcrafting a data preparation workflow,                  target schema attribute, and a weight of 16 .
the user focuses on expressing their requirements, and then the              Data Context: Supplementary instance data associated with
system automatically populates the end data product. In particu-               the target schema, which can be used as additional ev-
lar, the user provides:                                                        idence to inform the automated data preparation pro-
      Input Data Sources: A collection of data sources that can                cess [20]. For example, in Figure 1, reference data is pro-
        be used to populate the result. Figure 1, illustrates the              vided that provides definitive address data.
        data sources in our running scenario. These sources in-             Given the above information, and a user-defined targeted size
        clude real estate property data (sources s 1 to s 3 ) and open   for the end product of 6 tuples, from the data sources in Figure 1,
        government data (source s 4 ).                                   the system can automatically produce the first end data product
                                                                                  example illustrated in Figure 2, with a support size of 2, it
                                                                                  will produce a number of CFDs, including the following:
                                                                                  property([postcode] → [streetname],(M1 5BY | | Cambridge
                                                                                  Street))
                                                                                  property([postcode] → [locality],(M9 8QB | | Manchester))
                                                                                  property([postcode] → [streetname],(OX2 9DU | | Crabtree
                                                                                  Road))
                                                                                  property([postcode] → [locality],(OX28 4GE | | Witney))
                                                                                  property([postcode] → [streetname],(OX4 2 DU | | Oxford Road))


                                                                                  Given a set of CFDs and a dataset, the Data Repair compo-
            Figure 3: A Data Preparation Pipeline                                 nent will identify violations of these CFDs, which can then
                                                                                  be repaired, in the sense that violations are removed [11].
                                                                                  In Figure 2, rules are learnt from the repeated presence
in Figure 2. In the absence of feedback or any other user-defined                 of the italicised tuples highlighted in blue in the refer-
characteristics, the system will select tuples that are as complete               ence data in Figure 2. Thus, the incorrect values Market
as possible to populate the end data product. However, as illus-                  Street, Crabtree Rd, etc. have been corrected to Lakeside
trated in Figure 2, feedback on the correctness of the result can                 Rise, Crabtree Road, etc. resulting in an end product that
be used to revise how the data product is produced, and thus                      is consistent with respect to the reference data.
improve its overall quality.
                                                                         3     PROBLEM STATEMENT
2.2    Data Preparation Process and                                      This section provides more details on the problem to be solved,
       Components                                                        along with an overview of the approach to be followed. The
Figure 3 illustrates the basic flow of events for the data process-      problem can be described as follows.
ing pipeline in this paper, where the plus precedes and follows                 Assume we have a data preparation pipeline P, that
parallel tasks. First, the system is initialised using the sources and          orchestrates a collection of data preparation steps
data context that the user has provided. Then, CFD Miner, Data                   {s 1 , ..., sn }, to produce an end data product E that
Profiler and Matching components are run on the sources and                     consists of a set of tuples. The problem is, given
data context. Given the matches and profiling data, the Mapping                 a set of feedback instances F on tuples from E, to
component generates a set of candidate mappings, over which                     re-orchestrate some or all of the data preparation
Mapping Selection evaluates the user criteria to select the most                steps si , revised in the light of the feedback, in a
suitable mappings for contributing to the end product. Subse-                   way that produces an improved end data product.
quently, the Data Repair component repairs constraint violations            In this paper, we assume that the feedback takes the form of
that are detected on the end product. The components are now             true positive (TP) or false positive (FP) annotations on tuples or
outlined in more detail:                                                 attribute values from E. Where a tuple is labelled as a TP this
    Initialise: This component sets up the system Knowledge              means that it is considered to belong in the result; and where a
       Base with metadata about the data available in the sys-           tuple is labelled as an FP, this means that it should not have been
       tem, the target schema, user preferences and component            included in the result. When an attribute value is labelled as a
       configurations.                                                   TP this means that the given value is a legitimate member of the
    Matching: Given a set of sources and the target schema               domain of the column, and when it is labelled as an FP this value
       T , the matching component produces a set of matches              should not have appeared in that column2 .
       between their attributes and the target schema attributes.           Given such annotations, the approach consists of the following
       Each match has a similarity score, with 0 ≤ score ≤ 1.            steps:
    Profiling: This component analyses the available sources                (1) Given feedback F, identify a collection of hypotheses H
       and produces a set of candidate keys and a set of inclusion              that could explain the feedback. For example, if an attribute
       dependencies among the sources.                                          value is incorrect, the following are possible hypotheses:
    Mapping Generation: Using the results of the two previ-                     (i) a match that was used to associate that value in a source
       ous components as inputs, mapping generation searches                    with this attribute in the target is incorrect; (ii) a mapping
       the space of possible ways of combining the sources using                that was used to populate that value in the target is in-
       unions or joins, producing a set of candidate mappings.                  correct, for example joining two tables that should not
    Mapping Selection: Given a set of user criteria, a set of                   have been joined; and (iii) a format transformation has
       candidate mappings, the target schema and a targeted size,               introduced an error into the value.
       mapping selection establishes how many tuples to obtain              (2) Given a hypothesis h ∈ H , review all the evidence pertain-
       from each of the candidate mappings to populate a target                 ing to h from the feedback to establish if the confidence
       schema instance, thus creating an end data product [1].                  in the hypothesis is sufficient to suggest that it should be
    Data Repair: Repair takes place in the presence of reference                investigated further. For example, if the hypothesis is that
       data; reference data is considered to provide complete cov-              a match is incorrect, all the feedback on data derived from
       erage of the domains for certain target columns. Data                    that match should be considered together, with a view
       repair involves the cooperation of 2 components. First,           2 These annotations lend themselves to propagation as follows. If a tuple is marked
       CFD Miner is trained on the data context [11]. The com-           as TP, all of its attribute values are marked as TP. If an attribute value is marked as
       ponent is configured by a support size parameter. In the          FP, all tuples containing any of these attribute values are marked as FP.
        to determining whether the match should be considered            4.2     Hypothesis Testing
        problematic.                                                     This section describes how hypotheses are tested with respect to
    (3) Given a hypothesis h ∈ H in which there is some confi-           the evidence from feedback.
        dence from feedback, identify actions that could be taken           As stated in Section 2, the user can define the desired charac-
        in the pipeline P. For example, if the hypothesis is that a      teristics of the end product in the form of a set of criteria. Some
        match is incorrect, possible actions would be to drop that       of these criteria can be determined without feedback (such as
        match or to drop all the mappings that use the match.            the completeness with respect to the data context of the values
    (4) Given that evidence from feedback F may support several          in a column), but some can be informed by feedback (such as
        different hypotheses, there is a space of possible actions       relevance and correctness). In the examples and experiments in
        that could be carried out, each leading to a different collec-   this paper, feedback is on correctness. For criteria that are based
        tion of data preparation steps. As a result, we must explore     on feedback, Equation 1 is used, in which cˆs is the criterion eval-
        the space of candidate integrations that implement the dif-      uation on a source s, tp (resp. f p) the numbers of tuples marked
        ferent actions.                                                  by the user as true (resp. false) positives, and |s | is the source
                                                                         size.
4     SOLUTION
                                                                                                       1      tp − f p
This section provides additional details on how the steps from                                       cˆs =
                                                                                                         (1 +          )                    (1)
Section 3 are carried out in practice, and includes details on how                                     2         |s |
the feedback can be used to inform actions on matches, map-                 Thus, in the absence of feedback, the default value for a cri-
pings and CFDs. In particular, Section 4.1 identifies hypotheses         terion is 0.5. However, if we have feedback on the correctness
that may be suggested by the available feedback; Section 4.2 de-         of an attribute that there are 6 tp values and 3 fp values, then if
scribes how the statistical significance of such hypotheses can          there are 100 tuples, the estimated correctness for the attribute
be ascertained; Section 4.3 identifies some actions that may be          in s is now 12 (1 + |100|
                                                                                              6−3 ) = 0.515.
taken in response to a hypothesis; and Section 4.4 describes how            We now proceed to establish when criteria estimates are sig-
the different aspects of the solution are brought together in an         nificantly different using standard statistical techniques [7]3 . The
algorithm.                                                               estimated values of a criterion ĉ on two sources s 2 and s 1 are
                                                                         considered to be significantly different when Equation 2 holds:
4.1     Hypotheses                                                                                                 q
Given an end data product on which some feedback has been                                         ĉ s2 − ĉ s1 > z ses22 − ses21                       (2)
collected, an obvious question is what can the feedback tell us
about the way in which the data product has been produced. More               where ĉ s2 (resp. ĉ s1 ) is the evaluation of criterion ĉ on s 2 (resp.
specifically, given an end data product and some feedback that           s 1 ), ses2 and ses1 are the standard errors for sources s 2 and s 1
identifies problems with the data product, an obvious question           respectively, calculated by Equation 3, and z is a statistical term
is what went wrong in the production of this data product. For           measuring the relationship between a value and the mean of a
example, in Figure 1, the street_name Market Street is not consis-       group of values. For instance, a z-score of zero indicates that the
tent with the postcode M9 8QB, and thus could have been labelled         value and the mean are identical. In our experiments we use the
as an FP by the user. It is straightforward to identify possible         z-score that corresponds to a confidence level of 80%, i.e. 1.282.
reasons for this problem, which here we term hypotheses. In this         Standard error is calculated by Equation 3 below.
case, possible reasons include:                                                                               s
                                                                                                                  cˆs (1 − cˆs )
      Matching: Incorrect match used to relate the source to the                                      ses =                                             (3)
        target.                                                                                                         Ls
      Mapping Generation: Incorrect mapping combined sources                where s is a source, cˆs is the evaluated feedback-based criterion
        in an inappropriate way.                                         on source s, and Ls is the amount of feedback collected on source
      Data Repair: Incorrect data repair replaced the correct value      s.
        with the wrong one.                                                 Given the above, then our estimation of a data criterion cˆs on
   In this paper, all these hypotheses are considered as possible        a source s is cˆs ± es where es is the margin of error for the data
explanations for the identified FP. The question then is when do         criterion evaluation on s, and 0 ≤ cˆs ± es ≤ 1. A source s can
we provide credence to a hypothesis. Given the evidence available        be either a set of values, or a set of tuples. The formula for the
in the form of feedback, hypotheses are tested as follows:               margin of error is given in Equation 4.
      Matching: If the data that results from a match is signifi-                                                   s
                                                                                                                        |s | − Ls
        cantly worse than that produced from other matches for                                     es = z · ses ·                                       (4)
        the same target schema attribute, then the match is suspi-                                                       |s | − 1
        cious.                                                              where |s | is the number of elements in s (source size), and Ls
      Mapping Generation: If the result of a mapping is signif-          the amount of feedback collected for source s. This is feedback
        icantly worse than the results from the other mappings,          either provided directly on the data product (if it is the end data
        then the mapping is suspicious.                                  product) or propagated to it. We only consider attribute-level
      Data Repair: If the repaired values from a repair rule are         feedback instances when evaluating Ls on a set of values, and
        significantly worse than the other values for the repaired
        attribute, then the rule is suspicious.                          3 Such statistical techniques have been used before to compare mappings on the
                                                                         basis of feedback, with a view to targeting feedback collection [28]. In that work,
  The following section uses statistical techniques to compare           where there is uncertainty as to which mappings should be preferred in mapping
matches, mappings and feedback using evidence from feedback.             selection, additional feedback is obtained to reduce the uncertainty.
    Figure 4: Detecting significantly different matches

                                                                           Figure 5: Detecting significantly different mappings

we only consider tuple-level feedback instances when evaluating
Ls on a set of tuples.
                                                                        from 5 candidate mappings. Candidate mappings m 1 to m 4 con-
   In the absence of feedback on a set of tuples or values, given
                                                                        tribute to the solution, while m 5 does not. For each of the can-
that cˆs is unknown, hence cˆs = 21 ± 12 , we can solve Equation 4
                                                                        didate mappings that contributes to the solution, we evaluate
by setting es = 12 and Ls = 0, and evaluate the standard error
                                                                        the participating tuples against the rest of the tuples of the end
ses for the source s (needed for evaluating significant difference
                                                                        product, using Equation 2. For instance, for to evaluate whether
using Equation 2) in the absence of feedback as:
                                                                        mapping m 1 is significantly different, we use s 1 and s 2 as illus-
                                   s                                    trated in Figure 5 .
                               1       |s | − 1                            It is important to mention that for mappings, before evaluation,
                        ses =                                    (5)    we propagate to the mapping the feedback that the user has
                              2z           |s |
                                                                        provided on the end product, and on the mappings themselves.
   Next, we discuss the comparisons we used when testing the            Furthermore, the Mapping Selection component ensures that the
different components in the system. The focus is on what is             tuples that are marked as true (resp. false) positives are selected
considered as s 1 , s 2 , when evaluating Equation 2.                   (resp. not selected).
Testing matches for significant difference. Figure 4 shows              Testing CFDs for significant difference. In Figure 6 we see
the target schema T and a source s, with a match between at-            the result of a CFD on the end product. We mark as green the 2
tribute s.d and the target schema attribute T .d: m 1 : s.d ∼ T .d.     attribute values for column c that are found to be correct, and
    When testing match m 1 for significant difference, Equation 2       with red the 3 values in column d found to be violating the CFD,
is evaluated on the projections on the matching attributes. Specif-     and that were thus repaired. As before, we consider tuple sets s 1
ically, to test the significance of match m 1 , for the evaluation      and s 2 in Figure 6 as inputs to Equation 2. Note that the correct
of Equation 2, we use value sets s 1 and s 2 (i.e., s.d and T .d), as   values are not considered in s 1 , as they have no effect on the end
illustrated in Figure 4.                                                product. We only consider the tuples that were modified.
    Note that the set s 1 is the complete set of values for attribute
s.d, regardless of whether the values are selected to populate the      4.3    Actions
end product or not, while s 2 is the greyed part of attribute T .d.     Each of the hypotheses is associated with a respective action,
    Consider the example in Figure 1, for which some feedback has       typically set in this paper to ruling out the suspicious item be-
been obtained on the initial end product in Figure 2, to the effect     fore rerunning the data preparation pipeline. An item here is a
that the first two tuples have fp annotations for their bedroom_no.     match, a candidate mapping, or a CFD rule. The hypotheses and
In Figure 2, the first 3 tuples have been obtained from Source 1,       respective actions per component are thus summarised below:
but involve an incorrect match of Source1.bathrooms with Tar-
get.bedroom_no. The correctness estimated for this match using
Equation 1 is 12 (1 + 0−2
                       |3 | ) = 0.167. The correctness for the values
obtained from other matches on the same column, on which no
feedback has been collected, is estimated using Equation 1 to
be 0.5. Although with this small illustrative example statistical
significance is difficult to establish, we can see that the correct-
ness estimate associated with the match from Source1.bathrooms
with Target.bedroom_no is lower than that for the other values in
the column (obtained using matches involving source attributes
Source 2.bedroom_no and Source 3.bed_num), and thus the match
involving Source1.bathrooms with Target.bedroom_no may be
identified as being suspicious.
Testing mappings for significant difference. Figure 5 shows
an example of a target schema T populated with tuples selected                Figure 6: Detecting significantly different CFDs
Algorithm 1 Apply collected feedback                                  set of CFDs. The resulting end data product can then be provided
 1: function ApplyFeedback                                            to the user for analysis or further feedback collection. Newly
 2:    for all m ∈ Matches do                                         collected feedback instances are added to the existing ones, and
 3:        if significantlyWorse(m.a, T .a \ m.a) then                propagated to the end product, and the candidate mappings. In
 4:            Matches.remove(m)                                      addition, the feedback can periodically be applied to the whole
 5:        end if                                                     process using Algorithm 1 to generate a revised end product. The
 6:    end for                                                        process continues until the user finds the result fit for purpose
 7:    Mappinдs ← MappingGeneration(Matches, pro f ileData)           and terminates data preparation pipeline.
 8:    for all map ∈ Mappinдs do                                          Function significantlyWorse (lines 3, 9 and 16) is essentially
 9:        if significantlyWorse(map, T \ map) then                   testing Equation 2 with s 1 and s 2 as arguments, and returns true
10:            Mappinдs.remove(map)                                   if it holds, or false otherwise. The arguments for this function are
11:        end if                                                     the ones illustrated as s 1 and s 2 in Figure 4 when detecting sig-
12:    end for                                                        nificantly worse matches, Figure 5 when detecting significantly
13:    endProduct ← MappingSelection(Mappinдs)                        worse mappings, and Figure 6 when detecting significantly worse
14:    for all c f d ∈ CF Ds do                                       rules. Next, using the same approach we detect and remove suspi-
15:       s ← ViolatingTuples(T , c f d)                              cious mappings (lines 12–16) and suspicious CFDs (lines 18–23).
16:        if significantlyWorse(s, T \ s) then                       As s in line 15 we define the set of tuples in T violating a given
17:            CF Ds.remove(c f d)                                    c f d.
18:        end if
19:    end for                                                        5 EVALUATION
20:    endProduct ← DataRepair(CF Ds)
21: end function
                                                                      5.1 Experimental Setup
                                                                      For the evaluation, we used as sources the following datasets:
                                                                      (a) forty datasets with real-estate properties extracted from the
      • Matching. If a match is suspicious, discard the match.        web using OXpath [13], with contents similar to sources s 1 to s 3
        This means that mapping generation (or any of the other       in Figure 1, (b) English indices of deprivation data, downloaded
        components down the line) will not take this match into       from www.gov.uk, as shown in s 4 in Figure 1. As reference data,
        account.                                                      we used open address data from openaddressesuk.org.
      • Mapping Generation. If a mapping is suspicious, discard          Then, to enable the automatic evaluation of correctness on
        the mapping. This means that Mapping Selection will not       the end product and throughout the pipeline, we constructed
        take this mapping into account.                               a ground truth based on the available data, which we used for
      • Data Repair. If a repair rule is suspicious, discard the      our experiments, as follows: we manually matched, mapped,
        rule.                                                         deduplicated, and then repaired the end product. This gave us a
                                                                      dataset consisting of approximately 4.5k tuples.
   We need not take actions for items that are found to be sig-
                                                                         For the experiments, the match threshold was set to 0.6, the
nificantly better than others, as they are retained by default. As
                                                                      top 100 mappings are retained from mapping generation, and the
such, we only focus on removing suspicious items.
                                                                      support size for the data repair component was set to 5.
                                                                         Each of the criteria in the user context was set to be the cor-
4.4     Algorithm                                                     rectness of an attribute from the target schema. Thus, the user
Algorithm 1 provides the basic sequence of events while assimi-       criteria are: correctness(price), correctness(postcode), correct-
lating feedback. It is assumed that the feedback assimilation takes   ness(income), correctness(bedroom_no), correctness(street_name),
place in the context of an integration, in which Matches is the set   and correctness(location). They all have the same weight, 16 . The
of existing matches, profileData contains the existing inclusion      correctness of each of these properties is estimated based on
dependency and primary key data, and CFDs is a set of existing        feedback. Mapping selection is configured to select what are pre-
conditional functional dependencies, as produced in the upper         dicted to be the best 1000 tuples from a subset of the generated
part of Figure 3.                                                     mappings.
    Then, the algorithm considers the hypotheses on matches,             The workflow for the experiments is illustrated in Figure 7.
mappings and CFDs in turn; this order is chosen because changes       We essentially automatically reran the workflow of Figure 3, col-
to matches give rise to changes to mappings, and CFDs are applied     lecting 25 feedback instances in each iteration, until 500 feedback
to the results of mappings.                                           instances on the end product had been collected and propagated.
    First, we iterate over the available matches (lines 2–6) and      Feedback is generated randomly in the experiments, based on
test whether any of these produces a result that is significantly     the correctness of the respective tuple with respect to the ground
worse than the results of other matches. T .a (line 3) is a target    truth. In each iteration, half of the feedback is given at tuple level,
schema attribute that is the target of a match m with a source        and half at attribute level.
attribute m.a. Any match that yields a significantly worse result        We then configured the system to test:
is removed.
    The remaining matches are then used for mapping generation            • Matching. This means running Algorithm 1 without lines
(line 7), and any generated mappings that perform significantly             8–12, and without lines 14–19.
worse than the others are removed (lines 8–12).                           • Mapping generation. This means running Algorithm 1
    A similar process is followed with CFDs, removing CDFs that             without lines 2–6, and without lines 14–19.
are found to be problematic in the light of the feedback (lines 14–       • Data repair. This means running Algorithm 1 without lines
19). The end product is then repaired (line 20) using the reduced           2–6, and without lines 8–12.
                         Figure 7: Experiments                                             Figure 8: Precision for different configurations



      • All of the components. This means running Algorithm 1                        to correctness, this in turn leads to more true positives in the
        as defined.                                                                  results and a higher precision.
      • None of the components. This means running Algorithm                            As such, there might be considered to be 2 baselines in this
        1 without lines 2–6, without lines 8–12, and without lines                   experiment: the case in which there is no feedback, and the
        14–19. In this case, although no actions are taken to re-                    precision is around 0.2, and none in which the feedback informs
        move matches, mappings or CFDs, the data product is still                    mapping selection, but the approach from Section 4 is not applied.
        improved in the light of feedback, as MappingSelection                          The actions that have been taken by the matching and mapping
        in line 13 is able to benefit from improved estimates of                     components, i.e., the removal of suspicious matches or mappings,
        mapping correctness.                                                         have had a positive impact on the precision of the end product,
The none of the components case forms the baseline, as we know                       as shown in Figure 8. It is unclear from this experiment which
of no direct competitor that is seeking to apply feedback on                         one has more impact.
the result of an integration to revise the behaviour of several                         Taking action on the CFDs has had little impact on the end
integration components.                                                              product precision. This can be understood as follows:
                                                                                         • The rules being learnt are numerous, e.g., 3526 rules were
5.2     Results                                                                            learnt with a support size of 5. As a result, each rule applies
The precision values obtained for different configurations are                             to quite few rows, and it can be difficult to obtain enough
illustrated in Figure 84 . Each of the lines in Figure 8 is the average                    feedback to draw statistically significant conclusions as to
of 5 runs from 0 to 500 randomly collected feedback instances,                             the quality of an individual rule.
using the configurations from Section 5.1. Note that, as the col-                        • Each of the rules typically has a small effect to the end
lection of different feedback instances leads to different possible                        product. As such, even when a problematic rule is removed,
actions, which leads to different results being made available for                         this tends not to have much of an effect on overall quality.
feedback collection, there is no realistic way to provide identical
                                                                                     However, it is still clear that identifying CFDs that introduced
feedback instances to different runs of the experiment. This is
                                                                                     errors to the data, and discarding them, has the potential to have
reflected in the uneven curves in Figure 8.
                                                                                     a positive effect on the resulting quality.
    As usual, the precision of the end product is evaluated as:
                  tp                                                                     The most important result for the overall proposal, however,
precision = tp+f p , where a tuple is considered a true positive                     is that when the actions across all the components are combined,
(tp) if all of its values appear in the ground truth, and as a false                 this provides much the greatest improvement compared with any
positive (f p) otherwise.                                                            of the individual means of applying the feedback. Furthermore,
    Notice in Figure 8 that the none case (i.e., when there is no                    much of this improvement is obtained with modest amounts of
discarding of matches, mappings or rules along the way) still                        feedback.
leads to an increase of the precision of the end product. This                           We noticed in the experiments that discarding suspicious
is because, throughout the experiments, mapping selection (as                        matches, mappings or CFDs, even in the case of significant is-
outlined in Section 2.2) is informed by correctness estimates, and                   sues with the quality of their associated results, did not always
feedback is used to improve the correctness estimates. So, before                    guarantee that the precision of the end product would increase.
any feedback is collected, all mappings have the same correct-                       This happens because the contents of the tuples that replace the
ness (estimated at 0.5) and thus mapping selection considers all                     discarded ones are yet unknown, as feedback on them is not al-
mappings to be of equal quality. However, as feedback is gath-                       ways present. As such, the new tuples are not guaranteed to be of
ered, different mappings will come to have different levels of                       better quality. The trend is, though, that the overall correctness
correctness, and mapping selection will make increasingly well                       can be expected to increase, even if not monotonically.
informed selection decisions. As these selection decisions relate                        Next, we report on the number of suspicious items discovered
4 In a setting where mapping selection retains the top 1000 tuples from a ground
                                                                                     in each of the experiments. Each line in Figure 9, corresponds to
truth of around 4500 tuples, the recall tends to increase broadly in line with the
                                                                                     the average of the same 5 runs as reported in Figure 8. Figures
precision.                                                                           9a to c, respectively show the numbers of suspicious matches,
                                    a. Matching                                   b. Mapping Generation




                                   c. Data Repair                                 d. All Components

                                              Figure 9: Detecting Suspicious Products


mappings and CFDs considered, when only the respective com-              In relation to applying feedback to multiple activities, it has
ponent is being tested. Figure 9d shows the overall number of         been recognised that the same feedback may be able to inform
suspicious items discovered and discarded when all components         different aspects of the same step. For example, in [5], feedback
were being tested.                                                    on the correctness of results is used to inform both mapping
   We can observe the following: (i) Rather few suspicious matches    selection and mapping refinement. In contrast with the work here,
have been detected, so the benefit obtained from the removal          these are two essentially independent proposals that happen to
of each such match has been quite substantial. We note that           use the same feedback, where the feedback has been obtained
as matches relate to individual columns, obtaining sufficient FP      directly on the mapping results. In Corleone [14], several dif-
feedback on the data deriving from a match can require quite a        ferent aspects of entity resolution are informed by feedback on
lot of feedback. (ii) More suspicious mappings are identified, and    matching pairs; in particular the feedback is used to inform the
they are identified from early in the process. (iii) Although quite   learning of comparison rules and to identify and resolve problem
a few suspicious CFDs are identified, this is a small fraction of     cases. This may be the closest work to ours, though the focus is
the overall number.                                                   on a single data integration step, for which custom feedback has
   From Figures 8 and 9 it is clear that taking feedback into         been obtained. Also for entity resolution, feedback on matching
account and acting on all components increases the return on the      pairs is used in [25] as part of a single optimisation step that
investment from feedback. When considering all possible actions       configures together blocking, detailed comparison and clustering.
together, the overall benefit is greater, and the benefit accrues     This contrasts with the current paper in focusing different as-
with smaller amounts of feedback.                                     pects of the same integration step. To the best of our knowledge,
                                                                      there is no prior work in which several distinct steps in the data
6   RELATED WORK                                                      preparation process are considered together when responding to
                                                                      feedback.
In this section, we consider related work under three headings,
                                                                         In terms of reducing manual effort in data preparation, there
pay-as-you-go data preparation, applying feedback to multiple
                                                                      are a variety of different approaches. For example, again in the
activities and reducing manual effort in data preparation.
                                                                      context of individual steps, tools can be used to support data en-
   In pay-as-you-go data preparation, following from the vision
                                                                      gineers in developing transformations. For example, in relation
of dataspaces [12], a best-effort integration is produced through
                                                                      to format transformation, Wrangler [19] can suggest potential
automated bootstrapping, which is refined in the light of user
                                                                      transformation programs, and FlashFill [16] can synthesize trans-
feedback. There have been many proposals for pay-as-you-go
                                                                      formation programs from examples. There are also proposals
data preparation components, for example relating to data ex-
                                                                      in which mappings are discovered based on data instances (e.g.,
traction [10], matching [18, 36], mapping [5, 6, 30] and entity
                                                                      [2, 15, 26]). In ETL, there are also a variety of approaches that seek
resolution [14, 25]. Such proposals have addressed issues such
                                                                      to reduce the amount of manual labour required. For example,
as targeting the most appropriate feedback [10, 28, 35], and ac-
                                                                      this includes the provision of language features [4, 32] or pat-
commodating unreliable feedback, in particular in the context of
                                                                      terns [31, 33] that support recurring data preparation behaviours,
crowdsourcing [9, 23]. However, such work has primarily used a
                                                                      techniques for managing evolution of ETL programs [8], and
single type of feedback to inform a single data preparation step.
development of ETL processes that abstract over more concrete                identifying the subset of the candidate actions that together are
implementation details [3, 22]. However, although such work                  the most effective.
focuses on raising the abstraction levels at which data engineers            Acknowledgement: This work is funded by the UK Engineering
engage in data preparation tasks, we are not aware of prior re-              and Physical Sciences Research Council (EPSRC) through the
sults that use feedback on data products to make changes across              VADA Progrmame Grant.
complete data preparation processes.

7    CONCLUSIONS                                                             REFERENCES
                                                                              [1] Edward Abel, John Keane, Norman W. Paton, Alvaro A.A. Fernandes, Martin
The development of data preparation processes is laborious, re-                   Koehler, Nikolaos Konstantinou, Julio Cesar Cortes Rios, Nurzety A. Azuan,
quiring sustained attention to detail from data engineers across                  and Suzanne M. Embury. 2018. User driven multi-criteria source selection.
                                                                                  Information Sciences 430-431 (2018), 179–199. https://doi.org/10.1016/j.ins.
a variety of tasks. Many of these tasks involve activities that are               2017.11.019
amenable to automation. However, automated approaches have                    [2] Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang Chiew Tan. 2011.
                                                                                  Designing and refining schema mappings via data examples. In Proceedings of
partial knowledge, and thus cannot necessarily be relied upon                     the ACM SIGMOD International Conference on Management of Data, SIGMOD.
to make the best integration decisions. When automation falls                     133–144. https://doi.org/10.1145/1989323.1989338
short, one way to improve the situation is through feedback on                [3] Syed Muhammad Fawad Ali and Robert Wrembel. 2017. From concep-
                                                                                  tual design to performance optimization of ETL workflows: current state
the candidate end data product. The successful combination of                     of research and open problems. VLDB J. 26, 6 (2017), 777–801. https:
automation and feedback provides a route to data preparation                      //doi.org/10.1007/s00778-017-0477-2
without programming, which was considered to be important by                  [4] Ove Andersen, Christian Thomsen, and Kristian Torp. 2018. SimpleETL: ETL
                                                                                  Processing by Simple Specifications. In Proceedings of the 20th International
90% of participants in a survey on end user data preparation5 .                   Workshop on Design, Optimization, Languages and Analytical Processing of
   Towards this goal of reducing the burden of data prepara-                      Big Data co-located with 10th EDBT/ICDT Joint Conference (EDBT/ICDT 2018),
                                                                                  Vienna, Austria, March 26-29, 2018. http://ceur-ws.org/Vol-2062/paper10.pdf
tion, we now revisit and elaborate on the contributions from the              [5] Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A.
introduction.                                                                     Fernandes, and Cornelia Hedeler. 2010. Feedback-based annotation, selection
                                                                                  and refinement of schema mappings for dataspaces. In EDBT. 573–584. https:
    (1) A technique for applying feedback on a data product across                //doi.org/10.1145/1739041.1739110
        a multi-step data preparation process that both identifies            [6] Angela Bonifati, Radu Ciucanu, and Slawek Staworko. 2014. Interactive Infer-
        statistically significant issues and provides a mechanism                 ence of Join Queries. In 17th International Conference on Extending Database
                                                                                  Technology, EDBT. 451–462. https://doi.org/10.5441/002/edbt.2014.41
        for exploring the actions that may resolve these issues. We           [7] Michael G. Bulmer. 1979. Principles of Statistics. Dover Publications.
        have described an approach in which hypotheses about                  [8] Darius Butkevicius, Philipp D. Freiberger, and Frederik M. Halberg. 2017.
                                                                                  MAIME: A Maintenance Manager for ETL Processes. In Proceedings of the
        the problems with an integration are tested for statistical               Workshops of the EDBT/ICDT 2017 Joint Conference (EDBT/ICDT 2017), Venice,
        significance with respect to user feedback on the candi-                  Italy, March 21-24, 2017. http://ceur-ws.org/Vol-1810/DOLAP_paper_08.pdf
        date end data product, giving rise to actions that seek to            [9] Valter Crescenzi, Alvaro A. A. Fernandes, Paolo Merialdo, and Norman W.
                                                                                  Paton. 2017. Crowdsourcing for data management. Knowl. Inf. Syst. 53, 1
        resolve issues with the feedback. The same approach is po-                (2017), 1–41. https://doi.org/10.1007/s10115-017-1057-x
        tentially applicable to different types of feedback, diverse         [10] Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2015. Crowdsourcing
        data preparation components, and a variety of actions.                    large scale wrapper inference. Distributed and Parallel Databases 33, 1 (2015),
                                                                                  95–122. https://doi.org/10.1007/s10619-014-7163-9
    (2) A realisation of the technique from (1) in a specific data           [11] Wenfei Fan, Floris Geerts, Laks V S Lakshmanan, and Ming Xiong. 2011.
        preparation platform, where feedback is used to change the                Discovering conditional functional dependencies. Proceedings - International
                                                                                  Conference on Data Engineering 23, 5 (2011). https://doi.org/10.1109/ICDE.
        matches used in an integration, change which mappings                     2009.208
        are selected, and change which data quality rules are ap-            [12] Michael J. Franklin, Alon Y. Halevy, and David Maier. 2005. From databases to
        plied. We have indicated how the technique can be applied                 dataspaces: a new abstraction for information management. SIGMOD Record
                                                                                  34, 4 (2005), 27–33. https://doi.org/10.1145/1107499.1107502
        to matching, mapping and repair steps, on the basis of               [13] Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, and Andrew
        true/false positive annotations on data product tuples, in                Sellers. 2013. OXPath: A language for scalable data extraction, automation,
        the VADA data preparation system.                                         and crawling on the deep web. The VLDB Journal 22, 1 (01 Feb 2013), 47–72.
                                                                                  https://doi.org/10.1007/s00778-012-0286-6
    (3) An empirical evaluation of the implementation of the ap-             [14] Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan
        proach from (2) that investigates the effectiveness of the                Rampalli, Jude W. Shavlik, and Xiaojin Zhu. 2014. Corleone: hands-off crowd-
                                                                                  sourcing for entity matching. In SIGMOD. 601–612. https://doi.org/10.1145/
        proposed approach both for individual data preparation con-               2588555.2588576
        structs (matches, mappings and CFDs) and for applying feed-          [15] Georg Gottlob and Pierre Senellart. 2010. Schema Mapping Discovery from
        back across all these constructs together. An experimental                Data Instances. JACM 57, 2 (2010), 6:1–6:37. https://doi.org/10.1145/1667053.
                                                                                  1667055
        evaluation with real estate data has shown how the ap-               [16] Sumit Gulwani, William R. Harris, and Rishabh Singh. 2012. Spreadsheet
        proach can identify actions that can improve data product                 data manipulation using examples. Commun. ACM 55, 8 (2012), 97–105.
        quality on the basis of changes to individual data prepa-                 https://doi.org/10.1145/2240236.2240260
                                                                             [17] Cornelia Hedeler, Khalid Belhajjame, Norman W. Paton, Alessandro Campi,
        ration steps, and can coordinate changes across multiple                  AlvaroA.A. Fernandes, and SuzanneM. Embury. 2010. Dataspaces. In Search
        such steps, with particularly significant benefits from the               Computing. LNCS, Vol. 5950. Springer Berlin Heidelberg, 114–134. http://dx.
                                                                                  doi.org/10.1007/978-3-642-12310-8_7
        combined approach.                                                   [18] Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Vinh Tuan Chau, Tri Kurniawan
   There are several promising directions for further investiga-                  Wijaya, Zoltán Miklós, Karl Aberer, Avigdor Gal, and Matthias Weidlich. 2015.
                                                                                  SMART: A tool for analyzing and reconciling schema matching networks. In
tion, which include: (a) extending the data preparation compo-                    31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South
nents to which the approach is applied, for example to include                    Korea, April 13-17, 2015. 1488–1491. https://doi.org/10.1109/ICDE.2015.7113408
source selection or data format transformation; (b) considering              [19] Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2011.
                                                                                  Wrangler: Interactive Visual Specification of Data Transformation Scripts. In
alternative actions, such as changing match scores, mapping al-                   CHI. 3363–3372.
gorithm thresholds, or rule learning parameters; and (c) more                [20] Martin Koehler, Alex Bogatu, Cristina Civili, Nikolaos Konstantinou, Ed-
                                                                                  ward Abel, Alvaro A. A. Fernandes, John A. Keane, Leonid Libkin, and
selective or incremental application of actions, with a view to                   Norman W. Paton. 2017. Data context informed data wrangling. In 2017
                                                                                  IEEE International Conference on Big Data, BigData 2017. 956–963. https:
5 https://www.datawatch.com/2017-end-user-data-preparation-market-study-2/        //doi.org/10.1109/BigData.2017.8258015
[21] Nikolaos Konstantinou, Martin Koehler, Edward Abel, Cristina Civili, Bernd
     Neumayr, Emanuel Sallinger, Alvaro A. A. Fernandes, Georg Gottlob, John A.
     Keane, Leonid Libkin, and Norman W. Paton. 2017. The VADA Architecture
     for Cost-Effective Data Wrangling. In ACM SIGMOD. 1599–1602. https://doi.
     org/10.1145/3035918.3058730
[22] Georgia Kougka, Anastasios Gounaris, and Alkis Simitsis. 2018. The many
     faces of data-centric workflow optimization: a survey. I. J. Data Science and
     Analytics 6, 2 (2018), 81–107. https://doi.org/10.1007/s41060-018-0107-0
[23] Guoliang Li, Jiannan Wang, Yudian Zheng, and Michael J. Franklin. 2016.
     Crowdsourced Data Management: A Survey. IEEE Trans. Knowl. Data Eng. 28,
     9 (2016), 2296–2319. https://doi.org/10.1109/TKDE.2016.2535242
[24] Ehtisham Zaidi Mark A. Beyer, Eric Thoo. 2018. Magic Quadrant for Data
     Integration Tools. Technical Report. Gartner. G00340493.
[25] Ruhaila Maskat, Norman W. Paton, and Suzanne M. Embury. 2016. Pay-as-you-
     go Configuration of Entity Resolution. T. Large-Scale Data- and Knowledge-
     Centered Systems 29 (2016), 40–65. https://doi.org/10.1007/978-3-662-54037-4_
     2
[26] Fotis Psallidas, Bolin Ding, Kaushik Chakrabarti, and Surajit Chaudhuri.
     2015. S4: Top-k Spreadsheet-Style Search for Query Discovery. In Pro-
     ceedings of the 2015 ACM SIGMOD International Conference on Management
     of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 2001–2016.
     https://doi.org/10.1145/2723372.2749452
[27] Tye Rattenbury, Joe Hellerstein, Jeffery Heer, Sean Kandel, and Conner Car-
     reras. 2017. Principles of Data Wrangling. O’Reilly.
[28] Julio César Cortés Ríos, Norman W. Paton, Alvaro A. A. Fernandes, and Khalid
     Belhajjame. 2016. Efficient Feedback Collection for Pay-as-you-go Source
     Selection. In Proceedings of the 28th International Conference on Scientific and
     Statistical Database Management, SSDBM 2016, Budapest, Hungary, July 18-20,
     2016. 1:1–1:12. https://doi.org/10.1145/2949689.2949690
[29] Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales, Mitch
     Cherniack, Stanley B. Zdonik, Alexander Pagan, and Shan Xu. 2013. Data Cura-
     tion at Scale: The Data Tamer System. In CIDR 2013, Sixth Biennial Conference
     on Innovative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013, On-
     line Proceedings. http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper28.pdf
[30] Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby
     Crammer, Zachary G. Ives, Fernando C. N. Pereira, and Sudipto Guha. 2008.
     Learning to create data-integrating queries. PVLDB 1, 1 (2008), 785–796.
     https://doi.org/10.14778/1453856.1453941
[31] Vasileios Theodorou, Alberto Abelló, Maik Thiele, and Wolfgang Lehner. 2017.
     Frequent patterns in ETL workflows: An empirical approach. Data Knowl.
     Eng. 112 (2017), 1–16. https://doi.org/10.1016/j.datak.2017.08.004
[32] Christian Thomsen and Torben Bach Pedersen. 2009. pygrametl: a powerful
     programming framework for extract-transform-load programmers. In DOLAP
     2009, ACM 12th International Workshop on Data Warehousing and OLAP, Hong
     Kong, China, November 6, 2009, Proceedings. 49–56. https://doi.org/10.1145/
     1651291.1651301
[33] Kalle Tomingas, Margus Kliimask, and Tanel Tammet. 2014. Data Integration
     Patterns for Data Warehouse Automation. In 18th East European Conference
     on Advances in Databases and Information Systems, ADBIS. 41–55. https:
     //doi.org/10.1007/978-3-319-10518-5_4
[34] P. Vassiliadis. 2011. A Survey of Extract-Transform-Load Technology. IJDWM
     5, 3 (2011), 1–27.
[35] Zhepeng Yan, Nan Zheng, Zachary G. Ives, Partha Pratim Talukdar, and Cong
     Yu. 2015. Active learning in keyword search-based data integration. VLDB J.
     24, 5 (2015), 611–631. https://doi.org/10.1007/s00778-014-0374-x
[36] Chen Jason Zhang, Lei Chen, H. V. Jagadish, and Caleb Chen Cao. 2013.
     Reducing Uncertainty of Schema Matching via Crowdsourcing. PVLDB 6, 9
     (2013), 757–768. https://doi.org/10.14778/2536360.2536374