=Paper= {{Paper |id=Vol-2324/Paper00-InvTalk2-NPaton |storemode=property |title=Automating Data Preparation: Can We? Should We? Must We? |pdfUrl=https://ceur-ws.org/Vol-2324/Paper00-InvTalk2-NPaton.pdf |volume=Vol-2324 |authors=Norman Paton |dblpUrl=https://dblp.org/rec/conf/dolap/Paton19 }} ==Automating Data Preparation: Can We? Should We? Must We?== https://ceur-ws.org/Vol-2324/Paper00-InvTalk2-NPaton.pdf
Automating Data Preparation: Can We? Should We? Must We?
                                                                    Norman W. Paton
                                                             School of Computer Science
                                                              University of Manchester
                                                                  Manchester, UK
                                                           norman.paton@manchester.ac.uk

ABSTRACT                                                                               • Source selection: choosing the data sets that are actually
Obtaining value from data through analysis often requires signifi-                       suitable for the problem at hand, in terms of relevance,
cant prior effort on data preparation. Data preparation covers the                       coverage and quality.
discovery, selection, integration and cleaning of existing data sets                   • Matching: identifying which properties of different sources
into a form that is suitable for analysis. Data preparation, also                        may contain the same type of information.
known as data wrangling or extract transform load, is reported                         • Mapping: the development of transformation programs
as taking 80% of the time of data scientists. How can this time                          that combine or reorganize data sources to remove struc-
be reduced? Can it be reduced by automation? There have been                             tural inconsistencies.
significant results on the automation of individual steps within                       • Data repair: the removal of constraint violations, for ex-
the data wrangling process, and there are now a few proposals for                        ample between a zip code and a street name.
end-to-end automation. This paper reviews the state-of-the-art,                        • Duplicate detection: the identification of duplicate entries
and asks the following questions: Can we automate data prepara-                          for the same real world object within individual data sets
tion – what techniques are already available? Should we – what                           or in the results of mappings.
data preparation activities seem likely to be able to be carried                       • Data fusion: the selection of the data from identified du-
out better by software than by human experts? Must we – what                             plicates for use in later steps of the process.
data preparation challenges cannot realistically be carried out by                   This is an intimidating collection of stages to approach manu-
manual approaches?                                                                ally. Quite a few of these are individually challenging, involving
                                                                                  tasks such as the authoring of mappings or format transforma-
                                                                                  tion rules, and the setting of thresholds and parameters (e.g.,
1    INTRODUCTION                                                                 for matching and duplicate detection). Current data preparation
It is widely reported that data scientists are spending around 80%                systems provide frameworks within which such tasks can be
of their time on data preparation1, 2 . Likely data scientists can’t              carried out, but typically the data scientist or engineer remains
realistically expect to spend 0% of their time on data preparation,               responsible for making many fine grained decisions.
but 80% seems unnecessarily high. Why is this figure so high? It                     Any data preparation solution in which the data scientist
isn’t because there are no products to support data preparation;                  retains fine grained control over each of the many steps that
the data preparation tools market is reported to be worth $2.9B                   may be involved in data preparation is likely to be expensive. As
and growing rapidly [24].                                                         a result, the hypothesis of this paper is that automation of these
    It seems likely that data preparation is expensive because it is              steps, and indeed of complete data preparation processes, should
still in significant measure a programming task: data scientists ei-              be a priority. This could be expressed in the following principle
ther write data wrangling programs directly (e.g., [19]), use visual              for data preparation systems:
programming interfaces [18] to develop transformation scripts,
                                                                                          Data preparation systems should involve the de-
or write workflows [35] to combine data preparation operations.
                                                                                          scription of what is required, and not the specifica-
This leads to a significant amount of work, as the activities facing
                                                                                          tion of how it should be obtained.
the data scientist are likely to include the following:
                                                                                      To realise this principle in practice, all the steps described
     • Data discovery: the identification of potentially relevant
                                                                                  above need to be automated, informed by the description of what
       data sources, such as those that are similar to or join with
                                                                                  is required. Providing a description of what is required involves
       a given target.
                                                                                  the data scientist in understanding the problem to be solved, but
     • Data extraction: obtaining usable data sets from challeng-
                                                                                  this is what the data scientist should be focusing on3 .
       ing and heterogeneous types of source, such as the deep
                                                                                      In this setting, we assume that it would be desirable for a
       web.
                                                                                  system to take more responsibility for data preparation tasks
     • Data profiling: understanding the basic properties of in-
                                                                                  than is currently the case, if there can be confidence that the
       dividual data sets (such as keys) and the relationships
                                                                                  system will perform as well as a human expert. As a result in this
       between data sets (such as inclusion dependencies).
                                                                                  paper, we explore three questions:
     • Format transformation: resolving inconsistencies in value
       representations, for example for dates and names.                               • Can we automate? There has been previous work on au-
                                                                                         tomating individual data preparation steps. Section 2 re-
1 https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-
                                                                                         views such work, and the information that is used to in-
to-insights-is-janitor-work.html                                                         form automation.
2 https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-
consuming-least-enjoyable-data-science-task-survey-says/#6d86d9256f63
                                                                                  3 We note that there are also likely to be further data preparation steps that are

© 2019 Copyright held by the author(s). Published in the Workshop Proceedings     more closely tied to the analysis to be undertaken [29]. Such steps are also both
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on      important and potentially time consuming, and are likely to benefit from some level
CEUR-WS.org.                                                                      of automation, but are beyond the scope of this paper.
      Stage                           What is Automated                                    What Evidence is Used                    Citation
      Data discovery                  The search for unionable data sets                   A populated target table                 [26]
      Data extraction                 The creation of extraction rules                     Training data, feedback                  [11]
      Data profiling                  Annotations on dependencies, keys, ...               The source data                          [27]
      Format transformation           The learning of transformations                      Training examples                        [15]
      Source selection                Identification of the sources matching user criteria Source criteria                          [2]
      Matching                        Identification of related attributes                 Schema and instances                     [3]
      Mapping                         The search for queries that can populate a target    Source schema, target examples           [28]
      Data repair                     The correction of constraint violations              Master data                              [13]
      Duplicate detection             Setting parameters, comparison functions             Training data, feedback                  [25]
      Data fusion                     Selection of values across sources                   Other data sources                       [12]
                                          Table 1: Proposals for automating data preparation steps.



     • Should we automate? There are likely data preparation                     there are also proposals for discovering the examples, for exam-
       steps for which automation can be expected to out-perform                 ple making use of web tables [1] or instance data (such as master
       a human expert, and vice versa. Section 3 discusses which                 data) for a target representation [8].
       steps seem most amenable to automation, and the situa-                       An important feature of automated approaches is that they
       tions in which human experts are likely to be more difficult              can often generate (large numbers of) alternative proposals. For
       to replace.                                                               example, a mapping generation process is likely to produce mul-
     • Must we automate? There are likely settings in which either               tiple candidate mappings, and a duplicate detection program can
       the scale of the data preparation task or the resources                   likely generate alternative comparison rules and thresholds. As
       available preclude a labour intensive approach. Section 4                 a result, a popular approach is to generate a solution automati-
       considers where automation may be an imperative, i.e.,                    cally, and then request feedback on the results. This feedback can
       where a lack of automation means that an investigation                    then be used to refine individual steps within the automated data
       cannot be carried out.                                                    preparation process (e.g., for mapping generation [5, 9, 34] or
                                                                                 for duplicate detection [14, 25]) or to influence the behaviour of
2    CAN WE?                                                                     the complete data preparation pipeline (e.g., [22]). The provision
What might an automated approach to data preparation look                        of feedback involves some effort from the user, but builds on
like? In our principle for data preparation automation in Section                knowledge of the domain, and does not require the user to take
1, it is proposed that the user should provide a description of what             fine grained control over how data preparation is being carried
is required. This means that it can be assumed that the user is                  out.
familiar with the application in which data preparation is to take                  Up to this point, the focus has been on automation of individ-
place, and can provide information about it.                                     ual steps within the data preparation process; most results to date
    What sort of information may be required by an automated                     have involved individual steps, but there are now a few more
system? Table 1 lists proposals for automating each of the data                  end-to-end proposals. In Data Tamer [33]5 , a learning-based ap-
preparation steps listed in Section 1, and for each of these steps               proach is taken to instance-level data integration, in particular
other proposals exist that either use different techniques to un-                focusing on aligning schemas through matching, and bringing
derpin the automation or different evidence to inform the au-                    together the data about application concepts through duplicate
tomation.                                                                        detection and data fusion. In Data Tamer, the approach is semi-
    It cannot be assumed that the automation of data prepara-                    automatic, in that the automatically produced results of different
tion can take place without some supplementary evidence to                       steps are reviewed by users, so the principal forms of evidence
inform the step. In Table 1, the What Evidence is Used column                    deployed are feedback and training data. In VADA [21], all of
outlines what data is used by each proposal to inform the deci-                  format transformation, source selection, matching, mapping and
sions it makes. In some cases, rather little supplementary data is               data repair are informed by evidence in the form of the data
required. For example Data Profiling acts on source data directly,               context [20], instance values that are aligned with a subset of the
and Matching builds on source data and metadata.                                 target schema (e.g., master data or example values). Furthermore,
    However, in general, automation stands to benefit from addi-                 feedback on the automatically produced results can be used to
tional evidence. As an example, for Format Transformation, the                   revisit several of the steps within the automated process [22].
cited method infers format transformation programs from ex-                      These proposals both satisfy our principle that the user should
amples. For example, if we have the source names Robert Allen                    provide information about the domain of application, and not
Zimmerman and Farrokh Bulsara4 , and the target names R. Zim-                    about how to wrangle the data.
merman and F. Bulsara, a format transformation program can
be synthesized for reformatting extended names into abbrevi-                     3    SHOULD WE?
ated names consisting of an initial, a dot, and the surname [15].                In considering when or whether to automate data preparation
So, to save the data scientist from writing format transforma-                   steps, it seems important to understand the consequences for
tion programs, instead the program is synthesized. This reduces                  automation on the quality of the result, both for individual steps
the programming burden, but still requires that examples are                     and for complete data preparation processes. In enterprise data
provided, which may be difficult and time consuming. However,                    integration, for example for populating data warehouses using
4 The birth names of Bob Dylan and Freddie Mercury, in case you are wondering.   5 Commercialised as Tamr: https://www.tamr.com/
ETL tools, the standard practice is for data engineers to craft        partly depends on the role it is playing, and it seems impor-
well understood ETL steps, and to work on these steps and their        tant to the cost-effectiveness of feedback collection for the same
dependencies until there is high confidence that the result is         feedback to be used for more than one task [22]. We note that
of good quality. It is then expected that analyses over the data       some feedback-based proposals obtain feedback on the final data
warehouse will provide dependable results. This high-cost, high-       product (e.g., [6, 9, 11, 34]), but that in some other proposals,
quality setting is both important and well established, and may        the feedback is more tightly coupled to a single step in the data
represent a class of application for which expert authoring of         integration process (e.g., for entity resolution [25, 36]) or to the
ETL processes will continue to be appropriate. In such settings,       specific method being used to generate a solution (e.g., for match-
the warehouse is primarily populated using data from inside the        ing [17] or mapping generation [10]).
organisation, typically from a moderate number of stable and              Should automation be focused on individual steps or on the
well understood transactional databases, to support management         end-to-end data preparation process? Likely this depends on the
reporting. However, there are other important settings for data        task and environment at hand. Where data preparation involves
preparation and analytics; for example, any analysis over a data       programming, data engineers have complete control over how
lake is likely faced with numerous, highly heterogeneous and           the data is manipulated, and thus bespoke processing and com-
rapidly changing data sources, of variable quality and relevance,      plex transformations are possible. End-to-end automation will
for which a labour-intensive approach is less practical. How-          not be able to provide the same levels of customization as are
ever, such data lakes provide new opportunities, for example for       available to programmers. As a result, there is certainly scope
analysing external and internal data sets together [23]. In such a     for automating certain steps within an otherwise manual pro-
setting, an important question is: what are the implications for       cess, although the potential cost savings, and synergies between
the quality of the result from the use of automated techniques?        automated steps, will not be as substantial as with end-to-end
   It seems that there have been few studies on the effective-         automation. Furthermore, we note that avoiding programming
ness of automated techniques in direct comparison with manual          is a common requirement in self-service data preparation [32].
approaches, but there are a few specific studies:
    Format Transformation: Bartoli et al. [4] have developed
                                                                       4    MUST WE?
      techniques for generating regular expressions from train-        Are there circumstances in which the only option is to automate?
      ing examples, for extracting data such as URLs or dates          It seems that automation must be used if the alternative is to
      from documents. In a study comparing the technique with          leave the task undone; in such situations, a best-effort automated
      human users, it was found that the generated regular ex-         approach creates opportunities for obtaining value from data
      pressions were broadly as effective (in terms of F-measure)      that would otherwise be missed. Here are two situations where
      as the most experienced group of humans, while taking sig-       automation seems to be the only option:
      nificantly less time. There is also a usability study on semi-        • The task presents challenges that are punishing for manual
      automatic approaches for format transformation [16], in                 approaches. The big data movement is associated with the
      which the system (Wrangler) suggests transformations                    production of ever larger numbers of data sources, from
      to users. In this study, the users made rather sparing use              which value can potentially be achieved by bringing the
      of the system-generated transformations, and completion                 data together in new ways. The Variety, Veracity and Veloc-
      times were similar with and without the suggested trans-                ity features of big data mitigate against the use of manual
      formations. This study at least calls into question the ef-             data preparation processes, where specific cleaning and
      fectiveness of a semi-automated approach.                               integration steps may need to be developed for each new
    Mapping generation: Qian et al. [30] have developed a sys-                format of data set. It seems likely that manually produced
      tem for generating schema mappings from example values                  data preparation tasks will always lag behind the avail-
      in the target schema. An experimental evaluation found                  able data. In particular, the growth of open data sets and
      that mapping construction was substantially quicker when                the development of data lakes present opportunities for
      based on examples, than when using a traditional mapping                exploratory analyses that require flexible and rapid data
      development system, in which the user curates matches                   preparation, even if the results may not be as carefully
      and is provided with generated mappings to refine [7].                  curated as a human expert could produce given sufficient
      This study at least suggests that the provision of instance             time.
      data to inform automated data preparation may be a prac-              • The resources are not available to enable a thorough, more
      tical option.                                                           manual approach. The knowledge economy doesn’t only
                                                                              consist of large enterprises; e.g., as noted in the UK govern-
Overall, the evidence on the quality of the results of automated
                                                                              ment’s Information Economy Strategy6 , the overwhelming
data preparation in direct comparison with manual approaches
                                                                              majority of information economy businesses – 95% of the
seems to be quite hard to come by in the literature, and fur-
                                                                              120,000 enterprises in the sector – employ fewer than 10 peo-
ther studies would be valuable. However, research papers on
                                                                              ple. As a result, many small and medium sized enterprises
automated techniques often report empirical evaluations of their
                                                                              are active in data science, but cannot employ large teams
absolute performance and/or performance against a computa-
                                                                              or have large budgets for data preparation. For example,
tional baseline, which provides evidence that such techniques
                                                                              an e-Commerce start-up that seeks to compare its prices
can provide respectable results. Furthermore, there are also em-
                                                                              with those of competitors, or a local house builder that is
pirical evaluations of the impact of feedback on results; these
                                                                              trying to understand pricing trends in a region, may need
show significant variety. In some problems substantial improve-
                                                                              to carry out analyses over a collection of data sets, but
ments are observed with modest amounts of feedback (e.g., [25])
                                                                              may not employ a team of data scientists.
and in some cases more substantial samples are required (e.g.,
[31]). The amount of feedback required for refining a solution         6 https://www.gov.uk/government/publications/information-economy-strategy
   What about the individual steps within data preparation, from              20237 , efficient techniques for exploratory analyses over
Table 1? Are there cases in which an automated approach seems                 data lakes are likely to be in growing demand.
the most likely to succeed? The following seem like cases where        Acknowledgement: Research into Data Preparation at Manch-
it may be difficult to produce good results without automation:        ester is funded by the UK Engineering and Physical Sciences
                                                                       Research Council (EPSRC) through the VADA Programme Grant.
    • Matching: Identifying the relationships between the at-
      tributes in n of sources involves n 2 comparisons; even
                                                                       REFERENCES
      manually curating the results of such automated compar-
                                                                        [1] Ziawasch Abedjan, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Pa-
      isons is a significant task.                                          potti, and Michael Stonebraker. 2016. DataXFormer: A robust transformation
    • Mapping: Exploring how data sets can be combined po-                  discovery system. In 32nd IEEE International Conference on Data Engineering,
                                                                            ICDE. 1134–1145. https://doi.org/10.1109/ICDE.2016.7498319
      tentially involves considering all permutations; again, any       [2] Edward Abel, John Keane, Norman W. Paton, Alvaro A.A. Fernandes, Martin
      manual exploration of how data sets can be combined for               Koehler, Nikolaos Konstantinou, Julio Cesar Cortes Rios, Nurzety A. Azuan,
      large numbers of sources seems likely to miss potentially             and Suzanne M. Embury. 2018. User driven multi-criteria source selection.
                                                                            Information Sciences 430-431 (2018), 179–199. https://doi.org/10.1016/j.ins.
      useful solutions.                                                     2017.11.019
    • Entity Resolution: Entity resolution strategies need to con-      [3] David Aumueller, Hong Hai Do, Sabine Massmann, and Erhard Rahm. 2005.
      figure significant numbers of parameters (typically in all            Schema and ontology matching with COMA++. In Proceedings of the ACM
                                                                            SIGMOD International Conference on Management of Data, Baltimore, Maryland,
      of blocking, pairwise comparison and clustering), as well             USA, June 14-16, 2005. 906–908. https://doi.org/10.1145/1066157.1066283
      as defining a comparison function; this is a difficult, multi-    [4] Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. 2016.
                                                                            Inference of Regular Expressions for Text Extraction from Examples. IEEE
      dimensional search space for a human to navigate.                     Trans. Knowl. Data Eng. 28, 5 (2016), 1217–1230. https://doi.org/10.1109/TKDE.
                                                                            2016.2515587
These challenges at the level of individual steps are compounded        [5] Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A.
when considering a pipeline of operations; we have the experi-              Fernandes, and Cornelia Hedeler. 2010. Feedback-based annotation, selection
                                                                            and refinement of schema mappings for dataspaces. In EDBT. 573–584. https:
ence that the best results come when parameter setting across               //doi.org/10.1145/1739041.1739110
multiple steps is coordinated [25]. Again, manual multi-component       [6] Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A.
tuning is likely to be difficult in practice.                               Fernandes, and Cornelia Hedeler. 2013. Incrementally improving dataspaces
                                                                            based on user feedback. Inf. Syst. 38, 5 (2013), 656–687. https://doi.org/10.
                                                                            1016/j.is.2013.01.006
                                                                        [7] Philip A. Bernstein and Laura M. Haas. 2008. Information integration in the
5   CONCLUSIONS                                                             enterprise. CACM 51, 9 (2008), 72–79. https://doi.org/10.1145/1378727.1378745
This paper has discussed the hypothesis that data preparation           [8] Alex Bogatu, Norman W. Paton, and Alvaro A. A. Fernandes. 2017. Towards
                                                                            Automatic Data Format Transformations: Data Wrangling at Scale. In Data
should be automated, with the many components being config-                 Analytics - 31st British International Conference on Databases, BICOD. 36–48.
ured for specific sets of sources on the basis of information about         https://doi.org/10.1007/978-3-319-60795-5_4
                                                                        [9] Angela Bonifati, Radu Ciucanu, and Slawek Staworko. 2014. Interactive Infer-
the target. Three questions have been considered:                           ence of Join Queries. In 17th International Conference on Extending Database
                                                                            Technology, EDBT. 451–462. https://doi.org/10.5441/002/edbt.2014.41
    Can we? There are significant results on the automation            [10] Angela Bonifati, Ugo Comignani, Emmanuel Coquery, and Romuald Thion.
      of individual steps, and several proposals for end-to-end             2017. Interactive Mapping Specification with Exemplar Tuples. In Proceedings
                                                                            of the 2017 ACM International Conference on Management of Data, SIGMOD
      automation, where the steps are informed by data about                Conference 2017, Chicago, IL, USA, May 14-19, 2017. 667–682. https://doi.org/
      the intended outcome of the process, typically in the form            10.1145/3035918.3064028
      of training data or examples. For the future, further work       [11] Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2015. Crowdsourcing
                                                                            large scale wrapper inference. Distributed and Parallel Databases 33, 1 (2015),
      on each of the steps, for example to use different sorts of           95–122. https://doi.org/10.1007/s10619-014-7163-9
      evidence about the target, should increase the applicability     [12] Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin
                                                                            Murphy, Shaohua Sun, and Wei Zhang. 2014. From Data Fusion to Knowledge
      of automated methods. Early work on automating end-to-                Fusion. PVLDB 7, 10 (2014), 881–892. https://doi.org/10.14778/2732951.2732962
      end data preparation seems promising, but there is likely        [13] Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management.
      much more to do.                                                      Morgan & Claypool.
                                                                       [14] Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan
    Should we? A case can be made that automating many of                   Rampalli, Jude W. Shavlik, and Xiaojin Zhu. 2014. Corleone: hands-off crowd-
      the steps should be able to produce results that are at least         sourcing for entity matching. In SIGMOD. 601–612. https://doi.org/10.1145/
      as good as a human expert should manage, especially for               2588555.2588576
                                                                       [15] Sumit Gulwani, William R. Harris, and Rishabh Singh. 2012. Spreadsheet
      large applications. There is a need for more systematic               data manipulation using examples. Commun. ACM 55, 8 (2012), 97–105.
      evaluation of automated techniques in comparison with                 https://doi.org/10.1145/2240236.2240260
                                                                       [16] Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and Jeffrey Heer. 2011.
      human experts, to identify when automation can already                Proactive wrangling: mixed-initiative end-user programming of data trans-
      be trusted to identify solutions that compete with those              formation scripts. In Proceedings of the 24th Annual ACM Symposium on User
      of experts, and those in which the automated technique                Interface Software and Technology, Santa Barbara, CA, USA, October 16-19, 2011.
                                                                            65–74. https://doi.org/10.1145/2047196.2047205
      or the evidence used can usefully be revisited.                  [17] Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Zoltán Miklós, Karl Aberer,
    Must we? There will be tasks that are out of reach for man-             Avigdor Gal, and Matthias Weidlich. 2014. Pay-as-you-go reconciliation in
      ual approaches. These may not only be the large and chal-             schema matching networks. In IEEE 30th International Conference on Data
                                                                            Engineering ICDE. 220–231. https://doi.org/10.1109/ICDE.2014.6816653
      lenging tasks; if your budget is x and the cost of man-          [18] Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2011.
      ual data preparation is 2x, then the task is out of reach.            Wrangler: Interactive Visual Specification of Data Transformation Scripts. In
                                                                            CHI. 3363–3372.
      As in many cases the available budget may be severely            [19] Jacqueline Kazil and Katharine Jarmul. 2016. Data Wrangling in Python.
      constrained, there is likely to be a market for automated             O’Reilly.
      techniques in small to medium sized organisations, where         [20] Martin Koehler, Alex Bogatu, Cristina Civili, Nikolaos Konstantinou, Ed-
                                                                            ward Abel, Alvaro A. A. Fernandes, John A. Keane, Leonid Libkin, and
      at the moment more manual approaches are rather par-                  Norman W. Paton. 2017. Data context informed data wrangling. In 2017
      tial (e.g. investigating only a small subset of the available         IEEE International Conference on Big Data, BigData 2017. 956–963. https:
      data). In addition, with the data lakes market predicted to           //doi.org/10.1109/BigData.2017.8258015
      grow at a 28% compound annual growth rate to $28B by             7 https://www.marketresearchfuture.com/reports/data-lakes-market-1601
[21] Nikolaos Konstantinou, Martin Koehler, Edward Abel, Cristina Civili, Bernd
     Neumayr, Emanuel Sallinger, Alvaro A. A. Fernandes, Georg Gottlob, John A.
     Keane, Leonid Libkin, and Norman W. Paton. 2017. The VADA Architecture
     for Cost-Effective Data Wrangling. In ACM SIGMOD. 1599–1602. https:
     //doi.org/10.1145/3035918.3058730
[22] Nikolaos Konstantinou and Norman W. Paton. 2019. Feedback Driven Im-
     provement of Data Preparation Pipelines. In Proc. 21st International Workshop
     on Design, Optimization, Languages and Analytical Processing of Big Data
     (DOLAP). CEUR.
[23] Jorn Lyseggen. 2017. Outside Insight Navigating a World Drowning in Data.
     Penguin.
[24] Ehtisham Zaidi Mark A. Beyer, Eric Thoo. 2018. Magic Quadrant for Data
     Integration Tools. Technical Report. Gartner. G00340493.
[25] Ruhaila Maskat, Norman W. Paton, and Suzanne M. Embury. 2016. Pay-as-you-
     go Configuration of Entity Resolution. T. Large-Scale Data- and Knowledge-
     Centered Systems 29 (2016), 40–65. https://doi.org/10.1007/978-3-662-54037-4_
     2
[26] Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table
     Union Search on Open Data. PVLDB 11, 7 (2018), 813–825. https://doi.org/10.
     14778/3192965.3192973
[27] Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and
     Felix Naumann. 2015. Data Profiling with Metanome. Proceedings of the VLDB
     Endowment 8, 12 (2015), 1860–1863. https://doi.org/10.14778/2824032.2824086
[28] Fotis Psallidas, Bolin Ding, Kaushik Chakrabarti, and Surajit Chaudhuri.
     2015. S4: Top-k Spreadsheet-Style Search for Query Discovery. In Pro-
     ceedings of the 2015 ACM SIGMOD International Conference on Management
     of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 2001–2016.
     https://doi.org/10.1145/2723372.2749452
[29] Dorian Pyle. 1999. Data Preparation for Data Mining. Morgan Kaufmann.
[30] Li Qian, Michael J. Cafarella, and H. V. Jagadish. 2012. Sample-driven schema
     mapping. In Proceedings of the ACM SIGMOD International Conference on
     Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012.
     73–84. https://doi.org/10.1145/2213836.2213846
[31] Julio César Cortés Ríos, Norman W. Paton, Alvaro A. A. Fernandes, and Khalid
     Belhajjame. 2016. Efficient Feedback Collection for Pay-as-you-go Source
     Selection. In Proceedings of the 28th International Conference on Scientific and
     Statistical Database Management, SSDBM 2016, Budapest, Hungary, July 18-20,
     2016. 1:1–1:12. https://doi.org/10.1145/2949689.2949690
[32] Rita L. Sallam, Paddy Forry, Ehtisham Zaidi, and Shubhangi Vashisth. 2016.
     Market Guide for Self-Service Data Preparation. Technical Report. Gartner.
[33] Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales, Mitch
     Cherniack, Stanley B. Zdonik, Alexander Pagan, and Shan Xu. 2013. Data
     Curation at Scale: The Data Tamer System. In CIDR 2013, Sixth Biennial Con-
     ference on Innovative Data Systems Research, Asilomar, CA, USA, January 6-9,
     2013, Online Proceedings. http://www.cidrdb.org/cidr2013/Papers/CIDR13_
     Paper28.pdf
[34] Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby
     Crammer, Zachary G. Ives, Fernando C. N. Pereira, and Sudipto Guha. 2008.
     Learning to create data-integrating queries. PVLDB 1, 1 (2008), 785–796.
     https://doi.org/10.14778/1453856.1453941
[35] Panos Vassiliadis. 2011. A Survey of Extract-Transform-Load Technology.
     IJDWM 5, 3 (2011), 1–27.
[36] Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012.
     CrowdER: Crowdsourcing Entity Resolution. PVLDB 5, 11 (2012), 1483–1494.
     https://doi.org/10.14778/2350229.2350263