<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Conceptual Constraints for Data Quality in Data Lakes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Ciaccia</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Martinenghi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Torlone</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Elettronica, Informazione e Bioingegneria</institution>
          ,
          <addr-line>Politecnico di Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dipartimento di Informatica - Scienza e Ingegneria, Università di Bologna</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dipartimento di Ingegneria, Università Roma Tre</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A data lake is a loosely-structured collection of data at scale built for analysis purposes that is initially fed with almost no requirement of data quality. This approach aims at eliminating any efort before the actual exploitation of data, but the problem is only delayed since robust and defensible data analysis can only be performed after very complex data preparation activities. In this paper, we address this problem by proposing a novel and general approach to data curation in data lakes based on: (i) the specification of integrity constraints over a conceptual representation of the data lake and (ii) the automatic translation and enforcement of such constraints over the actual data. We discuss the advantages of this idea and the challenges behind its implementation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Lake</kwd>
        <kwd>Schema</kwd>
        <kwd>Constraints</kwd>
        <kwd>Metadata</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In traditional big data analysis, activities such as cleaning, transforming, and integrating source
data are essential but they usually make knowledge extraction a very long and tedious process.
For this reason, data-driven organizations have recently adopted an agile strategy that dismisses
any data processing before their actual consumption. This is done by building and maintaining
a repository, called “data lake”, for storing any kind of data in its native format. A dataset in the
lake is usually just a collection of raw data, either gathered from internal applications (e.g., logs
or user-generated data) or from external sources (e.g., open data), that is made persistent on a
storage system, usually distributed, “as is”, without going through an ETL process.</p>
      <p>
        Unfortunately, reducing the engineering efort upfront just delays the traditional issues
of data pre-processing since this approach does not eliminate the need for high quality data
and schema understanding. Therefore, to guarantee reliable results, a long process of data
preparation (a.k.a. data wrangling) is required over the portion of the data lake that is relevant
for a business purpose before any meaningful analysis can be performed on it [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. This
process typically consists of pipelines of operations such as: source and feature selection, data
enrichment, data transformation, data curation, and data integration. A number of
state-ofthe-art applications can support these activities, including: (i) data and metadata catalogs, for
understanding and selecting the appropriate datasets [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7">4, 5, 6, 7</xref>
        ]; (ii) tools for full-text indexing,
for providing keyword search and other advanced search capabilities [
        <xref ref-type="bibr" rid="ref6 ref8">8, 6</xref>
        ]; (iii) data profilers,
for collecting meta-information from datasets [
        <xref ref-type="bibr" rid="ref1 ref8 ref9">1, 8, 9</xref>
        ]; (iv) distributed data processing engines
like Spark [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and (v) tools and libraries for data manipulation and analysis, such as Pandas1
and Scikit-learn,2 in conjunction with data science notebooks, such as Jupyter3 and Zeppelin.4
Still, data preparation is an involved, fragmented and time-consuming process, thus making the
extraction of valuable knowledge from the lake hard.
      </p>
      <p>
        In this scenario, we argue that the availability of a high-level, conceptual representation of the
data lake is fundamental, not only for data discovery, understanding, and searching [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ], but
also for evaluating and possibly improving the quality of data. This is because a representation
of the real-world concepts and relationships that the data capture (e.g., employees, customers,
products, locations, sales, and so on) provides an ideal setting for identifying the constraints
that hold in the application domain of reference (e.g., the fact that, for business purposes, all
the products for sale must be classified in categories). If we are able to map and enforce such
constraints on the underlying data, their quality naturally improves and makes the subsequent
analysis more efective and less prone to errors.
      </p>
      <p>
        Building on this idea, in this vision paper we propose a principled approach to data curation in
data lakes based on the identification and enforcement of conceptual constraints. The approach
is based on the following main activities: (1) the gathering of metadata from the data lake (or
from a portion of interest for a specific business goal) in the form of a conceptual schema, (2) the
analysis of the conceptual schema and the specification of integrity constraints over it, (3) the
automatic translation of the constraints defined at the conceptual level into constraints over
the datasets in the data lake, (4) the enforcement of the integrity constraints so obtained over
the actual data. While there is a large body of works on extracting and collecting metadata
from data sources [
        <xref ref-type="bibr" rid="ref1 ref8 ref9">1, 8, 9</xref>
        ] and on repairing data given a set of integrity constraints [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
        ],
corresponding to steps (1) and (4) above, to our knowledge the issue of exploiting conceptual
representations for data lake curation has never been explored before.
      </p>
      <p>The rest of the paper is devoted to the presentation of some initial steps towards this goal.</p>
      <p>Specifically, in Section 2 we state the problem by recalling the typical data life-cycle in a data
lake and by illustrating, in this framework, our proposal for data curation. Then, in Section 3 we
state the basic notions (datasets, schemas, constraints, and mappings) underlying our approach.
This is done by means of very general definitions, in order to make the approach independent
of any specific data model and format. In Section 4 we provide some details of our solution
through an example. Finally, in Section 5 we discuss the related works, the main issues involved
in the implementation of our proposal, and the work that needs to be done to tackle these issues.</p>
      <sec id="sec-1-1">
        <title>1https://pandas.pydata.org/ 2https://scikit-learn.org/ 3https://jupyter.org/ 4https://zeppelin.apache.org/</title>
        <p>Data consumption</p>
        <p>Visualization
PROCESSED DATA
Data analysis
Data sources</p>
        <p>Ingestion</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Data Quality in Data Lakes</title>
      <sec id="sec-2-1">
        <title>2.1. Data life-cycle</title>
        <p>The typical data life-cycle in a data lake is illustrated in Figure 1, in which blue boxes represent
activities and green ones represent repositories of persistent data. The following main phases
are usually involved in this process.</p>
        <p>1. During data ingestion, raw copies of source data are stored in their native format (e.g.,
relational, CSV, XML, JSON, or just text) in a centralized repository. Usually, a simple file
system, possibly distributed, is used for this purpose.
2. In the data preparation step, data that are relevant for a specific business goal are extracted
from the central repository and suitably transformed into a curated form so as to be
efectively used for analysis purposes. This activity includes various tasks, such as data
cleaning, standardization, enrichment, and integration. During this stage, data is usually
stored into a more advanced system for data management (e.g., a relational or a NoSQL
database store), which allows the specialists to specify the constraints that need to be
enforced for guaranteeing an adequate level of data quality.
3. Data analysis includes the final activities of knowledge extraction from curated data,
which may involve a broad spectrum of techniques, based on statistics, data mining, and
machine learning. Also in this case, the output is usually stored in a persistent database
ProjNo {key}
*
*</p>
        <p>to simplify the final activity of consumption of the results of the analysis, by means of
various forms of data visualization.</p>
        <p>
          As highlighted in Figure 1, the management of metadata plays a fundamental role along all
the above mentioned activities. This is done by building and maintaining a repository of
information describing, possibly at diferent levels of abstraction, all the various kinds of data
that are produced in the various stages of data processing occurring in the data lake [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Note
also that the processes of data preparation and data analysis are iterative since the quality of
both the data and the result of analysis are usually improved just progressively.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Using Conceptual constraints for data curation</title>
        <p>In this scenario, we envisage the need for a conceptual representation of the metadata describing
the content of interest of the data lake, which we call the conceptual schema. This involves
concepts (such as entities, relationships, and generalizations) that map to the actual components
(such as attributes, documents, and labels) of datasets stored in the data lake.</p>
        <p>The availability of a conceptual schema  of data lake  can provide a number of important
benefits:
1. it allows the analysts to have a general and system-independent vision of the data available
in ,
2. it provides an abstract view of the data lake content which can be used to define and
possibly specifying queries over , and
3. it allows the specification of real-world constraints that, enforced on , improve the
overall quality of its content.</p>
        <p>
          In this paper, we focus on problem 3 above that, to the best of our knowledge, has not been
studied before. As shown graphically in Figure 2, it basically requires the tasks that follow.
1. A (portion of interest of a) data lake  is initially transformed into a “standardized”
version, obtained by adapting source data to the format of the data storage system chosen
for the curated layer.
2. The skeleton ̂︀ of a conceptual schema is built from . Basically, ̂︀ includes the main
entities and relationship involved in  as well as a mapping between the components of 
and the elements of ̂︀. This task can be done manually and/or using available techniques
and tools for semantic annotation or column-type discovery in data lakes [
          <xref ref-type="bibr" rid="ref16 ref17 ref18">16, 17, 18</xref>
          ].
3. ̂︀ is refined, possibly incrementally, into an “evolved” schema  by adding a collection of
real-world constraints. For instance, by stating that an entity is a special case of another
entity or that an entity can only participate in a single occurrence of a certain relationship.
Typically, this step requires a knowledge of the specific domain (e.g., that a department
has a single manager).
4. The constraints represented by  are mapped to constraints  over the actual data stored
in .  can be expressed in several ways, depending on the system used to store and
manage .
5. The constraints  are enforced on . Again, this can be done in several ways, depending
on the tools available for storing and manipulating data in the data lake [
          <xref ref-type="bibr" rid="ref15 ref19">15, 19</xref>
          ].
        </p>
        <p>We can notice that in the process above no specific work has specifically addressed point 4.
In the rest of the paper, we focus on this challenging task by first introducing the relevant
elements of the problem (Section 3), and by then illustrating the main ideas for its solution
through an example (Section 4).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Data and Metadata Management</title>
      <p>Let us now fix some basic notions that we will refer to in the following. Our definitions are
deliberately abstract so as to be as general as possible, without the need to commit to any
specific data lake model and format.</p>
      <p>Dataset. We consider that a dataset (, ) has a name  and is composed of a set  of
attributes and a set  of data items. Each data item in  is a set of attribute-value pairs, with
attributes taken from .</p>
      <p>Figure 3 shows an example of datasets still in a “raw” format, reporting data about the finance
and tech departments of a company. After curation, the so-obtained datasets also take part in
the data lake.</p>
      <p>Data Lake. For our purposes, a data lake  = (, ℳ) can be modeled as a collection  of
datasets having distinct names, plus a set of metadata ℳ, including a (possibly empty) set of
constraints  on the datasets.</p>
      <p>Figure 4 shows a collection of partially curated datasets in  (D_Emp, D_Dept, and D_Act) that
have been obtained from the raw datasets of Figure 3 by unnesting employees from departments
and activities from employees. The metadata include, e.g., cross-dataset constraints, such as the
fact that DeptCodes appearing in D_Emp must also appear in D_Dept, as well as, say, domain
constraints such as the fact that Level must be an integer (so employee E_05 violates this).</p>
      <p>Finance Department</p>
      <p>Conceptual schema. We consider that the domain of interest for analysis purposes is
represented by a conceptual schema , expressed by means of a suitable language ℒ . Examples
are Entity-Relationship (E-R) diagrams, RDF(S), UML’s class diagrams, and Description Logic
(DL) languages, such as those underlying the OWL 2 standard and its profiles. 5. Besides specific
diferences, each of these languages allows for the definition of concepts (i.e., classes of objects,
entities), relationships (a.k.a. as roles) among them, and properties (of concepts and relationships).
Conceptual constraints. Of particular interest to us are the conceptual constraints that
characterize the elements of the schema . Clearly, these are a subset of those available in the chosen
language ℒ . For instance, in the E-R formalism we can state that two entities 1 and 2 have
a common generalizing entity  (subset(1, ) and subset(2, )) and that 1 and 2 are
disjoint (disjoint(1, 2)). However, the E-R model provides no means to state, say, that
the instances of 1 are exactly those instances of  for which the attribute  of  has a value
≥ 20.6
Mapping. The connection between the conceptual schema  and the data lake (, ℳ) is
based on a mapping  , i.e., a set of assertions relating the elements in  to the datasets in</p>
      <sec id="sec-3-1">
        <title>5https://www.w3.org/TR/owl2-profiles/</title>
        <p>6This would require an additional, possibly ad hoc language, a scenario we do not consider here.</p>
        <p>Name
Homer
Marge
Bart</p>
        <p>Lisa
D_Dept
DeptCode
D01
D02</p>
        <p>Salary
100K
150K
80K
50K</p>
        <p>DeptCode
D01
D02
D02
D01
. For instance, an entity Departments in  could be mapped to the projection of dataset
D_Dept on just the attributes DeptCode and DeptName, with the MgrNo attribute representing
a relationship between Departments and Employees.</p>
        <p>
          Before proceeding, we remark that, unlike OBDA (Ontology-Based Data Access)
approaches [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], we do not use  for the purpose of obtaining results from  given a query
on . Rather,  is the key ingredient to define and enforce on the data lake the conceptual
constraints in . Concisely, we denote as  the efects of this constraint propagation back to
the datasets in :
        </p>
        <p>⊂  − 1().</p>
        <p>Once the conceptual constraints on the data lake  have been generated, they may be used to
check if  is consistent with respect to  and, eventually, to repair .</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. An Example</title>
      <p>The E-R schema  in Figure 5 describes a simplified scenario regarding the departments of a
company. The schema includes structural information (such as the fact that Employees have a
Name and a Salary) as well as constraints (such as the fact that Managers are also Employees or
that each Department has at least one Employee). Notice that the schema  deliberately does
not include the NoHours attribute that characterizes each activity of a researcher (see dataset
D_Act in Figure 4). This is to emphasize that  only focuses on that part of the data lake that is
of interest for the analysis, which here does not include, as we assume, the NoHours attribute.</p>
      <p>Besides basic constraints on attributes, such as non-nullability and domain of admitted values
(which, in the following, we will omit for brevity), relevant constraints in , here informally
described as self-explanatory predicates, are:
unique(EmpNo,Employees) every employee is identified by EmpNo
unique(DeptCode,Departments) every department is identified by DeptCode
. . .
subset(Managers,Employees) managers are employees
subset(Researchers,Employees) researchers are employees
disjoint(Managers,Researchers) no manager is a researcher
card(Departments,Direct,1,1) every department has exactly one manager
card(Employees,Work,1,1) every employee works in exactly one department
card(Departments,Work,1,n) every department has at least one employee
Now, consider the datasets in Figure 4, whose structure is reported below for the sake of clarity:
D_Emp(EmpNo,Name,Salary,DeptCode,Level,CV,PID,PName,Budget),
D_Dept(DeptCode,DeptName,MgrNo),</p>
      <p>D_Act(ResNo,Activity,NoHours).</p>
      <p>Then, we can define the mapping  by means of the following statements, one for each entity
and relationship in :7
7The underscore symbol indicates (anonymous) variables not relevant to the statement. The adopted notation is
therefore positional like in, e.g., Datalog.</p>
      <p>The constraints  corresponding to this mapping, include, among others, the following ones,
where we additionally assume that any two tuples 1, 2 mentioned in the constraints are distinct:
• Uniqueness of DeptCode:</p>
      <p>1 : ∀1, 2 ∈ D_Dept : ¬(1.DeptCode = 2.DeptCode)
• Disjointness of managers and researchers:</p>
      <p>2 : ∀1 ∈ D_Emp : ¬(NotNull(1.Level) ∧ NotNull(1.CV))
• Departments are directed by managers:</p>
      <p>3 : ∀1 ∈ D_Dept∃2 ∈ D_Emp : 1.MgrNo = 2.EmpNo ∧ NotNull(2.Level)
• Each department has at least one employee:</p>
      <p>4 : ∀1 ∈ D_Dept∃2 ∈ D_Emp : 1.DeptCode = 2.DeptCode
• Each employee has activities only within a project:</p>
      <p>5 : ∀1 ∈ D_Act∃2 ∈ D_Emp : 1.ResNo = 2.EmpNo ∧ NotNull(2.PID)</p>
      <p>Consider now the datasets in Figure 4. It is apparent that  violates the following conceptual
constraints in :
• Employee E07 has both attributes Level and CV not null, thus violating constraint 2;
• Department D02 is managed by an employee (E10) that is not a manager, contradicting
constraint 3;
• Constraint 5 is also violated, since employee E12 appears in the dataset D_Act although
she does not participate in any project.</p>
      <p>
        Once the above violations are discovered, the datasets can be cleaned using some of the
available methods (see, e.g., [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]).
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusions</title>
      <p>In this vision paper we have put forward the idea of generating constraints on the datasets of a
data lake by exploiting a high-level, conceptual representation, in order to improve the quality
of data and, consequently, that of subsequent analysis.</p>
      <p>
        Our approach can be regarded as complementary to those that aim to curate data by directly
specifying constraints through ad-hoc languages/tools. For instance, CLAMS [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] adopts the
RDF data model for representing data in the curated layer, and defines conditional denial
constraints over views of the data lake defined using SPARQL queries. Although this is a
powerful approach, able to exploit the expressivity of SPARQL, it leaves the full burden of
specifiying constraints (and queries) to the designer/analyst. Furthermore, there is no guarantee
that the set of constraints is consistent, i.e., non-contradictory. The Deequ system [
        <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
        ] is an
open-source library aimed at supporting the automatic verification of data quality. However,
the constraints available in the library apply to a single dataset, thus inter-dataset constraints
cannot be specified.
      </p>
      <p>A major challenge of our approach is to demonstrate that the propagation of conceptual
constraints, i.e., the generation of , can be fully automated. Although in the past decades a
large body of work has investigated how to automatically translate ER schemas to relational
tables (see, e.g., [23]), much less is known for other conceptual models and/or data models such
as RDF. Our view of the problem currently considers (automatic) constraint propagation as a
two-step process: (1) first, one operates a canonical transformation of the conceptual schema
 into a schema  in the target data model of the curated layer; (2) then,  is mapped
to the actual . Besides the obvious advantage of splitting the complexity of the problem into
two well-defined sub-problems, this approach can exploit in step (2) all that is known about the
equivalence of schemas ( and  in our case) expressed in the same formalism.</p>
      <p>
        In the example introduced in Section 4 we have implicitly assumed a complete mapping, i.e., a
mapping in which all elements of the conceptual schema are described in terms of the available
datasets. This is not a necessary condition for our approach, which can also consider larger,
preexisting domain ontologies to enrich the quality of the datasets [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
3299869.3320210.
[23] V. M. Markowitz, A. Shoshani, Representing extended entity-relationship structures in
relational databases: A modular approach, ACM Trans. Database Syst. 17 (1992) 423–464.
URL: https://doi.org/10.1145/132271.132273. doi:10.1145/132271.132273.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Elmagarmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>The data civilizer system</article-title>
          ,
          <source>in: CIDR</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Heudecker</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. White,</surname>
          </string-name>
          <article-title>The data lake fallacy: All water and little substance</article-title>
          ,
          <source>Gartner Report G 264950</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Terrizzano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Colino</surname>
          </string-name>
          ,
          <article-title>Data wrangling: The challenging journey from the wild to the lake</article-title>
          ,
          <source>in: CIDR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>CKAN:</surname>
          </string-name>
          <article-title>The open source data portal software</article-title>
          , http://ckan.org/, (accessed November,
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Bhardwaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Elmore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Karger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Parameswaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Subramanyam</surname>
          </string-name>
          , E. Wu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Collaborative data analytics with DataHub</article-title>
          ,
          <source>PVLDB</source>
          <volume>8</volume>
          (
          <year>2015</year>
          )
          <fpage>1916</fpage>
          -
          <lpage>1927</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Korn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Olston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Polyzotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Whang</surname>
          </string-name>
          , Goods:
          <article-title>Organizing google's datasets</article-title>
          ,
          <source>in: SIGMOD</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sreekanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dalton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ramachandran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Donsky</surname>
          </string-name>
          , G. Fierro,
          <string-name>
            <given-names>C.</given-names>
            <surname>She</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Steinbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          , E. Sun,
          <article-title>Ground: A data context service</article-title>
          ,
          <source>in: CIDR</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Hai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Geisler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quix</surname>
          </string-name>
          ,
          <string-name>
            <surname>Constance:</surname>
          </string-name>
          <article-title>An intelligent data lake system</article-title>
          , in: F. Özcan, G. Koutrika, S. Madden (Eds.),
          <source>Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference</source>
          <year>2016</year>
          , San Francisco, CA, USA, June 26 - July 01,
          <year>2016</year>
          , ACM,
          <year>2016</year>
          , pp.
          <fpage>2097</fpage>
          -
          <lpage>2100</lpage>
          . URL: https://doi.org/10.1145/2882903.2899389. doi:
          <volume>10</volume>
          .1145/2882903.2899389.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Papenbrock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bergmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Finke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zwiener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <article-title>Data profiling with metanome</article-title>
          ,
          <source>PVLDB</source>
          <volume>8</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wendell</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. Das</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Armbrust</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Dave</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Venkataraman</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ghodsi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Shenker</surname>
            ,
            <given-names>I. Stoica</given-names>
          </string-name>
          ,
          <article-title>Apache spark: a unified engine for big data processing</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>59</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Nargesian</surname>
          </string-name>
          , E. Zhu,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Arocena</surname>
          </string-name>
          ,
          <source>Data lake management: Challenges and opportunities 12</source>
          (
          <year>2019</year>
          )
          <fpage>1986</fpage>
          -
          <lpage>1989</lpage>
          . URL: https://doi.org/10.14778/3352063.3352116. doi:
          <volume>10</volume>
          .14778/3352063.3352116.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Korn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Olston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Polyzotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Whang</surname>
          </string-name>
          ,
          <article-title>Managing google's data lake: an overview of the goods system</article-title>
          ,
          <source>IEEE Data Eng. Bull</source>
          .
          <volume>39</volume>
          (
          <year>2016</year>
          )
          <fpage>5</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yakout</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Elmagarmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Neville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <article-title>Guided data repair</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>4</volume>
          (
          <year>2011</year>
          )
          <fpage>279</fpage>
          -
          <lpage>289</lpage>
          . URL: https://doi.org/10.14778/1952376.1952378. doi:
          <volume>10</volume>
          .14778/1952376.1952378.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>A unified model for data and constraint repair</article-title>
          , in: S. Abiteboul,
          <string-name>
            <given-names>K.</given-names>
            <surname>Böhm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Koch</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          Tan (Eds.),
          <source>Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16</source>
          ,
          <year>2011</year>
          , Hannover, Germany, IEEE Computer Society,
          <year>2011</year>
          , pp.
          <fpage>446</fpage>
          -
          <lpage>457</lpage>
          . URL: https://doi.org/10.1109/ICDE.
          <year>2011</year>
          .
          <volume>5767833</volume>
          . doi:
          <volume>10</volume>
          .1109/ICDE.
          <year>2011</year>
          .
          <volume>5767833</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Geerts</surname>
          </string-name>
          , G. Mecca,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <article-title>Cleaning data with llunatic</article-title>
          ,
          <source>VLDB J</source>
          .
          <volume>29</volume>
          (
          <year>2020</year>
          )
          <fpage>867</fpage>
          -
          <lpage>892</lpage>
          . URL: https://doi.org/10.1007/s00778-019-00586-5. doi:
          <volume>10</volume>
          .1007/ s00778-019-00586-5.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hulsebos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bakker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zgraggen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Satyanarayan</surname>
          </string-name>
          , T. Kraska, c. Demiralp,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hidalgo</surname>
          </string-name>
          ,
          <article-title>Sherlock: A deep learning approach to semantic data type detection</article-title>
          ,
          <source>in: Proceedings of the 25th ACM SIGKDD, KDD '19</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>1500</fpage>
          -
          <lpage>1508</lpage>
          . URL: https://doi.org/10.1145/3292500.3330993. doi:
          <volume>10</volume>
          .1145/3292500.3330993.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Data-driven domain discovery for structured datasets</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>953</fpage>
          -
          <lpage>967</lpage>
          . URL: https://doi.org/10.14778/3384345. 3384346. doi:
          <volume>10</volume>
          .14778/3384345.3384346.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hulsebos</surname>
          </string-name>
          , Y. Suhara, c. Demiralp,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>Sato: Contextual semantic type detection in tables</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>1835</fpage>
          -
          <lpage>1848</lpage>
          . URL: https://doi.org/10. 14778/3407790.3407793. doi:
          <volume>10</volume>
          .14778/3407790.3407793.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Farid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roatis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          <article-title>Chu, CLAMS: bringing quality to data lakes</article-title>
          , in: F. Özcan, G. Koutrika, S. Madden (Eds.),
          <source>Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference</source>
          <year>2016</year>
          , San Francisco, CA, USA, June 26 - July 01,
          <year>2016</year>
          , ACM,
          <year>2016</year>
          , pp.
          <fpage>2089</fpage>
          -
          <lpage>2092</lpage>
          . URL: https://doi.org/10.1145/2882903. 2899391. doi:
          <volume>10</volume>
          .1145/2882903.2899391.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>G.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kontchakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zakharyaschev</surname>
          </string-name>
          ,
          <article-title>Ontology-based data access: A survey</article-title>
          , in: J.
          <string-name>
            <surname>Lang</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19</source>
          ,
          <year>2018</year>
          , Stockholm, Sweden, ijcai.org,
          <year>2018</year>
          , pp.
          <fpage>5511</fpage>
          -
          <lpage>5519</lpage>
          . URL: https://doi.org/10.24963/ ijcai.
          <year>2018</year>
          /777. doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2018</year>
          /777.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schelter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Celikel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bießmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Grafberger</surname>
          </string-name>
          ,
          <article-title>Automating large-scale data quality verification</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>11</volume>
          (
          <year>2018</year>
          )
          <fpage>1781</fpage>
          -
          <lpage>1794</lpage>
          . URL: http: //www.vldb.org/pvldb/vol11/p1781-schelter.pdf.
          <source>doi:10.14778/3229863</source>
          .3229867.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schelter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bießmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rukat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Seufert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brunelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Taptunov</surname>
          </string-name>
          ,
          <article-title>Unit testing data with deequ</article-title>
          , in: P. A.
          <string-name>
            <surname>Boncz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Manegold</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ailamaki</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Deshpande</surname>
          </string-name>
          , T. Kraska (Eds.),
          <source>Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference</source>
          <year>2019</year>
          , Amsterdam, The Netherlands, June 30 - July 5,
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>1993</fpage>
          -
          <lpage>1996</lpage>
          . URL: https://doi.org/10.1145/3299869.3320210. doi:
          <volume>10</volume>
          .1145/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>