<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analytic query answering in a Semantic Data Lake (extended abstract)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claudia Diamantini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Potena</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuele Storti˚</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DII, Università Politecnica delle Marche</institution>
          ,
          <addr-line>via Brecce Bianche 60131</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The management of Data Lake technologies is challenged by the increasing flexibility they provide in data storage, as well as the fast-changing and diverse data they handle. In order to efectively identify relevant sources for analysis, it is crucial to make sense of disparate data, which is especially important in data science applications where users need to analyze statistical measures from multiple heterogeneous sources. In the paper, a knowledge-based approach for a Semantic Data Lake is presented to enable eficient integration of data sources and alignment to a Knowledge Graph, which represents indicators of interest, their mathematical formulas, and dimensions of analysis. A query-driven discovery approach is used to dynamically identify, integrate and rank the sources to respond to a given analytical query.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Lake</kwd>
        <kwd>Query-driven discovery</kwd>
        <kwd>Knowledge graph</kwd>
        <kwd>Multidimensional model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Data Lakes (DL) are repositories for storing data in their native format, providing centralized
access and the capability to apply data transformations when needed according to an ELT
approach. However, the lack of a global schema and the need to make sense of disparate raw
data pose challenges related to data management. As recognized by recent literature [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], how
to integrate heterogeneous data sources and help users to find the most relevant data are still
open issues in this setting.
      </p>
      <p>
        A variety of solutions have been proposed for DL integration, ranging from raw data
management to semantic-enriched frameworks. Traditional techniques based on schema matching
typically assume complete metadata, which is not realistic for real-world Data Lakes [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
Among the latter, Knowledge graphs have been exploited to drive integration, relying on
information extraction tools (e.g., [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]), while recent efort focused on combining the
Ontology-Based Data Access paradigm with Data Lakes to support uniform access (e.g., [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]).
      </p>
      <p>
        In order to combine the two aspects of discovery and integration, which are often seen as
intertwined operations, a query-driven discovery paradigm was recently proposed [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] , aimed
to finding datasets that are similar to a query dataset and that can be integrated in some way
(e.g., by join, union or aggregates). A related problem is the correlated dataset search, in which
besides identifying possible joins, it is also necessary to compute, or estimate, the joinability
among the sources (or their correlation). Algorithms such as JOSIE [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] provide an exact solution,
while Lazo [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], LSH Ensemble [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or GB-KMV [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], focus on approximate solutions at the
reduced cost of precision and recall. Aurum [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] exploits hypergraphs to find similarity-based
relationships through LSH among tabular datasets. In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], given an input query table, the aim
is finding the top-k tables that are both joinable with it and contain columns that are correlated
with a column in the query, through a novel hashing scheme that allows the construction of a
sketch-based index to support eficient correlated table search. After the discovery has been
performed (through join, union or related-table search), tables can be integrated (e.g., [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]).
      </p>
      <p>However, when dealing with summary data, that is statistical measures or indicators derived
from raw data, specific issues rise that have not been taken into account by the literature. This
is the case of open Data Lakes managed by public bodies, e.g., to monitor economic trends or
the efectiveness of governmental policies and initiatives like a vaccination campaign.</p>
      <p>In this work, we propose a query-driven knowledge-based approach for integration and
discovery in a Data Lake. The approach builds on a Knowledge Graph that includes a formal
model of measures and their computation formulas, in which concepts are used to enrich source
metadata. The approach defines mechanisms for integration and mapping discovery, based on
eficient evaluation of set containment between a source domain and a concept in the Knowledge
Graph. It also defines an ontology-based and math-aware query answering function, specifically
tailored to analytical processing, capable of identifying the set of sources collectively capable
of responding to the user request and the proper transformation rules to make the needed
calculation. To quantitatively estimate the quality of such results, we propose an algorithm to
eficiently evaluate the degree of joinability index, which estimates the cardinality of the join
among a set of sources.</p>
      <p>
        Unlike alternative solutions in the literature, our approach takes into account both data and
metadata (i.e., mappings to indicators concept in the Knowledge Graph and their formulas) as a
support to reformulate the query and determine which sources can be used to respond. This
helps in reducing the search space by identifying the most semantically relevant data sources
according to the discovery need. Second, in our case the target query is extended to general
OLAP queries. As such, it can also be named a Semantic Data Lakehouse, following a recent
terminological proposal in the literature [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The article is an extended abstract of a work
submitted to Information Systems Frontiers, a prior version thereof is available at [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>The rest of the paper is structured as follows: Section 2 is devoted to introducing the Semantic
Data Lake model. The approach for source integration is discussed in Section 3, while query
answering mechanisms are introduced in Section 4. An evaluation of the approach is discussed
in Section 5 while Section 6 concludes the work and draws future research lines.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Semantic Data Lake: data model</title>
      <p>
        A Semantic Data Lake is defined as a tuple  “ x, , , y, where  “ t1, . . . , u is a
set of data sources,  “ t1, . . . , u is the corresponding set of metadata,  is a Knowledge
Graph and  Ď  ˆ  is a mapping function relating metadata to knowledge concepts.
Our approach is agnostic w.r.t. both the degree of structuredness of the sources, ranging
from structured datasets to semi-structured documents (e.g., XML, JSON), and the specific DL
architecture at hand, e.g., based on ponds vs. zones (see also [
        <xref ref-type="bibr" rid="ref16">16, 17</xref>
        ]). If the architecture is
pond-based, in fact, the approach is applied to datasets in a single stage, while in zone-based
DLs the approach can be applied on any stage of the platform, although it is best suited to
the staged area for data exploration/analysis. As a minimum requirement, we assume a data
ingestion process to wrap separate data sources and load them into a data storage. The model
for a Semantic Data Lake is detailed in the following.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Metadata layer</title>
        <p>Diferent typologies of metadata can be related to a resource, depending on how they are gathered
[18]. Hereby, we refer to technical metadata, i.e., related to data format and, whenever applicable,
to their schema. Since the representation of metadata is highly source-dependent (e.g., the
schema definition for a relational table), a uniform representation of data sources in a metadata
layer is required for the management of a Data Lake. The procedure to represent technical
metadata of a given source depends on the typology of data source, e.g., a relational database
has tables with attributes, while XML/JSON documents include complex/simple elements and
their attributes. For each source , metadata are represented as a directed graph  that is
built incrementally by a metadata management system [19], starting from the definition of a
node for each metadata element. An edge is defined to represent the structural relation between
a table and a column of a relational database, or between a JSON complex object and a simple
object.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Knowledge layer</title>
        <p>The knowledge layer of the Semantic Data Lake is based on KPIOnto1, an OWL2-RL ontology
aimed to provide the terminology to model an indicator in terms of description, unit of
measurement and mathematical formula for its computation. The ontology also provides classes and
properties to fully represent multidimensional hierarchies for dimensions (e.g., level Province
rolls up to Country in the Geo dimension) and members.</p>
        <p>On this top, a Knowledge Graph provides a representation of the domain knowledge in terms
of definitions of indicators, dimension hierarchies and dimension members. Concepts are
represented in RDF as Linked Data according to the KPIOnto ontology, thus enabling standard
graph access and query mechanism. Finally, Logic Programming rules are enacted by the XSB2
logical reasoner providing algebraic services. These are capable of performing mathematical
manipulation of formulas (e.g., equation solving), which are exploited to infer all formulas for a
given indicator. This functionality is used to support query answering (see Section 4).</p>
        <p>Figure 1a shows a fragment of a Knowledge Graph representing two dimensions Time and
Geo with the corresponding hierarchy of levels. On the other hand, Figure 1b highlights the
mathematical relations among a set of indicators related to monitoring of COVID, some of
which are atomic (e.g., Positive, Deaths, Recovered, ICU ), and others can be calculated from the
former (Cases and ICU on Positive Rate).
1KPIOnto specifications are available at http://w3id.org/kpionto
2http://xsb.sourceforge.net/
(a)
(b)</p>
        <p>For each source, the nodes in the metadata graphs are aligned with concepts in the Knowledge
Graph through the definition of the mapping function  Ď  ˆ , which links the metadata to
the knowledge layer, following the approach discussed in the next section.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Integration and mapping discovery</title>
      <p>This section is aimed to discuss (a) how to identify dimensions, given a new data source and (b)
how to properly map them to the Knowledge Graph. Hereby, we refer to data domain as a set
of values from a data source, e.g., for relation tables it is the projection of one attribute, while
for a JSON collection is the set of values extracted from all the included documents according
to a given path.</p>
      <p>In order to identify whether a given domain from a data source and a dimensional level
represent the same concept, a matching step is required. The Jaccard similarity coeficient
is one of the most widely adopted index for comparing sets, however when sets are skewed,
i.e., have very diferent cardinality, this index is biased against the largest one. Given that the
cardinality of a domain (without duplicates) is typically much lower than that of a dimensional
level, we refer to an asymmetric variant named set containment, which is better suited than
Jaccard to evaluate whether a domain has intersection with a given level. Given two sets ,  ,
it is defined as p,  q = |X | , i.e. it is independent on the dimension of the second set. As
||
an example, let us consider a domain  “ t, ,  u and a dimensional level
3 , meaning that the
Geo.City including 100 cities in Europe. In this case, p, .q “ 3
domain perfectly matches the dimensional level, while  p, .q “ 1030 .</p>
      <p>
        We formalize the problem of mapping a domain of a data source to a dimensional level as
a reformulation of the domain search problem [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which belongs to the class of R-nearest
neighbor search problems. Here, the goal is to determine what dimensional level, defined in the
Knowledge Graph, is the most relevant, i.e. best represent the values in the domain at hand.
Formally, given a set of dimensional levels ℒ, a domain , and a threshold  P r0, 1s, the set of
relevant dimensional levels from ℒ is t : p, q ě ,  Ď ℒu. In the following we refer to
the most relevant dimensional level as the one having the greatest threshold . As an example,
the most relevant dimensional level for a domain country_region in a data source, containing
names of countries, is Geo.Country.
      </p>
      <p>Comparing a given domain to a dimensional level involves a linear time complexity in the
size of the sets. Given the target scenario, which may include data sources with hundreds of
thousands or even millions of tuples, the computation of the index may often be not scalable in
many practical cases. An improvement discussed in the literature consists in the estimation of
the index using MinHash computation [20], which involves performing the comparison on their
MinHash signatures instead of on the original sets. If data sources have high dimensionality,
however, MinHash is used with a data structure capable of significantly reducing the running
time, named Locality Sensitivity Hashing (or LSH) [21], a sub-linear approximate algorithm.</p>
      <p>
        While the previous approach is targeted to the Jaccard index, an estimation of the set
containment can be obtained through LSH Ensemble [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which is proved to be suitable for skewed
sets and more performing than alternative solutions in terms of accuracy and execution time.
In our approach, given a domain of a source, we rely on LSH Ensemble to obtain the
dimensional level(s) that are estimated to have a containment score above a certain threshold. In the
following, given a domain  from a data source , given a set of dimensional levels ℒ, and a
threshold  P r0, 1s, we refer to _ as a function returning the set of relevant
dimensional levels for .
      </p>
      <p>For what concerns measures, they are particular domains which are purely quantitative. As
such, unlike dimensional levels, they are not constrained to a finite number of possible values.
For this reason, solutions for evaluating domain similarity through containment such as LSH
Ensemble cannot be applied. Several approaches can be considered ranging from string-based
ones to those based on dictionary, semantic similarity (e.g., [19]) or frequency distribution and
will be discussed in future work.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Query answering</title>
      <p>The mappings defined between the metadata graphs and the Knowledge Graph are exploited to
support query-driven discovery and query answering in the Data Lake context. This requires
to determine what data sources are needed and how to combine them for a given request. A
user query  is expressed as a tuple  “ x, t1, . . . , uy, where  is an indicator and
t1, . . . , u is a set of levels, each belonging to a diferent dimension.</p>
      <p>A data source  has a compatible dimensional schema with respect to a query if  contains
a subset of the levels in the query. For all dimensions of the query that are not included in ,
the source is assumed to supply such dimensions at the most aggregate level. A data source
can respond a query if its dimensional schema is compatible and if it provides the requested
indicator. On the other hand, if the indicator is not provided by any source but it can be
calculated from other indicators, a set of data sources may collectively answer the query if they
have a compatible dimensional schema and provide all the component indicators. In the latter
case, the actual calculation of the indicator requires to join the needed data sources.</p>
      <p>It is worth noting that multiple formulas may exist to calculate an indicator and also for each
formula there may be multiple sets of sources that have the necessary measures. Clearly, the
diferent solutions must be compared to assess the quality of the query result. To this end, it
is necessary to join the sources considered in each solution. This is highly ineficient in the
Algorithm 1 Computing degree of joinability
context of a Data Lake. Therefore, we propose an eficient algorithm to estimate the quality
of the query result, in terms of its cardinality. The outcome of the algorithm is then used to
choose which sources will be joined to compute the query result.</p>
      <p>The algorithm takes as input a query  “ x, t1, . . . , uy and returns the list of possible
solutions, in terms of the formula to be applied and sources to be considered, enriched with the
estimated cardinality of the result. First, using the reasoning services defined over the KPIOnto,
the algorithm searches for all formulas  p1, . . . , q for  that can be derived from
, such that each component measure ,  “ 1, . . . ,  is provided by a data source with a
dimensional schema compatible with . For each formula in  p.q, the sets of sources that can
provide 1, . . . ,  are also returned. On these sets the degree of joinability is calculated,
which is used to estimate the cardinality of the query result. Such index measures the likelihood
to produce a result out of a join among a set of domains.</p>
      <p>Sources are joinable if they have the same values for domains that are mapped to the same
dimensional levels. To check this condition, the corresponding domains should be compared in
order to determine how many values are shared between the sources through set containment.
However, a full comparison is not practical in a Data Lake scenario. For this reason, we resort
to the LSH Ensemble to provide an estimated evaluation of the joinability of  data sources.
Typical use of LSH Ensemble is based on single join attribute at a time (similarity between sets),
while in our case the match needs to be performed on sets of dimensional levels. Hence, we
apply a combination function (e.g., a concatenation of strings) to the domains representing
levels, in order to map them into a single domain before applying the hashing function. In the
following, we refer to combined MinHashes, that can be pre-computed at source loading time
in order to speed up the evaluation of the joinability index.</p>
      <p>The procedure for computing the degree of joinability is summarized in Algorithm 1. Given
the set of sources t1, . . . , u with ˚ being the one with the lowest cardinality, the algorithm
returns the portion of elements of ˚ that will be considered in computing the join with the
other sources. Since the set t1, . . . , u defines a unique identifier for each , multiplying
the degree of joinability by |˚| yields the estimation of the cardinality of the join. In case
the indicator is already available in a source, the cardinality of the query result is equal to
the cardinality of the source. As a first step (line 2), the threshold  is set to the maximum
value. Then, after identifying the source ˚ with the lowest cardinality (line 3), the function
LSH_Ensemble is called to obtain the set of sources with which S* is estimated to have a
containment score above  . If there is at least one source for which this does not hold, then the
degree of joinability is less than  and the threshold is decreased by a given step (line 7).</p>
      <p>It is noteworthy that Algorithm 1 returns an overestimate of the degree of joinability of
 sources. To give an example, if  “ t, , u,  “ t, , , u and  “ t, , , u, the
compute_joinability returns 23 , but the cardinality of the join is 1, so the degree of joinability
should be 13 of . To get a more accurate result MinHash could be directly used to estimate
the set containment, and then to perform the join among the  sources. Clearly this solution
lengthens the computation time, so for the scenario of this work we consider the approximation
proposed above.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>An evaluation3 of the approach is proposed here on a case study based on the Microsoft Azure
Covid-19 Data Lake4 and Our World in Data repository5. The Data Lake contains 5 sources
reporting several measures on COVID, aggregated by temporal and geographical dimensions
(basic informations are available in the first columns of Table 1).</p>
      <p>The Knowledge Graph was setup by defining dimensions and levels from available online
resources. For any loaded data source, initialization includes computation of MinHashes for any
domain, mapping with the dimensional levels and computation of the combined MinHashes
for domains mapped to dimensional levels. For LSH Ensemble we set the number of hashing
permutations to 256 and number of parts to 32. The average execution time for hashing
computation for a domain ranges from 0.076 s (S3) to 28.125 s (S1). The mapping discovery
phase always requires less than 0.001 s per domain, while the time for computation of combined
MinHashes ranges from 0.151 s (S2) to 21.235 s (S1) per domain. Overall, domains are processed
in less than 1.6 s on average.</p>
      <p>In the following, we report an example of the application of the algorithms on the case study.
The result of the mapping discovery is shown in Table 1, where mapped levels and measures
are reported for each source. Let us assume the user is interested in analysing measures
ICU_on_Positives_Rate and Positive at Geo.Country and Time.Day levels. As for the first measure,
the algorithm returns p , tt1u, t3uuq. In this case, no join is needed as the measure
is directly available from multiple sources. Therefore, the degree of joinability is equal to 1.
3Tests have been carried out on an Intel Core i5-1135G7, 8 cores @ 2.40GHz, x86_64 architecture, with 8 GB RAM
running Linux Fedora 34.
4https://docs.microsoft.com/en-us/azure/open-datasets/dataset-covid-19-data-lake
5https://github.com/owid/covid-19-data</p>
      <p>As for the second measure, the function returns p   , tt5u, t1, 3uuq. Combination
of sources are produced and two alternative solutions are available by combining S5 with either
S1 or S3. They are checked for joinability as follows, considering that the cardinality of S5 is
28661:
• x5, 1y: the degree of joinability between S5 and S1 is 0.78. Hence, the estimated join
cardinality is 0.78 * 28661 = 22355 with a query time equal to 3.109 s;
• x5, 3y: the degree of joinability between S5 and S3 is 0.31. Hence, the estimated join
cardinality is 0.31 * 28661 = 8884, with a query time equal to 3.283 s.</p>
      <p>As a result, the solution (S5,S1) is preferred over (S5,S3). This is motivated by the fact that S5
and S1 include data for both years 2020 and 2021, while S3 includes data only on year 2020.
Therefore, the degree of joinability of S3 with S5 is lower than that of S1, as the former shares a
smaller subset of data with the latter.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper has introduced a knowledge-based approach for analytic query-driven discovery in
a Data Lake, which is characterized by the formal representation of indicators’ formulas and
eficient mechanisms for source integration and mapping discovery. Given a query ontologically
expressed as a measure of interest and relevant analysis dimensions, the framework identifies
sources capable of collectively responding by utilizing math-aware reasoning on indicator
formulas. The joinability of sources is quantitatively evaluated through the degree of joinability
index. With respect to previous work on query-driven discovery, which requires a number
evaluations among sources increasing linearly with their number, our approach reduces such a
number to only the relevant sources by performing a preliminary evaluation based on mapping
to domains in the Knowledge Graph and formula rewriting.</p>
      <p>Future work will be devoted to define a more comprehensive metadata model for the Data Lake,
including also operational and business metadata. We also aim to extend the query answering
approach towards interesting research directions. In particular, the degree of joinability could
be adapted to evaluate the completeness of a data source with respect to the Knowledge Graph
concepts. This would enable to determine the scope of a source and paves the way for an
eficient evaluation of the overlapping or complementarity among sources, and possible more
eficient indexing approaches. Merging capabilities could also be beneficial to find unionable
sources and hence to vertically integrate data providing the same measures. Finally, dynamic
calculation of indicators can be envisaged for a variety of analytical tasks, including interactive
data exploration [22] or navigation [23]. Furthermore, we plan to individuate real case studies
for an extensive evaluation, which will help in more precisely identify potential benefits and
limitations for specific application contexts.
2019, pp. 179–188.
[17] P. Sawadogo, J. Darmont, On data lake architectures and metadata management, Journal
of Intelligent Information Systems 56 (2021) 97–120.
[18] A. Oram, Managing the Data Lake, O’Reilly, Sebastopol, CA, USA, 2015.
[19] C. Diamantini, P. L. Giudice, D. Potena, E. Storti, D. Ursino, An approach to extracting
topic-guided views from the sources of a data lake, Information Systems Frontiers 23
(2021) 243–262.
[20] A. Z. Broder, On the resemblance and containment of documents, in: Proceedings.</p>
      <p>Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), IEEE, 1997, pp.
21–29.
[21] P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of
dimensionality, in: Proceedings of the thirtieth annual ACM symposium on Theory of
computing, 1998, pp. 604–613.
[22] C. Diamantini, D. Potena, E. Storti, H. Zhang, An ontology-based data exploration tool for
key performance indicators, Lecture Notes in Computer Science 8841 (2014) 727–744.
[23] E. Zhu, K. Q. Pu, F. Nargesian, R. J. Miller, Interactive navigation of open data linkages,
Proc. VLDB Endow. 10 (2017) 1837–1840.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Nargesian</surname>
          </string-name>
          , E. Zhu,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Arocena</surname>
          </string-name>
          ,
          <article-title>Data lake management: challenges and opportunities</article-title>
          ,
          <source>Proceedings of the VLDB Endowment</source>
          <volume>12</volume>
          (
          <year>2019</year>
          )
          <fpage>1986</fpage>
          -
          <lpage>1989</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Farid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roatis</surname>
          </string-name>
          , I. Ilyas,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          <article-title>Chu, CLAMS: bringing quality to Data Lakes</article-title>
          ,
          <source>in: Proc. of the International Conference on Management of Data (SIGMOD/PODS'16)</source>
          , San Francisco, CA, USA,
          <year>2016</year>
          , pp.
          <fpage>2089</fpage>
          -
          <lpage>2092</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mansour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Qahtan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elmagarmid</surname>
          </string-name>
          , I. Ilyas,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Seeping semantics: Linking datasets using word embeddings for data discovery</article-title>
          ,
          <source>in: 2018 IEEE 34th International Conference on Data Engineering (ICDE)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>989</fpage>
          -
          <lpage>1000</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Hai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Geisler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quix</surname>
          </string-name>
          ,
          <string-name>
            <surname>Constance:</surname>
          </string-name>
          <article-title>An intelligent data lake system</article-title>
          ,
          <source>in: Proc. of the International Conference on Management of Data (SIGMOD</source>
          <year>2016</year>
          ), San Francisco, CA, USA,
          <year>2016</year>
          , pp.
          <fpage>2097</fpage>
          -
          <lpage>2100</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Mami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Graux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Scerri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jabeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <article-title>Uniform access to multiform data lakes using semantic technologies</article-title>
          ,
          <source>in: Proceedings of the 21st International Conference on Information Integration and Web-based Applications &amp; Services</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>313</fpage>
          -
          <lpage>322</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Open data integration</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>11</volume>
          (
          <year>2018</year>
          )
          <fpage>2130</fpage>
          -
          <lpage>2139</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nargesian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Josie: Overlap set similarity search for finding joinable tables in data lakes</article-title>
          ,
          <source>in: Proceedings of the 2019 International Conference on Management of Data</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>847</fpage>
          -
          <lpage>864</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <article-title>Lazo: A cardinality-based method for coupled estimation of jaccard similarity and containment</article-title>
          ,
          <source>in: 2019 IEEE 35th International Conference on Data Engineering (ICDE)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>1190</fpage>
          -
          <lpage>1201</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nargesian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Lsh ensemble: Internet-scale domain search</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>9</volume>
          (
          <year>2016</year>
          )
          <fpage>1185</fpage>
          -
          <lpage>1196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Gb-kmv: An augmented kmv sketch for approximate containment similarity search</article-title>
          ,
          <source>in: 2019 IEEE 35th International Conference on Data Engineering (ICDE)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>458</fpage>
          -
          <lpage>469</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Koko</surname>
          </string-name>
          , G. Yuan,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <article-title>Aurum: A data discovery system</article-title>
          ,
          <source>in: 2018 IEEE 34th International Conference on Data Engineering (ICDE)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>1001</fpage>
          -
          <lpage>1012</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bessa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Musco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <article-title>A sketch-based index for correlated dataset search</article-title>
          ,
          <source>in: 2022 IEEE 38th International Conference on Data Engineering (ICDE)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>2928</fpage>
          -
          <lpage>2941</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Khatiwada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gatterbauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Integrating data lake tables</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          <volume>16</volume>
          (
          <year>2022</year>
          )
          <fpage>932</fpage>
          -
          <lpage>945</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Armbrust</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghodsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <article-title>Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics</article-title>
          ,
          <source>in: Proceedings of CIDR</source>
          ,
          <year>2021</year>
          , p.
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Diamantini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Potena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Storti</surname>
          </string-name>
          ,
          <article-title>A knowledge-based approach to support analytic query answering in semantic data lakes</article-title>
          ,
          <source>in: Advances in Databases and Information Systems: 26th European Conference, ADBIS</source>
          <year>2022</year>
          , Turin, Italy, September 5-
          <issue>8</issue>
          ,
          <year>2022</year>
          , Proceedings, Springer,
          <year>2022</year>
          , pp.
          <fpage>179</fpage>
          -
          <lpage>192</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Giebler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gröger</surname>
          </string-name>
          , E. Hoos,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitschang</surname>
          </string-name>
          ,
          <article-title>Leveraging the data lake: Current state and challenges</article-title>
          , in: C.
          <string-name>
            <surname>Ordonez</surname>
            , I. Song,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Anderst-Kotsis</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Tjoa</surname>
          </string-name>
          , I. Khalil (Eds.),
          <source>Big Data Analytics and Knowledge Discovery</source>
          , Springer International Publishing, Cham,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>