<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="editor">
          <string-name>Vancouver, Canada</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ben-Gurion University of the Negev, Computer Science Department</institution>
          ,
          <addr-line>Beer-Sheva</addr-line>
          ,
          <country country="IL">Israel</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in “on-demand” mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. Our research focuses on three related topics: (1) identify the key challenges to improving query performance in cloud data lakes, (2) provide a theoretical model that formally defines the problem of poor query performance in cloud data lakes, (3) design a practical solution to the problem and demonstrate its eficiency via large-scale experimental evaluation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>data lakes, cloud storage, query optimization
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License are stored in “file170”, and records 7-9 in “file051”. Files’
Attribution 4.0 International (CC BY 4.0).
possible to speed up calculations on the data). In single- in cloud data lakes, and our contributions so far can be</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Traditionally, storage systems have been favoring data
locality (meaning they wanted to be as close to the data as
node databases, data locality occurs trivially, whereas
in shared-nothing distributed systems, data locality is
achieved by performing computation on the same
machines that store the data. However, with the rise of
cloud technologies, a new family of storage systems has
emerged – cloud object stores (e.g., AWS S3 and Azure
Blob Storage). These systems provide object storage
service through a web interface. Users create buckets, and
each bucket may contain multiple binary objects uniquely
identified within the bucket by a string key.</p>
      <sec id="sec-2-1">
        <title>Cloud object stores are widely considered to be the</title>
        <p>most cost-efective storage systems in the world right
now [1, 2, 3, 4], and as a result, they are heavily used as the
main building block of enterprise data repositories that
have come to be known as cloud data lakes [1, 2, 3, 4, 5].
The main feature of the cloud data lakes is that they store
data in cloud object stores and as a result do not follow
the traditional shared-nothing architecture but, instead,
disaggregate compute layer from the storage layer. This
new approach is commonly called a data lake architecture
[5].</p>
        <p>VLDB 2023 PhD Workshop, co-located with the 49th International
This research was partially supported by the Israeli Council for
Higher Education (CHE) via Data Science Research Center, the Israel
Data Science Initiative (IDSI), and the Lynne and William Frankel
Center for Computer Science
nEvelop-O</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Problem Statement</title>
      <p>Let us introduce the problem via a simple example first.</p>
      <p>Consider a typical metric data presented in Table 1. Let
us assume that this table is stored in the cloud data lake
such that records 1-3 are stored in “file201”, records 4-6
...
...
...
...
...
...
...
...
...</p>
      <p>file201
file170
file051
format can be any of the standard supported formats (e.g., icantly improve query performance in a cloud data lake
the files needed to satisfy  ) and denote it by  ( , )
We call such  a coverage set of  . Note that for any  ,
holds  ( , )</p>
      <p>.
it by   ( , )
and denote it by  ()</p>
      <p>.</p>
      <sec id="sec-3-1">
        <title>Definition 3 (query tight coverage).</title>
        <p>Given
a data lake query  , 
⊆

tightly covers
 ↔  ( , )
⋀ ¬∃ ∈  , ∀ ∈  , ¬(
 , )</p>
        <sec id="sec-3-1-1">
          <title>When  satisfies Definition</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>3 for some data lake query</title>
          <p>, we say that  tightly covers  (meaning that  contains
all the files needed to satisfy  and only them) and denote
. We call such  a tight coverage set of</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>If for any data lake query  , we could (eficiently) com</title>
          <p>pute  such that   ( , )
, we would be able to
signifarchitecture by accessing only files in  instead of all
those in  (and in most real-world scenarios | | &lt;&lt; | |
In fact, as we show below, in many cases finding the
).
exact tight coverage might be too complicated, and we
can be content with some coverage set that is not tight
but still can help us improve query performance. For
such scenarios, the definition of tightness degree might
be useful.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Definition 4 (tightness degree).</title>
        <p>Given a data lake
query  , for any coverage set  of  , the tightness degree
{
1 −
0
| |−| ()|
| |−| ()|
if | ()| &lt; | |
otherwise
}</p>
        <sec id="sec-3-2-1">
          <title>Intuitively, the tightness degree shows to what extent</title>
          <p>the given coverage set is close to the tight coverage set
(1 means perfectly close).</p>
          <p>Based on the above semantics we can formulate our
main research question as follows:
• Can we develop an algorithm, that for any data
lake query  , can find a coverage set of  ,  , such
that:
– tightness degree of  is maximized
– cost1 of computing  is minimized
– as a result of the above, the total execution</p>
          <p>time of  is reduced as much as possible</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Related</title>
    </sec>
    <sec id="sec-5">
      <title>Work</title>
      <p>The most trivial approach for query execution in cloud
data lakes is to read all the files. Clearly,  covers any  ,
but the coverage is far from being tight and hence query
performance is poor.</p>
      <sec id="sec-5-1">
        <title>One of the first suggested optimizations was data par</title>
        <p>titioning [6] which is supported by all modern query
 , we say that  covers  (meaning that  contains all 1cost is measured by the number of files read from the cloud</p>
      </sec>
      <sec id="sec-5-2">
        <title>Parquet or ORC).</title>
        <p>Let us briefly review how the state-of-the-art query
engines [6, 7, 8] would execute a typical query on Table 1
stored in the cloud data lake. For example, when a query
engine receives a query with ”where” condition ”metric =
’memory’ and val &gt; 10”, it reads the files from the storage,
scans them in-memory to find the records satisfying the
predicate, and returns the result. In our example, only
record 3 from file201</p>
        <p>will be returned. However, all the
data lake files should be read from the storage and
processed; and since production data lakes might contain
billions of files [ 9], this approach is extremely wasteful. of  is defined as:
So, intuitively, we would like to read only the “relevant”
ifles for a given query ( file201 in our example) and skip
all the rest. Let us define the problem formally now.
 ( , ) =
2.1. Formal problem definition
We model data lake tables according to the standard
relational model.</p>
        <p>Given a set of 
domains  =
{ 1,  2, … ,   }, a table  is defined as a subset of the
Cartesian product  1 ×  2 × … ×   . Each domain   ∈ 
has an associated column name   ∈  , where  is the
set of all column names.  is a set of tuples { 1,  2, … ,  | | },
where each tuple  is a set of pairs {(  ∶  ) ∣   ∈ ,  ∈
  ,  ∈ {1, … , }} .  is stored in a cloud object store as a
collection of files denoted by  = { 1,  2, … ,  | | }.  is a
partition of  , meaning that ∀≠ ∶   ⊆  , ⋃  =  ,   ∩   = ∅.</p>
        <sec id="sec-5-2-1">
          <title>Definition 1 (data lake query).</title>
          <p>We define a data lake
query  as a standard SQL query on table  and denote the
predicate in the where clause of  as   . We assume that
  is given in a conjunctive normal form (CNF). If tuple 
from the file  ∈</p>
          <p>satisfies   we denote it by (  , ) .</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>Definition 2 (query coverage).</title>
          <p>Given a data lake
query  ,  ⊆ 
covers  ↔ ∀  ∈  ⧵  , ¬∃ ∈  , (
 , ) .</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>When  satisfies Definition</title>
        <p>2 for some data lake query
engines. Unfortunately, only a limited subset of table cus on the ”where” condition of the query and look at
columns can be used in partitioning, while production each predicate clause separately. There may be many
tables may contain tens of thousands of columns [10]. coverage sets associated with each clause, and there may</p>
        <p>Another well-known approach [11] is to attach meta- be many ways to compute each of these coverage sets.
data to each data lake file and use it during the reads to We assume that for each possible coverage execution plan
skip irrelevant files. Columnar formats support metadata- we can estimate (e.g., via statistics, caching, ML models,
based skipping out-of-the-box by storing the metadata etc.) what is the expected cost and expected result of
and the data in the same file and relying on the fact that each plan. An important observation is that intersection
cloud object stores support reading of the particular sec- of coverage sets of any subset of the query clauses is a
tions of the file. Unfortunately, metadata-based skipping coverage set of the original query. Based on this
observais very sensitive to data distribution and helps only in tion, our optimization scheme consists of the following
cases where the data is nicely clustered. two steps:
ltrSoaheepytelStueeiromcrrem,nta)isezdesodauotciptpirlooproeenortulha,redttaevipsopacunwnrosomtahevnripidendduocgeotorentnrhsodlleyatsinyqtnmheuesreeie.ogrrdmTyhehlttpeeoibrvscemeatadnefilosittccevehrasreetneec(hdieoqtu.orouggdu.ete,shtiwAaesdmWsouatuorogSilurndraneSgbgat3eest- 1. scaaGuoggibsveetsesspenelatatanosndfqissuc)t,hlemaseruuiynssciaeihzmsnetd(ihoazifaentdtstdh.tetheshetieinmistrueacmrtoseerdorcfetvistaophlnuoeeniorsdf,eitwnshtgeei
mccfinooadvvtaeeerdr-of data between storage and compute layers. However, 2. Then, we execute each of the coverage plans,
inwe still perform a lot of costly filtering operations; the tersect their results and execute the original query
only diference is that the operation is performed in a on the files in the intersection only.
diferent place (i.e., in the worst-case scenario all | | files Step (1) can be formulated as an optimization problem
should be accessed). and we prove that it is NP-hard (by reduction from the</p>
        <p>Table format (e.g., Delta Lake, Apache Iceberg) [9] is a set covering problem); we suggest heuristic algorithms to
novel approach to add missing capabilities (e.g., transac- deal with the hardness of the problem. Our result
estimations, schema evolution, query optimization) to the data tion scheme is based on the regular relational databases
lake architecture. The main idea is to add an additional selection size estimation [13], in which we can estimate
layer of metadata between the files in the storage and the number of records satisfying the predicate based on
the compute layer. This metadata then can be used, for the statistical information. Once we have the estimated
example, for storing information about schema changes, number of records ( ) for the given clause, we can
estidata mutation, and various statistics to improve query mate the corresponding number of files as the number of
performance. While table format is an important step non-empty bins after randomly throwing  balls into | |
towards query optimization in cloud data lakes, so far it bins. Execution of coverage plans (step 2) is based on our
does not solve the main problem considered in our study: previous work on indexing in cloud data lakes [12, 14].
reading of (a lot of) irrelevant files from the storage. We built a prototype of our solution; the
implemen</p>
        <p>Indexing is the primary method for improving query tation is available online 2. We used Apache Spark as
performance in relational databases but it is rarely used the main compute engine and AWS as the cloud provider.
in big data environment. Our preliminary work in [12] For the benchmark, we used the TPC-H dataset 3 with a
presents an indexing scheme for relational data in cloud scale factor of 1 TB. We executed diferent queries with
data lakes where indexes are stored inside the lake and and without our scheme and achieved up to x19 query
their creation is performed by parallel algorithms. This performance improvement.
scheme, however, limits the query model to simple
selection/projection queries with a single-column predicate.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusions and Future</title>
    </sec>
    <sec id="sec-7">
      <title>Research Directions</title>
    </sec>
    <sec id="sec-8">
      <title>4. Our Approach</title>
      <p>Finding the tight coverage set of a query might be too Data volumes keep growing, and new technologies are
costly (in can be proved that in worst-case scenario it being developed all the time to adjust to the constant
cannot be better than Ω( ) ). The main idea of our ap- increase in data trafic. Cloud data lakes are one of the
proach is, instead of looking for the tight coverage set most successful solutions to the big data challenge. They
of the given query, to find some coverage set that will provide great scalability, usability, and cost-eficiency,
result in an optimal total query execution time (i.e., we but perform poorly with interactive queries.
are looking for the coverage set  such that the sum of
the cost of finding  and its size is minimized). We fo- 2https://github.com/grishaw/data-lake-coverage
3https://www.tpc.org/tpch/</p>
      <p>We analyze the problem of poor performance of
interactive queries in cloud data lakes and identify the main
obstacles achieving better performance. We formally
deifne the problem by introducing a new concept of the
query (tight) coverage set. We use our terminology to
deifne an optimization problem that finds the best coverage
set for the given query. We show that the problem is
NPhard and deal with its hardness by suggesting heuristic
algorithms. Our solution is based on ideas from relational
databases related to indexing and statistics management
adjusted to a cloud data lake architecture. The main idea
is that the storage resources are usually much cheaper
than the compute resources [4, 12], so if we could
provably improve query performance by adding more storage
it probably would be very useful for many data lake users.</p>
      <p>For future research, we are planning to focus on the
following directions:</p>
      <sec id="sec-8-1">
        <title>Choosing a cloud dbms: architectures and trade</title>
        <p>ofs, Proceedings of the VLDB Endowment 12 (2019)
2170–2182.
[5] D. Abadi, A. Ailamaki, D. Andersen, P. Bailis, M.
Balazinska, P. A. Bernstein, P. Boncz, S. Chaudhuri,
A. Cheung, A. Doan, et al., The seattle report on
database research, Communications of the ACM 65
(2022) 72–79.
[6] J. Camacho-Rodríguez, A. Chauhan, A. Gates,</p>
        <p>E. Koifman, O. O’Malley, V. Garg, et al., Apache
hive: From mapreduce to enterprise-grade big data
warehousing, in: Proceedings of the 2019
International Conference on Management of Data, 2019,
pp. 1773–1786.
[7] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K.</p>
        <p>Bradley, X. Meng, T. Kaftan, M. J. Franklin, A.
Ghodsi, et al., Spark sql: Relational data processing in
spark, in: Proceedings of the 2015 ACM SIGMOD
• We can try improving our estimations scheme by international conference on management of data,
training ML models to ”guess” expected result of 2015, pp. 1383–1394.</p>
        <p>a given query coverage plan. [8] R. Sethi, M. Traverso, D. Sundstrom, D. Phillips,
• We want to extend our basic query model to more W. Xie, Y. Sun, N. Yegitbasi, H. Jin, E. Hwang,
complex types (e.g., joins and group by). N. Shingte, et al., Presto: Sql on everything, in:
• An interesting caching technique can be based 2019 IEEE 35th International Conference on Data
on the tight coverage sets:   function defines Engineering (ICDE), IEEE, 2019, pp. 1802–1813.
an equivalence relation on a set of all possible [9] P. Jain, P. Kraft, C. Power, T. Das, I. Stoica, M.
Zaqueries over the table. So, if for any query  we haria, Analyzing and comparing lakehouse storage
could tell for each equivalence class it belongs, systems, in: Proceedings of CIDR, 2023.
we would be able to cache tight coverage set per [10] P. Edara, M. Pasumansky, Big metadata: when
equivalence class on the query engine side and metadata is big data, Proceedings of the VLDB
by that improve query performance even more Endowment 14 (2021) 3083–3095.</p>
        <p>while using limited memory resources. [11] G. Moerkotte, Small materialized aggregates: A
• We want to apply our optimization techniques in light weight index structure for data warehousing,
real-world applications (we already have initial in: VLDB’98, Proceedings of 24rd International
results in a multidisciplinary project where we Conference on Very Large Data Bases, 1998, pp.
analyze large-scale genomic data with cloud data 476–487.
lakes[15, 16]). [12] G. Weintraub, E. Gudes, S. Dolev, Needle in
a haystack queries in cloud data lakes., in:</p>
        <p>EDBT/ICDT Workshops, 2021.
[13] A. Silberschatz, H. F. Korth, S. Sudarshan, et al.,
[1] M. Armbrust, T. Das, L. Sun, B. Yavuz, S. Zhu, Database system concepts, volume 5, McGraw-Hill
M. Murthy, et al., Delta lake: high-performance New York, 2002.
acid table storage over cloud object stores, Proceed- [14] G. Weintraub, E. Gudes, S. Dolev, Indexing cloud
ings of the VLDB Endowment 13 (2020) 3411–3424. data lakes within the lakes, in: Proceedings of the
[2] M. Armbrust, A. Ghodsi, R. Xin, M. Zaharia, Lake- 14th ACM International Conference on Systems
house: a new generation of open platforms that and Storage, 2021, pp. 1–1.
unify data warehousing and advanced analytics, in: [15] G. Weintraub, N. Hadar, E. Gudes, S. Dolev, O. Birk,
Proceedings of CIDR, 2021. Analyzing large-scale genomic data with cloud data
[3] Y. Yang, M. Youill, M. Woicik, Y. Liu, X. Yu, M. Ser- lakes, in: Proceedings of the 16th ACM
Internaafini, A. Aboulnaga, M. Stonebraker, Flexpush- tional Conference on Systems and Storage, SYSTOR
downdb: Hybrid pushdown and caching in a cloud ’23, 2023, p. 142.
dbms, Proceedings of the VLDB Endowment 14 [16] N. Hadar, G. Weintraub, E. Gudes, S. Dolev, O. S.
(2021) 2101–2113. Birk, Geniepool: genomic database with
corre[4] J. Tan, T. Ghanem, M. Perron, X. Yu, M. Stonebraker, sponding annotated samples based on a cloud data
D. DeWitt, M. Serafini, A. Aboulnaga, T. Kraska, lake architecture, Database 2023 (2023) baad043.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>