-

Vancouver, Canada

0 Ben-Gurion University of the Negev, Computer Science Department , Beer-Sheva , Israel

Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in “on-demand” mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. Our research focuses on three related topics: (1) identify the key challenges to improving query performance in cloud data lakes, (2) provide a theoretical model that formally defines the problem of poor query performance in cloud data lakes, (3) design a practical solution to the problem and demonstrate its eficiency via large-scale experimental evaluation.

data lakes, cloud storage, query optimization © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License are stored in “file170”, and records 7-9 in “file051”. Files’ Attribution 4.0 International (CC BY 4.0). possible to speed up calculations on the data). In single- in cloud data lakes, and our contributions so far can be

1. Introduction

Traditionally, storage systems have been favoring data locality (meaning they wanted to be as close to the data as node databases, data locality occurs trivially, whereas in shared-nothing distributed systems, data locality is achieved by performing computation on the same machines that store the data. However, with the rise of cloud technologies, a new family of storage systems has emerged – cloud object stores (e.g., AWS S3 and Azure Blob Storage). These systems provide object storage service through a web interface. Users create buckets, and each bucket may contain multiple binary objects uniquely identified within the bucket by a string key.

Cloud object stores are widely considered to be the

most cost-efective storage systems in the world right now [1, 2, 3, 4], and as a result, they are heavily used as the main building block of enterprise data repositories that have come to be known as cloud data lakes [1, 2, 3, 4, 5]. The main feature of the cloud data lakes is that they store data in cloud object stores and as a result do not follow the traditional shared-nothing architecture but, instead, disaggregate compute layer from the storage layer. This new approach is commonly called a data lake architecture [5].

VLDB 2023 PhD Workshop, co-located with the 49th International This research was partially supported by the Israeli Council for Higher Education (CHE) via Data Science Research Center, the Israel Data Science Initiative (IDSI), and the Lynne and William Frankel Center for Computer Science nEvelop-O

2. Problem Statement

Let us introduce the problem via a simple example first.

Consider a typical metric data presented in Table 1. Let us assume that this table is stored in the cloud data lake such that records 1-3 are stored in “file201”, records 4-6 ... ... ... ... ... ... ... ... ...

file201 file170 file051 format can be any of the standard supported formats (e.g., icantly improve query performance in a cloud data lake the files needed to satisfy ) and denote it by ( , ) We call such a coverage set of . Note that for any , holds ( , )

. it by ( , ) and denote it by ()

Definition 3 (query tight coverage).

Given a data lake query , ⊆ tightly covers ↔ ( , ) ⋀ ¬∃ ∈ , ∀ ∈ , ¬( , )

When satisfies Definition 3 for some data lake query

, we say that tightly covers (meaning that contains all the files needed to satisfy and only them) and denote . We call such a tight coverage set of

If for any data lake query , we could (eficiently) com

pute such that ( , ) , we would be able to signifarchitecture by accessing only files in instead of all those in (and in most real-world scenarios | | << | | In fact, as we show below, in many cases finding the ). exact tight coverage might be too complicated, and we can be content with some coverage set that is not tight but still can help us improve query performance. For such scenarios, the definition of tightness degree might be useful.

Definition 4 (tightness degree).

Given a data lake query , for any coverage set of , the tightness degree { 1 − 0 | |−| ()| | |−| ()| if | ()| < | | otherwise }

Intuitively, the tightness degree shows to what extent

the given coverage set is close to the tight coverage set (1 means perfectly close).

Based on the above semantics we can formulate our main research question as follows: • Can we develop an algorithm, that for any data lake query , can find a coverage set of , , such that: – tightness degree of is maximized – cost1 of computing is minimized – as a result of the above, the total execution

time of is reduced as much as possible

3. Related Work

The most trivial approach for query execution in cloud data lakes is to read all the files. Clearly, covers any , but the coverage is far from being tight and hence query performance is poor.

One of the first suggested optimizations was data par

titioning [6] which is supported by all modern query , we say that covers (meaning that contains all 1cost is measured by the number of files read from the cloud

Parquet or ORC).

Let us briefly review how the state-of-the-art query engines [6, 7, 8] would execute a typical query on Table 1 stored in the cloud data lake. For example, when a query engine receives a query with ”where” condition ”metric = ’memory’ and val > 10”, it reads the files from the storage, scans them in-memory to find the records satisfying the predicate, and returns the result. In our example, only record 3 from file201

will be returned. However, all the data lake files should be read from the storage and processed; and since production data lakes might contain billions of files [ 9], this approach is extremely wasteful. of is defined as: So, intuitively, we would like to read only the “relevant” ifles for a given query ( file201 in our example) and skip all the rest. Let us define the problem formally now. ( , ) = 2.1. Formal problem definition We model data lake tables according to the standard relational model.

Given a set of domains = { 1, 2, … , }, a table is defined as a subset of the Cartesian product 1 × 2 × … × . Each domain ∈ has an associated column name ∈ , where is the set of all column names. is a set of tuples { 1, 2, … , | | }, where each tuple is a set of pairs {( ∶ ) ∣ ∈ , ∈ , ∈ {1, … , }} . is stored in a cloud object store as a collection of files denoted by = { 1, 2, … , | | }. is a partition of , meaning that ∀≠ ∶ ⊆ , ⋃ = , ∩ = ∅.

Definition 1 (data lake query).

We define a data lake query as a standard SQL query on table and denote the predicate in the where clause of as . We assume that is given in a conjunctive normal form (CNF). If tuple from the file ∈

satisfies we denote it by ( , ) .

Definition 2 (query coverage).

Given a data lake query , ⊆ covers ↔ ∀ ∈ ⧵ , ¬∃ ∈ , ( , ) .

When satisfies Definition

2 for some data lake query engines. Unfortunately, only a limited subset of table cus on the ”where” condition of the query and look at columns can be used in partitioning, while production each predicate clause separately. There may be many tables may contain tens of thousands of columns [10]. coverage sets associated with each clause, and there may

Another well-known approach [11] is to attach meta- be many ways to compute each of these coverage sets. data to each data lake file and use it during the reads to We assume that for each possible coverage execution plan skip irrelevant files. Columnar formats support metadata- we can estimate (e.g., via statistics, caching, ML models, based skipping out-of-the-box by storing the metadata etc.) what is the expected cost and expected result of and the data in the same file and relying on the fact that each plan. An important observation is that intersection cloud object stores support reading of the particular sec- of coverage sets of any subset of the query clauses is a tions of the file. Unfortunately, metadata-based skipping coverage set of the original query. Based on this observais very sensitive to data distribution and helps only in tion, our optimization scheme consists of the following cases where the data is nicely clustered. two steps: ltrSoaheepytelStueeiromcrrem,nta)isezdesodauotciptpirlooproeenortulha,redttaevipsopacunwnrosomtahevnripidendduocgeotorentnrhsodlleyatsinyqtnmheuesreeie.ogrrdmTyhehlttpeeoibrvscemeatadnefilosittccevehrasreetneec(hdieoqtu.orouggdu.ete,shtiwAaesdmWsouatuorogSilurndraneSgbgat3eest- 1. scaaGuoggibsveetsesspenelatatanosndfqissuc)t,hlemaseruuiynssciaeihzmsnetd(ihoazifaentdtstdh.tetheshetieinmistrueacmrtoseerdorcfetvistaophlnuoeeniorsdf,eitwnshtgeei mccfinooadvvtaeeerdr-of data between storage and compute layers. However, 2. Then, we execute each of the coverage plans, inwe still perform a lot of costly filtering operations; the tersect their results and execute the original query only diference is that the operation is performed in a on the files in the intersection only. diferent place (i.e., in the worst-case scenario all | | files Step (1) can be formulated as an optimization problem should be accessed). and we prove that it is NP-hard (by reduction from the

Table format (e.g., Delta Lake, Apache Iceberg) [9] is a set covering problem); we suggest heuristic algorithms to novel approach to add missing capabilities (e.g., transac- deal with the hardness of the problem. Our result estimations, schema evolution, query optimization) to the data tion scheme is based on the regular relational databases lake architecture. The main idea is to add an additional selection size estimation [13], in which we can estimate layer of metadata between the files in the storage and the number of records satisfying the predicate based on the compute layer. This metadata then can be used, for the statistical information. Once we have the estimated example, for storing information about schema changes, number of records ( ) for the given clause, we can estidata mutation, and various statistics to improve query mate the corresponding number of files as the number of performance. While table format is an important step non-empty bins after randomly throwing balls into | | towards query optimization in cloud data lakes, so far it bins. Execution of coverage plans (step 2) is based on our does not solve the main problem considered in our study: previous work on indexing in cloud data lakes [12, 14]. reading of (a lot of) irrelevant files from the storage. We built a prototype of our solution; the implemen

Indexing is the primary method for improving query tation is available online 2. We used Apache Spark as performance in relational databases but it is rarely used the main compute engine and AWS as the cloud provider. in big data environment. Our preliminary work in [12] For the benchmark, we used the TPC-H dataset 3 with a presents an indexing scheme for relational data in cloud scale factor of 1 TB. We executed diferent queries with data lakes where indexes are stored inside the lake and and without our scheme and achieved up to x19 query their creation is performed by parallel algorithms. This performance improvement. scheme, however, limits the query model to simple selection/projection queries with a single-column predicate.

5. Conclusions and Future Research Directions 4. Our Approach

Finding the tight coverage set of a query might be too Data volumes keep growing, and new technologies are costly (in can be proved that in worst-case scenario it being developed all the time to adjust to the constant cannot be better than Ω( ) ). The main idea of our ap- increase in data trafic. Cloud data lakes are one of the proach is, instead of looking for the tight coverage set most successful solutions to the big data challenge. They of the given query, to find some coverage set that will provide great scalability, usability, and cost-eficiency, result in an optimal total query execution time (i.e., we but perform poorly with interactive queries. are looking for the coverage set such that the sum of the cost of finding and its size is minimized). We fo- 2https://github.com/grishaw/data-lake-coverage 3https://www.tpc.org/tpch/

We analyze the problem of poor performance of interactive queries in cloud data lakes and identify the main obstacles achieving better performance. We formally deifne the problem by introducing a new concept of the query (tight) coverage set. We use our terminology to deifne an optimization problem that finds the best coverage set for the given query. We show that the problem is NPhard and deal with its hardness by suggesting heuristic algorithms. Our solution is based on ideas from relational databases related to indexing and statistics management adjusted to a cloud data lake architecture. The main idea is that the storage resources are usually much cheaper than the compute resources [4, 12], so if we could provably improve query performance by adding more storage it probably would be very useful for many data lake users.

For future research, we are planning to focus on the following directions:

Choosing a cloud dbms: architectures and trade

ofs, Proceedings of the VLDB Endowment 12 (2019) 2170–2182. [5] D. Abadi, A. Ailamaki, D. Andersen, P. Bailis, M. Balazinska, P. A. Bernstein, P. Boncz, S. Chaudhuri, A. Cheung, A. Doan, et al., The seattle report on database research, Communications of the ACM 65 (2022) 72–79. [6] J. Camacho-Rodríguez, A. Chauhan, A. Gates,

E. Koifman, O. O’Malley, V. Garg, et al., Apache hive: From mapreduce to enterprise-grade big data warehousing, in: Proceedings of the 2019 International Conference on Management of Data, 2019, pp. 1773–1786. [7] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K.

Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al., Spark sql: Relational data processing in spark, in: Proceedings of the 2015 ACM SIGMOD • We can try improving our estimations scheme by international conference on management of data, training ML models to ”guess” expected result of 2015, pp. 1383–1394.

a given query coverage plan. [8] R. Sethi, M. Traverso, D. Sundstrom, D. Phillips, • We want to extend our basic query model to more W. Xie, Y. Sun, N. Yegitbasi, H. Jin, E. Hwang, complex types (e.g., joins and group by). N. Shingte, et al., Presto: Sql on everything, in: • An interesting caching technique can be based 2019 IEEE 35th International Conference on Data on the tight coverage sets: function defines Engineering (ICDE), IEEE, 2019, pp. 1802–1813. an equivalence relation on a set of all possible [9] P. Jain, P. Kraft, C. Power, T. Das, I. Stoica, M. Zaqueries over the table. So, if for any query we haria, Analyzing and comparing lakehouse storage could tell for each equivalence class it belongs, systems, in: Proceedings of CIDR, 2023. we would be able to cache tight coverage set per [10] P. Edara, M. Pasumansky, Big metadata: when equivalence class on the query engine side and metadata is big data, Proceedings of the VLDB by that improve query performance even more Endowment 14 (2021) 3083–3095.

while using limited memory resources. [11] G. Moerkotte, Small materialized aggregates: A • We want to apply our optimization techniques in light weight index structure for data warehousing, real-world applications (we already have initial in: VLDB’98, Proceedings of 24rd International results in a multidisciplinary project where we Conference on Very Large Data Bases, 1998, pp. analyze large-scale genomic data with cloud data 476–487. lakes[15, 16]). [12] G. Weintraub, E. Gudes, S. Dolev, Needle in a haystack queries in cloud data lakes., in:

EDBT/ICDT Workshops, 2021. [13] A. Silberschatz, H. F. Korth, S. Sudarshan, et al., [1] M. Armbrust, T. Das, L. Sun, B. Yavuz, S. Zhu, Database system concepts, volume 5, McGraw-Hill M. Murthy, et al., Delta lake: high-performance New York, 2002. acid table storage over cloud object stores, Proceed- [14] G. Weintraub, E. Gudes, S. Dolev, Indexing cloud ings of the VLDB Endowment 13 (2020) 3411–3424. data lakes within the lakes, in: Proceedings of the [2] M. Armbrust, A. Ghodsi, R. Xin, M. Zaharia, Lake- 14th ACM International Conference on Systems house: a new generation of open platforms that and Storage, 2021, pp. 1–1. unify data warehousing and advanced analytics, in: [15] G. Weintraub, N. Hadar, E. Gudes, S. Dolev, O. Birk, Proceedings of CIDR, 2021. Analyzing large-scale genomic data with cloud data [3] Y. Yang, M. Youill, M. Woicik, Y. Liu, X. Yu, M. Ser- lakes, in: Proceedings of the 16th ACM Internaafini, A. Aboulnaga, M. Stonebraker, Flexpush- tional Conference on Systems and Storage, SYSTOR downdb: Hybrid pushdown and caching in a cloud ’23, 2023, p. 142. dbms, Proceedings of the VLDB Endowment 14 [16] N. Hadar, G. Weintraub, E. Gudes, S. Dolev, O. S. (2021) 2101–2113. Birk, Geniepool: genomic database with corre[4] J. Tan, T. Ghanem, M. Perron, X. Yu, M. Stonebraker, sponding annotated samples based on a cloud data D. DeWitt, M. Serafini, A. Aboulnaga, T. Kraska, lake architecture, Database 2023 (2023) baad043.