<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Needle in a haystack queries in cloud data lakes</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Grisha</forename><surname>Weintraub</surname></persName>
							<email>grishaw@post.bgu.ac.il</email>
						</author>
						<author>
							<persName><forename type="first">Ehud</forename><surname>Gudes</surname></persName>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="institution">Ben-Gurion University of the Negev Beer-Sheva</orgName>
								<address>
									<country key="IL">Israel</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="institution">Ben-Gurion University of the Negev Beer-Sheva</orgName>
								<address>
									<settlement>Shlomi Dolev</settlement>
									<country key="IL">Israel</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="institution">Ben-Gurion University of the Negev Beer-Sheva</orgName>
								<address>
									<country key="IL">Israel</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Needle in a haystack queries in cloud data lakes</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">9B1C766797415A5178E1F3FEF2556EB6</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T01:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Cloud data lakes are a modern approach for storing large amounts of data in a convenient and inexpensive way. Query engines (e.g. Hive, Presto, SparkSQL) are used to run SQL queries on data lakes. Their main focus is on analytical queries while random reads are overlooked. In this paper, we present our approach for optimizing needle in a haystack queries in cloud data lakes. The main idea is to maintain an index structure that maps indexed column values to their files. According to our analysis and experimental evaluation, our solution imposes a reasonable storage overhead while providing an order of magnitude performance improvement.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Data lakes are a relatively new concept, which has recently received much attention from both the research community and industry <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b28">29,</ref><ref type="bibr" target="#b40">41]</ref>. There is no formal definition of the term data lake, but it is commonly described as a centralized repository containing very large amounts of data in their original (or minimally processed) format. The main idea behind this approach is that instead of loading the data into traditional data warehouses, which require scrupulous schema design, and a high development and maintenance cost, the data is simply streamed into a distributed storage system (e.g. HDFS <ref type="bibr" target="#b42">[43]</ref>), and from that moment becomes available for analytics. Cloud data lakes are managed by cloud providers and store their data in cloud object stores, as AWS S3 <ref type="bibr" target="#b11">[12]</ref>, Google Cloud Storage <ref type="bibr" target="#b30">[31]</ref>, and Azure Blob Storage <ref type="bibr" target="#b36">[37]</ref>. For analytics, compute engines (e.g. Spark <ref type="bibr" target="#b46">[47]</ref>, MapReduce <ref type="bibr" target="#b24">[25]</ref>) are used "on-demand". They access cloud data lakes via simple get/put API over the network, and this separation of storage and compute layers results in a lower cost and allows independent scaling of each layer (Fig. <ref type="figure" target="#fig_0">1</ref>). on top of data lakes. Hive <ref type="bibr" target="#b17">[18]</ref>, originally built at Facebook, was the first such system. In Hive, SQL-like queries are compiled into a series of map-reduce jobs and are executed on MapReduce <ref type="bibr" target="#b24">[25]</ref> framework. A somewhat similar approach is implemented in Spark SQL <ref type="bibr" target="#b12">[13]</ref>, where SQL queries are running on top of the Spark engine <ref type="bibr" target="#b46">[47]</ref>.</p><p>Another group of query engines do not use general-purpose frameworks such as MapReduce or Spark but rely on their own engines, built on the ideas from parallel databases. The first such system was Google's Dremel <ref type="bibr" target="#b35">[36]</ref>, which inspired many modern query engines; most well-known are Drill <ref type="bibr" target="#b31">[32]</ref>, Presto <ref type="bibr" target="#b41">[42]</ref> and Impala <ref type="bibr" target="#b32">[33]</ref>.</p><p>The following techniques are commonly associated with the query engines running on data lakes:</p><p>• in-situ processing: In-situ means that the data is accessed in-place (i.e. in HDFS or cloud object store), and there is no need to load the data into the database management system (DBMS). Thanks to this feature, the same data lake can be queried simultaneously by different query engines without any difficulties. • parallel processing: The data is accessed in parallel (usually by a cluster of dedicated machines). • columnar storage: The main idea is that instead of storing the data row by row, as is customary in relational databases, the data is stored column by column. The most important benefits of such column-wise storage are fast projection queries and efficient column-specific compression and encoding techniques. Some of the popular open source formats are ORC and Parquet <ref type="bibr" target="#b27">[28]</ref>. • data partitioning: The idea is simple and yet powerful.</p><p>The data is partitioned horizontally based on some of the input columns. Then, files related to each partition are stored in a different directory in the storage system (see an example below). Let us demonstrate how these techniques look in a typical real-world scenario. Consider a large enterprise company receiving a lot of events (e.g. from IoT sensors) and stores them in a data lake. The data lake would look like a collection of files (Fig. <ref type="figure" target="#fig_2">2</ref>) stored in some distributed storage system. In our example, the data lake is partitioned by year and month columns, so for example events from March 2020 are located in the path datalake/year:2020/month:03/. Events are stored in a columnar format Parquet, so queries on specific columns would scan only relevant chunks of data. When a query engine executes a query, it scans all the relevant files in parallel and returns the result to the client.</p><p>Data lakes and query engines are optimized for analytic queries and a typical query calculates aggregate values (e.g. sum, max, min) over a specific partition. For such queries, all the files under specified partition should be scanned.</p><p>However, sometimes users may be merely interested in finding a specific record in the data lake (a.k.a needle in a haystack). Consider for example a scenario where a specific event should be fetched from the lake based on its unique identifier for troubleshooting or business flow analysis. Another important use  case is managing the General Data Protection Regulation (GDPR) compliance <ref type="bibr" target="#b44">[45]</ref>, where records related to a specific user should be found in the data lake upon user request. Unfortunately, even when the requested information resides in a single file, query engines will perform a scan of all the files in the data lake (can be millions). Obviously, this behavior results in a long query latency and an expensive computation cost.</p><p>In this paper, we present our approach for optimizing "needle in a haystack" queries in cloud data lakes. The main idea is to build an index structure that maps indexed column values to the files that contain those values. We assume enormous data volumes in cloud data lakes, and hence, we need our index to be highly scalable in both compute and storage sense. For scalable compute, we utilize parallel processing paradigm <ref type="bibr" target="#b24">[25,</ref><ref type="bibr" target="#b46">47]</ref>; for scalable storage, we design a new storage format that allows unlimited scaling while minimizing the number of network operations during reads.</p><p>Our main contributions are as follows:</p><p>• Development of a novel index structure that allows speeding up random queries in cloud data lakes. • A prototype implementation of our solution and its experimental evaluation in a real-world cloud data lake. The rest of the paper is structured as follows. In Section 2, we formally define the problem. In section 3, we describe the proposed solution. In section 4, we provide a performance analysis of our solution. In Section 5, we introduce our proof-of-concept implementation and provide an experimental evaluation thereof. In Section 6, we review related work. We conclude in Section 7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">PROBLEM STATEMENT</head><p>We model a data lake as a set of tuples 𝐷 = {&lt; 𝑐 1 : 𝑣 1 , 𝑐 2 : 𝑣 2 , ..., 𝑐 𝑛 : 𝑣 𝑛 &gt;| ∀1 ≤ 𝑖 ≤ 𝑛, 𝑐 𝑖 ∈ 𝐶, 𝑣 𝑖 ∈ 𝑉 } where 𝐶 is the set of column names and 𝑉 is a set of column values (see notation in Table <ref type="table" target="#tab_4">5</ref>). Data lake 𝐷 is stored in a distributed storage system as a collection of 𝑁 files, denoted by 𝐹 = {𝑓 1 , 𝑓 2 , ..., 𝑓 𝑁 }.</p><p>Query engines support SQL queries on data lakes. In this paper our focus is on simple selection queries of type: SELECT C1, C2, ... FROM DATA-LAKE WHERE SOME-COLUMN = SOME-VALUE Files that contain tuples that satisfy query 𝑄 are denoted by 𝐹 (𝑄). As already mentioned in the previous section, the current for a given 𝑄 we would read only relevant files and so will be able to significantly improve query performance. Thus the problem we are trying to solve in this paper can be defined as -"construct a function 𝑔𝑒𝑡𝐹𝑖𝑙𝑒𝑠 (𝑄) that receives a query 𝑄 and returns 𝐹 (𝑄) -the list of files that contain tuples that satisfy 𝑄".</p><p>Figure <ref type="figure">3</ref>: System model Fig. <ref type="figure">3</ref> presents a high-level diagram of the considered system model.</p><p>(1) A client submits a query to the query engine (2) A query engine queries the index to get relevant file names (3) A query engine reads relevant files from the data lake (4) A query engine computes query result and returns it to the client An obvious approach to implement 𝑔𝑒𝑡𝐹𝑖𝑙𝑒𝑠 (𝑄) is to create a simple inverted index <ref type="bibr" target="#b39">[40]</ref> that stores a mapping between the values of relevant columns and the files that contain these values (Table <ref type="table" target="#tab_0">1</ref>). An open question is: what is the best way to do that in a big data environment. More formally, we are looking for a solution that will satisfy the following constraints:</p><p>(1) scalability: we want our index to be able to handle very large and growing amounts of data (2) performance: we need our index to be able to respond to the lookup queries in a reasonable time (optimally subseconds) (3) cost: we want to minimize the monetary cost of our solution as much as possible The trivial solution would be to store our index in a distributed NoSQL database (e.g Cassandra <ref type="bibr" target="#b33">[34]</ref>, DynamoDB <ref type="bibr" target="#b10">[11]</ref>) and update it with parallel compute engines (e.g Spark, MapReduce). However, while this approach may satisfy the first two of our requirements (scalability and performance), its monetary cost is far from being optimal. Let us show why by using some real numbers.</p><p>We have two main alternatives for using a database service in the cloud environment: IaaS and DaaS. 276 $ <ref type="bibr" target="#b5">[6]</ref> (1) IaaS: "Infrastructure-as-a-Service" model provides users with the requested number of virtual machines where they can install DBMS and use it as they wish (e.g. Cassandra on top of EC2). ( <ref type="formula">2</ref>) DaaS: "Database-as-a-Service" model provides users with database API without the need for managing hardware or software (e.g. DynamoDB).</p><p>In Table <ref type="table" target="#tab_1">2</ref> 1 , we compare the monetary cost of storing data in each one of these alternatives against storing it in a cloud object store (similarly to how the data is stored in cloud data lakes). It clearly can be seen that an object store is at least one order of magnitude lower-priced than any other option. Since our focus is on real-world cloud data lakes (i.e. petabytes and beyond scale), we need to support indexes of very large volumes as well (index on a unique column of data lake 𝐷 will need 𝑂 (|𝐷 |) storage). Thus, potential monetary savings of implementing indexing by using object stores, instead of the naive solution based on keyvalue stores, can be tremendous.</p><p>This observation is not new, and object stores superiority in cost is a well-known fact in both industry <ref type="bibr" target="#b23">[24]</ref> and research community <ref type="bibr" target="#b43">[44]</ref>. However, object stores are not suitable for every case. More specifically, when comparing them to key-value stores, they have the following caveats:</p><p>(1) higher latency : the actual numbers may vary, but usually object stores have higher latency than DBMS <ref type="bibr" target="#b43">[44]</ref> (2) limited requests rate : while requests rate is nearly unlimited in scalable databases, in object stores there is an upper limit (typically around thousands per second <ref type="bibr" target="#b37">[38]</ref>) (3) optimized for large objects : object stores are optimized for storing large objects (typically GB's <ref type="bibr" target="#b7">[8]</ref>) and hence are not suitable for systems that need random reads (4) no random writes : the only way to mutate objects in object stores is to replace them, which may be problematic for systems that need random writes While the first two points ("higher latency" and "limited requests rate") are usually acceptable in OLAP systems, the last two ("large objects" and "no random writes") introduce significant implementation challenges. In our solution, we address both these challenges.</p><p>To summarize, in this work, our research goal is twofold:</p><p>• to implement 𝑔𝑒𝑡𝐹𝑖𝑙𝑒𝑠 (𝑄) and thereby improving random reads performance in cloud data lakes • to build our solution on top of object stores so the additional monetary cost imposed by our scheme will be as low as possible 1 For simplicity, we ignore some factors that cannot change the general trend, like the maintenance cost of Cassandra, the cost of requests in S3 and DynamoDB, and the fact the data in object stores can be compressed. Our goal is only to show that object stores are significantly cheaper than the other options. We also provide numbers for specific systems (Cassandra, DynamoDB, AWS), but the same cost trend is preserved in other similar systems as well.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">OUR APPROACH</head><p>In sections 3.1, 3.2, and 3.3 below, we explain how exactly we store, compute, and use our index. In section 3.4, we discuss potential optimizations to our basic scheme.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Index Storage</head><p>We store our index as a collection of files partitioned by an indexed column name (Fig. <ref type="figure" target="#fig_4">4</ref>). Each index file is responsible for some range of column values and stores relevant information about this range in a format depicted in Table <ref type="table" target="#tab_2">3</ref>.</p><p>There are two sections in the index file -data and metadata. The data part consists of a map between a column value and a list of corresponding file names ordered by a column value. The map is divided into 𝐾 chunks based on the predefined number of values per chunk 𝑀, such that each of the 𝐾 chunks contains a mapping for the 𝑀 consecutive values of the particular indexed column. For example, in the index presented in Table <ref type="table" target="#tab_2">3</ref>, we have 𝑀 = 3, and the first chunk contains a mapping between 3 column values (207, 350, 418) and the file names that contain tuples with these values.</p><p>Metadata stores information about the location of data chunks inside the index file along with their "border" values. Considering the example in Table <ref type="table" target="#tab_2">3</ref> again, from the metadata section we learn that the firsts chunk's minimal value is 207, the maximal value is 418, and its location in the file is in offsets 0:2048.</p><p>For very big data lakes, we will have many index files and then will start experiencing the same problem we are trying to solve in data lakes (many irrelevant files are accessed). We solve this issue by managing an index of indexes (i.e. root index) -a single file that stores a mapping between the index file name and its border values (Table <ref type="table" target="#tab_3">4</ref>). In Section 3.3 below, we will explain how the root index and metadata sections are used by the index users.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Index Computation</head><p>Since data lakes store enormous volumes of data, our index will be very large as well. To be able to compute such an index in a reasonable time we would like to be able to compute it in parallel by using common parallel compute engines like MapReduce <ref type="bibr" target="#b24">[25]</ref> and Spark <ref type="bibr" target="#b46">[47]</ref>.</p><p>Fortunately, we can easily do that by a simple map-reduce flow. Pseudo-code (in Spark-like syntax) of the algorithm for creating an index on a specific column in a given data lake is presented in Algorithm 1. In the map phase, each worker emits (col-value, file-name) pairs and in reduce phase map results are grouped into (colvalue, List&lt;file-name&gt;) map. Then, the result is written into the distributed storage in our custom format (presented in Table <ref type="table" target="#tab_2">3</ref> and denoted in the pseudo-code as my.index). Finally, we create the root index file by scanning metadata sections of all the created index files and taking minimum and maximum values from each file.</p><p>Algorithm 1 creates an index for an existing data lake (i.e. static scenario). In a dynamic scenario, bulks of new files are being added to the lake periodically during scheduled Extract-Transform-Load (ETL) processes. To support a dynamic scenario, we should be able to update our index in accordance with the new data.</p><p>Our algorithm for index update, presented in Spark-like syntax in Algorithm 2, runs in parallel and tries to modify as few existing index files as possible. It works like that:</p><p>• First (line 2), we read the root index to get "ranges" -a set of tuples (min, max, index-file-name) for a given column (the result may look similar to Table <ref type="table" target="#tab_3">4</ref>). • Then (lines 3-4), we use broadcast join technique <ref type="bibr" target="#b15">[16]</ref> to compute "labeled-delta" -values from delta files labeled with the corresponding "index-file-name" from "ranges" set. Considering the example in Tables <ref type="table" target="#tab_3">1-4</ref>, the result would look as a set of triples (event-id, data-file-name, index-file-name); for example, (450, file1111, index13), (645, file1112, index23). • Finally (lines 5-8), we use index file names computed in labeled-delta, to identify index files that should be changed. We read these files and combine their data with the delta files. Then, we apply Algorithm 1 on this data and override relevant existing index files and the root index. • Note that we need to re-partition our data by "index-filename" (line 7) to ensure that there will not be a mix between different index files and each index file will be computed by a different process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Index Usage</head><p>In this section we explain how to use our index format to implement 𝑔𝑒𝑡𝐹𝑖𝑙𝑒𝑠 (𝑄) (pseudo-code is presented in Algorithm 3).</p><p>Upon receiving a "needle in a haystack" query 𝑄 and a path to index location, we first extract the column name and the column value from the "where" clause of the query (line 2). We then read the root index and find to which index file the given value belongs (lines 3-4). Once we know what index file has information about our value, we perform the following:</p><p>• we read the metadata part of the file • we find to which data chunk our value belongs based on the min/max values. • once found to which chunk our value belongs, we read the (column value, List&lt;file name&gt;) map from the specified offset • if our value appears in the map we are done and the method returns the file names list • otherwise we know that requested value is not in the index so we exit with a relevant flag Note that Algorithm 3 does not require parallel processing and can run as a standalone function.</p><p>Algorithm 1 Create Index 1: procedure CreateIndex(datalake-path, index-path, col-name) 2: datalake = read(datalake-path) 3: index = datalake.groupBy(col-name).agg(collect("f-name")) end if 14: end procedure</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Optimizations</head><p>In this section, we briefly discuss several potential optimizations that can be done to our basic scheme. We plan to deepen some of them in our future work.</p><p>• Bloom filter: In Algorithm 3, we perform two reads from the index file that may contain our queried value even if the value does not exist in the data lake at all. A possible solution to this issue is adding a Bloom filter <ref type="bibr" target="#b16">[17]</ref> containing all the column values in the index file to the metadata section and check it before querying the index data chunk. This way, in most cases, we will perform a single read operation for non-existing value but will have a larger metadata section in each index file. • Caching: By caching the root index and metadata sections on the query engine side, we can significantly reduce the number of network operations per query. • Compression/Encoding: We can apply different encoding and compression techniques to reduce the storage overhead of our index. For example, instead of storing file names lists as strings, we can use compressed bit vectors <ref type="bibr" target="#b20">[21]</ref>; metadata offsets can be encoded with delta encoding <ref type="bibr" target="#b34">[35]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">PERFORMANCE ANALYSIS</head><p>In this section, we analyze our solution from a performance perspective. We first show how our index improves queries performance and then we show what is the performance overhead imposed by it. A cost of query 𝑄 is dominated by a cost of read operations from remote storage. When columnar format is used, irrelevant columns can be skipped, but at least some data should be read from each file. Thus, by using a notation from Table <ref type="table" target="#tab_4">5</ref>, we can define a cost of query Q as:</p><formula xml:id="formula_0">𝐶 𝑄 = 𝐶 𝑅 • |𝐹 | (1)</formula><p>When using our index, we read only relevant data files 𝐹 (𝑄) and perform at most three read operations from the index files (for the root index, metadata, and data chunk). Hence, the cost is reduced to:</p><formula xml:id="formula_1">𝐶 𝑄 = 𝐶 𝑅 • (|𝐹 (𝑄)| + 3) (2)</formula><p>Thus, the performance is improved by:</p><formula xml:id="formula_2">|𝐹 | |𝐹 (𝑄)| + 3 (3)</formula><p>Since our focus is on "needle in a haystack" queries, we assume that |𝐹 (𝑄)| is a small number (typically below 100), while |𝐹 | in real-world data lakes can reach millions <ref type="bibr" target="#b41">[42]</ref>. Thus, we can expect a reduction of query cost by order of magnitude.</p><p>Our solution imposes overhead in the following two dimensions:</p><p>• Storage: Additional storage required for our index has an upper limit of |𝐷 | (in case that all columns are indexed).</p><p>In a more realistic scenario, when only a few of the data lake columns are indexed, storage overhead is expected to be much lower than |𝐷 |. For example, in our experimental evaluation an index on unique column requires around 5% of the data lake size (exact numbers are presented in Section 5.1) • CreateIndex/UpdateIndex: Algorithms 1 and 2 are supposed to run offline without affecting query performance. Their cost is dominated by the cost of reading relevant Cost of operation 𝑥 data lake files, creating an index by map-reduce flow, and writing the index into a distributed storage. In summary, our solution imposes reasonable storage and (offline) computation overhead but improves query performance by order of magnitude (equations 1-3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">IMPLEMENTATION AND EXPERIMENTAL RESULTS</head><p>We have implemented a prototype of our solution (Algorithms 1, 3); the implementation and extended results are available on-line <ref type="bibr" target="#b45">[46]</ref>. For experimental evaluation, we cooperated with a large enterprise company that manages cloud data lakes at petabyte scale. We used an anonymized subset of the real data lake for our experiments.</p><p>The data lake we used is stored in AWS S3 cloud storage. The data lake properties are as follows:</p><p>• 1,242 files in Parquet format (compressed with snappy)</p><p>• 605,539,843 total records</p><p>• each record has a unique identifier (8 bytes)</p><p>• 233 columns (mainly strings but also numbers and dates)</p><p>• 54.2 GB (total compressed size)</p><p>Our evaluation focused on the following stages:</p><p>(1) Index Creation: We created an index according to our approach (Algorithm 1) and according to a naive approach based on a key-value store. Then, we compared index creation time and its storage size in each of the approaches. (2) Index Usage: We queried random values from the data lake in different ways (with and without index) and compared query execution time in different modes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Index Creation Evaluation</head><p>We built an index on two data lake columns:</p><p>• record-id (Long): unique identifier of each record in the data lake (distinct count -605,539,843) • event-id (String) : unique identifier of each event in the data lake (distinct count -99,765,289), each event-id is mapped to 6 record-ids on average. For index creation, we used Apache Spark (2.4.4) running on AWS EMR (5.29.0). EMR cluster had 15 nodes of type r5d.2xlarge (8 cores, 64GB memory, 300GB SSD storage). Our index was stored in AWS S3 as a collection of Parquet files compressed with Snappy and with a chunk size of 10 MB; root-index was stored in a single line-delimited JSON file. The key-value index was stored   It is worth noting, that even though our primary reason to implement the data lake index on top of object stores was motivated by reducing the monetary cost, our experiments provide an additional benefit of this decision. Index creation time of our approach is an order of magnitude faster than those based on a key-value store. In addition, in our approach, the storage size of the index is much smaller for a unique column and more or less the same for a non-unique column.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Index Usage Evaluation</head><p>For index usage evaluation, we performed the following test for the indexed columns (record-id, event-id):</p><p>(1) get 10 random column values from the data lake (2) for each column-value, do (a) run "select * from data-lake where column = columnvalue" (b) print query execution time</p><p>We executed this test in the following modes and compared their results.</p><p>(1) SparkSQL with our index (Algorithm 3) (2) SparkSQL with Cassandra index (3) SparkSQL without index (4) AWS Athena <ref type="bibr" target="#b2">[3]</ref> without index SparkSQL was executed on the same cluster as in the previous section (AWS EMR with 15 nodes). AWS Athena (Amazon serverless query engine based on Presto <ref type="bibr" target="#b41">[42]</ref>) ran on AWS Glue <ref type="bibr" target="#b6">[7]</ref> table created on top of the data lake Parquet files. The average results of the 10 runs for each one of the modes are presented in Fig. <ref type="figure">7</ref>.</p><p>Experimental results confirm our performance analysis in Section 4, as it clearly can be seen that our solution (as well as Cassandra-based) outperforms existing query engines (Spark and Athena) by order of magnitude.</p><p>To better understand the trade-offs between our solution and Cassandra-based, we also compared their execution time of calculating 𝑔𝑒𝑡𝐹𝑖𝑙𝑒𝑠 (𝑄). The results are presented in Fig. <ref type="figure">8</ref>. We evaluated two versions of our approach, with and without caching of the root index (see Section 3.4). Cassandra latency is lower than that of our approach, but when using a cache, the difference is not significant (less than a second).</p><p>To summarize, as expected, adding indexing to data lakes drastically improves queries performance. In this evaluation, we have demonstrated that index can be implemented in two different ways -a trivial approach based on key-value stores and our approach based on cloud object stores. Both approaches improve queries' execution time in an almost identical way (the difference is in milliseconds resolution). However, our approach has the following advantages over the naive solution:</p><p>(1) order of magnitude monetary cost improvement (Table <ref type="table" target="#tab_1">2</ref>) (2) order of magnitude index creation time improvement (Fig. <ref type="figure" target="#fig_6">5</ref>) (3) significantly less storage overhead in some cases (Fig. <ref type="figure" target="#fig_7">6</ref>)  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">RELATED WORK</head><p>Data lakes are a relatively new approach for storing and analyzing large amounts of data. As a new technology, it still lacks a formal definition, and in existing literature, different authors assign different (and sometimes even contradictory) meanings to this concept <ref type="bibr" target="#b28">[29,</ref><ref type="bibr" target="#b40">41]</ref>. In this paper, we stick to the definition from the recent Seattle Report on Database Research <ref type="bibr" target="#b1">[2]</ref>, where the main characteristic of the data lakes architecture is that it "disaggregates compute and storage, so they can scale independently. "</p><p>There are several research challenges in the data lakes domain. The most fundamental ones are knowledge discovery <ref type="bibr" target="#b29">[30]</ref> and various optimizations <ref type="bibr" target="#b1">[2]</ref>. We focus on the latter and study optimization of a specific scenario in data lake systems -"needle in a haystack" queries. Our solution can be seen as a combination of different techniques from databases and big data domains. Thus, we use partitions as in <ref type="bibr" target="#b17">[18]</ref> to organize our index according to columns. For scalability, we use root-index as in B+ trees <ref type="bibr" target="#b39">[40]</ref> and some distributed databases (e.g. <ref type="bibr" target="#b21">[22]</ref>). The format of our index files is similar to columnar formats <ref type="bibr" target="#b27">[28]</ref> with multiple chunks of data and a single metadata section. We use Bloom filters <ref type="bibr" target="#b16">[17]</ref> to improve the performance of queries on non-existing values. Parallel processing paradigm <ref type="bibr" target="#b24">[25,</ref><ref type="bibr" target="#b46">47]</ref> is used for index creation and updating.</p><p>At a high level, our solution resembles a well-known inverted index structure which is extensively used in full-text search engines <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b39">40,</ref><ref type="bibr" target="#b47">48]</ref>. Inverted index maps words to the lists of documents that contain them, and for data sets of the moderate size, it can be implemented with the standard index structures as a B+ tree or a hash index <ref type="bibr" target="#b39">[40]</ref>. For "big data" volumes, distributed inverted indexes were proposed <ref type="bibr" target="#b47">[48]</ref>. Two main techniques used in distributed inverted indexing are document-based <ref type="bibr" target="#b13">[14]</ref> and term-based <ref type="bibr" target="#b19">[20]</ref>. Both these techniques assume combined compute and storage layers, while in our system model, the main assumption is that compute and storage are separated.</p><p>In <ref type="bibr" target="#b26">[27]</ref>, the authors study inverted indexes in dataspaces ("collections of heterogeneous and partially unstructured data"). Their main focus is on the heterogeneity of data and very large indexes are not considered (index implementation is based on a singlenode Lucene engine <ref type="bibr" target="#b14">[15]</ref>); in our solution, the focus is on structured data, yet we expect very large volumes of both data and indexes. In <ref type="bibr" target="#b25">[26]</ref>, indexing in big data is achieved by enhancing the MapReduce load process via user-defined functions (UDF). However, this solution is limited to a particular system (Hadoop), where compute (MapReduce) and storage (HDFS) components are running on the same machines; our focus is on disaggregated architecture. System model similar to ours is considered in <ref type="bibr" target="#b18">[19]</ref>, as well as in multiple industry approaches (e.g. <ref type="bibr" target="#b9">[10]</ref>), however in these approaches index is stored in a key-value database, and as we show in this work it has two severe drawbacks comparing to our solution -significantly higher monetary cost and much longer load time.</p><p>To the best of our knowledge, our work is the first research attempt to improve queries performance in cloud data lakes by building indexing on top of object stores. However, two industry projects focus on the same domain -Microsoft Hyperspace <ref type="bibr" target="#b38">[39]</ref> and Databricks Delta Lake <ref type="bibr" target="#b22">[23]</ref>. Both assume data lakes architecture and try to improve queries performance by using indexes stored in object stores. Below we briefly discuss how these projects differ from our approach based on the available online information about these systems <ref type="bibr" target="#b22">[23,</ref><ref type="bibr" target="#b38">39]</ref>.</p><p>• Hyperspace currently uses only "covering indexes" and plan to support other index types in the future. Covering indexes are built upon user-provided columns lists "indexed columns" and "data columns", and aim to improve the performance of the queries that search by "indexed columns" and retrieve "data columns". The implementation is based on copying both "indexed" and "data" columns from the data lake and storing them in a columnar format sorted by "indexed" columns. While covering indexes can improve some types of queries, they are not suitable for "needle in a haystack" queries where retrieval of all the columns should be supported (e.g. for GDPR scenario). • Delta Lake uses Bloom filter indexes in the following way. Each file has an associated file that contains a Bloom filter containing values of indexed columns from this file. Then, upon a user query, Bloom filters are checked before reading the files, and the file is read only if the value was found in the corresponding Bloom filter. We use a similar technique (Section 3.4), but in our case, only a single Bloom filter will be read for a particular value, while in Delta Lake all existing Bloom filters should be read.</p><p>In recent years, cloud data lakes have become a prevalent way of storing large amounts of data. The main implication is a separation between the storage and compute layers, and as a result, a rise of new technologies in both compute and storage domains. Query engines, the main player in the compute domain, are designed to run on in-place data that is usually stored in a columnar format. Their main focus is on analytical queries while random reads are overlooked.</p><p>In this paper, we present our approach for optimizing needle in a haystack queries in cloud data lakes. The main idea is to maintain an index structure that maps indexed column values to their files. We built our solution in accordance with data lake architecture, where the storage is completely separated from the compute. The reason for building our index on top of object stores was motivated by the observation that the monetary cost of this solution is significantly lower than the one based on DBMS. Our experimental evaluation provides an additional reason to favor object stores over the DMBS approach -much lower index computation time.</p><p>For future work, we are planning to extend our index scheme and support more complex query types (e.g. joins, group by). An additional research direction may be an implementation of different systems according to data lake architecture and thereby improving their monetary cost and load time. For example, implementing a key-value store on top of object stores seems to be a promising research direction which can be based on the ideas presented in this work.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Data Lakes Architecture</figDesc><graphic coords="1,61.74,537.17,215.99,123.79" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Data lake layout in storage system</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Index storage layout example</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head></head><label></label><figDesc>= index.orderBy(col-name) 5: index-sorted.write.format("my.index").save(index-path) 6: create-root-index(index-path) 7: end procedure Algorithm 2 Update Index1: procedure UpdateIndex(path-to-new-files, index-path, col-name) 2: ranges = getIndexRanges(index-path, col-name) 3: delta = read(path-to-new-files).select(col-name, "f-name") 4: labeled-delta = delta.join(broadcast(ranges)) 5: old-index = read(labeled-delta.select(index-file-name)) 6: new-index = labeled-delta.union(old-index) 7: new-index.repartition(index-file-name) 8: run CreateIndex on new-index and override old-index files and the root-index 9: end procedure Algorithm 3 Get Files 1: procedure GetFiles(Q, index-path) 2: col-name, col-value = extract-col-values(Q) 3: root-index = readRootIndex(index-path) 4:index-file = root-index.getFile(col-name, col-value)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 5</head><label>5</label><figDesc>Figure 5: Index Creation Time</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Index Creation Storage</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>Figure 7 :Figure 8 :</head><label>78</label><figDesc>Figure 7: Data Lake Queries Latency</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Index content example 𝑄 is to scan all 𝐹 files in parallel, which is wasteful as only 𝐹 (𝑄) files should be scanned and in many cases |𝐹 | &gt;&gt; |𝐹 (𝑄)|. If we knew 𝐹 (𝑄)</figDesc><table><row><cell cols="3">Column Name Column Value File Names</cell></row><row><cell>event-id</cell><cell>207</cell><cell>file051, file033</cell></row><row><cell>event-id</cell><cell>350</cell><cell>file002</cell></row><row><cell>event-id</cell><cell>418</cell><cell>file170, file034</cell></row><row><cell>...</cell><cell>...</cell><cell>...</cell></row><row><cell>client-ip</cell><cell>192.168.1.15</cell><cell>file170</cell></row><row><cell>client-ip</cell><cell>172.16.254.1</cell><cell>file883, file051</cell></row><row><cell cols="2">approach to perform query</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Storage cost comparison cost of 1TB/year pricing info references</figDesc><table><row><cell>Cassandra on EC2</cell><cell>16,164 $</cell><cell>[5, 9]</cell></row><row><cell>DynamoDB</cell><cell>2,925 $</cell><cell>[4]</cell></row><row><cell>S3</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Index file example</figDesc><table><row><cell>Data</cell><cell cols="2">207 file051, file033</cell></row><row><cell></cell><cell cols="2">350 file002</cell></row><row><cell></cell><cell cols="2">418 file170, file034</cell></row><row><cell></cell><cell cols="2">513 file0002</cell></row><row><cell></cell><cell cols="2">680 file443, file001, file033</cell></row><row><cell></cell><cell cols="2">799 file883</cell></row><row><cell></cell><cell>...</cell><cell>...</cell></row><row><cell cols="2">Metadata 1</cell><cell>min:207|max:418|offset:0</cell></row><row><cell></cell><cell>2</cell><cell>min:513|max:799|offset:2048</cell></row><row><cell></cell><cell>...</cell><cell>...</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 :</head><label>4</label><figDesc>Root index example</figDesc><table><row><cell cols="4">Column Name Min Max Index File Name</cell></row><row><cell>event-id</cell><cell cols="2">207 560</cell><cell>index13</cell></row><row><cell>event-id</cell><cell cols="2">590 897</cell><cell>index23</cell></row><row><cell>...</cell><cell>...</cell><cell>...</cell><cell>...</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5 :</head><label>5</label><figDesc>Notation</figDesc><table><row><cell cols="2">Symbol Description</cell></row><row><cell>𝐷</cell><cell>A data lake</cell></row><row><cell>𝐶</cell><cell>Data lake column names</cell></row><row><cell>𝑉</cell><cell>Data lake column values</cell></row><row><cell>𝐹</cell><cell>Data lake files</cell></row><row><cell>𝑁</cell><cell>Number of files in the data lake</cell></row><row><cell>𝐾</cell><cell>Number of data chunks in index file</cell></row><row><cell>𝑀</cell><cell>Number of values in a data chunk</cell></row><row><cell>𝑄</cell><cell>A query</cell></row><row><cell>𝑅</cell><cell>Read operation from remote storage</cell></row><row><cell>𝐹 (𝑄)</cell><cell>Files that contain tuples that satisfy 𝑄</cell></row><row><cell>|𝑥 |</cell><cell>Size of object 𝑥</cell></row><row><cell>𝐶 𝑥</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 6 :</head><label>6</label><figDesc>Index creation results</figDesc><table><row><cell cols="2">column name</cell><cell>our index (S3)</cell><cell>Cassandra index</cell></row><row><cell></cell><cell></cell><cell>compute time:</cell></row><row><cell></cell><cell></cell><cell></cell><cell>compute time:</cell></row><row><cell></cell><cell></cell><cell>3 min</cell></row><row><cell></cell><cell></cell><cell></cell><cell>4 hours, 23 minutes</cell></row><row><cell cols="2">record-id (pk)</cell><cell></cell></row><row><cell></cell><cell></cell><cell>storage size:</cell></row><row><cell></cell><cell></cell><cell></cell><cell>storage size:</cell></row><row><cell></cell><cell></cell><cell>1 root file -5.5 KB</cell></row><row><cell></cell><cell></cell><cell></cell><cell>13.2 GB</cell></row><row><cell></cell><cell></cell><cell>30 index files -3.1 GB</cell></row><row><cell></cell><cell></cell><cell>compute time:</cell></row><row><cell></cell><cell></cell><cell></cell><cell>compute time:</cell></row><row><cell></cell><cell></cell><cell>3 min</cell></row><row><cell></cell><cell></cell><cell></cell><cell>47 min</cell></row><row><cell cols="2">event-id</cell><cell></cell></row><row><cell></cell><cell></cell><cell>storage size:</cell></row><row><cell></cell><cell></cell><cell></cell><cell>storage size:</cell></row><row><cell></cell><cell></cell><cell>1 root file -5.3 KB</cell></row><row><cell></cell><cell></cell><cell></cell><cell>4.4 GB</cell></row><row><cell></cell><cell></cell><cell>30 index files -5.2 GB</cell></row><row><cell cols="4">in Apache Cassandra (3.4.4) running on AWS EC2 (3 nodes of</cell></row><row><cell cols="4">type i3.xlarge, a replication factor of 1, LZ4 compression).</cell></row><row><cell cols="4">Experimental results of index creation are presented in Table</cell></row><row><cell cols="2">6 and Figures 5, 6.</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell>Our Index (S3)</cell></row><row><cell></cell><cell></cell><cell></cell><cell>Cassandra Index</cell></row><row><cell>Index creation time (minutes)</cell><cell>10 1 10 2</cell><cell></cell></row><row><cell></cell><cell></cell><cell>record-id</cell><cell>event-id</cell></row><row><cell></cell><cell></cell><cell>Columns</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">SQL-on-hadoop systems: tutorial</title>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Abadi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the VLDB Endowment</title>
				<meeting>the VLDB Endowment</meeting>
		<imprint>
			<date type="published" when="2015">2015. 2015</date>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="2050" to="2051" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">The Seattle Report on Database Research</title>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Abadi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM SIGMOD Record</title>
		<imprint>
			<biblScope unit="volume">48</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="44" to="53" />
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Amazon Athena</title>
		<author>
			<persName><surname>Amazon</surname></persName>
		</author>
		<ptr target="https://aws.amazon.com/athena" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Amazon DynamoDB On-Demand Pricing</title>
		<author>
			<persName><surname>Amazon</surname></persName>
		</author>
		<ptr target="https://aws.amazon.com/dynamodb/pricing/on-demand/" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Amazon EC2 On-Demand Pricing</title>
		<author>
			<persName><surname>Amazon</surname></persName>
		</author>
		<ptr target="https://aws.amazon.com/ec2/pricing/on-demand/" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Amazon S3 pricing</title>
		<ptr target="https://aws.amazon.com/s3/pricing/" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
			<publisher>Amazon</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">AWS Glue</title>
		<author>
			<persName><surname>Amazon</surname></persName>
		</author>
		<ptr target="https://aws.amazon.com/glue" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Best Practices for Amazon EMR</title>
		<author>
			<persName><surname>Amazon</surname></persName>
		</author>
		<ptr target="https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Best Practices for Running Apache Cassandra on Amazon EC2</title>
		<author>
			<persName><surname>Amazon</surname></persName>
		</author>
		<ptr target="https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-cassandra-on-amazon-ec2/" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Building and Maintaining an Amazon S3 Metadata Index without Servers</title>
		<author>
			<persName><surname>Amazon</surname></persName>
		</author>
		<ptr target="https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/" />
	</analytic>
	<monogr>
		<title level="m">Retrieved November 14</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><surname>Amazon</surname></persName>
		</author>
		<ptr target="https://aws.amazon.com/dynamodb/" />
		<title level="m">DynamoDB</title>
				<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">S3</title>
		<ptr target="https://aws.amazon.com/s3/" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
			<publisher>Amazon</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Spark sql: Relational data processing in spark</title>
		<author>
			<persName><forename type="first">Michael</forename><surname>Armbrust</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Reynold</forename><forename type="middle">S</forename><surname>Xin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cheng</forename><surname>Lian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yin</forename><surname>Huai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Davies</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Joseph</forename><forename type="middle">K</forename><surname>Bradley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiangrui</forename><surname>Meng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tomer</forename><surname>Kaftan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ali</forename><surname>Michael J Franklin</surname></persName>
		</author>
		<author>
			<persName><surname>Ghodsi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2015 International Conference on Management of Data (SIGMOD)</title>
				<meeting>the 2015 International Conference on Management of Data (SIGMOD)</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="1383" to="1394" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Web search for a planet: The Google cluster architecture</title>
		<author>
			<persName><forename type="first">André</forename><surname>Luiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeffrey</forename><surname>Barroso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Urs</forename><surname>Dean</surname></persName>
		</author>
		<author>
			<persName><surname>Holzle</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE micro</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="22" to="28" />
			<date type="published" when="2003">2003. 2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Lucid Imagination</title>
		<author>
			<persName><forename type="first">Andrzej</forename><surname>Białecki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robert</forename><surname>Muir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Grant</forename><surname>Ingersoll</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR 2012 workshop on open source information retrieval</title>
				<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="volume">17</biblScope>
		</imprint>
	</monogr>
	<note>Apache lucene 4</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A comparison of join algorithms for log processing in mapreduce</title>
		<author>
			<persName><forename type="first">Spyros</forename><surname>Blanas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jignesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vuk</forename><surname>Patel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun</forename><surname>Ercegovac</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eugene</forename><forename type="middle">J</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yuanyuan</forename><surname>Shekita</surname></persName>
		</author>
		<author>
			<persName><surname>Tian</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (SIGMOD)</title>
				<meeting>the 2010 ACM SIGMOD International Conference on Management of data (SIGMOD)</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="975" to="986" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Space/time trade-offs in hash coding with allowable errors</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">H</forename><surname>Bloom</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="422" to="426" />
			<date type="published" when="1970">1970. 1970</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Apache hive: From mapreduce to enterprise-grade big data warehousing</title>
		<author>
			<persName><forename type="first">Jesús</forename><surname>Camacho-Rodríguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ashutosh</forename><surname>Chauhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alan</forename><surname>Gates</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eugene</forename><surname>Koifman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O'</forename><surname>Owen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vineet</forename><surname>Malley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zoltan</forename><surname>Garg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sergey</forename><surname>Haindrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Prasanth</forename><surname>Shelukhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Siddharth</forename><surname>Jayachandran</surname></persName>
		</author>
		<author>
			<persName><surname>Seth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 International Conference on Management of Data (SIGMOD)</title>
				<meeting>the 2019 International Conference on Management of Data (SIGMOD)</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1773" to="1786" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Web data indexing in the cloud: efficiency and cost reductions</title>
		<author>
			<persName><forename type="first">Jesús</forename><surname>Camacho-Rodríguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dario</forename><surname>Colazzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ioana</forename><surname>Manolescu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 16th International Conference on Extending Database Technology (EDBT)</title>
				<meeting>the 16th International Conference on Extending Database Technology (EDBT)</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="41" to="52" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">A term-based inverted index partitioning model for efficient distributed query processing</title>
		<author>
			<persName><forename type="first">Cambazoglu</forename><surname>Barla</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on the Web (TWEB)</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="1" to="23" />
			<date type="published" when="2013">2013. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Better bitmap performance with roaring bitmaps</title>
		<author>
			<persName><forename type="first">Samy</forename><surname>Chambi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Software: practice and experience</title>
		<imprint>
			<biblScope unit="volume">46</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="709" to="719" />
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Bigtable: A distributed storage system for structured data</title>
		<author>
			<persName><forename type="first">Fay</forename><surname>Chang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Computer Systems (TOCS)</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="1" to="26" />
			<date type="published" when="2008">2008. 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">Delta Lake</title>
		<author>
			<persName><surname>Databricks</surname></persName>
		</author>
		<ptr target="https://docs.databricks.com/delta" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<title level="m" type="main">Top 5 Reasons for Choosing S3 over HDFS</title>
		<author>
			<persName><surname>Databricks</surname></persName>
		</author>
		<ptr target="https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">MapReduce: simplified data processing on large clusters</title>
		<author>
			<persName><forename type="first">Jeffrey</forename><surname>Dean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sanjay</forename><surname>Ghemawat</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">51</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="107" to="113" />
			<date type="published" when="2008">2008. 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Hadoop++ making a yellow elephant run like a cheetah (without it even noticing)</title>
		<author>
			<persName><forename type="first">Jens</forename><surname>Dittrich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the VLDB Endowment</title>
				<meeting>the VLDB Endowment</meeting>
		<imprint>
			<date type="published" when="2010">2010. 2010</date>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="515" to="529" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Indexing dataspaces</title>
		<author>
			<persName><forename type="first">Xin</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alon</forename><surname>Halevy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2007 ACM SIGMOD international conference on Management of data (SIGMOD)</title>
				<meeting>the 2007 ACM SIGMOD international conference on Management of data (SIGMOD)</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="43" to="54" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Sql-on-hadoop: Full circle back to shared-nothing database architectures</title>
		<author>
			<persName><forename type="first">Avrilia</forename><surname>Floratou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the VLDB Endowment</title>
				<meeting>the VLDB Endowment</meeting>
		<imprint>
			<date type="published" when="2014">2014. 2014</date>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="1295" to="1306" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Leveraging the Data Lake: Current State and Challenges</title>
		<author>
			<persName><forename type="first">Corinna</forename><surname>Giebler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Big Data Analytics and Knowledge Discovery (DaWaK)</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="179" to="188" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">An approach to extracting complex knowledge patterns among concepts belonging to structured, semi-structured and unstructured sources in a data lake</title>
		<author>
			<persName><forename type="first">Paolo</forename><surname>Giudice</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Sciences</title>
		<imprint>
			<biblScope unit="volume">478</biblScope>
			<biblScope unit="page" from="606" to="626" />
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<title level="m" type="main">Google Cloud Storage</title>
		<author>
			<persName><surname>Google</surname></persName>
		</author>
		<ptr target="https://cloud.google.com/storage" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Apache drill: interactive ad-hoc analysis at scale</title>
		<author>
			<persName><forename type="first">Michael</forename><surname>Hausenblas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jacques</forename><surname>Nadeau</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Big Data</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="100" to="104" />
			<date type="published" when="2013">2013. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Impala: A Modern, Open-Source SQL Engine for Hadoop</title>
		<author>
			<persName><forename type="first">Marcel</forename><surname>Kornacker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 7th Conference on Innovative Data Systems Research (CIDR)</title>
				<meeting>the 7th Conference on Innovative Data Systems Research (CIDR)</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">1</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Cassandra: a decentralized structured storage system</title>
		<author>
			<persName><forename type="first">Avinash</forename><surname>Lakshman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Prashant</forename><surname>Malik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM SIGOPS Operating Systems Review</title>
		<imprint>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="35" to="40" />
			<date type="published" when="2010">2010. 2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Decoding billions of integers per second through vectorization</title>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Lemire</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Leonid</forename><surname>Boytsov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Software: Practice and Experience</title>
		<imprint>
			<biblScope unit="volume">45</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="29" />
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Dremel: interactive analysis of web-scale datasets</title>
		<author>
			<persName><forename type="first">Sergey</forename><surname>Melnik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">54</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="114" to="123" />
			<date type="published" when="2011">2011. 2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<title level="m" type="main">Azure Blob Storage</title>
		<ptr target="https://azure.microsoft.com/services/storage/blobs/" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
		<respStmt>
			<orgName>Microsoft</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<monogr>
		<title level="m" type="main">Azure subscription and service limits, quotas, and constraints</title>
		<ptr target="https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
		<respStmt>
			<orgName>Microsoft</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<monogr>
		<ptr target="https://microsoft.github.io/hyperspace" />
		<title level="m">Hyperspace</title>
				<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
		<respStmt>
			<orgName>Microsoft</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<monogr>
		<title level="m" type="main">Database management systems</title>
		<author>
			<persName><forename type="first">Raghu</forename><surname>Ramakrishnan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Johannes</forename><surname>Gehrke</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2000">2000</date>
			<publisher>McGraw-Hill</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<analytic>
		<title level="a" type="main">Data lakes: Trends and perspectives</title>
		<author>
			<persName><forename type="first">Franck</forename><surname>Ravat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yan</forename><surname>Zhao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Database and Expert Systems Applications (DEXA)</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="304" to="313" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<analytic>
		<title level="a" type="main">Presto: SQL on everything</title>
		<author>
			<persName><forename type="first">Raghav</forename><surname>Sethi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Martin</forename><surname>Traverso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dain</forename><surname>Sundstrom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Phillips</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wenlei</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yutian</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nezih</forename><surname>Yegitbasi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Haozhun</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eric</forename><surname>Hwang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nileema</forename><surname>Shingte</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 35th International Conference on Data Engineering (ICDE). IEEE</title>
				<meeting>the 35th International Conference on Data Engineering (ICDE). IEEE</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1802" to="1813" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b42">
	<analytic>
		<title level="a" type="main">The hadoop distributed file system</title>
		<author>
			<persName><forename type="first">Konstantin</forename><surname>Shvachko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hairong</forename><surname>Kuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sanjay</forename><surname>Radia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robert</forename><surname>Chansler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE</title>
				<meeting>the 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="1" to="10" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b43">
	<analytic>
		<title level="a" type="main">Choosing a cloud DBMS: architectures and tradeoffs</title>
		<author>
			<persName><forename type="first">Junjay</forename><surname>Tan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the VLDB Endowment</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="2170" to="2182" />
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b44">
	<monogr>
		<title level="m">EU General Data Protection Regulation (GDPR)-An implementation and compliance guide</title>
				<imprint>
			<publisher>IT Governance Ltd</publisher>
			<date type="published" when="2020">2020</date>
		</imprint>
		<respStmt>
			<orgName>IT Governance Privacy Team</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b45">
	<monogr>
		<title level="m" type="main">data-lake-index</title>
		<author>
			<persName><forename type="first">Grisha</forename><surname>Weintraub</surname></persName>
		</author>
		<ptr target="https://github.com/grishaw/data-lake-index" />
		<imprint>
			<date type="published" when="2020-11-14">2020. November 14, 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b46">
	<analytic>
		<title level="a" type="main">Apache spark: a unified engine for big data processing</title>
		<author>
			<persName><forename type="first">Matei</forename><surname>Zaharia</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">59</biblScope>
			<biblScope unit="page" from="56" to="65" />
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b47">
	<analytic>
		<title level="a" type="main">Inverted files for text search engines</title>
		<author>
			<persName><forename type="first">Justin</forename><surname>Zobel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alistair</forename><surname>Moffat</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM computing surveys (CSUR)</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page">6</biblScope>
			<date type="published" when="2006">2006. 2006</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
