-

C. Gröger, Industrial Analytics - An Overview, it - consistency of a table, nor do they guarantee atomicity Information Technology

The Data Platform Evolution: From Data Warehouses over Data Lakes to Lakehouses

Jan Schneider

Christoph Gröger

Arnold Lutsch

1 0 Institute for Parallel and Distributed Systems, University of Stuttgart , Universitätsstraße 38, 70569 Stuttgart , Germany 1 Robert Bosch GmbH , Borsigstraße 4, 70469 Stuttgart , Germany

2021

64 2022

The continuously increasing availability of data and the growing maturity of data-driven analysis techniques have encouraged enterprises to collect and analyze huge amounts of business-relevant data in order to exploit it for competitive advantages. To facilitate these processes, various platforms for analytical data management have been developed: While data warehouses have traditionally been used by business analysts for reporting and OLAP, data lakes emerged as an alternative concept that also supports advanced analytics. As these two common types of data platforms show rather contrary characteristics and target diferent user groups and analytical approaches, enterprises usually need to employ both of them, resulting in complex, error-prone and cost-expensive architectures. To address these issues, eforts have recently become apparent to combine features of data warehouses and data lakes into so-called lakehouses, which pursue to serve all kinds of analytics from a single data platform. This paper provides an overview on the evolution of analytical data platforms from data warehouses over data lakes to lakehouses and elaborates on the vision and characteristics of the latter. Furthermore, it addresses the question of what aspects common data lakes are currently missing that prevent them from transitioning to lakehouses.

eol>Lakehouse Data Warehouse Data Lake Data Management Data Analytics

1. Introduction practice, especially the traditional data warehouses and the more recent data lakes have become the predominant Within the course of the digital transformation of society types of data platforms. With so-called lakehouses, a supand economy, the importance of data for enterprises is posedly new kind of data platform has recently attracted continuously growing. Due to the ever-increasing aford- attention: They are driven by the vision of combining the ability of smart devices and sensors in the scope of the characteristics and features of data warehouses and data Internet of Things [1], as well as a wide range of other lakes, which are perceived as complementary, into inteupcoming technologies for capturing data about prod- grated data platforms. With the prospect of being able to ucts, shop floors, suppliers, customers and other entities, serve all kinds of analytical workloads from one univerenterprises have gained manifold opportunities for col- sally applicable platform, lakehouses promise to simplify lecting business-related data along their value chains. By and improve existing enterprise analytics architectures, leveraging data-driven analysis techniques, this data can which commonly needed to operate data warehouses be exploited for evaluating and optimizing products and and data lakes in parallel and hence sufered from high business processes and hence constitutes a key factor operational costs, slow analytical processes, as well as for continuous development and improvement. How- a low trustworthiness of analysis results [4]. Over the ever, in order to be able to derive valuable insights and past years, a variety of technologies have emerged or knowledge from huge amounts of collected data, this evolved with the intention to address these issues and data needs to be organized and prepared in a systematic hence to enable the construction of lakehouse-like data manner, along with metadata that describes the context platforms, such as Delta Lake1, Dremio2 or Snowflake 3. in which the data was created and processed [2]. Plat- As indicated by our evaluation of several data manageforms for analytical data management can support these ment tools [5], frameworks that operate on top of data tasks, as they are specifically developed for the storage, lakes and pursue to enhance them for typical features management, processing and provisioning of data from of data warehouses appear to be particularly promising all types of data sources that is supposed to be made avail- in this regard, including Delta Lake, Apache Hudi4 and able for diferent types of analytics applications [ 3]. In Apache Iceberg5. This paper first provides an overview on the evolution of data platforms and explains the vision behind the lakehouse paradigm. Section 3 then elaborates features that go beyond those of conventional relational on the characteristics of lakehouses, which are compared databases, such as time travel and data governance cato the architecture of a typical data lake in Section 4. This pabilities. The left side of Fig. 2 shows the common arway, several aspects are identified that conventional data chitecture of a data warehouse, based on the reference lakes need to address in order to be able to complete the architecture by Bauer and Günzel [9]. Data Warehouses transition to lakehouses. are typically designed specifically for a given application scenario and employ a Extract-Transform-Load (ETL) process, where the data is first extracted from the data 2. Evolution of Data Platforms sources, then prepared and transformed into the target schema in a dedicated data staging area and finally load Between 1960 and 1970, the first databases appeared and into the core data warehouse, which is responsible for the the relational data model [6] was developed. The purpose long-term storage of all data. While the data staging area of these databases was primarily to provide data manage- can leverage diferent types of storage systems, such as ment capabilities for applications and were accordingly relational or NoSQL databases, the core data warehouse designed for workloads where rather simple read and typically relies on relational databases. Due to the large write operations have to be performed on large datasets amount of data that resides in the core data warehouse, it with high frequency. However, many of these databases can be reasonable to extract parts of the data and to make are less suitable for analytics applications, where large it available in dependent data marts [10], which then alamounts of historical data have to be sporadically an- low to speed up downstream analyses. For example, some alyzed with rather complex queries in order to derive data marts may be based on relational databases and opinsights and knowledge that can then be used for guiding timized for reporting, while other data marts employ business decisions. For this reason, data platforms for multi-dimensional databases in order to support Online analytical data management have been developed, which Analytical Processing (OLAP) [10]. By using appropriate support the systematic long-term storage, management query languages, data analysts can perform their analand querying of data for analytical purposes. yses either on individual data marts or directly on the Data Warehouse Data Lake core data warehouse. As data warehouses employ complex, static data models, store pre-processed data instead Data Analysts Data Scientists of the raw data and leverage proprietary data formats that impede direct data access, they are mostly suited Reporting OLAP AAndvaalynticcesd for analysis questions that are already known in advance tRioenlaa-l Data Mart Data Mart Mdiumlti.- Data Lake and provide only very limited support for data mining Extract Extract ... Pgololyt- Eraxtpivloe- liDveer-y Pgololyt- aarnedpmriamchariniley loepatrinmi nizge.dMfoorretohveebra,tscinhcperdoacteasswinagreohfohuusgees tRioenlaa-l WCaorreehDLooauatsade Extract Pgololyt- Distilled aaopmfpsoilmiucnapttliseoondfasdt[a1a1toa]p,tethrhaaettyiroecnlaysnwobniatthrheehl yingebhaerfru-erseqeaudletifnmocreys.etWxreeiactmuhtiitnohnge

Poly- Harmonized

Data Staging Area glot Transform goal of making data warehouses more flexible, there have Pgololyt- Extract fTorramns- Pgololyt- RawExtract & Loadbtw. zones rbaewendvaatrai.oFuosraettxeammpptlset,odeantaabvlaeutlht e[1s2to]rraegpereosfesntrtsucatudraetda modeling approach that facilitates the easy incorporation of changes to the data schema without requiring Mainly structured data All types of data adjustments to the structure of existing tables and hence Data flow ... sUtnodraegrleyisnygstem ... zDoantae lake accTohmemcoondtaitneusothueslyvairnicarbeilaistyinogfdreamwadnadtaf.or organizing and analyzing semi-structured and unstructured data Figure 1: Comparison of typical high-level architectures of led to the emergence of data lakes [13] in about 2010. data warehouses (left) and data lakes (right). Data lakes are based on the idea of collecting raw data from the data sources and deciding at a later point how this data can be processed and analyzed. This leads to a

Data warehouses [7] represent the most established Extract-Load-Transform (ELT) process, where the data type of analytical data platform and emerged from rela- is first extracted and load into the data lake and subsetional database systems in the 1980s. They are primarily quently prepared and transformed in order to make it designed for the management of structured data, impose accessible for diferent types of analytics applications. As well-defined and possibly multi-dimensional data mod- a result, data lakes manage not only preprocessed and els [8], often provide ACID guarantees and tend to ofer pre-aggregated data, but also raw data, which allows to 3. The Lakehouse Paradigm increase the eficiency of re-occurring analyses while still maintaining a high level of flexibility. As indicated on the right-hand side of Fig. 2, data lakes typically impose Although there is a widespread agreement that lakea polyglot architecture, in which several diferent sys- houses represent amalgamations of data warehouses and tems for data storage and data processing are utilized, data lakes, diferent opinions in literature exist about how including relational and NoSQL databases, distributed file the architecture of lakehouses should look like and what systems, batch and stream processing engines and event characteristics these data platforms must necessarily poshubs. By applying zone models, the architecture is com- sess. For example, many authors consider lakehouses monly divided into zones that reflect diferent degrees as integrated data platforms that are based on directlyof data processing and governance policies [14]. Instead accessible storage, such as distributed file systems or of proprietary file formats, data lakes tend to leverage object storages and can also provide typical features of open file formats, such as Apache Parquet 6 or Apache data warehouses like ACID transactions [4]. However, ORC7. These formats enable tabular data representations others argue that a two-tier architecture consisting of and provide further optimizations in terms of data com- self-contained data warehouses and data lakes that are pression and query processing. These aspects and the potentially connected by an integration layer for unified possibility to directly access the data on the underlying data access can also constitute a lakehouse [17]. In our storage systems enable the execution of data mining and work [5], we assessed diferent views and definitions of machine learning applications on top of data lakes. By in- the lakehouse paradigm and finally derived a new definitegrating stream storage and stream processing systems, tion that reflects the additional value of lakehouses for such as Apache Spark and Apache Kafka8, respectively, enterprises in comparison to conventional data platforms. into the architecture and by applying well-established From our perspective, lakehouses are beneficial for enarchitecture patterns like the Lambda [15] or Kappa [16] terprises when they contribute to simplifying enterprise architecture, data lakes are also suitable for near-realtime analytics architectures by providing a single source of reporting and streaming analytics. truth, limiting the variety of involved technologies and

Due to these complementary alignments of data ware- hence reducing the number of required data movement houses and data lakes, enterprises tend to employ com- and transformation processes. Accordingly, we define a plex analytics architectures in which both types of data lakehouse as "integrated data platform that leverages the platforms are operated in parallel This approach com- same storage type and data format for reporting and OLAP, monly results in several shortcomings [4], such as data data mining and machine learning, as well as streaming replication across multiple storage systems and the need workloads." [5]. Fig. 3 illustrates how such a data platform for continuously transferring, transforming and synchro- may look like. First of all, the term "integrated platform" nizing the data between the involved data platforms, expresses that a lakehouse should not be considered as a which likely leads to high operational costs and inconsis- loose amalgamation of standalone data warehouses and tent or erroneous data. In addition, the necessary move- data lakes, but rather as a single, self-contained data platment of data extends the time until analysis results are form. Limiting the architecture to one type of storage, e.g. available. Vendors of various data management tools to a distributed file system, and one data format, e.g. to have recognized these problems and recently developed Apache Parquet, eliminates the need for additional data products that pursue to close the gap between data ware- movement and transformation processes within the lakehouses and data lakes: On the one hand, modern and house and therefore reduces the complexity and errorpossibly cloud-based data warehouses like Snowflake are proneness of the overall architecture. Furthermore, it evolving in order to support the management of unstruc- supports the formation of a single source of truth, as tured data, the stream ingestion of near-realtime data, as the same data may no longer be replicated between difwell the querying of data that is stored in open formats ferent systems with varying characteristics. Finally, the on external, third-party storage systems. On the other definition emphasizes that lakehouses must support all hand, frameworks and query engines like Apache Hudi, typical analytical workloads of data warehouses and data Apache Iceberg, Dremio and Trino9 are emerging that lakes, so that data analysts and data scientists can use a can be used to enhance data lakes by typical features of lakehouse instead of the former data platforms. data warehouses and hence make analyses more conve- Based on this definition and the characteristics of the nient. This observable convergence of data warehouses workloads mentioned therein, we derived a total of eight and data lakes contributed to the coining of the term technical requirements that lakehouses should fulfill [ 5]: "lakehouse" and its underlying vision. R1: Same type of storage and data format Lakehouses must employ only a single type of storage for all data and metadata and use only one format for the data.

R2: CRUD for all types of data Lakehouses must support the ingestion, retrieval, updating and deletion of

HDFS Lakehouse Parquet Parquet

Reporting,

OLAP Advanced

Analytics

Extract & Load All types of data

Data flow ... sUtnodraegrleyisnygstem

Data Scientists Data Analysts larly promising for the fulfillment of the aforementioned requirements and thus for the construction of lakehouses.

These frameworks basically act as libraries for highly scalable batch and stream processing engines, such as Apache Spark10 or Apache Flink11 and implement data access protocols that control how these engines read data from and write data to storage systems (cf. [18]). In addition, they manage technical metadata, which allows them to represent datasets as relational data collections and track additions, updates and deletions of data.

Distributed File System/ Object Storage

R5 R6 R8 Lakehouse Framework

Processing Engine

all kinds of data at least on the level of data collections.

R3: Relational data collections Lakehouses must Data Data provide means to abstract from the stored data files and Scientists Analysts R2 R3 R4 to represent them as cohesive data collections with relational properties on the logical level.

R4: Query language Lakehouses must ofer a declarative, structured query language that allows to query the Data flow All types of data data in a relational manner.

R5: Consistency Guarantees Lakehouses must pro- Rn fRuelfqillueidrebmyednattnaaltaivkeely Rn bRyeqfruaimreemweonrtkfulfilled vide consistency guarantees for the data, such as schema validation, which can either be enforced on data inges- Figure 3: Typical architecture of a data lake that can transition or when the data is queried. tion to a lakehouse by adding a corresponding framework. R6: Isolation and Atomicity Similar to relational database systems, lakehouses must provide isolation and Fig. 4 shows the conceptual architecture of a data lake atomicity for data operations in order to ensure the con- as it can often be encountered in practice. It essentially sistency of the data and to support concurrency. consists of a storage system, which can be either a disR7: Direct read access Lakehouses must provide di- tributed file system or an object storage that persists the rect access to the data and metadata on the underlying data as data files in an open file format. A batch and storage system and must employ open data formats only. stream processing engine can read data from the storR8: Unified batch and stream processing Lake- age system, process it and then write the results back houses must support record-wise data operations in near- to the storage system. Hence, the data lake is supposed realtime and allow to treat data collections as sources to store the raw data next to pre-processed and aggreand sinks for batch and stream processing. gated data. This processing engine is also used to ingest

These requirements can be achieved in various ways, data and data analysts can leverage it in order to query for example by opening existing data warehouses and the data via a query language like SQL. For data mining driving them into the direction of data lakes or by devel- and machine learning, data science applications can dioping technologies that enhance data lakes for common rectly access the data on the storage system. Without features and characteristics of data warehouses. the lakehouse framework that is depicted in Fig. 4, the data lake would already satisfy the requirements R1, R2, 4. Transitioning from Data Lakes R3, R4, and R7. R3 is satisfied because many processing engines like Apache Spark already enable relational data to Lakehouses abstraction, so that multiple data files that reside on the storage system can collectively represent the contents of a table. R5 and R6 are not met, since processing engines In the course of our evaluation of several data management tools [5], frameworks for data lakes like Delta Lake, Apache Hudi and Apache Iceberg appeared to be particu10https://spark.apache.org 11https://flink.apache.org