The Data Platform Evolution: From Data Warehouses over Data Lakes to Lakehouses Jan Schneider1 , Christoph Gröger2 and Arnold Lutsch2 1 Institute for Parallel and Distributed Systems, University of Stuttgart, Universitätsstraße 38, 70569 Stuttgart, Germany 2 Robert Bosch GmbH, Borsigstraße 4, 70469 Stuttgart, Germany Abstract The continuously increasing availability of data and the growing maturity of data-driven analysis techniques have encouraged enterprises to collect and analyze huge amounts of business-relevant data in order to exploit it for competitive advantages. To facilitate these processes, various platforms for analytical data management have been developed: While data warehouses have traditionally been used by business analysts for reporting and OLAP, data lakes emerged as an alternative concept that also supports advanced analytics. As these two common types of data platforms show rather contrary characteristics and target different user groups and analytical approaches, enterprises usually need to employ both of them, resulting in complex, error-prone and cost-expensive architectures. To address these issues, efforts have recently become apparent to combine features of data warehouses and data lakes into so-called lakehouses, which pursue to serve all kinds of analytics from a single data platform. This paper provides an overview on the evolution of analytical data platforms from data warehouses over data lakes to lakehouses and elaborates on the vision and characteristics of the latter. Furthermore, it addresses the question of what aspects common data lakes are currently missing that prevent them from transitioning to lakehouses. Keywords Lakehouse, Data Warehouse, Data Lake, Data Management, Data Analytics 1. Introduction practice, especially the traditional data warehouses and the more recent data lakes have become the predominant Within the course of the digital transformation of society types of data platforms. With so-called lakehouses, a sup- and economy, the importance of data for enterprises is posedly new kind of data platform has recently attracted continuously growing. Due to the ever-increasing afford- attention: They are driven by the vision of combining the ability of smart devices and sensors in the scope of the characteristics and features of data warehouses and data Internet of Things [1], as well as a wide range of other lakes, which are perceived as complementary, into inte- upcoming technologies for capturing data about prod- grated data platforms. With the prospect of being able to ucts, shop floors, suppliers, customers and other entities, serve all kinds of analytical workloads from one univer- enterprises have gained manifold opportunities for col- sally applicable platform, lakehouses promise to simplify lecting business-related data along their value chains. By and improve existing enterprise analytics architectures, leveraging data-driven analysis techniques, this data can which commonly needed to operate data warehouses be exploited for evaluating and optimizing products and and data lakes in parallel and hence suffered from high business processes and hence constitutes a key factor operational costs, slow analytical processes, as well as for continuous development and improvement. How- a low trustworthiness of analysis results [4]. Over the ever, in order to be able to derive valuable insights and past years, a variety of technologies have emerged or knowledge from huge amounts of collected data, this evolved with the intention to address these issues and data needs to be organized and prepared in a systematic hence to enable the construction of lakehouse-like data manner, along with metadata that describes the context platforms, such as Delta Lake1 , Dremio2 or Snowflake3 . in which the data was created and processed [2]. Plat- As indicated by our evaluation of several data manage- forms for analytical data management can support these ment tools [5], frameworks that operate on top of data tasks, as they are specifically developed for the storage, lakes and pursue to enhance them for typical features management, processing and provisioning of data from of data warehouses appear to be particularly promising all types of data sources that is supposed to be made avail- in this regard, including Delta Lake, Apache Hudi4 and able for different types of analytics applications [3]. In Apache Iceberg5 . This paper first provides an overview on the evolution of data platforms and explains the vision GvDB’23: 34th Workshop on Foundations of Database Systems, June 07–09, 2023, Hirsau, Germany $ {firstname.lastname}@ipvs.uni-stuttgart.de (J. Schneider); 1 https://delta.io {firstname.lastname}@de.bosch.com (C. Gröger); 2 https://www.dremio.com {firstname.lastname}@de.bosch.com (A. Lutsch) 3 https://www.snowflake.com © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License 4 Attribution 4.0 International (CC BY 4.0). https://hudi.apache.org CEUR Workshop Proceedings (CEUR-WS.org) 5 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 https://iceberg.apache.org CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings behind the lakehouse paradigm. Section 3 then elaborates features that go beyond those of conventional relational on the characteristics of lakehouses, which are compared databases, such as time travel and data governance ca- to the architecture of a typical data lake in Section 4. This pabilities. The left side of Fig. 2 shows the common ar- way, several aspects are identified that conventional data chitecture of a data warehouse, based on the reference lakes need to address in order to be able to complete the architecture by Bauer and Günzel [9]. Data Warehouses transition to lakehouses. are typically designed specifically for a given application scenario and employ a Extract-Transform-Load (ETL) process, where the data is first extracted from the data 2. Evolution of Data Platforms sources, then prepared and transformed into the target schema in a dedicated data staging area and finally load Between 1960 and 1970, the first databases appeared and into the core data warehouse, which is responsible for the the relational data model [6] was developed. The purpose long-term storage of all data. While the data staging area of these databases was primarily to provide data manage- can leverage different types of storage systems, such as ment capabilities for applications and were accordingly relational or NoSQL databases, the core data warehouse designed for workloads where rather simple read and typically relies on relational databases. Due to the large write operations have to be performed on large datasets amount of data that resides in the core data warehouse, it with high frequency. However, many of these databases can be reasonable to extract parts of the data and to make are less suitable for analytics applications, where large it available in dependent data marts [10], which then al- amounts of historical data have to be sporadically an- low to speed up downstream analyses. For example, some alyzed with rather complex queries in order to derive data marts may be based on relational databases and op- insights and knowledge that can then be used for guiding timized for reporting, while other data marts employ business decisions. For this reason, data platforms for multi-dimensional databases in order to support Online analytical data management have been developed, which Analytical Processing (OLAP) [10]. By using appropriate support the systematic long-term storage, management query languages, data analysts can perform their anal- and querying of data for analytical purposes. yses either on individual data marts or directly on the Data Warehouse Data Lake core data warehouse. As data warehouses employ com- plex, static data models, store pre-processed data instead Data Analysts Data Scientists of the raw data and leverage proprietary data formats that impede direct data access, they are mostly suited Reporting OLAP Advanced Analytics for analysis questions that are already known in advance Rela- Data Mart tional Data Mart Multi- dim. Data Lake and provide only very limited support for data mining Explo- De- and machine learning. Moreover, since data warehouses Extract Extract ... Poly- glot rative livery Poly- glot are primarily optimized for the batch processing of huge Core Data Warehouse Extract amounts of data, they can barely be used for streaming Poly- Distilled Rela- tional glot applications [11] that rely on the near-realtime execution Load Poly- Harmonized of simple data operations with high frequency. With the Data Staging Area glot Transform goal of making data warehouses more flexible, there have btw. zones Poly- Trans- form Poly- Raw been various attempts to enable the storage of structured glot Extract glot Extract & Load raw data. For example, data vault [12] represents a data modeling approach that facilitates the easy incorpora- tion of changes to the data schema without requiring Mainly structured data All types of data adjustments to the structure of existing tables and hence Data flow Underlying Data lake accommodates the variability of raw data. ... ... storage system zone The continuously increasing demand for organizing and analyzing semi-structured and unstructured data led to the emergence of data lakes [13] in about 2010. Figure 1: Comparison of typical high-level architectures of data warehouses (left) and data lakes (right). Data lakes are based on the idea of collecting raw data from the data sources and deciding at a later point how this data can be processed and analyzed. This leads to a Data warehouses [7] represent the most established Extract-Load-Transform (ELT) process, where the data type of analytical data platform and emerged from rela- is first extracted and load into the data lake and subse- tional database systems in the 1980s. They are primarily quently prepared and transformed in order to make it designed for the management of structured data, impose accessible for different types of analytics applications. As well-defined and possibly multi-dimensional data mod- a result, data lakes manage not only preprocessed and els [8], often provide ACID guarantees and tend to offer pre-aggregated data, but also raw data, which allows to increase the efficiency of re-occurring analyses while still 3. The Lakehouse Paradigm maintaining a high level of flexibility. As indicated on the right-hand side of Fig. 2, data lakes typically impose Although there is a widespread agreement that lake- a polyglot architecture, in which several different sys- houses represent amalgamations of data warehouses and tems for data storage and data processing are utilized, data lakes, different opinions in literature exist about how including relational and NoSQL databases, distributed file the architecture of lakehouses should look like and what systems, batch and stream processing engines and event characteristics these data platforms must necessarily pos- hubs. By applying zone models, the architecture is com- sess. For example, many authors consider lakehouses monly divided into zones that reflect different degrees as integrated data platforms that are based on directly- of data processing and governance policies [14]. Instead accessible storage, such as distributed file systems or of proprietary file formats, data lakes tend to leverage object storages and can also provide typical features of open file formats, such as Apache Parquet6 or Apache data warehouses like ACID transactions [4]. However, ORC7 . These formats enable tabular data representations others argue that a two-tier architecture consisting of and provide further optimizations in terms of data com- self-contained data warehouses and data lakes that are pression and query processing. These aspects and the potentially connected by an integration layer for unified possibility to directly access the data on the underlying data access can also constitute a lakehouse [17]. In our storage systems enable the execution of data mining and work [5], we assessed different views and definitions of machine learning applications on top of data lakes. By in- the lakehouse paradigm and finally derived a new defini- tegrating stream storage and stream processing systems, tion that reflects the additional value of lakehouses for such as Apache Spark and Apache Kafka8 , respectively, enterprises in comparison to conventional data platforms. into the architecture and by applying well-established From our perspective, lakehouses are beneficial for en- architecture patterns like the Lambda [15] or Kappa [16] terprises when they contribute to simplifying enterprise architecture, data lakes are also suitable for near-realtime analytics architectures by providing a single source of reporting and streaming analytics. truth, limiting the variety of involved technologies and Due to these complementary alignments of data ware- hence reducing the number of required data movement houses and data lakes, enterprises tend to employ com- and transformation processes. Accordingly, we define a plex analytics architectures in which both types of data lakehouse as "integrated data platform that leverages the platforms are operated in parallel This approach com- same storage type and data format for reporting and OLAP, monly results in several shortcomings [4], such as data data mining and machine learning, as well as streaming replication across multiple storage systems and the need workloads." [5]. Fig. 3 illustrates how such a data platform for continuously transferring, transforming and synchro- may look like. First of all, the term "integrated platform" nizing the data between the involved data platforms, expresses that a lakehouse should not be considered as a which likely leads to high operational costs and inconsis- loose amalgamation of standalone data warehouses and tent or erroneous data. In addition, the necessary move- data lakes, but rather as a single, self-contained data plat- ment of data extends the time until analysis results are form. Limiting the architecture to one type of storage, e.g. available. Vendors of various data management tools to a distributed file system, and one data format, e.g. to have recognized these problems and recently developed Apache Parquet, eliminates the need for additional data products that pursue to close the gap between data ware- movement and transformation processes within the lake- houses and data lakes: On the one hand, modern and house and therefore reduces the complexity and error- possibly cloud-based data warehouses like Snowflake are proneness of the overall architecture. Furthermore, it evolving in order to support the management of unstruc- supports the formation of a single source of truth, as tured data, the stream ingestion of near-realtime data, as the same data may no longer be replicated between dif- well the querying of data that is stored in open formats ferent systems with varying characteristics. Finally, the on external, third-party storage systems. On the other definition emphasizes that lakehouses must support all hand, frameworks and query engines like Apache Hudi, typical analytical workloads of data warehouses and data Apache Iceberg, Dremio and Trino9 are emerging that lakes, so that data analysts and data scientists can use a can be used to enhance data lakes by typical features of lakehouse instead of the former data platforms. data warehouses and hence make analyses more conve- Based on this definition and the characteristics of the nient. This observable convergence of data warehouses workloads mentioned therein, we derived a total of eight and data lakes contributed to the coining of the term technical requirements that lakehouses should fulfill [5]: "lakehouse" and its underlying vision. R1: Same type of storage and data format Lake- houses must employ only a single type of storage for all 6 https://parquet.apache.org data and metadata and use only one format for the data. 7 https://orc.apache.org R2: CRUD for all types of data Lakehouses must 8 https://kafka.apache.org 9 https://trino.io support the ingestion, retrieval, updating and deletion of Transform larly promising for the fulfillment of the aforementioned requirements and thus for the construction of lakehouses. Reporting, These frameworks basically act as libraries for highly OLAP Data Lakehouse Scientists scalable batch and stream processing engines, such as HDFS Advanced Analytics Apache Spark10 or Apache Flink11 and implement data Parquet Parquet Data access protocols that control how these engines read data Analysts from and write data to storage systems (cf. [18]). In ad- Extract & Load dition, they manage technical metadata, which allows them to represent datasets as relational data collections and track additions, updates and deletions of data. All types of data Data flow Underlying Distributed File System/ ... storage system Object Storage Figure 2: Example of a lakehouse that uses the HDFS as R1 R7 R5 R6 R8 storage system and Apache Parquet as data format. Lakehouse Framework Processing all kinds of data at least on the level of data collections. Engine R3: Relational data collections Lakehouses must Data Data provide means to abstract from the stored data files and Scientists Analysts R2 R3 R4 to represent them as cohesive data collections with rela- tional properties on the logical level. R4: Query language Lakehouses must offer a declara- tive, structured query language that allows to query the Data flow All types of data data in a relational manner. Requirement natively Requirement fulfilled Rn fulfilled by data lake Rn by framework R5: Consistency Guarantees Lakehouses must pro- vide consistency guarantees for the data, such as schema validation, which can either be enforced on data inges- Figure 3: Typical architecture of a data lake that can transi- tion or when the data is queried. tion to a lakehouse by adding a corresponding framework. R6: Isolation and Atomicity Similar to relational database systems, lakehouses must provide isolation and Fig. 4 shows the conceptual architecture of a data lake atomicity for data operations in order to ensure the con- as it can often be encountered in practice. It essentially sistency of the data and to support concurrency. consists of a storage system, which can be either a dis- R7: Direct read access Lakehouses must provide di- tributed file system or an object storage that persists the rect access to the data and metadata on the underlying data as data files in an open file format. A batch and storage system and must employ open data formats only. stream processing engine can read data from the stor- R8: Unified batch and stream processing Lake- age system, process it and then write the results back houses must support record-wise data operations in near- to the storage system. Hence, the data lake is supposed realtime and allow to treat data collections as sources to store the raw data next to pre-processed and aggre- and sinks for batch and stream processing. gated data. This processing engine is also used to ingest These requirements can be achieved in various ways, data and data analysts can leverage it in order to query for example by opening existing data warehouses and the data via a query language like SQL. For data mining driving them into the direction of data lakes or by devel- and machine learning, data science applications can di- oping technologies that enhance data lakes for common rectly access the data on the storage system. Without features and characteristics of data warehouses. the lakehouse framework that is depicted in Fig. 4, the data lake would already satisfy the requirements R1, R2, 4. Transitioning from Data Lakes R3, R4, and R7. R3 is satisfied because many processing engines like Apache Spark already enable relational data to Lakehouses abstraction, so that multiple data files that reside on the storage system can collectively represent the contents of In the course of our evaluation of several data manage- a table. R5 and R6 are not met, since processing engines ment tools [5], frameworks for data lakes like Delta Lake, Apache Hudi and Apache Iceberg appeared to be particu- 10 https://spark.apache.org 11 https://flink.apache.org usually do not provide means for enforcing the internal [3] C. Gröger, Industrial Analytics – An Overview, it - consistency of a table, nor do they guarantee atomicity Information Technology 64 (2022) 55–65. and isolation when performing operations on the data. [4] M. Armbrust, A. Ghodsi, R. Xin, et al., Lakehouse: Although processing engines like Apache Spark gener- A New Generation of Open Platforms that Unify ally support the batch and stream processing of data that Data Warehousing and Advanced Analytics, in: resides on a distributed file system or object storage, R8 11th CIDR, 2021. is often not met, because especially engines that apply [5] J. Schneider, C. Gröger, A. Lutsch, et al., Assessing micro-batching are often not optimized for simple data the Lakehouse: Analysis, Requirements and Defini- operations that occur at high frequency, which results tion, in: Proceedings of the 25th International Con- in the creation of many small data files when streaming ference on Enterprise Information Systems (ICEIS), data needs to be materialized. This high number of data 2023, pp. 44–56. files prevents the efficient querying of data, as many files[6] E. F. Codd, A Relational Model of Data for Large have to be read and consolidated [18]. To solve this issue, Shared Data Banks, Communications of the ACM a dedicated stream storage system, such as Apache Kafka, 13 (1970) 377–387. could be leveraged, but this would in turn increase the [7] W. H. Inmon, Building the Data Warehouse, John complexity of the data lake and in particular violate R1, Wiley & Sons, 2005. as it represents another type of storage system. [8] R. Kimball, M. Ross, The Data Warehouse Toolkit: When integrating a lakehouse framework into the pro- The Definitive Guide to Dimensional Modeling, cessing engine, the previously unmet requirements R5, third ed., John Wiley & Sons, 2013. R6, and R8 can be satisfied [5]: As these frameworks pro- [9] A. Bauer, H. Günzel, Data-Warehouse-Systeme: vide means for enforcing the inner consistency of data Architektur, Entwicklung, Anwendung, collections, such as schema validation and constraint dpunkt.verlag, 2013. checking, R5 can be fulfilled. Furthermore, they use the [10] H. Baars, H.-G. Kemper, Business Intelligence & collected technical metadata in order to implement data Analytics, fourth ed., Springer Vieweg, 2021. access protocols that achieve atomicity and at least snap- [11] T. Akidau, S. Chernyak, R. Lax, Streaming Systems: shot isolation [19] via multi-version concurrency con- The What, Where, When, and How of Large-Scale trol [20] (cf. R6). By offering various optimizations, such Data Processing, O’Reilly Media, 2018. as different table types that are either designed for fre- [12] D. Linstedt, M. Olschimke, Building a Scalable Data quent reads or writes, as well as compaction techniques Warehouse with Data Vault 2.0, Elsevier Science & for data and metadata, these frameworks avoid the cre- Technology Books, 2015. ation of many small data and metadata files and hence [13] C. Giebler, C. Gröger, E. Hoos, et al., Leveraging the increase the efficiency of stream processing (cf. R8). Data Lake: Current State and Challenges, in: Big Data Analytics and Knowledge Discovery, Springer International Publishing, 2019. 5. Conclusion [14] C. Giebler, C. Gröger, E. Hoos, et al., A Zone Refer- ence Model for Enterprise-Grade Data Lake Man- By assessing the properties of a typical data lake archi- agement, in: 24th Internat. Enterprise Distributed tecture and comparing them to requirements that are Object Computing Conference (EDOC), 2020. relevant for lakehouses, it became apparent that it lacks [15] J. Warren, N. Marz, Big Data: Principles and Best consistency guarantees, atomicity and isolation for data Practices of Scalable Realtime Data Systems, Simon operations, as well as optimizations for stream processing and Schuster, 2015. in order to complete the transition to a lakehouse. While [16] J. Kreps, Questioning the Lambda Architec- the lakehouse approach looks promising, its concepts and ture, 2014. URL: https://www.oreilly.com/radar/ technologies have not reached maturity yet and hence questioning-the-lambda-architecture/. require further research, for example in terms of data [17] D. Oreščanin, T. Hlupić, Data Lakehouse - A Novel modeling and the suitability of different architectures. Step in Analytics Architecture, in: 44th Interna- tional Convention on Information, Communication References and Electronic Technology (MIPRO), 2021. [18] M. Armbrust, T. Das, L. Sun, et al., Delta Lake: [1] O. Vermesan, P. Friess, Internet of Things: Con- High-Performance ACID Table Storage over Cloud verging Technologies for Smart Environments and Object Stores, Proc. VLDB Endow. 13 (2020). Integrated Ecosystems, River publishers, 2013. [19] G. Weikum, G. Vossen, Transactional Information [2] DAMA International, DAMA-DMBOK: Data Man- Systems, Elsevier, 2001. agement Body of Knowledge, second ed., Technics [20] P. Jain, P. Kraft, C. Power, et al., Analyzing and Com- Publications, 2017. paring Lakehouse Storage Systems, CIDR, 2023.