=Paper=
{{Paper
|id=Vol-3714/invited2
|storemode=property
|title=The Data Platform Evolution: From DataWarehouses over Data Lakes to Lakehouses
|pdfUrl=https://ceur-ws.org/Vol-3714/invited2.pdf
|volume=Vol-3714
|authors=Jan Schneider,Christoph Gröger,Arnold Lutsch
|dblpUrl=https://dblp.org/rec/conf/gvd/SchneiderGL23
}}
==The Data Platform Evolution: From DataWarehouses over Data Lakes to Lakehouses==
<pdf width="1500px">https://ceur-ws.org/Vol-3714/invited2.pdf</pdf>
<pre>
                                The Data Platform Evolution: From Data Warehouses over
                                Data Lakes to Lakehouses
                                Jan Schneider1 , Christoph Gröger2 and Arnold Lutsch2
                                1
                                    Institute for Parallel and Distributed Systems, University of Stuttgart, Universitätsstraße 38, 70569 Stuttgart, Germany
                                2
                                    Robert Bosch GmbH, Borsigstraße 4, 70469 Stuttgart, Germany


                                                                             Abstract
                                                                             The continuously increasing availability of data and the growing maturity of data-driven analysis techniques have encouraged
                                                                             enterprises to collect and analyze huge amounts of business-relevant data in order to exploit it for competitive advantages.
                                                                             To facilitate these processes, various platforms for analytical data management have been developed: While data warehouses
                                                                             have traditionally been used by business analysts for reporting and OLAP, data lakes emerged as an alternative concept that
                                                                             also supports advanced analytics. As these two common types of data platforms show rather contrary characteristics and
                                                                             target different user groups and analytical approaches, enterprises usually need to employ both of them, resulting in complex,
                                                                             error-prone and cost-expensive architectures. To address these issues, efforts have recently become apparent to combine
                                                                             features of data warehouses and data lakes into so-called lakehouses, which pursue to serve all kinds of analytics from a
                                                                             single data platform. This paper provides an overview on the evolution of analytical data platforms from data warehouses
                                                                             over data lakes to lakehouses and elaborates on the vision and characteristics of the latter. Furthermore, it addresses the
                                                                             question of what aspects common data lakes are currently missing that prevent them from transitioning to lakehouses.

                                                                             Keywords
                                                                             Lakehouse, Data Warehouse, Data Lake, Data Management, Data Analytics


                                1. Introduction                                                                                                                       practice, especially the traditional data warehouses and
                                                                                                                                                                      the more recent data lakes have become the predominant
                                Within the course of the digital transformation of society                                                                            types of data platforms. With so-called lakehouses, a sup-
                                and economy, the importance of data for enterprises is                                                                                posedly new kind of data platform has recently attracted
                                continuously growing. Due to the ever-increasing afford-                                                                              attention: They are driven by the vision of combining the
                                ability of smart devices and sensors in the scope of the                                                                              characteristics and features of data warehouses and data
                                Internet of Things [1], as well as a wide range of other                                                                              lakes, which are perceived as complementary, into inte-
                                upcoming technologies for capturing data about prod-                                                                                  grated data platforms. With the prospect of being able to
                                ucts, shop floors, suppliers, customers and other entities,                                                                           serve all kinds of analytical workloads from one univer-
                                enterprises have gained manifold opportunities for col-                                                                               sally applicable platform, lakehouses promise to simplify
                                lecting business-related data along their value chains. By                                                                            and improve existing enterprise analytics architectures,
                                leveraging data-driven analysis techniques, this data can                                                                             which commonly needed to operate data warehouses
                                be exploited for evaluating and optimizing products and                                                                               and data lakes in parallel and hence suffered from high
                                business processes and hence constitutes a key factor                                                                                 operational costs, slow analytical processes, as well as
                                for continuous development and improvement. How-                                                                                      a low trustworthiness of analysis results [4]. Over the
                                ever, in order to be able to derive valuable insights and                                                                             past years, a variety of technologies have emerged or
                                knowledge from huge amounts of collected data, this                                                                                   evolved with the intention to address these issues and
                                data needs to be organized and prepared in a systematic                                                                               hence to enable the construction of lakehouse-like data
                                manner, along with metadata that describes the context                                                                                platforms, such as Delta Lake1 , Dremio2 or Snowflake3 .
                                in which the data was created and processed [2]. Plat-                                                                                As indicated by our evaluation of several data manage-
                                forms for analytical data management can support these                                                                                ment tools [5], frameworks that operate on top of data
                                tasks, as they are specifically developed for the storage,                                                                            lakes and pursue to enhance them for typical features
                                management, processing and provisioning of data from                                                                                  of data warehouses appear to be particularly promising
                                all types of data sources that is supposed to be made avail-                                                                          in this regard, including Delta Lake, Apache Hudi4 and
                                able for different types of analytics applications [3]. In                                                                            Apache Iceberg5 . This paper first provides an overview
                                                                                                                                                                      on the evolution of data platforms and explains the vision
                                GvDB’23: 34th Workshop on Foundations of Database Systems, June
                                07–09, 2023, Hirsau, Germany
                                $ {firstname.lastname}@ipvs.uni-stuttgart.de (J. Schneider);                                                                          1
                                                                                                                                                                        https://delta.io
                                {firstname.lastname}@de.bosch.com (C. Gröger);                                                                                        2
                                                                                                                                                                        https://www.dremio.com
                                {firstname.lastname}@de.bosch.com (A. Lutsch)                                                                                         3
                                                                                                                                                                        https://www.snowflake.com
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License   4
                                                                       Attribution 4.0 International (CC BY 4.0).                                                       https://hudi.apache.org
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                        5
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                                                                                                                        https://iceberg.apache.org


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
behind the lakehouse paradigm. Section 3 then elaborates                                          features that go beyond those of conventional relational
on the characteristics of lakehouses, which are compared                                          databases, such as time travel and data governance ca-
to the architecture of a typical data lake in Section 4. This                                     pabilities. The left side of Fig. 2 shows the common ar-
way, several aspects are identified that conventional data                                        chitecture of a data warehouse, based on the reference
lakes need to address in order to be able to complete the                                         architecture by Bauer and Günzel [9]. Data Warehouses
transition to lakehouses.                                                                         are typically designed specifically for a given application
                                                                                                  scenario and employ a Extract-Transform-Load (ETL)
                                                                                                  process, where the data is first extracted from the data
2. Evolution of Data Platforms                                                                    sources, then prepared and transformed into the target
                                                                                                  schema in a dedicated data staging area and finally load
Between 1960 and 1970, the first databases appeared and
                                                                                                  into the core data warehouse, which is responsible for the
the relational data model [6] was developed. The purpose
                                                                                                  long-term storage of all data. While the data staging area
of these databases was primarily to provide data manage-
                                                                                                  can leverage different types of storage systems, such as
ment capabilities for applications and were accordingly
                                                                                                  relational or NoSQL databases, the core data warehouse
designed for workloads where rather simple read and
                                                                                                  typically relies on relational databases. Due to the large
write operations have to be performed on large datasets
                                                                                                  amount of data that resides in the core data warehouse, it
with high frequency. However, many of these databases
                                                                                                  can be reasonable to extract parts of the data and to make
are less suitable for analytics applications, where large
                                                                                                  it available in dependent data marts [10], which then al-
amounts of historical data have to be sporadically an-
                                                                                                  low to speed up downstream analyses. For example, some
alyzed with rather complex queries in order to derive
                                                                                                  data marts may be based on relational databases and op-
insights and knowledge that can then be used for guiding
                                                                                                  timized for reporting, while other data marts employ
business decisions. For this reason, data platforms for
                                                                                                  multi-dimensional databases in order to support Online
analytical data management have been developed, which
                                                                                                  Analytical Processing (OLAP) [10]. By using appropriate
support the systematic long-term storage, management
                                                                                                  query languages, data analysts can perform their anal-
and querying of data for analytical purposes.
                                                                                                  yses either on individual data marts or directly on the
             Data Warehouse                                   Data Lake
                                                                                                  core data warehouse. As data warehouses employ com-
                                                                                                  plex, static data models, store pre-processed data instead
                            Data Analysts                                 Data Scientists
                                                                                                  of the raw data and leverage proprietary data formats
                                                                                                  that impede direct data access, they are mostly suited
    Reporting                     OLAP                                  Advanced
                                                                        Analytics                 for analysis questions that are already known in advance
 Rela- Data Mart
 tional
                             Data Mart Multi-
                                           dim.               Data     Lake                       and provide only very limited support for data mining
                                                             Explo-       De-
                                                                                                  and machine learning. Moreover, since data warehouses
        Extract                   Extract
                                              ...      Poly-
                                                       glot  rative livery Poly- glot             are primarily optimized for the batch processing of huge
                  Core Data
                 Warehouse
                                          Extract                                                 amounts of data, they can barely be used for streaming
                                                       Poly-      Distilled
        Rela-
        tional
                                                       glot                                       applications [11] that rely on the near-realtime execution
                          Load
                                                       Poly-  Harmonized
                                                                                                  of simple data operations with high frequency. With the
              Data Staging Area
                                                       glot
                                                                                       Transform  goal of making data warehouses more flexible, there have
                                                                                       btw. zones
        Poly-
                                        Trans-
                                        form           Poly-
                                                                    Raw                           been various attempts to enable the storage of structured
         glot
                          Extract
                                                       glot
                                                                        Extract & Load
                                                                                                  raw data. For example, data vault [12] represents a data
                                                                                                  modeling approach that facilitates the easy incorpora-
                                                                                                  tion of changes to the data schema without requiring
              Mainly structured data                         All types of data                    adjustments to the structure of existing tables and hence
                    Data flow
                                           Underlying                      Data lake              accommodates the variability of raw data.
                                                                   ...
                                      ... storage system                   zone
                                                                                                     The continuously increasing demand for organizing
                                                                                                  and analyzing semi-structured and unstructured data
                                                                                                  led to the emergence of data lakes [13] in about 2010.
Figure 1: Comparison of typical high-level architectures of
data warehouses (left) and data lakes (right).                                                    Data lakes are based on the idea of collecting raw data
                                                                                                  from the data sources and deciding at a later point how
                                                                                                  this data can be processed and analyzed. This leads to a
     Data warehouses [7] represent the most established Extract-Load-Transform (ELT) process, where the data
type of analytical data platform and emerged from rela- is first extracted and load into the data lake and subse-
tional database systems in the 1980s. They are primarily quently prepared and transformed in order to make it
designed for the management of structured data, impose accessible for different types of analytics applications. As
well-defined and possibly multi-dimensional data mod- a result, data lakes manage not only preprocessed and
els [8], often provide ACID guarantees and tend to offer pre-aggregated data, but also raw data, which allows to
increase the efficiency of re-occurring analyses while still   3. The Lakehouse Paradigm
maintaining a high level of flexibility. As indicated on
the right-hand side of Fig. 2, data lakes typically impose     Although there is a widespread agreement that lake-
a polyglot architecture, in which several different sys-       houses represent amalgamations of data warehouses and
tems for data storage and data processing are utilized,        data lakes, different opinions in literature exist about how
including relational and NoSQL databases, distributed file     the architecture of lakehouses should look like and what
systems, batch and stream processing engines and event         characteristics these data platforms must necessarily pos-
hubs. By applying zone models, the architecture is com-        sess. For example, many authors consider lakehouses
monly divided into zones that reflect different degrees        as integrated data platforms that are based on directly-
of data processing and governance policies [14]. Instead       accessible storage, such as distributed file systems or
of proprietary file formats, data lakes tend to leverage       object storages and can also provide typical features of
open file formats, such as Apache Parquet6 or Apache           data warehouses like ACID transactions [4]. However,
ORC7 . These formats enable tabular data representations       others argue that a two-tier architecture consisting of
and provide further optimizations in terms of data com-        self-contained data warehouses and data lakes that are
pression and query processing. These aspects and the           potentially connected by an integration layer for unified
possibility to directly access the data on the underlying      data access can also constitute a lakehouse [17]. In our
storage systems enable the execution of data mining and        work [5], we assessed different views and definitions of
machine learning applications on top of data lakes. By in-     the lakehouse paradigm and finally derived a new defini-
tegrating stream storage and stream processing systems,        tion that reflects the additional value of lakehouses for
such as Apache Spark and Apache Kafka8 , respectively,         enterprises in comparison to conventional data platforms.
into the architecture and by applying well-established         From our perspective, lakehouses are beneficial for en-
architecture patterns like the Lambda [15] or Kappa [16]       terprises when they contribute to simplifying enterprise
architecture, data lakes are also suitable for near-realtime   analytics architectures by providing a single source of
reporting and streaming analytics.                             truth, limiting the variety of involved technologies and
   Due to these complementary alignments of data ware-         hence reducing the number of required data movement
houses and data lakes, enterprises tend to employ com-         and transformation processes. Accordingly, we define a
plex analytics architectures in which both types of data       lakehouse as "integrated data platform that leverages the
platforms are operated in parallel This approach com-          same storage type and data format for reporting and OLAP,
monly results in several shortcomings [4], such as data        data mining and machine learning, as well as streaming
replication across multiple storage systems and the need       workloads." [5]. Fig. 3 illustrates how such a data platform
for continuously transferring, transforming and synchro-       may look like. First of all, the term "integrated platform"
nizing the data between the involved data platforms,           expresses that a lakehouse should not be considered as a
which likely leads to high operational costs and inconsis-     loose amalgamation of standalone data warehouses and
tent or erroneous data. In addition, the necessary move-       data lakes, but rather as a single, self-contained data plat-
ment of data extends the time until analysis results are       form. Limiting the architecture to one type of storage, e.g.
available. Vendors of various data management tools            to a distributed file system, and one data format, e.g. to
have recognized these problems and recently developed          Apache Parquet, eliminates the need for additional data
products that pursue to close the gap between data ware-       movement and transformation processes within the lake-
houses and data lakes: On the one hand, modern and             house and therefore reduces the complexity and error-
possibly cloud-based data warehouses like Snowflake are        proneness of the overall architecture. Furthermore, it
evolving in order to support the management of unstruc-        supports the formation of a single source of truth, as
tured data, the stream ingestion of near-realtime data, as     the same data may no longer be replicated between dif-
well the querying of data that is stored in open formats       ferent systems with varying characteristics. Finally, the
on external, third-party storage systems. On the other         definition emphasizes that lakehouses must support all
hand, frameworks and query engines like Apache Hudi,           typical analytical workloads of data warehouses and data
Apache Iceberg, Dremio and Trino9 are emerging that            lakes, so that data analysts and data scientists can use a
can be used to enhance data lakes by typical features of       lakehouse instead of the former data platforms.
data warehouses and hence make analyses more conve-               Based on this definition and the characteristics of the
nient. This observable convergence of data warehouses          workloads mentioned therein, we derived a total of eight
and data lakes contributed to the coining of the term          technical requirements that lakehouses should fulfill [5]:
"lakehouse" and its underlying vision.                         R1: Same type of storage and data format Lake-
                                                               houses must employ only a single type of storage for all
6
  https://parquet.apache.org                                   data and metadata and use only one format for the data.
7
  https://orc.apache.org                                       R2: CRUD for all types of data Lakehouses must
8
  https://kafka.apache.org
9
  https://trino.io                                             support the ingestion, retrieval, updating and deletion of
            Transform                                                 larly promising for the fulfillment of the aforementioned
                                                                      requirements and thus for the construction of lakehouses.
                                Reporting,                            These frameworks basically act as libraries for highly
                                 OLAP                    Data
           Lakehouse                                     Scientists   scalable batch and stream processing engines, such as
   HDFS
                                Advanced
                                Analytics                             Apache Spark10 or Apache Flink11 and implement data
          Parquet     Parquet                            Data         access protocols that control how these engines read data
                                                         Analysts
                                                                      from and write data to storage systems (cf. [18]). In ad-
                    Extract & Load                                    dition, they manage technical metadata, which allows
                                                                      them to represent datasets as relational data collections
                                                                      and track additions, updates and deletions of data.
          All types of data

                    Data flow           Underlying                                          Distributed File System/
                                  ...   storage system                                           Object Storage

Figure 2: Example of a lakehouse that uses the HDFS as                                 R1    R7            R5 R6        R8
storage system and Apache Parquet as data format.
                                                                                                           Lakehouse Framework

                                                                                                              Processing
all kinds of data at least on the level of data collections.                                                    Engine
R3: Relational data collections Lakehouses must                                 Data         Data
provide means to abstract from the stored data files and                      Scientists    Analysts       R2 R3        R4
to represent them as cohesive data collections with rela-
tional properties on the logical level.
R4: Query language Lakehouses must offer a declara-
tive, structured query language that allows to query the                           Data flow                 All types of data
data in a relational manner.                                                      Requirement natively         Requirement fulfilled
                                                                               Rn fulfilled by data lake   Rn by framework
R5: Consistency Guarantees Lakehouses must pro-
vide consistency guarantees for the data, such as schema
validation, which can either be enforced on data inges-               Figure 3: Typical architecture of a data lake that can transi-
tion or when the data is queried.                                     tion to a lakehouse by adding a corresponding framework.
R6: Isolation and Atomicity Similar to relational
database systems, lakehouses must provide isolation and
                                                              Fig. 4 shows the conceptual architecture of a data lake
atomicity for data operations in order to ensure the con-
                                                           as it can often be encountered in practice. It essentially
sistency of the data and to support concurrency.
                                                           consists of a storage system, which can be either a dis-
R7: Direct read access Lakehouses must provide di-
                                                           tributed file system or an object storage that persists the
rect access to the data and metadata on the underlying
                                                           data as data files in an open file format. A batch and
storage system and must employ open data formats only.
                                                           stream processing engine can read data from the stor-
R8: Unified batch and stream processing Lake-
                                                           age system, process it and then write the results back
houses must support record-wise data operations in near-
                                                           to the storage system. Hence, the data lake is supposed
realtime and allow to treat data collections as sources
                                                           to store the raw data next to pre-processed and aggre-
and sinks for batch and stream processing.
                                                           gated data. This processing engine is also used to ingest
   These requirements can be achieved in various ways,
                                                           data and data analysts can leverage it in order to query
for example by opening existing data warehouses and
                                                           the data via a query language like SQL. For data mining
driving them into the direction of data lakes or by devel-
                                                           and machine learning, data science applications can di-
oping technologies that enhance data lakes for common
                                                           rectly access the data on the storage system. Without
features and characteristics of data warehouses.
                                                           the lakehouse framework that is depicted in Fig. 4, the
                                                           data lake would already satisfy the requirements R1, R2,
4. Transitioning from Data Lakes R3, R4, and R7. R3 is satisfied because many processing
                                                           engines like Apache Spark already enable relational data
     to Lakehouses                                         abstraction, so that multiple data files that reside on the
                                                           storage system can collectively represent the contents of
In the course of our evaluation of several data manage-
                                                           a table. R5 and R6 are not met, since processing engines
ment tools [5], frameworks for data lakes like Delta Lake,
Apache Hudi and Apache Iceberg appeared to be particu- 10 https://spark.apache.org
                                                                      11
                                                                           https://flink.apache.org
usually do not provide means for enforcing the internal     [3] C. Gröger, Industrial Analytics – An Overview, it -
consistency of a table, nor do they guarantee atomicity         Information Technology 64 (2022) 55–65.
and isolation when performing operations on the data.       [4] M. Armbrust, A. Ghodsi, R. Xin, et al., Lakehouse:
Although processing engines like Apache Spark gener-            A New Generation of Open Platforms that Unify
ally support the batch and stream processing of data that       Data Warehousing and Advanced Analytics, in:
resides on a distributed file system or object storage, R8      11th CIDR, 2021.
is often not met, because especially engines that apply     [5] J. Schneider, C. Gröger, A. Lutsch, et al., Assessing
micro-batching are often not optimized for simple data          the Lakehouse: Analysis, Requirements and Defini-
operations that occur at high frequency, which results          tion, in: Proceedings of the 25th International Con-
in the creation of many small data files when streaming         ference on Enterprise Information Systems (ICEIS),
data needs to be materialized. This high number of data         2023, pp. 44–56.
files prevents the efficient querying of data, as many files[6] E. F. Codd, A Relational Model of Data for Large
have to be read and consolidated [18]. To solve this issue,     Shared Data Banks, Communications of the ACM
a dedicated stream storage system, such as Apache Kafka,        13 (1970) 377–387.
could be leveraged, but this would in turn increase the     [7] W. H. Inmon, Building the Data Warehouse, John
complexity of the data lake and in particular violate R1,       Wiley & Sons, 2005.
as it represents another type of storage system.            [8] R. Kimball, M. Ross, The Data Warehouse Toolkit:
   When integrating a lakehouse framework into the pro-         The Definitive Guide to Dimensional Modeling,
cessing engine, the previously unmet requirements R5,           third ed., John Wiley & Sons, 2013.
R6, and R8 can be satisfied [5]: As these frameworks pro-   [9] A. Bauer, H. Günzel, Data-Warehouse-Systeme:
vide means for enforcing the inner consistency of data          Architektur,       Entwicklung,          Anwendung,
collections, such as schema validation and constraint           dpunkt.verlag, 2013.
checking, R5 can be fulfilled. Furthermore, they use the   [10] H. Baars, H.-G. Kemper, Business Intelligence &
collected technical metadata in order to implement data         Analytics, fourth ed., Springer Vieweg, 2021.
access protocols that achieve atomicity and at least snap- [11] T. Akidau, S. Chernyak, R. Lax, Streaming Systems:
shot isolation [19] via multi-version concurrency con-          The What, Where, When, and How of Large-Scale
trol [20] (cf. R6). By offering various optimizations, such     Data Processing, O’Reilly Media, 2018.
as different table types that are either designed for fre- [12] D. Linstedt, M. Olschimke, Building a Scalable Data
quent reads or writes, as well as compaction techniques         Warehouse with Data Vault 2.0, Elsevier Science &
for data and metadata, these frameworks avoid the cre-          Technology Books, 2015.
ation of many small data and metadata files and hence      [13] C. Giebler, C. Gröger, E. Hoos, et al., Leveraging the
increase the efficiency of stream processing (cf. R8).          Data Lake: Current State and Challenges, in: Big
                                                                Data Analytics and Knowledge Discovery, Springer
                                                                International Publishing, 2019.
5. Conclusion                                              [14] C. Giebler, C. Gröger, E. Hoos, et al., A Zone Refer-
                                                                ence Model for Enterprise-Grade Data Lake Man-
By assessing the properties of a typical data lake archi-
                                                                agement, in: 24th Internat. Enterprise Distributed
tecture and comparing them to requirements that are
                                                                Object Computing Conference (EDOC), 2020.
relevant for lakehouses, it became apparent that it lacks
                                                           [15] J. Warren, N. Marz, Big Data: Principles and Best
consistency guarantees, atomicity and isolation for data
                                                                Practices of Scalable Realtime Data Systems, Simon
operations, as well as optimizations for stream processing
                                                                and Schuster, 2015.
in order to complete the transition to a lakehouse. While
                                                           [16] J. Kreps, Questioning the Lambda Architec-
the lakehouse approach looks promising, its concepts and
                                                                ture, 2014. URL: https://www.oreilly.com/radar/
technologies have not reached maturity yet and hence
                                                                questioning-the-lambda-architecture/.
require further research, for example in terms of data
                                                           [17] D. Oreščanin, T. Hlupić, Data Lakehouse - A Novel
modeling and the suitability of different architectures.
                                                                Step in Analytics Architecture, in: 44th Interna-
                                                                tional Convention on Information, Communication
References                                                      and Electronic Technology (MIPRO), 2021.
                                                           [18] M. Armbrust, T. Das, L. Sun, et al., Delta Lake:
 [1] O. Vermesan, P. Friess, Internet of Things: Con-           High-Performance ACID Table Storage over Cloud
      verging Technologies for Smart Environments and           Object Stores, Proc. VLDB Endow. 13 (2020).
      Integrated Ecosystems, River publishers, 2013.       [19] G. Weikum, G. Vossen, Transactional Information
 [2] DAMA International, DAMA-DMBOK: Data Man-                  Systems, Elsevier, 2001.
      agement Body of Knowledge, second ed., Technics [20] P. Jain, P. Kraft, C. Power, et al., Analyzing and Com-
      Publications, 2017.                                       paring Lakehouse Storage Systems, CIDR, 2023.

</pre>