=Paper=
{{Paper
|id=Vol-3714/invited2
|storemode=property
|title=The Data Platform Evolution: From DataWarehouses over Data Lakes to Lakehouses
|pdfUrl=https://ceur-ws.org/Vol-3714/invited2.pdf
|volume=Vol-3714
|authors=Jan Schneider,Christoph Gröger,Arnold Lutsch
|dblpUrl=https://dblp.org/rec/conf/gvd/SchneiderGL23
}}
==The Data Platform Evolution: From DataWarehouses over Data Lakes to Lakehouses==
The Data Platform Evolution: From Data Warehouses over
Data Lakes to Lakehouses
Jan Schneider1 , Christoph Gröger2 and Arnold Lutsch2
1
Institute for Parallel and Distributed Systems, University of Stuttgart, Universitätsstraße 38, 70569 Stuttgart, Germany
2
Robert Bosch GmbH, Borsigstraße 4, 70469 Stuttgart, Germany
Abstract
The continuously increasing availability of data and the growing maturity of data-driven analysis techniques have encouraged
enterprises to collect and analyze huge amounts of business-relevant data in order to exploit it for competitive advantages.
To facilitate these processes, various platforms for analytical data management have been developed: While data warehouses
have traditionally been used by business analysts for reporting and OLAP, data lakes emerged as an alternative concept that
also supports advanced analytics. As these two common types of data platforms show rather contrary characteristics and
target different user groups and analytical approaches, enterprises usually need to employ both of them, resulting in complex,
error-prone and cost-expensive architectures. To address these issues, efforts have recently become apparent to combine
features of data warehouses and data lakes into so-called lakehouses, which pursue to serve all kinds of analytics from a
single data platform. This paper provides an overview on the evolution of analytical data platforms from data warehouses
over data lakes to lakehouses and elaborates on the vision and characteristics of the latter. Furthermore, it addresses the
question of what aspects common data lakes are currently missing that prevent them from transitioning to lakehouses.
Keywords
Lakehouse, Data Warehouse, Data Lake, Data Management, Data Analytics
1. Introduction practice, especially the traditional data warehouses and
the more recent data lakes have become the predominant
Within the course of the digital transformation of society types of data platforms. With so-called lakehouses, a sup-
and economy, the importance of data for enterprises is posedly new kind of data platform has recently attracted
continuously growing. Due to the ever-increasing afford- attention: They are driven by the vision of combining the
ability of smart devices and sensors in the scope of the characteristics and features of data warehouses and data
Internet of Things [1], as well as a wide range of other lakes, which are perceived as complementary, into inte-
upcoming technologies for capturing data about prod- grated data platforms. With the prospect of being able to
ucts, shop floors, suppliers, customers and other entities, serve all kinds of analytical workloads from one univer-
enterprises have gained manifold opportunities for col- sally applicable platform, lakehouses promise to simplify
lecting business-related data along their value chains. By and improve existing enterprise analytics architectures,
leveraging data-driven analysis techniques, this data can which commonly needed to operate data warehouses
be exploited for evaluating and optimizing products and and data lakes in parallel and hence suffered from high
business processes and hence constitutes a key factor operational costs, slow analytical processes, as well as
for continuous development and improvement. How- a low trustworthiness of analysis results [4]. Over the
ever, in order to be able to derive valuable insights and past years, a variety of technologies have emerged or
knowledge from huge amounts of collected data, this evolved with the intention to address these issues and
data needs to be organized and prepared in a systematic hence to enable the construction of lakehouse-like data
manner, along with metadata that describes the context platforms, such as Delta Lake1 , Dremio2 or Snowflake3 .
in which the data was created and processed [2]. Plat- As indicated by our evaluation of several data manage-
forms for analytical data management can support these ment tools [5], frameworks that operate on top of data
tasks, as they are specifically developed for the storage, lakes and pursue to enhance them for typical features
management, processing and provisioning of data from of data warehouses appear to be particularly promising
all types of data sources that is supposed to be made avail- in this regard, including Delta Lake, Apache Hudi4 and
able for different types of analytics applications [3]. In Apache Iceberg5 . This paper first provides an overview
on the evolution of data platforms and explains the vision
GvDB’23: 34th Workshop on Foundations of Database Systems, June
07–09, 2023, Hirsau, Germany
$ {firstname.lastname}@ipvs.uni-stuttgart.de (J. Schneider); 1
https://delta.io
{firstname.lastname}@de.bosch.com (C. Gröger); 2
https://www.dremio.com
{firstname.lastname}@de.bosch.com (A. Lutsch) 3
https://www.snowflake.com
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License 4
Attribution 4.0 International (CC BY 4.0). https://hudi.apache.org
CEUR Workshop Proceedings (CEUR-WS.org) 5
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
https://iceberg.apache.org
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
behind the lakehouse paradigm. Section 3 then elaborates features that go beyond those of conventional relational
on the characteristics of lakehouses, which are compared databases, such as time travel and data governance ca-
to the architecture of a typical data lake in Section 4. This pabilities. The left side of Fig. 2 shows the common ar-
way, several aspects are identified that conventional data chitecture of a data warehouse, based on the reference
lakes need to address in order to be able to complete the architecture by Bauer and Günzel [9]. Data Warehouses
transition to lakehouses. are typically designed specifically for a given application
scenario and employ a Extract-Transform-Load (ETL)
process, where the data is first extracted from the data
2. Evolution of Data Platforms sources, then prepared and transformed into the target
schema in a dedicated data staging area and finally load
Between 1960 and 1970, the first databases appeared and
into the core data warehouse, which is responsible for the
the relational data model [6] was developed. The purpose
long-term storage of all data. While the data staging area
of these databases was primarily to provide data manage-
can leverage different types of storage systems, such as
ment capabilities for applications and were accordingly
relational or NoSQL databases, the core data warehouse
designed for workloads where rather simple read and
typically relies on relational databases. Due to the large
write operations have to be performed on large datasets
amount of data that resides in the core data warehouse, it
with high frequency. However, many of these databases
can be reasonable to extract parts of the data and to make
are less suitable for analytics applications, where large
it available in dependent data marts [10], which then al-
amounts of historical data have to be sporadically an-
low to speed up downstream analyses. For example, some
alyzed with rather complex queries in order to derive
data marts may be based on relational databases and op-
insights and knowledge that can then be used for guiding
timized for reporting, while other data marts employ
business decisions. For this reason, data platforms for
multi-dimensional databases in order to support Online
analytical data management have been developed, which
Analytical Processing (OLAP) [10]. By using appropriate
support the systematic long-term storage, management
query languages, data analysts can perform their anal-
and querying of data for analytical purposes.
yses either on individual data marts or directly on the
Data Warehouse Data Lake
core data warehouse. As data warehouses employ com-
plex, static data models, store pre-processed data instead
Data Analysts Data Scientists
of the raw data and leverage proprietary data formats
that impede direct data access, they are mostly suited
Reporting OLAP Advanced
Analytics for analysis questions that are already known in advance
Rela- Data Mart
tional
Data Mart Multi-
dim. Data Lake and provide only very limited support for data mining
Explo- De-
and machine learning. Moreover, since data warehouses
Extract Extract
... Poly-
glot rative livery Poly- glot are primarily optimized for the batch processing of huge
Core Data
Warehouse
Extract amounts of data, they can barely be used for streaming
Poly- Distilled
Rela-
tional
glot applications [11] that rely on the near-realtime execution
Load
Poly- Harmonized
of simple data operations with high frequency. With the
Data Staging Area
glot
Transform goal of making data warehouses more flexible, there have
btw. zones
Poly-
Trans-
form Poly-
Raw been various attempts to enable the storage of structured
glot
Extract
glot
Extract & Load
raw data. For example, data vault [12] represents a data
modeling approach that facilitates the easy incorpora-
tion of changes to the data schema without requiring
Mainly structured data All types of data adjustments to the structure of existing tables and hence
Data flow
Underlying Data lake accommodates the variability of raw data.
...
... storage system zone
The continuously increasing demand for organizing
and analyzing semi-structured and unstructured data
led to the emergence of data lakes [13] in about 2010.
Figure 1: Comparison of typical high-level architectures of
data warehouses (left) and data lakes (right). Data lakes are based on the idea of collecting raw data
from the data sources and deciding at a later point how
this data can be processed and analyzed. This leads to a
Data warehouses [7] represent the most established Extract-Load-Transform (ELT) process, where the data
type of analytical data platform and emerged from rela- is first extracted and load into the data lake and subse-
tional database systems in the 1980s. They are primarily quently prepared and transformed in order to make it
designed for the management of structured data, impose accessible for different types of analytics applications. As
well-defined and possibly multi-dimensional data mod- a result, data lakes manage not only preprocessed and
els [8], often provide ACID guarantees and tend to offer pre-aggregated data, but also raw data, which allows to
increase the efficiency of re-occurring analyses while still 3. The Lakehouse Paradigm
maintaining a high level of flexibility. As indicated on
the right-hand side of Fig. 2, data lakes typically impose Although there is a widespread agreement that lake-
a polyglot architecture, in which several different sys- houses represent amalgamations of data warehouses and
tems for data storage and data processing are utilized, data lakes, different opinions in literature exist about how
including relational and NoSQL databases, distributed file the architecture of lakehouses should look like and what
systems, batch and stream processing engines and event characteristics these data platforms must necessarily pos-
hubs. By applying zone models, the architecture is com- sess. For example, many authors consider lakehouses
monly divided into zones that reflect different degrees as integrated data platforms that are based on directly-
of data processing and governance policies [14]. Instead accessible storage, such as distributed file systems or
of proprietary file formats, data lakes tend to leverage object storages and can also provide typical features of
open file formats, such as Apache Parquet6 or Apache data warehouses like ACID transactions [4]. However,
ORC7 . These formats enable tabular data representations others argue that a two-tier architecture consisting of
and provide further optimizations in terms of data com- self-contained data warehouses and data lakes that are
pression and query processing. These aspects and the potentially connected by an integration layer for unified
possibility to directly access the data on the underlying data access can also constitute a lakehouse [17]. In our
storage systems enable the execution of data mining and work [5], we assessed different views and definitions of
machine learning applications on top of data lakes. By in- the lakehouse paradigm and finally derived a new defini-
tegrating stream storage and stream processing systems, tion that reflects the additional value of lakehouses for
such as Apache Spark and Apache Kafka8 , respectively, enterprises in comparison to conventional data platforms.
into the architecture and by applying well-established From our perspective, lakehouses are beneficial for en-
architecture patterns like the Lambda [15] or Kappa [16] terprises when they contribute to simplifying enterprise
architecture, data lakes are also suitable for near-realtime analytics architectures by providing a single source of
reporting and streaming analytics. truth, limiting the variety of involved technologies and
Due to these complementary alignments of data ware- hence reducing the number of required data movement
houses and data lakes, enterprises tend to employ com- and transformation processes. Accordingly, we define a
plex analytics architectures in which both types of data lakehouse as "integrated data platform that leverages the
platforms are operated in parallel This approach com- same storage type and data format for reporting and OLAP,
monly results in several shortcomings [4], such as data data mining and machine learning, as well as streaming
replication across multiple storage systems and the need workloads." [5]. Fig. 3 illustrates how such a data platform
for continuously transferring, transforming and synchro- may look like. First of all, the term "integrated platform"
nizing the data between the involved data platforms, expresses that a lakehouse should not be considered as a
which likely leads to high operational costs and inconsis- loose amalgamation of standalone data warehouses and
tent or erroneous data. In addition, the necessary move- data lakes, but rather as a single, self-contained data plat-
ment of data extends the time until analysis results are form. Limiting the architecture to one type of storage, e.g.
available. Vendors of various data management tools to a distributed file system, and one data format, e.g. to
have recognized these problems and recently developed Apache Parquet, eliminates the need for additional data
products that pursue to close the gap between data ware- movement and transformation processes within the lake-
houses and data lakes: On the one hand, modern and house and therefore reduces the complexity and error-
possibly cloud-based data warehouses like Snowflake are proneness of the overall architecture. Furthermore, it
evolving in order to support the management of unstruc- supports the formation of a single source of truth, as
tured data, the stream ingestion of near-realtime data, as the same data may no longer be replicated between dif-
well the querying of data that is stored in open formats ferent systems with varying characteristics. Finally, the
on external, third-party storage systems. On the other definition emphasizes that lakehouses must support all
hand, frameworks and query engines like Apache Hudi, typical analytical workloads of data warehouses and data
Apache Iceberg, Dremio and Trino9 are emerging that lakes, so that data analysts and data scientists can use a
can be used to enhance data lakes by typical features of lakehouse instead of the former data platforms.
data warehouses and hence make analyses more conve- Based on this definition and the characteristics of the
nient. This observable convergence of data warehouses workloads mentioned therein, we derived a total of eight
and data lakes contributed to the coining of the term technical requirements that lakehouses should fulfill [5]:
"lakehouse" and its underlying vision. R1: Same type of storage and data format Lake-
houses must employ only a single type of storage for all
6
https://parquet.apache.org data and metadata and use only one format for the data.
7
https://orc.apache.org R2: CRUD for all types of data Lakehouses must
8
https://kafka.apache.org
9
https://trino.io support the ingestion, retrieval, updating and deletion of
Transform larly promising for the fulfillment of the aforementioned
requirements and thus for the construction of lakehouses.
Reporting, These frameworks basically act as libraries for highly
OLAP Data
Lakehouse Scientists scalable batch and stream processing engines, such as
HDFS
Advanced
Analytics Apache Spark10 or Apache Flink11 and implement data
Parquet Parquet Data access protocols that control how these engines read data
Analysts
from and write data to storage systems (cf. [18]). In ad-
Extract & Load dition, they manage technical metadata, which allows
them to represent datasets as relational data collections
and track additions, updates and deletions of data.
All types of data
Data flow Underlying Distributed File System/
... storage system Object Storage
Figure 2: Example of a lakehouse that uses the HDFS as R1 R7 R5 R6 R8
storage system and Apache Parquet as data format.
Lakehouse Framework
Processing
all kinds of data at least on the level of data collections. Engine
R3: Relational data collections Lakehouses must Data Data
provide means to abstract from the stored data files and Scientists Analysts R2 R3 R4
to represent them as cohesive data collections with rela-
tional properties on the logical level.
R4: Query language Lakehouses must offer a declara-
tive, structured query language that allows to query the Data flow All types of data
data in a relational manner. Requirement natively Requirement fulfilled
Rn fulfilled by data lake Rn by framework
R5: Consistency Guarantees Lakehouses must pro-
vide consistency guarantees for the data, such as schema
validation, which can either be enforced on data inges- Figure 3: Typical architecture of a data lake that can transi-
tion or when the data is queried. tion to a lakehouse by adding a corresponding framework.
R6: Isolation and Atomicity Similar to relational
database systems, lakehouses must provide isolation and
Fig. 4 shows the conceptual architecture of a data lake
atomicity for data operations in order to ensure the con-
as it can often be encountered in practice. It essentially
sistency of the data and to support concurrency.
consists of a storage system, which can be either a dis-
R7: Direct read access Lakehouses must provide di-
tributed file system or an object storage that persists the
rect access to the data and metadata on the underlying
data as data files in an open file format. A batch and
storage system and must employ open data formats only.
stream processing engine can read data from the stor-
R8: Unified batch and stream processing Lake-
age system, process it and then write the results back
houses must support record-wise data operations in near-
to the storage system. Hence, the data lake is supposed
realtime and allow to treat data collections as sources
to store the raw data next to pre-processed and aggre-
and sinks for batch and stream processing.
gated data. This processing engine is also used to ingest
These requirements can be achieved in various ways,
data and data analysts can leverage it in order to query
for example by opening existing data warehouses and
the data via a query language like SQL. For data mining
driving them into the direction of data lakes or by devel-
and machine learning, data science applications can di-
oping technologies that enhance data lakes for common
rectly access the data on the storage system. Without
features and characteristics of data warehouses.
the lakehouse framework that is depicted in Fig. 4, the
data lake would already satisfy the requirements R1, R2,
4. Transitioning from Data Lakes R3, R4, and R7. R3 is satisfied because many processing
engines like Apache Spark already enable relational data
to Lakehouses abstraction, so that multiple data files that reside on the
storage system can collectively represent the contents of
In the course of our evaluation of several data manage-
a table. R5 and R6 are not met, since processing engines
ment tools [5], frameworks for data lakes like Delta Lake,
Apache Hudi and Apache Iceberg appeared to be particu- 10 https://spark.apache.org
11
https://flink.apache.org
usually do not provide means for enforcing the internal [3] C. Gröger, Industrial Analytics – An Overview, it -
consistency of a table, nor do they guarantee atomicity Information Technology 64 (2022) 55–65.
and isolation when performing operations on the data. [4] M. Armbrust, A. Ghodsi, R. Xin, et al., Lakehouse:
Although processing engines like Apache Spark gener- A New Generation of Open Platforms that Unify
ally support the batch and stream processing of data that Data Warehousing and Advanced Analytics, in:
resides on a distributed file system or object storage, R8 11th CIDR, 2021.
is often not met, because especially engines that apply [5] J. Schneider, C. Gröger, A. Lutsch, et al., Assessing
micro-batching are often not optimized for simple data the Lakehouse: Analysis, Requirements and Defini-
operations that occur at high frequency, which results tion, in: Proceedings of the 25th International Con-
in the creation of many small data files when streaming ference on Enterprise Information Systems (ICEIS),
data needs to be materialized. This high number of data 2023, pp. 44–56.
files prevents the efficient querying of data, as many files[6] E. F. Codd, A Relational Model of Data for Large
have to be read and consolidated [18]. To solve this issue, Shared Data Banks, Communications of the ACM
a dedicated stream storage system, such as Apache Kafka, 13 (1970) 377–387.
could be leveraged, but this would in turn increase the [7] W. H. Inmon, Building the Data Warehouse, John
complexity of the data lake and in particular violate R1, Wiley & Sons, 2005.
as it represents another type of storage system. [8] R. Kimball, M. Ross, The Data Warehouse Toolkit:
When integrating a lakehouse framework into the pro- The Definitive Guide to Dimensional Modeling,
cessing engine, the previously unmet requirements R5, third ed., John Wiley & Sons, 2013.
R6, and R8 can be satisfied [5]: As these frameworks pro- [9] A. Bauer, H. Günzel, Data-Warehouse-Systeme:
vide means for enforcing the inner consistency of data Architektur, Entwicklung, Anwendung,
collections, such as schema validation and constraint dpunkt.verlag, 2013.
checking, R5 can be fulfilled. Furthermore, they use the [10] H. Baars, H.-G. Kemper, Business Intelligence &
collected technical metadata in order to implement data Analytics, fourth ed., Springer Vieweg, 2021.
access protocols that achieve atomicity and at least snap- [11] T. Akidau, S. Chernyak, R. Lax, Streaming Systems:
shot isolation [19] via multi-version concurrency con- The What, Where, When, and How of Large-Scale
trol [20] (cf. R6). By offering various optimizations, such Data Processing, O’Reilly Media, 2018.
as different table types that are either designed for fre- [12] D. Linstedt, M. Olschimke, Building a Scalable Data
quent reads or writes, as well as compaction techniques Warehouse with Data Vault 2.0, Elsevier Science &
for data and metadata, these frameworks avoid the cre- Technology Books, 2015.
ation of many small data and metadata files and hence [13] C. Giebler, C. Gröger, E. Hoos, et al., Leveraging the
increase the efficiency of stream processing (cf. R8). Data Lake: Current State and Challenges, in: Big
Data Analytics and Knowledge Discovery, Springer
International Publishing, 2019.
5. Conclusion [14] C. Giebler, C. Gröger, E. Hoos, et al., A Zone Refer-
ence Model for Enterprise-Grade Data Lake Man-
By assessing the properties of a typical data lake archi-
agement, in: 24th Internat. Enterprise Distributed
tecture and comparing them to requirements that are
Object Computing Conference (EDOC), 2020.
relevant for lakehouses, it became apparent that it lacks
[15] J. Warren, N. Marz, Big Data: Principles and Best
consistency guarantees, atomicity and isolation for data
Practices of Scalable Realtime Data Systems, Simon
operations, as well as optimizations for stream processing
and Schuster, 2015.
in order to complete the transition to a lakehouse. While
[16] J. Kreps, Questioning the Lambda Architec-
the lakehouse approach looks promising, its concepts and
ture, 2014. URL: https://www.oreilly.com/radar/
technologies have not reached maturity yet and hence
questioning-the-lambda-architecture/.
require further research, for example in terms of data
[17] D. Oreščanin, T. Hlupić, Data Lakehouse - A Novel
modeling and the suitability of different architectures.
Step in Analytics Architecture, in: 44th Interna-
tional Convention on Information, Communication
References and Electronic Technology (MIPRO), 2021.
[18] M. Armbrust, T. Das, L. Sun, et al., Delta Lake:
[1] O. Vermesan, P. Friess, Internet of Things: Con- High-Performance ACID Table Storage over Cloud
verging Technologies for Smart Environments and Object Stores, Proc. VLDB Endow. 13 (2020).
Integrated Ecosystems, River publishers, 2013. [19] G. Weikum, G. Vossen, Transactional Information
[2] DAMA International, DAMA-DMBOK: Data Man- Systems, Elsevier, 2001.
agement Body of Knowledge, second ed., Technics [20] P. Jain, P. Kraft, C. Power, et al., Analyzing and Com-
Publications, 2017. paring Lakehouse Storage Systems, CIDR, 2023.