<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>C. Gröger, Industrial Analytics - An Overview, it -
consistency of a table, nor do they guarantee atomicity Information Technology</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>The Data Platform Evolution: From Data Warehouses over Data Lakes to Lakehouses</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Schneider</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christoph Gröger</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arnold Lutsch</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Parallel and Distributed Systems, University of Stuttgart</institution>
          ,
          <addr-line>Universitätsstraße 38, 70569 Stuttgart</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Robert Bosch GmbH</institution>
          ,
          <addr-line>Borsigstraße 4, 70469 Stuttgart</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>64</volume>
      <issue>2022</issue>
      <abstract>
        <p>The continuously increasing availability of data and the growing maturity of data-driven analysis techniques have encouraged enterprises to collect and analyze huge amounts of business-relevant data in order to exploit it for competitive advantages. To facilitate these processes, various platforms for analytical data management have been developed: While data warehouses have traditionally been used by business analysts for reporting and OLAP, data lakes emerged as an alternative concept that also supports advanced analytics. As these two common types of data platforms show rather contrary characteristics and target diferent user groups and analytical approaches, enterprises usually need to employ both of them, resulting in complex, error-prone and cost-expensive architectures. To address these issues, eforts have recently become apparent to combine features of data warehouses and data lakes into so-called lakehouses, which pursue to serve all kinds of analytics from a single data platform. This paper provides an overview on the evolution of analytical data platforms from data warehouses over data lakes to lakehouses and elaborates on the vision and characteristics of the latter. Furthermore, it addresses the question of what aspects common data lakes are currently missing that prevent them from transitioning to lakehouses.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Lakehouse</kwd>
        <kwd>Data Warehouse</kwd>
        <kwd>Data Lake</kwd>
        <kwd>Data Management</kwd>
        <kwd>Data Analytics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
practice, especially the traditional data warehouses and
the more recent data lakes have become the predominant
Within the course of the digital transformation of society types of data platforms. With so-called lakehouses, a
supand economy, the importance of data for enterprises is posedly new kind of data platform has recently attracted
continuously growing. Due to the ever-increasing aford- attention: They are driven by the vision of combining the
ability of smart devices and sensors in the scope of the characteristics and features of data warehouses and data
Internet of Things [1], as well as a wide range of other lakes, which are perceived as complementary, into
inteupcoming technologies for capturing data about prod- grated data platforms. With the prospect of being able to
ucts, shop floors, suppliers, customers and other entities, serve all kinds of analytical workloads from one
univerenterprises have gained manifold opportunities for col- sally applicable platform, lakehouses promise to simplify
lecting business-related data along their value chains. By and improve existing enterprise analytics architectures,
leveraging data-driven analysis techniques, this data can which commonly needed to operate data warehouses
be exploited for evaluating and optimizing products and and data lakes in parallel and hence sufered from high
business processes and hence constitutes a key factor operational costs, slow analytical processes, as well as
for continuous development and improvement. How- a low trustworthiness of analysis results [4]. Over the
ever, in order to be able to derive valuable insights and past years, a variety of technologies have emerged or
knowledge from huge amounts of collected data, this evolved with the intention to address these issues and
data needs to be organized and prepared in a systematic hence to enable the construction of lakehouse-like data
manner, along with metadata that describes the context platforms, such as Delta Lake1, Dremio2 or Snowflake 3.
in which the data was created and processed [2]. Plat- As indicated by our evaluation of several data
manageforms for analytical data management can support these ment tools [5], frameworks that operate on top of data
tasks, as they are specifically developed for the storage, lakes and pursue to enhance them for typical features
management, processing and provisioning of data from of data warehouses appear to be particularly promising
all types of data sources that is supposed to be made avail- in this regard, including Delta Lake, Apache Hudi4 and
able for diferent types of analytics applications [ 3]. In Apache Iceberg5. This paper first provides an overview
on the evolution of data platforms and explains the vision
behind the lakehouse paradigm. Section 3 then elaborates features that go beyond those of conventional relational
on the characteristics of lakehouses, which are compared databases, such as time travel and data governance
cato the architecture of a typical data lake in Section 4. This pabilities. The left side of Fig. 2 shows the common
arway, several aspects are identified that conventional data chitecture of a data warehouse, based on the reference
lakes need to address in order to be able to complete the architecture by Bauer and Günzel [9]. Data Warehouses
transition to lakehouses. are typically designed specifically for a given application
scenario and employ a Extract-Transform-Load (ETL)
process, where the data is first extracted from the data
2. Evolution of Data Platforms sources, then prepared and transformed into the target
schema in a dedicated data staging area and finally load
Between 1960 and 1970, the first databases appeared and into the core data warehouse, which is responsible for the
the relational data model [6] was developed. The purpose long-term storage of all data. While the data staging area
of these databases was primarily to provide data manage- can leverage diferent types of storage systems, such as
ment capabilities for applications and were accordingly relational or NoSQL databases, the core data warehouse
designed for workloads where rather simple read and typically relies on relational databases. Due to the large
write operations have to be performed on large datasets amount of data that resides in the core data warehouse, it
with high frequency. However, many of these databases can be reasonable to extract parts of the data and to make
are less suitable for analytics applications, where large it available in dependent data marts [10], which then
alamounts of historical data have to be sporadically an- low to speed up downstream analyses. For example, some
alyzed with rather complex queries in order to derive data marts may be based on relational databases and
opinsights and knowledge that can then be used for guiding timized for reporting, while other data marts employ
business decisions. For this reason, data platforms for multi-dimensional databases in order to support Online
analytical data management have been developed, which Analytical Processing (OLAP) [10]. By using appropriate
support the systematic long-term storage, management query languages, data analysts can perform their
analand querying of data for analytical purposes. yses either on individual data marts or directly on the
Data Warehouse Data Lake core data warehouse. As data warehouses employ
complex, static data models, store pre-processed data instead
Data Analysts Data Scientists of the raw data and leverage proprietary data formats
that impede direct data access, they are mostly suited
Reporting OLAP AAndvaalynticcesd for analysis questions that are already known in advance
tRioenlaa-l Data Mart Data Mart Mdiumlti.- Data Lake and provide only very limited support for data mining
Extract Extract ... Pgololyt- Eraxtpivloe- liDveer-y Pgololyt- aarnedpmriamchariniley loepatrinmi nizge.dMfoorretohveebra,tscinhcperdoacteasswinagreohfohuusgees
tRioenlaa-l WCaorreehDLooauatsade Extract Pgololyt- Distilled aaopmfpsoilmiucnapttliseoondfasdt[a1a1toa]p,tethrhaaettyiroecnlaysnwobniatthrheehl yingebhaerfru-erseqeaudletifnmocreys.etWxreeiactmuhtiitnohnge</p>
      <p>Poly- Harmonized</p>
      <p>Data Staging Area glot Transform goal of making data warehouses more flexible, there have
Pgololyt- Extract fTorramns- Pgololyt- RawExtract &amp; Loadbtw. zones rbaewendvaatrai.oFuosraettxeammpptlset,odeantaabvlaeutlht e[1s2to]rraegpereosfesntrtsucatudraetda
modeling approach that facilitates the easy
incorporation of changes to the data schema without requiring
Mainly structured data All types of data adjustments to the structure of existing tables and hence
Data flow ... sUtnodraegrleyisnygstem ... zDoantae lake accTohmemcoondtaitneusothueslyvairnicarbeilaistyinogfdreamwadnadtaf.or organizing
and analyzing semi-structured and unstructured data
Figure 1: Comparison of typical high-level architectures of led to the emergence of data lakes [13] in about 2010.
data warehouses (left) and data lakes (right). Data lakes are based on the idea of collecting raw data
from the data sources and deciding at a later point how
this data can be processed and analyzed. This leads to a</p>
      <p>Data warehouses [7] represent the most established Extract-Load-Transform (ELT) process, where the data
type of analytical data platform and emerged from rela- is first extracted and load into the data lake and
subsetional database systems in the 1980s. They are primarily quently prepared and transformed in order to make it
designed for the management of structured data, impose accessible for diferent types of analytics applications. As
well-defined and possibly multi-dimensional data mod- a result, data lakes manage not only preprocessed and
els [8], often provide ACID guarantees and tend to ofer pre-aggregated data, but also raw data, which allows to
3. The Lakehouse Paradigm
increase the eficiency of re-occurring analyses while still
maintaining a high level of flexibility. As indicated on
the right-hand side of Fig. 2, data lakes typically impose Although there is a widespread agreement that
lakea polyglot architecture, in which several diferent sys- houses represent amalgamations of data warehouses and
tems for data storage and data processing are utilized, data lakes, diferent opinions in literature exist about how
including relational and NoSQL databases, distributed file the architecture of lakehouses should look like and what
systems, batch and stream processing engines and event characteristics these data platforms must necessarily
poshubs. By applying zone models, the architecture is com- sess. For example, many authors consider lakehouses
monly divided into zones that reflect diferent degrees as integrated data platforms that are based on
directlyof data processing and governance policies [14]. Instead accessible storage, such as distributed file systems or
of proprietary file formats, data lakes tend to leverage object storages and can also provide typical features of
open file formats, such as Apache Parquet 6 or Apache data warehouses like ACID transactions [4]. However,
ORC7. These formats enable tabular data representations others argue that a two-tier architecture consisting of
and provide further optimizations in terms of data com- self-contained data warehouses and data lakes that are
pression and query processing. These aspects and the potentially connected by an integration layer for unified
possibility to directly access the data on the underlying data access can also constitute a lakehouse [17]. In our
storage systems enable the execution of data mining and work [5], we assessed diferent views and definitions of
machine learning applications on top of data lakes. By in- the lakehouse paradigm and finally derived a new
definitegrating stream storage and stream processing systems, tion that reflects the additional value of lakehouses for
such as Apache Spark and Apache Kafka8, respectively, enterprises in comparison to conventional data platforms.
into the architecture and by applying well-established From our perspective, lakehouses are beneficial for
enarchitecture patterns like the Lambda [15] or Kappa [16] terprises when they contribute to simplifying enterprise
architecture, data lakes are also suitable for near-realtime analytics architectures by providing a single source of
reporting and streaming analytics. truth, limiting the variety of involved technologies and</p>
      <p>Due to these complementary alignments of data ware- hence reducing the number of required data movement
houses and data lakes, enterprises tend to employ com- and transformation processes. Accordingly, we define a
plex analytics architectures in which both types of data lakehouse as "integrated data platform that leverages the
platforms are operated in parallel This approach com- same storage type and data format for reporting and OLAP,
monly results in several shortcomings [4], such as data data mining and machine learning, as well as streaming
replication across multiple storage systems and the need workloads." [5]. Fig. 3 illustrates how such a data platform
for continuously transferring, transforming and synchro- may look like. First of all, the term "integrated platform"
nizing the data between the involved data platforms, expresses that a lakehouse should not be considered as a
which likely leads to high operational costs and inconsis- loose amalgamation of standalone data warehouses and
tent or erroneous data. In addition, the necessary move- data lakes, but rather as a single, self-contained data
platment of data extends the time until analysis results are form. Limiting the architecture to one type of storage, e.g.
available. Vendors of various data management tools to a distributed file system, and one data format, e.g. to
have recognized these problems and recently developed Apache Parquet, eliminates the need for additional data
products that pursue to close the gap between data ware- movement and transformation processes within the
lakehouses and data lakes: On the one hand, modern and house and therefore reduces the complexity and
errorpossibly cloud-based data warehouses like Snowflake are proneness of the overall architecture. Furthermore, it
evolving in order to support the management of unstruc- supports the formation of a single source of truth, as
tured data, the stream ingestion of near-realtime data, as the same data may no longer be replicated between
difwell the querying of data that is stored in open formats ferent systems with varying characteristics. Finally, the
on external, third-party storage systems. On the other definition emphasizes that lakehouses must support all
hand, frameworks and query engines like Apache Hudi, typical analytical workloads of data warehouses and data
Apache Iceberg, Dremio and Trino9 are emerging that lakes, so that data analysts and data scientists can use a
can be used to enhance data lakes by typical features of lakehouse instead of the former data platforms.
data warehouses and hence make analyses more conve- Based on this definition and the characteristics of the
nient. This observable convergence of data warehouses workloads mentioned therein, we derived a total of eight
and data lakes contributed to the coining of the term technical requirements that lakehouses should fulfill [ 5]:
"lakehouse" and its underlying vision. R1: Same type of storage and data format
Lakehouses must employ only a single type of storage for all
data and metadata and use only one format for the data.</p>
      <p>R2: CRUD for all types of data Lakehouses must
support the ingestion, retrieval, updating and deletion of</p>
      <p>HDFS
Lakehouse
Parquet Parquet</p>
      <p>Reporting,</p>
      <p>OLAP
Advanced</p>
      <p>Analytics</p>
      <p>Extract &amp; Load
All types of data</p>
      <p>Data flow
... sUtnodraegrleyisnygstem</p>
      <p>Data
Scientists
Data
Analysts
larly promising for the fulfillment of the aforementioned
requirements and thus for the construction of lakehouses.</p>
      <p>These frameworks basically act as libraries for highly
scalable batch and stream processing engines, such as
Apache Spark10 or Apache Flink11 and implement data
access protocols that control how these engines read data
from and write data to storage systems (cf. [18]). In
addition, they manage technical metadata, which allows
them to represent datasets as relational data collections
and track additions, updates and deletions of data.</p>
    </sec>
    <sec id="sec-2">
      <title>Distributed File System/ Object Storage</title>
      <p>R5 R6
R8
Lakehouse Framework</p>
    </sec>
    <sec id="sec-3">
      <title>Processing Engine</title>
      <p>all kinds of data at least on the level of data collections.</p>
      <p>R3: Relational data collections Lakehouses must Data Data
provide means to abstract from the stored data files and Scientists Analysts R2 R3 R4
to represent them as cohesive data collections with
relational properties on the logical level.</p>
      <p>R4: Query language Lakehouses must ofer a
declarative, structured query language that allows to query the Data flow All types of data
data in a relational manner.</p>
      <p>R5: Consistency Guarantees Lakehouses must pro- Rn fRuelfqillueidrebmyednattnaaltaivkeely Rn bRyeqfruaimreemweonrtkfulfilled
vide consistency guarantees for the data, such as schema
validation, which can either be enforced on data inges- Figure 3: Typical architecture of a data lake that can
transition or when the data is queried. tion to a lakehouse by adding a corresponding framework.
R6: Isolation and Atomicity Similar to relational
database systems, lakehouses must provide isolation and Fig. 4 shows the conceptual architecture of a data lake
atomicity for data operations in order to ensure the con- as it can often be encountered in practice. It essentially
sistency of the data and to support concurrency. consists of a storage system, which can be either a
disR7: Direct read access Lakehouses must provide di- tributed file system or an object storage that persists the
rect access to the data and metadata on the underlying data as data files in an open file format. A batch and
storage system and must employ open data formats only. stream processing engine can read data from the
storR8: Unified batch and stream processing Lake- age system, process it and then write the results back
houses must support record-wise data operations in near- to the storage system. Hence, the data lake is supposed
realtime and allow to treat data collections as sources to store the raw data next to pre-processed and
aggreand sinks for batch and stream processing. gated data. This processing engine is also used to ingest</p>
      <p>These requirements can be achieved in various ways, data and data analysts can leverage it in order to query
for example by opening existing data warehouses and the data via a query language like SQL. For data mining
driving them into the direction of data lakes or by devel- and machine learning, data science applications can
dioping technologies that enhance data lakes for common rectly access the data on the storage system. Without
features and characteristics of data warehouses. the lakehouse framework that is depicted in Fig. 4, the
data lake would already satisfy the requirements R1, R2,
4. Transitioning from Data Lakes R3, R4, and R7. R3 is satisfied because many processing
engines like Apache Spark already enable relational data
to Lakehouses abstraction, so that multiple data files that reside on the
storage system can collectively represent the contents of
a table. R5 and R6 are not met, since processing engines
In the course of our evaluation of several data
management tools [5], frameworks for data lakes like Delta Lake,
Apache Hudi and Apache Iceberg appeared to be
particu10https://spark.apache.org
11https://flink.apache.org</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>