<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Open-Source Software Project: Modernizing Java-based Query Engines for the Lakehouse Era</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Akash Shankaran</string-name>
          <email>akash.shankaran@intel.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>George Gu</string-name>
          <email>george.gu@intel.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Weiting Chen</string-name>
          <email>weiting.chen@intel.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Binwei Yang</string-name>
          <email>binwei.yang@intel.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chidamber Kulkarni</string-name>
          <email>chidamber.kulkarni@intel.com</email>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Rambacher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nesime Tatbul</string-name>
          <email>nesime.tatbul@intel.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David E. Cohen</string-name>
          <email>david.e.cohen@intel.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Vancouver, Canada</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intel</institution>
          ,
          <addr-line>Boston, Massachusetts</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Intel</institution>
          ,
          <addr-line>Portland, Oregon</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Intel, Seattle</institution>
          ,
          <addr-line>Washington</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Intel</institution>
          ,
          <addr-line>Shanghai</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Intel</institution>
          ,
          <addr-line>Vancouver</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <issue>2023</issue>
      <abstract>
        <p>Year-on-year, exponential data growth, and the corresponding growth in machine learning's appetite to process that data is transforming the industry's data management discipline. In response, the data lakehouse architecture has emerged. The transformative nature of the lakehouse architecture and the need to enable a diverse set of query engines to access data that resides in a lakehouse is motivating a refactoring of capabilities in these query engines. Industry's response is the composable data management system (CDMS). This paper introduces the Gluten open-source software (OSS) project - an embodiment of the CDMS concept. Gluten is a Java Native Interface (JNI) bridge that enables Java-based query engines to ofload/accelerate processing to native acceleration libraries, such as the Meta-led Velox OSS project.</p>
      </abstract>
      <kwd-group>
        <kwd>engine written in Java</kwd>
        <kwd>initial work focuses on Apache</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>This paper introduces the open-source software (OSS)</title>
        <p>
          project, Gluten [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], a Java Native Interface (JNI) based
bridge between query engines written in Java and
database acceleration libraries such as the Velox OSS
project. Query engines that integrate Gluten embody the
Currently, Gluten uses the Substrait.io OSS project [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] to
enable the Spark-SQL query engine to employ the Velox
acceleration library [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Work is under way to generalize
the approach so that any query engine that incorporates
form this plan to a canonical form. A transformation is
provided that maps the canonical plan onto the targeted
acceleration library’s plan (e.g. a Velox plan). Execution
of the plan is then ofloaded to the library.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Although Gluten is intended to apply to any SQL query</title>
        <p>(D. E. Cohen)</p>
        <p>
          © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Our work on Spark-SQL-over-Gluten serves as the
motivating scenario for this paper. This work includes
integration of the Substrait and Velox OSS projects into
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Gluten. With Substrait mappings of the Spark-SQL and</title>
      </sec>
      <sec id="sec-1-4">
        <title>Velox plans in place, the integration of Gluten has trans</title>
        <p>mentation. This efort is early in its development, but
is already producing competitive results in
TPC-H/TPC</p>
      </sec>
      <sec id="sec-1-5">
        <title>DS-like characterizations. This development efort and</title>
        <p>characterization work is covered in the section entitled</p>
        <p>Concretely, the contributions of this paper are to
provide background on the motivation for the Data
Lakehouse, the technical architecture that has emerged, and
the disruption adoption of this architecture is having on
big data processing. No where is this disruption more
evident than amongst the largest users of Spark-SQL.
Insights into the Spark-SQL market are discussed along
with how the leaders of Spark project have leveraged the</p>
      </sec>
      <sec id="sec-1-6">
        <title>Data Lakehouse architecture to their advantage. This,</title>
        <p>in turn, has served as a catalyst for introducing
composability not just for Spark-SQL but to the broader set of
Java-based query engines. Spark-SQL is used to illustrate
the mechanics of composability, including early
experimental results, Finally, the paper provides thoughts on
how this composability can be extended to embrace the
coming wave of heterogeneous processors and memories.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        nia - Berkeley AMPLab in 2009, and was open sourced
in early 2010. In 2013, Matei Zaharia and several
othComputing demand continues to grow exponentially, ers from AMPLab founded Databricks whose charter is
largely driven by “big data” processing on hyperscale to provide a Cloud-only, Software-as-a-Service (SaaS)
data stores [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Increasingly, this data processing is in platform for working with Spark. Today, the Databricks
support of machine learning (ML), training models and service is globally available, running over
Infrastructuresubsequently serving up these models, to personalize dig- as-a-Service (IaaS) platforms operated by Alibaba Cloud,
ital content in an eCommerce setting, for example. This Amazon, Google, and Microsoft Azure [ 15].
ML-centric, big data processing increasingly operates In parallel, these same hyperscale companies
operover custom, heterogeneous processors such as GPUs, ate their own Spark-SQL-as-a-Service oferings. Each of
TPUs, FPGAs, etc [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Taken together, these forces are these oferings operate over their respective global
infrasmotivating dramatic changes in the data management tructure while Databricks is platform agnostic. This gives
process of large companies. rise to a competitive environment across all participants.
      </p>
      <sec id="sec-2-1">
        <title>2.1. The Emergence of the Data</title>
      </sec>
      <sec id="sec-2-2">
        <title>Lakehouse Architecture</title>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. The Databricks Lakehouse Changes the Competitive Landscape</title>
        <p>
          The result of these changes has been the emergence of The Apache Spark OSS project has historically been an
the Data Lakehouse architecture, a combination of tra- engine of innovation that benefited the community as a
ditional data warehouse functionality and more modern whole. All of the Spark-SQL-as-a-Service oferings were
data lakes. A data lake stores raw, unstructured data on based on this upstream Spark-SQL codebase. With the
disaggregated storage. In contrast, a data warehouse is a advent of the Lakehouse, Databricks has thrown down
repository for structured, filtered data that has already the gauntlet at its competition in this market segment.
been processed for a specific purpose. In the emerg- Their Deltalake [12] initiative has been instrumental
ing Lakehouse approach, an open table format is intro- in bringing the Lakehouse architecture to market. As
duced into the data lake architecture. This enables the part of this initiative, Databricks introduced a
propriunbundling of the query engine from the data manage- etary Photon database acceleration library; giving them
ment facilities of the warehouse. These data management performance and eficiency advantages and changing the
capabilities are refactored to operate over this open table dynamics of the Spark community. In response to this
abstraction. Introduction of these capabilities into the competitive threat, others in this segment are motivated
data lake is transformative, resulting in the Lakehouse to find an alternative to the Databricks Photon library
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. [16].
        </p>
        <p>
          What are the data management capabilities of a
Lakehouse deployment? First and foremost is the
disaggregation of log-structured storage from the servers over 2.4. Private Oferings of
which processing is carried out [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. This disaggrega- Spark-SQL-as-a-Service
tion allows for the independent scaling of the storage
and compute resources. Second is the ingestion of
virtually any type of data into the Lakehouse using a
supported serialization format, for example the Apache
Parquet [9]. Notably, the tables of a Lakehouse are mutable
[10], [11], [12], [13], [14]; allowing for transactional
updates, schema modifications, etc. From within the
Lakehouse, this data is then projected into analytic services
such as SQL query engines, search systems, stream
processors, query editors, notebooks, and machine learning
(ML) models through direct access, real-time, and batch
workflows.
        </p>
        <sec id="sec-2-3-1">
          <title>In addition to the Spark-SQL-as-a-Service market, some</title>
          <p>of the largest Cloud companies operate private
SparkSQL-as-a-Service oferings for internal constituents.
These companies include ByteDance, eBay, JD.Com,
LinkedIn, Maituan, Netflix, Pinterest, Stripe, etc.</p>
          <p>The scale of these oferings has reached a point where
they are motivated to find ways to realize
economies-ofscale. The performance and eficiency gains aforded by
the use of the Photon library is compelling. However, its
proprietary nature is counter to their objectives. This is
motivating interest across this segment in an open-source
alternative to the Photon library.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.2. The Impact of the Lakehouse</title>
      </sec>
      <sec id="sec-2-5">
        <title>Architecture on Spark-as-a-Service</title>
      </sec>
      <sec id="sec-2-6">
        <title>Deployments</title>
        <sec id="sec-2-6-1">
          <title>Finally, Meta has been a driving force in the Spark com</title>
          <p>munity for many years [17], [18]. Recently, however,
The Apache Spark open-source software (OSS) project they have begun deprecating the use of Spark-SQL with
started as a research project at the University of Califor- PrestoDB’s SQL taking its place. Meta is standardizing</p>
        </sec>
      </sec>
      <sec id="sec-2-7">
        <title>2.5. Meta Deprecates Spark-SQL</title>
        <p>on PrestoDB’s SQL via the use of the Velox database
acceleration native code library OSS project and that
library’s CoreSQL dialect [19]. PrestoDB-on-Spark
refactors PrestoDB’s functionality as a client library, similar
to Google’s F1 Query client library [20], [21].</p>
        <sec id="sec-2-7-1">
          <title>As shown in Figure 1, Spark-SQL-on-Gluten replaces</title>
          <p>Gazelle with the Velox database acceleration library. The
approach is similar to the one taken by Databricks in their
Photon native library. The clear diference is software
licensing. The Photon library is proprietary and only
avail3. Spark-SQL on Gluten able with the Databricks Spark-as-a-Service platform.
Gluten, on the other hand, is an Apache OSS licensed
3.1. Ofloading Spark Processing to a project. Gluten depends on OSS licensed projects such as
Native Database Accelerator the Apache Arrow, Substrate, and Velox projects. What
follows is a brief sketch of the Spark-SQL-on-Gluten
imFor analytical and machine learning workloads, the de- plementation.
sign of modern query engines is dominated by on-disk
(e.g. Apache Parquet) and in-memory (e.g. Apache Ar- 3.2.1. Plan Conversion
row), columnar serialization formats [22], [9], [23].</p>
          <p>While these workloads are memory/memory bandwidth Gluten uses Substrait to build a query plan tree. It
conbound, Spark workloads have become CPU-bound. Three verts Spark’s physical plan to a Substrait plan for the
companies realized the opportunity to transform Spark targeted backend, and then shares the Substrait plan over
into a vectorized SQL engine and break through to its JNI to trigger the execution pipeline in the Velox native
row-based data processing and JVM limitations. Today, library.</p>
          <p>Databricks, Intel, and NVIDIA each develop and
maintain JNI-based database acceleration implementations 3.2.2. Fallback Processing
that enable Spark-SQL to ofload/accelerate Java code to Gluten leverages the existing Spark JVM engine to check
a C++ library. These are the Photon, the Gluten, and that an operator is supported by the native library. If
the Spark-Rapids implementations respectively. Of these, not, Gluten falls back to the existing Spark-JVM-based
only Gluten is an OSS project. operator. This fallback mechanism comes at the cost</p>
          <p>In the same timeframe that Databricks and NVIDIA of column-to-row and row-to-column data conversions
were developing their solutions, Intel’s Spark team was between the memory layouts of the two environments.
working on the Gazelle project, a predecessor to Gluten
[24]. The Gazelle project focused on enabling Spark to
exploit single instruction, multiple data (SIMD), specif- 3.2.3. Memory Management
ically Intel’s Advanced Vector Extensions (Intel AVX) Gluten leverages Spark’s existing memory management
technology. A key deficiency of Gazelle was its limited system. It calls the Spark memory registration API for
evcommunity participation. This meant that the develop- ery native memory allocation/deallocation action. Spark
ment burden fell to Intel. manages the memory for each task thread. If the thread</p>
          <p>
            In the meantime, several vectorized SQL engines needs more memory than is available, it can call the
emerged with more active open-source communities. spill interface for operators that support this capability.
Among these, the Meta-led Velox project is a rising star, Spark’s memory management system protects against
providing a vectorized database acceleration library [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. memory leaks and out-of-memory issues.
While these vectorized engines are popular, prior to
Gluten there was no support for an open-source software 3.2.4. Columnar Shufle
option in Apache Spark. Intel has adopted the Velox
native OSS library to replace its own Gazelle library. Adop- Gluten reuses its predecessor Gazelle’s Apache
Arrowtion of Velox has opened Gluten up to the larger, more based Columnar Shufle Manager as the default shufle
vibrant Velox community of developers. The integration manager. A third-party library is responsible for handling
of the Gluten JNI bridge with Spark-SQL retains much the data transformation from native to Arrow.
Alternaof the Spark-SQL Executor’s Java-based implementation. tively, developers are free to implement their own shufle
In contrast, Meta’s PrestoDB-on-Spark implementation manager.
replaces Spark-SQL with a new Presto native C++
implementation that incorporates the Velox library. This 3.2.5. Metrics
new C++ SQL engine is then integrated with the Spark
execution framework.
          </p>
        </sec>
        <sec id="sec-2-7-2">
          <title>Gluten supports Spark’s Metrics functionality. The default Spark metrics are served for Java row-based data</title>
          <p>CPU Model
Micro-architecture</p>
          <p>CPUs
Memory</p>
          <p>NIC</p>
          <p>Disks
Intel® Xeon® Platinum 8480+</p>
          <p>Sapphire Rapids</p>
          <p>224
1024GB
1x Ethernet Controller I225-LM</p>
          <p>1x Ethernet interface
2x 1.5T INTEL SSDPE2KE016T8
1x 447.1G INTEL SSDSC2BB48
1x 447.1G INTEL SSDSC2KB48
7x 3.5T INTEL SSDPF2KX038TZ</p>
          <p>Name
Operating System</p>
          <p>Linux Kernel</p>
          <p>JDK version
GCC version(Gluten only)</p>
          <p>Spark version
Hadoop version</p>
          <p>Software Platform</p>
          <p>Ubuntu 22.04.1 LTS
5.16.0-051600rc5-generic
1.8
11
3.3.1
3.2
processing. Gluten includes additional metrics to
provide developers a means of debugging the targeted native
database acceleration library.</p>
        </sec>
        <sec id="sec-2-7-3">
          <title>Vanilla Spark which derives from TPC-H and TPC-DS</title>
          <p>benchmark with minor changes to accommodate Gluten
and Velox implementations. The results show a
signifi3.2.6. Shim Layer cant improvement by using Gluten and Velox. In Figure 2,
the result shows that Gluten outperforms Spark-SQL by
To fully integrate with Spark, Gluten includes a shim 2.71X in the TPC-H-like characterization and by 2.29X
layer whose role is to support multiple versions of Spark. in the TPC-DS-like characterization. As references, the
Gluten supports Spark versions 3.2 and 3.3, with newer Hardware and Software Configurations are listed in Table
version support being added.. 1 and Table 2 respectively.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Comparative Performance</title>
    </sec>
    <sec id="sec-4">
      <title>Characterization</title>
      <sec id="sec-4-1">
        <title>This section provides a comparative characterization of</title>
        <p>“Spark-SQL without Gluten” and “Spark-Gluten-Velox,”
executing on the latest Intel processor. Eforts to
optimize Java-based query engines rely on JVM/JDK
packages, whose capabilities difer from version to version.</p>
        <p>For example, JDK 17 includes a SIMD-based Vector API
capability that enables eficient query engine vectoriza- Figure 2: Comparative Characterization
tion. Widely used earlier versions such as JDK11 or JDK8
are missing this Vector API capability. Use of Gluten
removes this JVM/JDK version dependency when opti- The improvement can also be observed from CPU
promizing Java-based query engines. These results demon- cessor micro-architecture perspective. Figure 3 illustrates
strate the performance benefits of ofloading Spark-SQL Gluten instruction path length reduces by 3.7X in the
processing to Gluten. TPC-H-like query and by 2.5X in TPC-DS-like query</p>
        <p>Two benchmarks (TPC-H-like and TPC-DS-like) are against Spark-SQL. Gluten + Velox can also unleash the
used to evaluate the performance of Gluten compared to power of Intel AVX technology using SIMD instructions</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Roadmap and Future Work</title>
      <sec id="sec-5-1">
        <title>5.2. Gluten as a JNI Bridge to Database</title>
      </sec>
      <sec id="sec-5-2">
        <title>Acceleration Libraries for Any Java</title>
      </sec>
      <sec id="sec-5-3">
        <title>Query Engine</title>
        <sec id="sec-5-3-1">
          <title>We conclude the paper by sharing three key elements</title>
          <p>of the Gluten roadmap: (i) formalizing the use of the
Substrait.io project, (ii) generalizing Gluten for use across
several query engines, and (iii) enabling these
Glutenbased query engines to target heterogeneous hardware.</p>
        </sec>
        <sec id="sec-5-3-2">
          <title>The Gluten JNI implementation is designed to support</title>
          <p>multiple database acceleration libraries. Currently,
Kyligence provides a ClickHouse library while Intel employs
the Velox library [26], [27]. The Kyligence
implementation does not currently take advantage of the Substrait
5.1. Formalizing the Use of the transformations. Support for Substrait is included in the
Substrait.io Project Velox project. However, the lack of a Substrait ABI means
the to/from Substrait transformations for PrestoDB and
One method of vertical integration is to replace the top Spark-SQL are internal to Velox. Support for a Substrait
portion of the query engine framework [25]. This in- ABI will allow for a mapping to provide a
query-enginecludes all of the frontend components: the user-facing specific Substrait schema, a shared library, and a means
interfaces, the query plan/optimizer, the distributed ex- of registering the schema with Velox as part of the Velox
ecution framework, etc. An adapter is introduced that initialization.
allows these components to be replaced by a proxy. The In Gluten’s case, the framework is being refactored
new framework takes a query as input, produces a canon- as a general Java-Native-Interface (JNI) implementation
ical plan, and then maps that plan onto the plan of various
that uses the Substrait algebra to map a Java-based query sults demonstrate the promise of this approach. We
beengine such as PrestoDB or Spark-SQL on to a native lieve Gluten can be generalized for use by any Java-based
database acceleration library.Gluten supports the Kyli- query engine. Further, we believe the use of Substrait
gence and Spark-SQL query engines. Work is now un- and Velox allow for vertical composability to be extended
derway to enable the Trino project to integrate Gluten to encompass the underlying heterogeneous hardware
and integration with Apache Flink is in the Gluten back- that is coming online. To that end, the paper provides
log [28], [29]. Trino and Flink will provide Subtstrait insights into the Gluten roadmap along with plans to
schemas and it is hoped that Kyligence will add support work with the Subtrait and Velox community to realize
for Substrait to their roadmap. the Composable Data Management System vision.</p>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>5.3. Enabling Gluten to Target</title>
      </sec>
      <sec id="sec-5-5">
        <title>Heterogeneous Processors</title>
        <sec id="sec-5-5-1">
          <title>The Gluten abstraction also afords the opportunity to tar</title>
          <p>get heterogeneous processors via the native accelerator
library. For example, the Velox library is being extended
to target heterogeneous hardware accelerators which
may be based on General Purpose CPU, FPGA or GPU.
The diverse heterogeneous accelerators could become
out-of-box components for CDMS systems to achieve
significant advantages on design flexibility, system elasticity,
performance and power eficiency ( Figure 1).</p>
          <p>Currently, Velox provides a vectorization engine that
targets general purpose CPUs: x86, IA, and ARM [30].
In contrast, the PyTorch framework provides the
TorchDynamo and TorchInductor sub-project that together
enable deployments to target heterogeneous processors
such as an inductor OpenMP backend for general
purpose CPUs and an inductor Triton backend for GPUs,
including NVIDIA, AMD, etc [31], [32], [33]. The
proposed extension to Velox will introduce an ABI that is
analogous to TorchDynamo/TorchInductor. The idea is
to provide baseline support with OpenMP and Triton.
The Pytorch community has investigations underway
to extend this to support IA-based on-die accelerators,
FPGAs, and Habana [34]. We envision Velox pursuing a
similar path as PyTorch’s TorchDynamo/TorchInductor
approach.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <sec id="sec-6-1">
        <title>This paper provided background on the motivation for</title>
        <p>the Data Lakehouse and the disruption adoption of the
Lakehouse architecture is causing. No where is this
disruption more apparent than amongst the largest users of
Spark-SQL. This market includes Databricks, founded by
the creators of the Spark project and arguably the leaders
of the Data Lakehouse movement. Their introduction
of the proprietary Photon database acceleration library
has been a catalyst for interest in the Spark-Gluten open
source software (OSS) project and its use of the Substrait
and Velox OSS projects. The methods used by Gluten
to enable Spark-SQL to take advantage of these projects
embodies vertical composability. Early experimental
re</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <sec id="sec-7-1">
        <title>Thanks to Masha Basmanova, Orri Erling, Deepak Ma</title>
        <p>jeti, Pedro Pedreira, and the rest of the Velox
community. Thanks also to Paul Amonson, Lukasz Grab, Milosz
Linkiewicz, Kelly Mckeighan, Cezary Sawicki, and the
rest of the Intel folks working on the Gluten and Velox
projects. Special thanks to Jim Younan, who passed away
unexpectedly at the end of last year.
IEEE International Conference on Data Engineering L. S. Sakka, K. Pai, W. He, B. Chattopadhyay, Velox:
(ICDE), 2022, pp. 1598–1609. Meta’s Unified Execution Engine, Proceedings of
[9] Apache Parquet, https://parquet.apache.org/, 2023. the VLDB Endowment 15 (2022) 3372–3384.</p>
        <p>Accessed: 2023-07-05. [20] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S.
Shiv[10] Apache Hudi, https://hudi.apache.org/, 2023. Ac- akumar, M. Tolton, T. Vassilakis, H. Ahmadi, D.
Decessed: 2023-07-05. lorey, S. Min, M. Pasumansky, J. Shute, Dremel: A
[11] Apache Iceberg, https://iceberg.apache.org/, 2023. Decade of Interactive SQL Analysis at Web Scale,</p>
        <p>Accessed: 2023-07-05. Proceedings of the VLDB Endowment 13 (2020)
[12] M. Armbrust, T. Das, S. Paranjpye, R. Xin, S. Zhu, 3461–3472.</p>
        <p>A. Ghodsi, B. Yavuz, M. Murthy, J. Torres, L. Sun, [21] B. Samwel, J. Cieslewicz, B. Handy, J. Govig,
P. A. Boncz, M. Mokhtar, H. V. Hovell, A. Ionescu, P. Venetis, C. Yang, K. Peters, J. Shute, D. Tenedorio,
A. Luszczak, M. Switakowski, T. Ueshin, X. Li, H. Apte, F. Weigel, D. Wilhite, J. Yang, J. Xu, J. Li,
M. Szafranski, P. Senster, M. Zaharia, Delta Lake: Z. Yuan, C. Chasseur, Q. Zeng, I. Rae, A. Biyani,
High-Performance ACID Table Storage over Cloud A. Harn, Y. Xia, A. Gubichev, A. El-Helw, O.
ErObject Stores, Proceedings of the VLDB Endow- ling, Z. Yan, M. Yang, Y. Wei, T. Do, C. Zheng,
ment 13 (2020) 3411–3424. G. Graefe, S. Sardashti, A. M. Aly, D. Agrawal,
[13] J. Camacho-Rodríguez, A. Agrawal, A. Gruen- A. Gupta, S. Venkataraman, F1 Query:
Declaraheid, A. Gosalia, C. Petculescu, J. Aguilar-Saborit, tive Querying at Scale, Proceedings of the VLDB
A. Floratou, C. Curino, R. Ramakrishnan, LST- Endowment 11 (2018) 1835–1848.
Bench: Benchmarking Log-Structured Tables in the [22] Apache Arrow, https://arrow.apache.org/, 2023.
AcCloud, CoRR abs/2305.01120 (2023). cessed: 2023-07-05.
[14] P. Jain, P. Kraft, C. Power, T. Das, I. Stoica, M. Za- [23] X. Zeng, Y. Hui, J. Shen, A. Pavlo, W. McKinney,
haria, Analyzing and Comparing Lakehouse Stor- H. Zhang, An Empirical Evaluation of Columnar
age Systems, in: Proceedings of the 13th Confer- Storage Formats, CoRR abs/2304.05028 (2023).
ence on Innovative Data Systems Research (CIDR), [24] Gazelle Plug-in, https://github.com/oap-project/
2023. gazelle_plugin, 2023. Accessed: 2023-07-05.
[15] C. Power, H. Patel, A. Jindal, J. Leeka, B. Jenkins, [25] H. Gavriilidis, L. Behme, S. Papadopoulos, S. Bortoli,
M. Rys, E. Triou, D. Zhu, L. Katahanas, C. B. Tala- J. Quiané-Ruiz, V. Markl, Towards a Modular Data
pady, J. Rowe, F. Zhang, R. Draves, I. Santa, A. Ku- Management System Framework, in: S. R. Valluri,
mar, The Cosmos Big Data Platform at Microsoft: M. Zaït (Eds.), Proceedings of the 1st International
Over a Decade of Progress and a Decade to Look Workshop on Composable Data Management
SysForward, Proceedings of the VLDB Endowment 14 tems (CDMS), 2022.</p>
        <p>(2021) 3148–3161. [26] Clickhouse, https://clickhouse.com/, 2023.
Ac[16] A. Behm, S. Palkar, U. Agarwal, T. Armstrong, cessed: 2023-07-05.</p>
        <p>D. Cashman, A. Dave, T. Greenstein, S. Hovsepian, [27] Kyligence, https://kyligence.io/, 2023. Accessed:
R. Johnson, A. S. Krishnan, P. Leventis, A. Luszczak, 2023-07-05.</p>
        <p>P. Menon, M. Mokhtar, G. Pang, S. Paranjpye, [28] Apache Flink, https://flink.apache.org/, 2023.
AcG. Rahn, B. Samwel, T. van Bussel, H. V. Hovell, cessed: 2023-07-05.</p>
        <p>M. Xue, R. Xin, M. Zaharia, Photon: A Fast Query [29] Trino, https://trino.io/, 2023. Accessed: 2023-07-05.
Engine for Lakehouse Systems, in: Proceedings [30] T. Kersten, V. Leis, A. Kemper, T. Neumann,
of the ACM SIGMOD International Conference on A. Pavlo, P. A. Boncz, Everything You Always
Management of Data, 2022, pp. 2326–2339. Wanted to Know About Compiled and Vectorized
[17] B. Chattopadhyay, P. Pedreira, S. Agarwal, Y. J. Sun, Queries But Were Afraid to Ask, Proceedings of
S. Vakharia, P. Li, W. Liu, S. Narayanan, Shared the VLDB Endowment 11 (2018) 2209–2222.
Foundations: Modernizing Meta’s Data Lakehouse, [31] OpenMP, https://www.openmp.org/, 2023.
Acin: Proceedings of the 13th Conference on Innova- cessed: 2023-07-05.</p>
        <p>tive Data Systems Research (CIDR), 2023. [32] PyTorch/Torch/Dynamo, https://github.com/
[18] M. Valdez-Vivas, V. Sharma, N. Stanisha, S. Li, L. Mi, pytorch/pytorch/tree/main/torch/_dynamo, 2023.</p>
        <p>W. Jiang, A. Kalinin, J. Metzler, Clockwork: A Delay- Accessed: 2023-07-05.</p>
        <p>Based Global Scheduling Framework for More Con- [33] PyTorch/Torch/Inductor, https://github.com/
sistent Landing Times in the Data Warehouse, in: pytorch/pytorch/tree/main/torch/_inductor, 2023.
Proceedings of the 27th ACM SIGKDD Conference Accessed: 2023-07-05.
on Knowledge Discovery and Data Mining, 2021, [34] Habana Labs, https://habana.ai/, 2023. Accessed:
pp. 3627–3637. 2023-07-05.
[19] P. Pedreira, O. Erling, M. Basmanova, K. Wilfong,</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Gluten</surname>
          </string-name>
          , https://github.com/oap-project/gluten,
          <year>2023</year>
          . Accessed:
          <fpage>2023</fpage>
          -06-28.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Substrait</surname>
          </string-name>
          , https://substrait.io/,
          <year>2023</year>
          . Accessed:
          <fpage>2023</fpage>
          - 06-28.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Velox</surname>
          </string-name>
          , https://github.com/facebookincubator/velox,
          <year>2023</year>
          . Accessed:
          <fpage>2023</fpage>
          -06-28.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Armbrust</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Bradley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kaftan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghodsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <surname>Spark</surname>
            <given-names>SQL</given-names>
          </string-name>
          :
          <article-title>Relational Data Processing in Spark</article-title>
          ,
          <source>in: Proceedings of the ACM SIGMOD International Conference on Management of Data</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1383</fpage>
          -
          <lpage>1394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sevilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Heim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Besiroglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hobbhahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Villalobos</surname>
          </string-name>
          ,
          <source>Compute Trends Across Three Eras of Machine Learning</source>
          ,
          <source>CoRR abs/2202</source>
          .05924 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dadu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karandikar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Asanovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ranganathan</surname>
          </string-name>
          ,
          <source>Profiling Hyperscale Big Data Processing, in: Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA)</source>
          ,
          <year>2023</year>
          , pp.
          <volume>47</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          :
          <fpage>16</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghodsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Armbrust</surname>
          </string-name>
          ,
          <article-title>Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics</article-title>
          ,
          <source>in: Proceedings of the 11th Conference on Innovative Data Systems Research (CIDR)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Korukanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Basmanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Du</surname>
          </string-name>
          , G. Baliga,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fu</surname>
          </string-name>
          , From Batch Processing to Real Time Analytics: Running Presto® at Scale,
          <source>in: Proceedings of the 38th</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>