1. Introduction

September

Open-Source Software Project: Modernizing Java-based Query Engines for the Lakehouse Era

Akash Shankaran

akash.shankaran@intel.com 2 5

George Gu

george.gu@intel.com 3 5

Weiting Chen

weiting.chen@intel.com 3 5

Binwei Yang

binwei.yang@intel.com 1 5

Chidamber Kulkarni

chidamber.kulkarni@intel.com 4 5

Mark Rambacher

0 5

Nesime Tatbul

nesime.tatbul@intel.com 0 5

David E. Cohen

david.e.cohen@intel.com 0 5

Vancouver, Canada

0 Intel , Boston, Massachusetts , USA 1 Intel , Portland, Oregon , USA 2 Intel, Seattle , Washington , USA 3 Intel , Shanghai , China 4 Intel , Vancouver , Canada 5 Workshop Proce dings

2023

1 2023

Year-on-year, exponential data growth, and the corresponding growth in machine learning's appetite to process that data is transforming the industry's data management discipline. In response, the data lakehouse architecture has emerged. The transformative nature of the lakehouse architecture and the need to enable a diverse set of query engines to access data that resides in a lakehouse is motivating a refactoring of capabilities in these query engines. Industry's response is the composable data management system (CDMS). This paper introduces the Gluten open-source software (OSS) project - an embodiment of the CDMS concept. Gluten is a Java Native Interface (JNI) bridge that enables Java-based query engines to ofload/accelerate processing to native acceleration libraries, such as the Meta-led Velox OSS project.

engine written in Java initial work focuses on Apache

1. Introduction This paper introduces the open-source software (OSS)

project, Gluten [ 1 ], a Java Native Interface (JNI) based bridge between query engines written in Java and database acceleration libraries such as the Velox OSS project. Query engines that integrate Gluten embody the Currently, Gluten uses the Substrait.io OSS project [ 2 ] to enable the Spark-SQL query engine to employ the Velox acceleration library [ 3 ]. Work is under way to generalize the approach so that any query engine that incorporates form this plan to a canonical form. A transformation is provided that maps the canonical plan onto the targeted acceleration library’s plan (e.g. a Velox plan). Execution of the plan is then ofloaded to the library.

Although Gluten is intended to apply to any SQL query

(D. E. Cohen)

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License [ 4 ]. Our work on Spark-SQL-over-Gluten serves as the motivating scenario for this paper. This work includes integration of the Substrait and Velox OSS projects into

Gluten. With Substrait mappings of the Spark-SQL and Velox plans in place, the integration of Gluten has trans

mentation. This efort is early in its development, but is already producing competitive results in TPC-H/TPC

DS-like characterizations. This development efort and

characterization work is covered in the section entitled

Concretely, the contributions of this paper are to provide background on the motivation for the Data Lakehouse, the technical architecture that has emerged, and the disruption adoption of this architecture is having on big data processing. No where is this disruption more evident than amongst the largest users of Spark-SQL. Insights into the Spark-SQL market are discussed along with how the leaders of Spark project have leveraged the

Data Lakehouse architecture to their advantage. This,

in turn, has served as a catalyst for introducing composability not just for Spark-SQL but to the broader set of Java-based query engines. Spark-SQL is used to illustrate the mechanics of composability, including early experimental results, Finally, the paper provides thoughts on how this composability can be extended to embrace the coming wave of heterogeneous processors and memories.

2. Background

nia - Berkeley AMPLab in 2009, and was open sourced in early 2010. In 2013, Matei Zaharia and several othComputing demand continues to grow exponentially, ers from AMPLab founded Databricks whose charter is largely driven by “big data” processing on hyperscale to provide a Cloud-only, Software-as-a-Service (SaaS) data stores [ 5 ]. Increasingly, this data processing is in platform for working with Spark. Today, the Databricks support of machine learning (ML), training models and service is globally available, running over Infrastructuresubsequently serving up these models, to personalize dig- as-a-Service (IaaS) platforms operated by Alibaba Cloud, ital content in an eCommerce setting, for example. This Amazon, Google, and Microsoft Azure [ 15]. ML-centric, big data processing increasingly operates In parallel, these same hyperscale companies operover custom, heterogeneous processors such as GPUs, ate their own Spark-SQL-as-a-Service oferings. Each of TPUs, FPGAs, etc [ 6 ]. Taken together, these forces are these oferings operate over their respective global infrasmotivating dramatic changes in the data management tructure while Databricks is platform agnostic. This gives process of large companies. rise to a competitive environment across all participants.

2.1. The Emergence of the Data Lakehouse Architecture 2.3. The Databricks Lakehouse Changes the Competitive Landscape

The result of these changes has been the emergence of The Apache Spark OSS project has historically been an the Data Lakehouse architecture, a combination of tra- engine of innovation that benefited the community as a ditional data warehouse functionality and more modern whole. All of the Spark-SQL-as-a-Service oferings were data lakes. A data lake stores raw, unstructured data on based on this upstream Spark-SQL codebase. With the disaggregated storage. In contrast, a data warehouse is a advent of the Lakehouse, Databricks has thrown down repository for structured, filtered data that has already the gauntlet at its competition in this market segment. been processed for a specific purpose. In the emerg- Their Deltalake [12] initiative has been instrumental ing Lakehouse approach, an open table format is intro- in bringing the Lakehouse architecture to market. As duced into the data lake architecture. This enables the part of this initiative, Databricks introduced a propriunbundling of the query engine from the data manage- etary Photon database acceleration library; giving them ment facilities of the warehouse. These data management performance and eficiency advantages and changing the capabilities are refactored to operate over this open table dynamics of the Spark community. In response to this abstraction. Introduction of these capabilities into the competitive threat, others in this segment are motivated data lake is transformative, resulting in the Lakehouse to find an alternative to the Databricks Photon library [ 7 ]. [16].

What are the data management capabilities of a Lakehouse deployment? First and foremost is the disaggregation of log-structured storage from the servers over 2.4. Private Oferings of which processing is carried out [ 8 ]. This disaggrega- Spark-SQL-as-a-Service tion allows for the independent scaling of the storage and compute resources. Second is the ingestion of virtually any type of data into the Lakehouse using a supported serialization format, for example the Apache Parquet [9]. Notably, the tables of a Lakehouse are mutable [10], [11], [12], [13], [14]; allowing for transactional updates, schema modifications, etc. From within the Lakehouse, this data is then projected into analytic services such as SQL query engines, search systems, stream processors, query editors, notebooks, and machine learning (ML) models through direct access, real-time, and batch workflows.

In addition to the Spark-SQL-as-a-Service market, some

of the largest Cloud companies operate private SparkSQL-as-a-Service oferings for internal constituents. These companies include ByteDance, eBay, JD.Com, LinkedIn, Maituan, Netflix, Pinterest, Stripe, etc.

The scale of these oferings has reached a point where they are motivated to find ways to realize economies-ofscale. The performance and eficiency gains aforded by the use of the Photon library is compelling. However, its proprietary nature is counter to their objectives. This is motivating interest across this segment in an open-source alternative to the Photon library.

2.2. The Impact of the Lakehouse Architecture on Spark-as-a-Service Deployments Finally, Meta has been a driving force in the Spark com

munity for many years [17], [18]. Recently, however, The Apache Spark open-source software (OSS) project they have begun deprecating the use of Spark-SQL with started as a research project at the University of Califor- PrestoDB’s SQL taking its place. Meta is standardizing

2.5. Meta Deprecates Spark-SQL

on PrestoDB’s SQL via the use of the Velox database acceleration native code library OSS project and that library’s CoreSQL dialect [19]. PrestoDB-on-Spark refactors PrestoDB’s functionality as a client library, similar to Google’s F1 Query client library [20], [21].

As shown in Figure 1, Spark-SQL-on-Gluten replaces

Gazelle with the Velox database acceleration library. The approach is similar to the one taken by Databricks in their Photon native library. The clear diference is software licensing. The Photon library is proprietary and only avail3. Spark-SQL on Gluten able with the Databricks Spark-as-a-Service platform. Gluten, on the other hand, is an Apache OSS licensed 3.1. Ofloading Spark Processing to a project. Gluten depends on OSS licensed projects such as Native Database Accelerator the Apache Arrow, Substrate, and Velox projects. What follows is a brief sketch of the Spark-SQL-on-Gluten imFor analytical and machine learning workloads, the de- plementation. sign of modern query engines is dominated by on-disk (e.g. Apache Parquet) and in-memory (e.g. Apache Ar- 3.2.1. Plan Conversion row), columnar serialization formats [22], [9], [23].

While these workloads are memory/memory bandwidth Gluten uses Substrait to build a query plan tree. It conbound, Spark workloads have become CPU-bound. Three verts Spark’s physical plan to a Substrait plan for the companies realized the opportunity to transform Spark targeted backend, and then shares the Substrait plan over into a vectorized SQL engine and break through to its JNI to trigger the execution pipeline in the Velox native row-based data processing and JVM limitations. Today, library.

Databricks, Intel, and NVIDIA each develop and maintain JNI-based database acceleration implementations 3.2.2. Fallback Processing that enable Spark-SQL to ofload/accelerate Java code to Gluten leverages the existing Spark JVM engine to check a C++ library. These are the Photon, the Gluten, and that an operator is supported by the native library. If the Spark-Rapids implementations respectively. Of these, not, Gluten falls back to the existing Spark-JVM-based only Gluten is an OSS project. operator. This fallback mechanism comes at the cost

In the same timeframe that Databricks and NVIDIA of column-to-row and row-to-column data conversions were developing their solutions, Intel’s Spark team was between the memory layouts of the two environments. working on the Gazelle project, a predecessor to Gluten [24]. The Gazelle project focused on enabling Spark to exploit single instruction, multiple data (SIMD), specif- 3.2.3. Memory Management ically Intel’s Advanced Vector Extensions (Intel AVX) Gluten leverages Spark’s existing memory management technology. A key deficiency of Gazelle was its limited system. It calls the Spark memory registration API for evcommunity participation. This meant that the develop- ery native memory allocation/deallocation action. Spark ment burden fell to Intel. manages the memory for each task thread. If the thread

In the meantime, several vectorized SQL engines needs more memory than is available, it can call the emerged with more active open-source communities. spill interface for operators that support this capability. Among these, the Meta-led Velox project is a rising star, Spark’s memory management system protects against providing a vectorized database acceleration library [ 3 ]. memory leaks and out-of-memory issues. While these vectorized engines are popular, prior to Gluten there was no support for an open-source software 3.2.4. Columnar Shufle option in Apache Spark. Intel has adopted the Velox native OSS library to replace its own Gazelle library. Adop- Gluten reuses its predecessor Gazelle’s Apache Arrowtion of Velox has opened Gluten up to the larger, more based Columnar Shufle Manager as the default shufle vibrant Velox community of developers. The integration manager. A third-party library is responsible for handling of the Gluten JNI bridge with Spark-SQL retains much the data transformation from native to Arrow. Alternaof the Spark-SQL Executor’s Java-based implementation. tively, developers are free to implement their own shufle In contrast, Meta’s PrestoDB-on-Spark implementation manager. replaces Spark-SQL with a new Presto native C++ implementation that incorporates the Velox library. This 3.2.5. Metrics new C++ SQL engine is then integrated with the Spark execution framework.

Gluten supports Spark’s Metrics functionality. The default Spark metrics are served for Java row-based data

CPU Model Micro-architecture

CPUs Memory

NIC

Disks Intel® Xeon® Platinum 8480+

Sapphire Rapids

224 1024GB 1x Ethernet Controller I225-LM

1x Ethernet interface 2x 1.5T INTEL SSDPE2KE016T8 1x 447.1G INTEL SSDSC2BB48 1x 447.1G INTEL SSDSC2KB48 7x 3.5T INTEL SSDPF2KX038TZ

Name Operating System

Linux Kernel

JDK version GCC version(Gluten only)

Spark version Hadoop version

Software Platform

Ubuntu 22.04.1 LTS 5.16.0-051600rc5-generic 1.8 11 3.3.1 3.2 processing. Gluten includes additional metrics to provide developers a means of debugging the targeted native database acceleration library.

Vanilla Spark which derives from TPC-H and TPC-DS

benchmark with minor changes to accommodate Gluten and Velox implementations. The results show a signifi3.2.6. Shim Layer cant improvement by using Gluten and Velox. In Figure 2, the result shows that Gluten outperforms Spark-SQL by To fully integrate with Spark, Gluten includes a shim 2.71X in the TPC-H-like characterization and by 2.29X layer whose role is to support multiple versions of Spark. in the TPC-DS-like characterization. As references, the Gluten supports Spark versions 3.2 and 3.3, with newer Hardware and Software Configurations are listed in Table version support being added.. 1 and Table 2 respectively.

4. Comparative Performance Characterization This section provides a comparative characterization of

“Spark-SQL without Gluten” and “Spark-Gluten-Velox,” executing on the latest Intel processor. Eforts to optimize Java-based query engines rely on JVM/JDK packages, whose capabilities difer from version to version.

For example, JDK 17 includes a SIMD-based Vector API capability that enables eficient query engine vectoriza- Figure 2: Comparative Characterization tion. Widely used earlier versions such as JDK11 or JDK8 are missing this Vector API capability. Use of Gluten removes this JVM/JDK version dependency when opti- The improvement can also be observed from CPU promizing Java-based query engines. These results demon- cessor micro-architecture perspective. Figure 3 illustrates strate the performance benefits of ofloading Spark-SQL Gluten instruction path length reduces by 3.7X in the processing to Gluten. TPC-H-like query and by 2.5X in TPC-DS-like query

Two benchmarks (TPC-H-like and TPC-DS-like) are against Spark-SQL. Gluten + Velox can also unleash the used to evaluate the performance of Gluten compared to power of Intel AVX technology using SIMD instructions

5. Roadmap and Future Work 5.2. Gluten as a JNI Bridge to Database Acceleration Libraries for Any Java Query Engine We conclude the paper by sharing three key elements

of the Gluten roadmap: (i) formalizing the use of the Substrait.io project, (ii) generalizing Gluten for use across several query engines, and (iii) enabling these Glutenbased query engines to target heterogeneous hardware.

The Gluten JNI implementation is designed to support

multiple database acceleration libraries. Currently, Kyligence provides a ClickHouse library while Intel employs the Velox library [26], [27]. The Kyligence implementation does not currently take advantage of the Substrait 5.1. Formalizing the Use of the transformations. Support for Substrait is included in the Substrait.io Project Velox project. However, the lack of a Substrait ABI means the to/from Substrait transformations for PrestoDB and One method of vertical integration is to replace the top Spark-SQL are internal to Velox. Support for a Substrait portion of the query engine framework [25]. This in- ABI will allow for a mapping to provide a query-enginecludes all of the frontend components: the user-facing specific Substrait schema, a shared library, and a means interfaces, the query plan/optimizer, the distributed ex- of registering the schema with Velox as part of the Velox ecution framework, etc. An adapter is introduced that initialization. allows these components to be replaced by a proxy. The In Gluten’s case, the framework is being refactored new framework takes a query as input, produces a canon- as a general Java-Native-Interface (JNI) implementation ical plan, and then maps that plan onto the plan of various that uses the Substrait algebra to map a Java-based query sults demonstrate the promise of this approach. We beengine such as PrestoDB or Spark-SQL on to a native lieve Gluten can be generalized for use by any Java-based database acceleration library.Gluten supports the Kyli- query engine. Further, we believe the use of Substrait gence and Spark-SQL query engines. Work is now un- and Velox allow for vertical composability to be extended derway to enable the Trino project to integrate Gluten to encompass the underlying heterogeneous hardware and integration with Apache Flink is in the Gluten back- that is coming online. To that end, the paper provides log [28], [29]. Trino and Flink will provide Subtstrait insights into the Gluten roadmap along with plans to schemas and it is hoped that Kyligence will add support work with the Subtrait and Velox community to realize for Substrait to their roadmap. the Composable Data Management System vision.

5.3. Enabling Gluten to Target Heterogeneous Processors The Gluten abstraction also afords the opportunity to tar

get heterogeneous processors via the native accelerator library. For example, the Velox library is being extended to target heterogeneous hardware accelerators which may be based on General Purpose CPU, FPGA or GPU. The diverse heterogeneous accelerators could become out-of-box components for CDMS systems to achieve significant advantages on design flexibility, system elasticity, performance and power eficiency ( Figure 1).

Currently, Velox provides a vectorization engine that targets general purpose CPUs: x86, IA, and ARM [30]. In contrast, the PyTorch framework provides the TorchDynamo and TorchInductor sub-project that together enable deployments to target heterogeneous processors such as an inductor OpenMP backend for general purpose CPUs and an inductor Triton backend for GPUs, including NVIDIA, AMD, etc [31], [32], [33]. The proposed extension to Velox will introduce an ABI that is analogous to TorchDynamo/TorchInductor. The idea is to provide baseline support with OpenMP and Triton. The Pytorch community has investigations underway to extend this to support IA-based on-die accelerators, FPGAs, and Habana [34]. We envision Velox pursuing a similar path as PyTorch’s TorchDynamo/TorchInductor approach.

6. Conclusion This paper provided background on the motivation for

the Data Lakehouse and the disruption adoption of the Lakehouse architecture is causing. No where is this disruption more apparent than amongst the largest users of Spark-SQL. This market includes Databricks, founded by the creators of the Spark project and arguably the leaders of the Data Lakehouse movement. Their introduction of the proprietary Photon database acceleration library has been a catalyst for interest in the Spark-Gluten open source software (OSS) project and its use of the Substrait and Velox OSS projects. The methods used by Gluten to enable Spark-SQL to take advantage of these projects embodies vertical composability. Early experimental re

Acknowledgments Thanks to Masha Basmanova, Orri Erling, Deepak Ma

jeti, Pedro Pedreira, and the rest of the Velox community. Thanks also to Paul Amonson, Lukasz Grab, Milosz Linkiewicz, Kelly Mckeighan, Cezary Sawicki, and the rest of the Intel folks working on the Gluten and Velox projects. Special thanks to Jim Younan, who passed away unexpectedly at the end of last year. IEEE International Conference on Data Engineering L. S. Sakka, K. Pai, W. He, B. Chattopadhyay, Velox: (ICDE), 2022, pp. 1598–1609. Meta’s Unified Execution Engine, Proceedings of [9] Apache Parquet, https://parquet.apache.org/, 2023. the VLDB Endowment 15 (2022) 3372–3384.

Accessed: 2023-07-05. [20] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shiv[10] Apache Hudi, https://hudi.apache.org/, 2023. Ac- akumar, M. Tolton, T. Vassilakis, H. Ahmadi, D. Decessed: 2023-07-05. lorey, S. Min, M. Pasumansky, J. Shute, Dremel: A [11] Apache Iceberg, https://iceberg.apache.org/, 2023. Decade of Interactive SQL Analysis at Web Scale,

Accessed: 2023-07-05. Proceedings of the VLDB Endowment 13 (2020) [12] M. Armbrust, T. Das, S. Paranjpye, R. Xin, S. Zhu, 3461–3472.

A. Ghodsi, B. Yavuz, M. Murthy, J. Torres, L. Sun, [21] B. Samwel, J. Cieslewicz, B. Handy, J. Govig, P. A. Boncz, M. Mokhtar, H. V. Hovell, A. Ionescu, P. Venetis, C. Yang, K. Peters, J. Shute, D. Tenedorio, A. Luszczak, M. Switakowski, T. Ueshin, X. Li, H. Apte, F. Weigel, D. Wilhite, J. Yang, J. Xu, J. Li, M. Szafranski, P. Senster, M. Zaharia, Delta Lake: Z. Yuan, C. Chasseur, Q. Zeng, I. Rae, A. Biyani, High-Performance ACID Table Storage over Cloud A. Harn, Y. Xia, A. Gubichev, A. El-Helw, O. ErObject Stores, Proceedings of the VLDB Endow- ling, Z. Yan, M. Yang, Y. Wei, T. Do, C. Zheng, ment 13 (2020) 3411–3424. G. Graefe, S. Sardashti, A. M. Aly, D. Agrawal, [13] J. Camacho-Rodríguez, A. Agrawal, A. Gruen- A. Gupta, S. Venkataraman, F1 Query: Declaraheid, A. Gosalia, C. Petculescu, J. Aguilar-Saborit, tive Querying at Scale, Proceedings of the VLDB A. Floratou, C. Curino, R. Ramakrishnan, LST- Endowment 11 (2018) 1835–1848. Bench: Benchmarking Log-Structured Tables in the [22] Apache Arrow, https://arrow.apache.org/, 2023. AcCloud, CoRR abs/2305.01120 (2023). cessed: 2023-07-05. [14] P. Jain, P. Kraft, C. Power, T. Das, I. Stoica, M. Za- [23] X. Zeng, Y. Hui, J. Shen, A. Pavlo, W. McKinney, haria, Analyzing and Comparing Lakehouse Stor- H. Zhang, An Empirical Evaluation of Columnar age Systems, in: Proceedings of the 13th Confer- Storage Formats, CoRR abs/2304.05028 (2023). ence on Innovative Data Systems Research (CIDR), [24] Gazelle Plug-in, https://github.com/oap-project/ 2023. gazelle_plugin, 2023. Accessed: 2023-07-05. [15] C. Power, H. Patel, A. Jindal, J. Leeka, B. Jenkins, [25] H. Gavriilidis, L. Behme, S. Papadopoulos, S. Bortoli, M. Rys, E. Triou, D. Zhu, L. Katahanas, C. B. Tala- J. Quiané-Ruiz, V. Markl, Towards a Modular Data pady, J. Rowe, F. Zhang, R. Draves, I. Santa, A. Ku- Management System Framework, in: S. R. Valluri, mar, The Cosmos Big Data Platform at Microsoft: M. Zaït (Eds.), Proceedings of the 1st International Over a Decade of Progress and a Decade to Look Workshop on Composable Data Management SysForward, Proceedings of the VLDB Endowment 14 tems (CDMS), 2022.

(2021) 3148–3161. [26] Clickhouse, https://clickhouse.com/, 2023. Ac[16] A. Behm, S. Palkar, U. Agarwal, T. Armstrong, cessed: 2023-07-05.

D. Cashman, A. Dave, T. Greenstein, S. Hovsepian, [27] Kyligence, https://kyligence.io/, 2023. Accessed: R. Johnson, A. S. Krishnan, P. Leventis, A. Luszczak, 2023-07-05.

P. Menon, M. Mokhtar, G. Pang, S. Paranjpye, [28] Apache Flink, https://flink.apache.org/, 2023. AcG. Rahn, B. Samwel, T. van Bussel, H. V. Hovell, cessed: 2023-07-05.

M. Xue, R. Xin, M. Zaharia, Photon: A Fast Query [29] Trino, https://trino.io/, 2023. Accessed: 2023-07-05. Engine for Lakehouse Systems, in: Proceedings [30] T. Kersten, V. Leis, A. Kemper, T. Neumann, of the ACM SIGMOD International Conference on A. Pavlo, P. A. Boncz, Everything You Always Management of Data, 2022, pp. 2326–2339. Wanted to Know About Compiled and Vectorized [17] B. Chattopadhyay, P. Pedreira, S. Agarwal, Y. J. Sun, Queries But Were Afraid to Ask, Proceedings of S. Vakharia, P. Li, W. Liu, S. Narayanan, Shared the VLDB Endowment 11 (2018) 2209–2222. Foundations: Modernizing Meta’s Data Lakehouse, [31] OpenMP, https://www.openmp.org/, 2023. Acin: Proceedings of the 13th Conference on Innova- cessed: 2023-07-05.

tive Data Systems Research (CIDR), 2023. [32] PyTorch/Torch/Dynamo, https://github.com/ [18] M. Valdez-Vivas, V. Sharma, N. Stanisha, S. Li, L. Mi, pytorch/pytorch/tree/main/torch/_dynamo, 2023.

W. Jiang, A. Kalinin, J. Metzler, Clockwork: A Delay- Accessed: 2023-07-05.

Based Global Scheduling Framework for More Con- [33] PyTorch/Torch/Inductor, https://github.com/ sistent Landing Times in the Data Warehouse, in: pytorch/pytorch/tree/main/torch/_inductor, 2023. Proceedings of the 27th ACM SIGKDD Conference Accessed: 2023-07-05. on Knowledge Discovery and Data Mining, 2021, [34] Habana Labs, https://habana.ai/, 2023. Accessed: pp. 3627–3637. 2023-07-05. [19] P. Pedreira, O. Erling, M. Basmanova, K. Wilfong,

[1] Gluten , https://github.com/oap-project/gluten, 2023 . Accessed: 2023 -06-28.

[2] Substrait , https://substrait.io/, 2023 . Accessed: 2023 - 06-28.

[3] Velox , https://github.com/facebookincubator/velox, 2023 . Accessed: 2023 -06-28.

[4]

Armbrust ,

R. S.

Xin ,

Lian ,

Huai ,

Liu ,

J. K.

Bradley ,

Meng ,

Kaftan ,

M. J.

Franklin ,

Ghodsi ,

Zaharia , Spark

SQL

: Relational Data Processing in Spark , in: Proceedings of the ACM SIGMOD International Conference on Management of Data , 2015 , pp. 1383 - 1394 .

[5]

Sevilla ,

Heim ,

Ho ,

Besiroglu ,

Hobbhahn ,

Villalobos , Compute Trends Across Three Eras of Machine Learning , CoRR abs/2202 .05924 ( 2022 ).

[6]

Gonzalez ,

Kolli ,

S. M.

Khan ,

Liu ,

Dadu ,

Karandikar ,

Chang ,

Asanovic ,

Ranganathan , Profiling Hyperscale Big Data Processing, in: Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA) , 2023 , pp. 47 : 1 - 47 : 16 .

[7]

Zaharia ,

Ghodsi ,

Xin ,

Armbrust , Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics , in: Proceedings of the 11th Conference on Innovative Data Systems Research (CIDR) , 2021 .

[8]

Luo ,

Niu ,

Korukanti ,

Sun ,

Basmanova ,

He ,

Wang ,

Agrawal ,

Luo ,

Tang ,

Singh ,

Li ,

Du , G. Baliga,

Fu , From Batch Processing to Real Time Analytics: Running Presto® at Scale, in: Proceedings of the 38th