1. Introduction: Evolving

Accelerators, and Native Half-Precision Processing for CP U-Local Analytics

Viktor Sanca

viktor.sanca@epfl.ch 0 2

Anastasia Ailamaki

anastasia.ailamaki@epfl.ch 0 1 2 0 EPFL , Lausanne , Switzerland 1 Google , Sunnyvale , USA 2 Local Node Numa node

3 15

Modern data management systems aim to provide both cutting-edge functionality and hardware eficiency. With the advent of AI-driven data processing and the post-Moore Law era, traditional memory-bound scale-up data management operations face scalability challenges. On the other hand, using accelerators such as GPUs has long been explored to ofload complex analytical patterns while trading-of data movement over an interconnect. GPUs typically provide massive parallelism and high-bandwidth memory, while CPUs are near-data processors and coordinators that are often memory-bound.

1. Introduction: Evolving Scalability

Data management has long focused on bringing both performant and eficient, hardware-conscious solutions and adapting to the ways and the requirements the data has to be analyzed to be useful and extract insights. Moore’s law and the improvements in processing technology allowed scaling-down chips and circuits, leading to moving the peripheral interconnect (PCIe) and memory controller (MC) from the Northbridge (1) to being integrated with the CPU (2), depicted in Figure 1. This CPU evolution allowed for avoiding the Front-Side Bus data transfer overheads. Along with the advent of high-capacity main memory and multi-socket, multi-core systems sparked optimizations for shifting the bottleneck from being IO to memory-bound. This was reflected in the emerging database system designs that used vectorization [1] and

Joint Workshops at 49th International Conference on Very Large Data Bases (VLDBW’23) — Workshop on Accelerating Analytics and Data Management Systems (ADMS’23), August 28 - September 1, 2023, Vancouver, Canada

Work done entirely at EPFL. (A. Ailamaki) (A. Ailamaki) 0000-0002-4799-8467 (V. Sanca); 0000-0002-9949-3639 PCIe

MC NUMA 0 2

PCIe MC PCIe MC MC

PCIe

NUMA 0 HBM HBM NUMA 1

Accelerators

Accelerators e ch Cores a C e ch Cores a C Accelerators

Cores Cores

Accelerators NUMA 2 HBM CPU Socket HBM NUMA 3 3

PCIe tegrated memory controller scaled to 3) Multi-die chip design.

Instead of only being present in multi-socket systems, NUMA evolved to include socket-local granularity. compilation [2, 3] in contrast to the Volcano-style iteration model to reduce the overheads of previous systems that were appropriate and tuned to the previously available hardware systems and memory hierarchies [4].

Similarly, when GPUs became more prevalent and driven by the machine learning community, data management research focused on using them as data processing In conjunction, novel applications and the postaccelerators [5, 6, 7], being especially useful for compu- Moore’s law era are bringing towards horizontal scaling tationally and random-access-heavy operators such as in CPUs, with multi-die chip designs (3) (Figure 1). While joins [8]. The advent of accelerators has resulted in sys- this design was already present in Intel Xeon Phi Knights tems with computational, memory, and interconnect het- Landing CPUs [21], more recently, chiplet-based deerogeneity, consequentially resulting in system designs signs have become mainstream with vendors such as aimed to reduce their complexity for practical use [ 9, 10 ]. AMD [22, 23], Apple [24], and yet again Intel [25]. In With memory-bound analytics, the next goal was reduc- such designs, rather than being a monolithic die, the CPU ing the efects of new bottlenecks, such as PCIe intercon- is composed of multiple individual chiplets interconnects that are required to transfer the data from the main nected in a single package. memory to the limited capacity, high-bandwidth GPU Novel hardware brings novel challenges and oppormemory [11], while also studying and utilizing faster tunities for data management system design. For this available interconnects [12]. Using all the available re- reason, we study the performance characteristics and sources efectively leads to fast and eficient data pro- present an initial evaluation of Intel Sapphire Rapids (4th cessing, which requires system designs that adapt to the Generation Xeon) CPUs and: available hardware and build appropriate abstractions to make it practical. • present and describe the novel multi-die chip ar

From another perspective, data management systems chitecture and the interplay of individual componot only adapt and drive the hardware trends but are also nents in Section 2, the backbone of making sense of, producing insights, and • analyze individually the efects of Highderiving value from the increasingly high data volume Bandwidth Memory (HBM) on access patterns and various sources. The ways the data is processed in Section 3, the use of hardware-supported change with advancements in fields such as data min- half-precision intrinsics in Section 4, and the ing and machine learning. While relational data is the novel on-core matrix-accelerators (AMX) in long-established way to store and analyze data in many Section 5, use cases, with increasingly high amounts of data, ma- • evaluate the combined efect of individual comchine learning models have become more powerful for ponents on a vector-heavy workload motivated various previously human-driven processing, such as by recent embedding methods in Section 6. object recognition or semantic analysis. With increas- The post-Moore’s law world resulted in CPU designs ingly practical and desirable models, machine learning to which data analytics systems must be tailored for ef(and inference) has become part of data analytics, both in fective and eficient use. In a unique fusion of features terms of machine learning for databases such as learned between high-bandwidth memory to tackle the memory indexes [13] or reinforcement learning for query opti- bandwidth wall and half-precision analytics and accelermization [14], as well as optimizing databases for ma- ators for ML workloads in a hierarchical NUMA scaling chine learning, for example using tensor computation package, we explore their individual and combined efruntimes [ 15 ]. fects.

Furthermore, With more recent findings in the domain of ML, such as the Transformer model architectures [16] and data embeddings, a class of multi-modal models 2. Design: Scaling and Element has presented state-of-the-art solutions for processing Fusion context-rich data. Such novel use cases and findings dictate the next important way to support extracting value The evolution of the scalability of physical CPU design and processing data. This is noticeable as machine learn- (Figure 1) exposes bottlenecks or primitives that provide ing and inference are becoming increasingly available challenges and opportunities for hardware-conscious in data management engines such as BigQuery ML, IBM software system design. While miniaturization allowed DB2 Data Insights, Azure Data Studio/SQL Server, Oracle iftting more components on a die and allowed the memMachine Learning, Amazon Aurora Machine Learning, ory and peripheral controller to be integrated (2) rather and others. In particular, extracting and learning embed- than separated via Northbridge (1), the same trends have dings from relational tables [17, 18] and using external long been known to slow down, as predicted by Moore’s models to enrich the data [ 19, 20 ] are some of the ways of law. We consider a NUMA region a unit where procesincorporating learning-driven data transformation. This sors can uniformly access memory, typically present in presents novel challenges in building declarative, optimiz- a single CPU socket. To scale, adding more cores enable, and hardware-conscious hybrid model-relational tails adding chiplets representing individual NUMA reengines [ 19 ] and optimizing them to modern hardware gions with memory and peripheral links (3). This creates capabilities. hierarchical NUMA regions inside the same socket by

HBM

HBM Cores

Cores Cores

Cores HBM

HBM NUMA 14 NUMA 15

NUMA 5 DDR5 DDR5 NUMA 7 adding more chips instead of further reducing their size. problem [30], and the technology is already deployed Comparatively, this NUMA phenomenon was seen and in high-end GPUs. With memory-bound analytics and studied in Intel Xeon Phi Knights Landing (KNL) proces- with many random-access analytical patterns that are a sors [26] about 5 years ago. However, the hierarchical natural fit for GPUs such as hash joins [ 8], higher bandNUMA design and chiplets are more critical nowadays width combined with many cores can ofer a diferent for scaling, with major vendors providing their solutions. tradeof, having the benefit of main memory locality. While AMD chiplet-based CPUs focus on scalability with This has also been demonstrated in the analysis of using uniform cores [23] to provide increasingly more cores, the Intel Xeon Phi Knight’s Landing CPU over TPC-H memory, and PCIe bandwidth, Intel’s approach is to ex- queries [4]. In contrast to GPUs and remote acceleratend the design with accelerators, novel intrinsics, and tors that have interconnects that reduce the transfers high-bandwidth memory [25]. to a fraction of the available local memory bandwidth,

Instead of simply adding more NUMA regions and CPU-integrated HBM efectively allows near-memory conversely approaching this new hardware design with processing [31]. Since Intel Xeon KNL had an earlier adaptive NUMA placement strategies [27], Intel Sapphire variant of HBM and chiplet-based architecture that is Rapids (Figure 2) introduces further hardware changes similar to Figure 2, there is existing applicable research to this design space that is tailored to contemporary on the topics of optimizing for chiplet-based hierarchical use-cases and workloads. In particular, we present the NUMA and HBM [32, 33, 26]. We study this behavior in schema of a dual-socket Intel Xeon CPU MAX 9480 [28] Section 3. with 56 physical cores and 64GB HBM per socket. Every However, the next important distinction is that halfNUMA region has 16GB of HBM and 14 physical cores, precision integer (int8) and floating point ( fp16) opand 28 with hyperthreading. erations are supported through hardware intrinsics, in

First, High-Bandwidth Memory (HBM) modules are cluding a brain-float format ( bf16), common in machine added to each NUMA region, ofering lower capacity learning. Beyond being optimized for machine learning than main memory (in this case, 16GB per region) but workloads or approximate query processing [34], this significantly higher bandwidth. For example, in compari- allows changing the relative throughput and memory son to DDR5 memory that reaches ∼ 60 GB/s per region, footprint for general analytics and designing data strucHBM2 [29] reaches ∼ 230 GB/s. In aggregate, HBM per tures that are more compact and make use of full hardsocket reaches nearly 1 TB/s, with DRAM at 240 GB/s. ware processing support. Half-precision types efectively HBM allows addressing the long-standing memory wall enable twice the throughput of the single-precision types and, in conjunction with high-bandwidth memory, may 0 1 2 3 4 5 6 7 change the access-pattern behavior and shift memoryto compute-bound workloads. We explore diverse ac- 0 60.04 60.69 60.16 60.25 58.56 58.69 59.24 58.52 cess patterns and these novel tradeofs that come with combining HBM and half-precision types in Section 4. 1 60.48 59.72 59.98 60.34 58.54 58.56 59.33 58.53

Finally, in addition to the high-bandwidth memory and native half-precision intrinsics, core-local accelerators 2 60.34 60.39 59.47 60.58 58.61 58.58 59.48 58.55 introduce processing heterogeneity and workload specialization. In conjunction with homogeneous compute 3 60.31 60.56 60.39 59.88 58.51 58.54 59.50 58.59 cores, Intel Sapphire Rapids CPU comes with specialized components such as tile registers and matrix multiplica- 4 59.55 58.67 58.27 58.89 59.61 60.32 60.47 60.06 tion units (AMX) aimed at machine learning workloads, data encryption and compression accelerators (QAT), and 5 59.32 58.56 58.30 58.84 60.37 59.61 60.39 60.20 data streaming accelerator (DSA) to provide hardware acceleration and ofload the CPU. In particular, we focus 6 59.27 58.54 58.49 58.76 60.19 60.03 60.08 60.47 on AMX and matrix operations as support for machine learning and vector-embedding-related analytical tasks. 7 59.23 58.65 58.54 58.85 60.05 60.13 60.65 59.42 Currently, only a single AMX accelerator is available, being Tile matrix multiply unit (TMUL) that processes matrix data stored in 8x1KB register files called tile registers. Figure 3: DRAM Bandwidth (GB/s) between NUMA regions. As accelerators allow ofloading tasks from CPU cores to specialized hardware and achieve better eficiency, we study briefly the impact of AMX in comparison to CPUonly execution in Section 5. The server is configured using Sub-NUMA clustering 4

Overall, this CPU architecture introduces a novel de- (SNC-4), as in Figure 2. Each socket has 56 physical cores, sign space and opportunities for changing existing trade- equally partitioned over four tiles interconnected using ofs by fusing high bandwidth memory, accelerators, and Intel’s embedded multi-die interconnect bridge (EMIB) half-precision processing in a single package. This ex- technology in a mesh. This yields 14 physical cores in a tends the memory hierarchy as it enables fast access to tile (NUMA region), with direct HBM (separate NUMA the data in HBM, in addition to the main memory, or region) and DRAM access, exposed in the finest available NVMe drives via PCIe links that allow comparable aggre- granularity to the OS. gate bandwidth [35]. This is especially important when modern analytics start to use embeddings and tensor data 3.1. Bandwidth and Latency formulations [ 17, 19, 15 ], where CPU-local accelerators With data movement becoming increasingly expensive, can better support not only machine learning workloads we evaluate the bandwidth and latency characteristics but equally data analytics by having fast and immediate of DRAM and HBM using the Intel Memory Latency access to the data with better-tailored operations, such as half-precision processing or accelerating matrix multi- Checker [36]. plications, that is lightweight enough not to necessitate GPU processing and crossing the comparatively slower 3.1.1. DRAM interconnect. 93.9 104.1 111.7 121.6 226.3 233.4 235.7 238 8 0 8 9 1 9 10 11

On the other hand, the socket-local HBM module la3 122.46 126.79 142.36 224.60 7 122.29 126.82 142.30 226.52 tencies (Figure 7) indicate little to no impact due to the socket local NUMA and HBM mesh enabled by Intel EMIB Packaging Technology [37], which is the socket-local Fmiegausruere5m: eHnBtMforbeaancdhwCidPtUh smocaktreitxw( GitBh/4s)C,tPhUe tsiolecsk/eNt-UloMcaAl interconnect of all the cores and multi-chip packaging regions. components. Figure 8 summarizes the relative latency slowdown (in percentages) of accessing HBM over DRAM by cores in the indicated NUMA regions.

The available NUMA granularity using SNC-4 allows 3.1.2. HBM better resource and workload control at the expense of Next, we evaluate the HBM bandwidth and latency char- more complex placement and potential interference beacteristics similarly. Figure 5 indicates the available band- tween similar memory and HBM bandwidths. Less granwidth between socket-local NUMA regions. There is ular exposure of cores is possible using Quad mode (2S) almost a 2x performance impact of the locality. Thus or HBM using Caching or HBM-only mode. judicious data placement and locality maintenance are imperative to achieve the full bandwidth. This contrasts 3.2. Data Access Patterns the DRAM bandwidth matrix (Figure 3), where the impact is negligible or nonexistent. Socket-local NUMA With HBM becoming a CPU-local, high-bandwidth, and regions are 0-3 and 4-7, with corresponding HBM NUMA low-latency part of the memory hierarchy rather than regions of 8-11 and 12-15 respectively. Figure 6 summa- separated by an interconnect, it is a candidate for use in rizes the relative bandwidth speedup (in percentages) of a traditional sense where sequential and random access HBM over DRAM in the corresponding NUMA regions. patterns appear in various workloads. We abstract out three access patterns that bind the 28.22 21.13 20.50 20.23 1.68 0.60 0.68 1.60 decision of paying the random access cost versus initiating a full-bandwidth scan [38]. For that reason, we evaluate three access patterns in Figure 9 for both HBM and DRAM to find an appropriate sweet spot between: 1. sequential scan (SCAN), that has the benefit of

using the full available bandwidth, 2. random access (RANDM), which we simulate by generating random indexes to probe the original data with, 3. sequential access with probing (SEQM), in which we generate an index column to probe the original data; however, this column is fully sorted, and we pay for the indirection cost as an ideal probing cost, such as in late materialization approaches, with the benefit of partially scanning the data if possible.

We run the experiment using a single FP32 column with 1 billion tuples and generate an integer index column randomly and sequentially for RANDM and SEQM cases, which are of reduced size depending on the indicated selectivity factor. For HBM experiments, we bind the allocations to NUMA node 8, and for DRAM experiments, to NUMA node 0. Computation remains on NUMA node 0, with 14 physical/28 logical cores available.

Despite the higher bandwidth, relative data access characteristics of DRAM versus HBM remain similar but accordingly shifted. Therefore, the tradeof shifts toward the sequential option when selecting random or sequenFP32-HBM-SEQM HBM-SCAN FP32-RAM-SEQM

RAM-SCAN 4 7 14

Threads tial access. We also note that the general overheads of random access are also reduced, allowing faster random access pattern processing than DRAM.

We note that the evaluation in Figure 9 uses all the available threads (28 total, 14 physical and 14 hyperthreads). The RAND access patterns are expected to be lower using HBM according to the latency slowdown matrix (Figure 8), and we study this through sensitivity analysis to the number of threads. While the efects of higher latency are visible using fewer cores, there is a crossing point where scalability continues in the case of

HBM compared to DRAM at 14 threads (Figure 10). Threads 3.3. Value Aggregation Besides data access and movement, the practical implication behind HBM is that it, for now, breaks the existing DRAM bandwidth wall.

Figure 11 illustrates this phenomenon, where DRAM workload plateaus at 60GB/s (which is the NUMA-local available DDR5 bandwidth), HBM-local data continues scaling and is not bound by memory. Both branching and sequential algorithms were sustained in either DRAM or HBM case. We use two workloads, one with a pure sequential summation of elements (SUM), and one with a predicate that introduces branching (SUM-IF).

4. Half-Precision Computation

with pinned cores and controlling the afinities using the numactl utility. 4.1. Throughput and Access Patterns We briefly revisit the previously explained SEQM pattern, extending it with FP16 and FP64 experiments in Figure 12.

While in highly selective queries, the efect of using a wrong (larger than needed) data type is not large, the benefit of using HBM can be significantly diminished when processing most of the data. Hardware intrinsics now allow making even more fine-grained decisions between FP32 and FP16 data types, in contrast to prior support of FP32 and FP64 only.

While half-precision numerical support is common in 4.2. BF16 Conversion GPUs and for machine learning and inference workloads, CPUs have typically not supported this with hardware An important factor to consider with half-precision data intrinsics. With support for brain float 16 (BF16), half- types using current Intel Sapphire Rapids processor inprecision floating point (FP16), and INT8 data types, trinsics is that BF16 is meant only as an intermediate data higher processing throughput is expected due to more val- type for computation. Accordingly, there are no load and ues fitting per cache line and in registers for processing. store instructions, just conversion from other data types Such data formats have a lower memory footprint, and and computations. In particular, two BF16 values can be higher throughput is expected compared to full (FP32) created from a single FP32 value and then processed, for and double (FP64) precision types. This consequently example, using a dot-product intrinsic. allows the designing of data structures and algorithms The dot product is a common operation for vector that can more flexibly use available instructions and data and matrix data processing, for example, in computing layouts. the cosine vector similarity. We, therefore, explore the

It is instructive to mention that kernel and compiler practicality of using BF16 as an intermediate format and support is necessary for using half-precision types. We evaluate the conversion speed in Figure 13. The task use Rocky Linux 9.1 (Blue Onyx) with Linux 6.3.1-1 kernel consists of computing a dot product and summing it over and Clang 16, implement all our prototypes, and experi- two columns of data containing elements of diferent ment using C++. Standard practices of aligned memory data types (FP64, FP32, FP16). We include the experiallocation have been followed, including a thread pool ment where starting from FP32 data using DRAM only, intermediate BF16 data is converted, and a specialized FP-16 FP-32

FP-64 1 2 4 14 21

28 intrinsic is used for dot product computation. We use all the NUMA-local cores (28 in total using hyperthreading).

While the computation consists only of the multiplication and addition of elements, the computational overhead of the FP64 data type is significantly higher due to data movement and computation requirements. Furthermore, we can notice that the BF16 execution time is similar to FP32. This is due to scanning the data in FP32 format and not benefiting from reduced data access, only from faster computations. Still, this overhead is comparably insignificant in more complex computations, such as in computationally intensive machine learning operations. 4.3. HBM + Half-Precision Still, with the availability of HBM, we are interested in if higher bandwidth can further ofset the FP32 to BF16 conversion cost and bring the processing time closer to one of FP16. We keep the setup but run the experiments on local DRAM and HBM separately, using all 28 tile-local threads.

Figure 14 indicates that the best computation time is indeed when FP16 is used, which is allocated as such, benefiting from lower data access and computational cost.

This is true for both DRAM and HBM. However, when able to use more than 14 threads and consequently more HBM bandwidth, the FP32 access cost diminishes, and the conversion pays of even compared to FP16 in DRAM.

Thus, breaking the bandwidth wall between DRAM and HBM is also apparent at the 4 threads point, briefly demonstrating the importance of both half-precision data types and high bandwidth memory. 1 2 4 14 21

This is equally apparent in Figure 15, where we rerun the SUM-IF query using diferent precision floating points in HBM and DRAM. So long as the computations can sustain the increased bandwidth, it will be useful and break through the existing 60GB/s NUMA-local DRAM bandwidth. Interestingly, corresponding data types have a similar bandwidth until reaching that wall with the number of threads required to transition between being compute and memory bound. Still, to fully benefit from added bandwidth and the impact of data type change, the workload must be able to consume the available bandwidth or initially sufer from data movement bottlenecks.

In this case, the relative bandwidth of consuming the same number of tuples between FP16, FP32, and FP64 is not purely memory bound (i.e., the data consumption bandwidth does not scale linearly).

NUMA: CPU 0, MEM 0 (CPU-DRAM)

FP-16 FP-32 FP-64

5. On-CPU Accelerators (AMX)

Finally, we analyze the last on-CPU building block that allows eficient data processing through hardware specialization. While other accelerators (QAT, DSA) are specialized for other tasks, we focus on Advanced Matrix Extensions (AMX) as a building block for modern analytical and machine learning workload acceleration over vector data types.

AMX introduces matrix multiplication intrinsics that operate over 8 tile registers of 1KB each, storing halfprecision floats and integers and performing dot product operation. In contrast, traditional AVX-512 intrinsics also have dot product instructions specialized for halfprecision computation, however, working on fewer data points that fit in standard registers. 5.1. Cores vs Accelerators 5000 4500 4000 3500 The natural question to ask in the presence of specialized hardware along a general-purpose core is where the line between the two lies. In particular, we want to Figure 16: 1 thread AMX vs. AVX-512 and parallel execution answer how many regular cores one AMX accelerator on CPU-tile-local cores (up to 28 with hyperthreading) on replaces. We generate 1 million tuples at random, each NUMA node 0 and local DRAM on NUMA node 0. tuple being a 512-dimensional vector. The given workload is to perform a cosine distance computation over all the tuples against a single 512-dimensional vector. To 6. Fusion: HBM + Half-Precision + answer the question, we run AMX on a single thread ex- AMX clusively while varying the threads for other data types and AVX-512-supported execution. At this point, we have individually presented the compo

We first start with DRAM and HBM execution over nents, and now we analyze them together: HBM, native diferent precision data types in Figure 16, similar to the half-precision hardware support, and matrix-processing experimental setup of Section 4.3. accelerators. For eficient use of hardware resources and

The experiment is conclusive that care should be taken with increasingly high heterogeneity, it is essential to to use appropriate data type precision, as they signifi- evaluate the entire design space and component intercantly impact the execution time and eficiency. While actions. The experimental setup is as in Section 5, we HBM can ofset some of that cost, the computation would generate 1 million tuples randomly, each tuple being a still dominate. 512-dimensional vector. The given workload is to per

In the case of DRAM, even in a single-threaded exe- form a cosine distance computation over all the tuples cution, AMX is almost as fast as all-core execution on against a single 512-dimensional vector. We run AMX FP16 and FP32 data types and faster than FP64, indicating on a single thread exclusively while varying the threads high accelerator eficiency for the given task. This allows for other data types and AVX-512-supported execution. ofloading other tasks to the available CPU cores. The most granular NUMA configuration (SNC-4) al

When HBM is used, higher bandwidth allows moving lows fine-grained data movement and access control over the overhead of repeatedly loading the data to AVX-512 HBM, DRAM, and processor resources on sockets and registers closer to AMX, and the crossing point indicates tiles. However, certain workloads, such as performing that, in this case, AMX replaces about 6 CPU cores, as matrix multiplication using FP64 datatype (Figure 19), shown in Figure 17. This enables more eficient resource result in workloads natively fully handled better by speutilization in the presence of suitable workloads and re- cialized accelerators, if at all possible to perform the conduces the reliance on GPUs and slow data transfers for version or approximate. matrix-specialized operations. Still, data locality should When there is a balance between data movement and be orchestrated carefully to benefit from the added spe- computation (Figure 20), an equally balanced use between cialized accelerators, as accessing remote data, even on accelerators and general computational resources is posan adjacent CPU tile (socket-local NUMA), reduces the sible, depending on particular optimization goals. The eficiency to about replacing 3 CPU cores (Figure 18). increased bandwidth of HBM helps in the data movement NUMA: CPU 0, MEM 8 (CPU-HBM) NUMA: CPU 0, MEM 1 (CPU-DRAM)

FP-16 FP-32 FP-64 tasks and further reduces the previously memory-bound workload time.

Finally, using all the hardware characteristics: HBM, FP16, and AMX allows fine-tuning the particular use case to the available resources. When respecting NUMA locality, the traditional CPU cores are utilized the best, and half-precision types reduce the stress on the memory bandwidth and data movement. This allows using remaining resources, such as HBM and available cores for other parallel tasks, and pushes the global eficiency of the whole NUMA node in comparison to using a specialized AMX accelerator, as depicted in Figure 21.

The increased hardware accelerator and memory hierarchy heterogeneity provide opportunities and solutions to address the existing bottlenecks. On the other hand, careful orchestration and tailoring to the workload are necessary to benefit from the fine-grained features and their interactions. Finally, this also spans data placement and memory management problems, as HBM capacity is limited, as well as specializing the computation using correct data types or CPU-specific intrinsics. 5000 4500 4000 3500 half-precision support, specialized accelerators, and highbandwidth memory allows a unique fusion of features.

Firstly, high-bandwidth memory (HBM) allows scaling the memory wall, accelerators (AMX) allow eficiency 7. Conclusion and Opportunities and alleviating general-purpose cores, and half-precision types allow tailoring the computation to modern workThe looming end of Moore’s law has introduced novel loads.

CPU architectural solutions. We analyzed 4th Genera- To use the underlying hardware eficiently, systems tion Intel Xeon Scalable processors known as Sapphire must be tuned to their characteristics. With a new design Rapids, as one proposed answer to continue CPU scalabil- and tradeofs in comparison to prior CPUs, our study on ity. In particular, a combination of homogeneous cores, Sapphire Rapids is an initial starting point for hardware1 2 14 21

28 4 We thank the anonymous reviewers for their insightful comments and detailed feedback that improved the paper.

We also thank the Intel Developer Cloud staf for their support in facilitating access to hardware, especially the team at Intel DCAI Labs in Swindon, who generously provided access to the server equipped with the 4th Generation Intel Xeon Scalable CPU (Sapphire Rapids) for testing and evaluation. software codesign of future chiplet-based heterogeneous CPUs and processing units, and how they impact both traditional and novel data management challenges.

Acknowledgments References

10.14778/3015274.3015275. doi:10.14778/3015274. D. Iyengar, et al., Embedded multi-die interconnect 3015275. bridge (emib)–a high density, high bandwidth pack[28] Intel, 2023. URL: https://www.intel.com/ aging interconnect, in: 2016 IEEE 66th Electronic content/www/us/en/products/sku/232592/ Components and Technology Conference (ECTC), intel-xeon-cpu-max-9480-processor-112-5m-cache-1-90-ghIzE/EE, 2016, pp. 557–565.

specifications.html. [38] M. S. Kester, M. Athanassoulis, S. Idreos, Access [29] H. Jun, J. Cho, K. Lee, H.-Y. Son, K. Kim, H. Jin, path selection in main-memory optimized data sysK. Kim, Hbm (high bandwidth memory) dram tech- tems: Should i scan or should i probe?, in: Proceednology and architecture, in: 2017 IEEE Interna- ings of the 2017 ACM International Conference on tional Memory Workshop (IMW), 2017, pp. 1–4. Management of Data, 2017, pp. 715–730. doi:10.1109/IMW.2017.7939084. [30] W. A. Wulf, S. A. McKee, Hitting the memory wall:

Implications of the obvious, ACM SIGARCH computer architecture news 23 (1995) 20–24. [31] W. Cui, Q. Zhang, S. Blanas, J. Camacho-Rodríguez,

B. Haynes, Y. Li, R. Ramamurthy, P. Cheng, R. Sen, M. Interlandi, Query processing on gaming consoles, in: Proceedings of the 19th International Workshop on Data Management on New Hardware, DaMoN ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 86–88. URL: https: //doi.org/10.1145/3592980.3595313. doi:10.1145/ 3592980.3595313. [32] S. Ramos, T. Hoefler, Capability models for manycore memory systems: A case-study with xeon phi knl, in: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2017, pp.

297–306. doi:10.1109/IPDPS.2017.30. [33] S. Jha, B. He, M. Lu, X. Cheng, H. P. Huynh, Improving main memory hash joins on intel xeon phi processors: An experimental approach, Proc. VLDB Endow. 8 (2015) 642–653. URL: https://doi.org/ 10.14778/2735703.2735704. doi:10.14778/2735703.

2735704. [34] V. Sanca, A. Ailamaki, Sampling-based AQP in modern analytical engines, in: S. Blanas, N. May (Eds.), International Conference on Management of Data, DaMoN 2022, Philadelphia, PA, USA, 13 June 2022, ACM, 2022, pp. 4:1–4:8. URL: https: //doi.org/10.1145/3533737.3535095. doi:10.1145/ 3533737.3535095. [35] H. Nicholson, A. Raza, P. Chrysogelos, A. Ailamaki, Hetcache: Synergising nvme storage and GPU acceleration for memory-eficient analytics, in: 13th Conference on Innovative Data Systems Research, CIDR 2023, Amsterdam, The Netherlands, January 8-11, 2023, www.cidrdb.org, 2023. URL: https://www.cidrdb.

org/cidr2023/papers/p84-nicholson.pdf. [36] K. Viswanathan, Intel® memory latency checker v3.9a, 2023. URL: https://www.intel.com/ content/www/us/en/developer/articles/tool/ intelr-memory-latency-checker.html. [37] R. Mahajan, R. Sankman, N. Patel, D.-W. Kim, K. Aygun, Z. Qian, Y. Mekonnen, I. Salama, S. Sharan,

data management under the test of time, in: generation analytics , in: 2023 IEEE 39th Inter-

10th Conference on Innovative Data Systems national Conference on Data Engineering (ICDE),

Research , CIDR

2020 , Amsterdam, The Nether- 2023, pp. 3699 - 3707 . doi: 10 .1109/ICDE55515.

lands , January 12-15 , 2020 , Online Proceed- 2023 .00298.

ings, www.cidrdb.org, 2020 . URL: http://cidrdb.org/ [20]

Park ,

Saur ,

Banda ,

Sen , M. Interlandi,

cidr2020/papers/p18-raza-cidr20.pdf. K. Karanasos , End- to-end optimization of ma[12] C.

Lutz , S.

Breß , S.

Zeuch , T.

Rabl , V.

Markl , Pump chine learning prediction queries , in: Z. G. Ives,

up the volume: Processing large data on gpus A . Bonifati , A. E. Abbadi (Eds.), SIGMOD '22: In-

Doan ,

Tan ,

Alawini ,

H. Q.

Ngo (Eds.), Pro- Philadelphia, PA, USA, June 12 - 17, 2022 , ACM,

ceedings of the 2020 International Conference on 2022 , pp. 587 - 601 . URL: https://doi.org/10.1145/

Management of Data, SIGMOD Conference 2020 , 3514221 .3526141. doi: 10 .1145/3514221.3526141.

online conference [Portland, OR , USA], June 14- [21]

Sodani ,

Gramunt ,

Corbal , H.-S. Kim,

19, 2020 , ACM, 2020 , pp. 1633 - 1649 . URL: https: K. Vinod , S.

Chinthamani , S.

Hutsell , R. Agarwal,

//doi.org/10.1145/3318464.3389705. doi: 10 .1145/ Y.-C. Liu, Knights landing: Second-generation in-

3318464.3389705. tel xeon phi product , IEEE Micro 36 ( 2016 ) 34 - 46 . [13]

Kraska ,

Beutel ,

E. H.

Chi ,

Dean , N. Poly- doi:10.1109/MM. 2016 . 25 .

zotis, The case for learned index structures , in: [22]

Nafziger ,

Beck ,

Burd ,

Lepak ,

G. H.

Loh ,

ceedings of the 2018 International Conference on nology and design for the amd epyc™ and ryzen™

Management of Data, SIGMOD Conference 2018 , processor families: Industrial product , in: 2021

Houston , TX, USA, June 10-15, 2018 , ACM, 2018 , ACM/IEEE 48th Annual International Symposium

pp. 489 - 504 . URL: https://doi.org/10.1145/3183713. on Computer Architecture (ISCA), IEEE, 2021 , pp.

3196909. doi: 10 .1145/3183713.3196909. 57 - 70 . [14]

Sioulas ,

Ailamaki , Scalable multi-query execu- [23] S.

Nafziger , K.

Lepak , M.

Paraschou , M. Subramony,

tion using reinforcement learning , in: G. Li , Z. Li , 2.2 amd chiplet architecture for high-performance

Idreos , D. Srivastava (Eds.), SIGMOD '21: Inter- server and desktop products , in: 2020 IEEE Inter-

tual Event , China, June 20-25, 2021 , ACM, 2021 , pp. IEEE , 2020 , pp. 44 - 45 .

1651- 1663 . URL: https://doi.org/10.1145/3448016. [24]

Kenyon ,

Capano , Apple silicon performance

3452799. doi: 10 .1145/3448016.3452799. in scientific computing, in: 2022 IEEE High Per[15]

He ,

S. C.

Nakandala ,

Banda , R. Sen, formance Extreme Computing Conference (HPEC),

Saur ,

Park ,

Curino ,

Camacho-Rodríguez , IEEE, 2022 , pp. 1 - 10 .

Karanasos ,

Interlandi , Query processing on [25]

Nassif ,

A. O.

Munch ,

C. L.

Molnar , G. Pas-

15 ( 2022 ) 2811 - 2825 . URL: https://www.vldb.org/ S. Venkataraman,

Kandula ,

Marom ,

A. M.

Kern ,

pvldb/vol15/p2811-he.pdf. B . Bowhill , D. R.

Mulvihill , S.

Nimmagadda , V. Ka[16] A.

Vaswani , N.

Shazeer , N.

Parmar , J. Uszkoreit, lidindi, J. Krause, M. M. Haq , R. Sharma , K. Duda,

tention is all you need, Advances in neural infor- scalable processor , in: 2022 IEEE International

mation processing systems 30 ( 2017 ). Solid- State Circuits Conference (ISSCC) , volume 65 , [17]

Bordawekar ,

Shmueli , Using word em- 2022 , pp. 44 - 46 . doi: 10 .1109/ISSCC42614. 2022 .

bedding to enable semantic queries in relational 9731107.

databases, in: Proceedings of the 1st Workshop [26]

Williams ,

Ionkov ,

Lang , Numa dis-

Learning , DEEM'17, Association for Computing ings of the Workshop on Memory Centric Pro-

Machinery , New York, NY, USA, 2017 . URL: https: gramming for HPC , MCHPC'17 , Association for

//doi.org/10.1145/3076246.3076251. doi: 10 .1145/ Computing Machinery, New York, NY, USA, 2017 ,

3076246.3076251. p. 30 - 34 . URL: https://doi.org/10.1145/3145617. [18]

Bordawekar ,

Shmueli , Enabling cognitive 3145620. doi:10.1145/3145617 .3145620.

intelligence queries in relational databases using [27] I . Psaroudakis,

Scheuer ,

May , A. Sellami,

arXiv:1603.07185 ( 2016 ). ment and task scheduling for analytical work [19]

Sanca ,

Ailamaki , Analytical engines with loads in main-memory column-stores , Proc.

context-rich processing: Towards eficient next- VLDB Endow . 10 ( 2016 ) 37 - 48 . URL: https://doi.org/