<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Accelerators, and Native Half-Precision Processing for CP U-Local Analytics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Viktor Sanca</string-name>
          <email>viktor.sanca@epfl.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasia Ailamaki</string-name>
          <email>anastasia.ailamaki@epfl.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>EPFL</institution>
          ,
          <addr-line>Lausanne</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Google</institution>
          ,
          <addr-line>Sunnyvale</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Local Node Numa node</institution>
        </aff>
      </contrib-group>
      <fpage>3</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Modern data management systems aim to provide both cutting-edge functionality and hardware eficiency. With the advent of AI-driven data processing and the post-Moore Law era, traditional memory-bound scale-up data management operations face scalability challenges. On the other hand, using accelerators such as GPUs has long been explored to ofload complex analytical patterns while trading-of data movement over an interconnect. GPUs typically provide massive parallelism and high-bandwidth memory, while CPUs are near-data processors and coordinators that are often memory-bound.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction: Evolving</title>
    </sec>
    <sec id="sec-2">
      <title>Scalability</title>
      <p>Data management has long focused on bringing both
performant and eficient, hardware-conscious solutions and
adapting to the ways and the requirements the data has to
be analyzed to be useful and extract insights. Moore’s law
and the improvements in processing technology allowed
scaling-down chips and circuits, leading to moving the
peripheral interconnect (PCIe) and memory controller
(MC) from the Northbridge (1) to being integrated with
the CPU (2), depicted in Figure 1. This CPU evolution
allowed for avoiding the Front-Side Bus data transfer
overheads. Along with the advent of high-capacity main
memory and multi-socket, multi-core systems sparked
optimizations for shifting the bottleneck from being IO
to memory-bound. This was reflected in the emerging
database system designs that used vectorization [1] and</p>
      <sec id="sec-2-1">
        <title>Joint Workshops at 49th International Conference on Very Large Data</title>
      </sec>
      <sec id="sec-2-2">
        <title>Bases (VLDBW’23) — Workshop on Accelerating Analytics and Data</title>
      </sec>
      <sec id="sec-2-3">
        <title>Management Systems (ADMS’23), August 28 - September 1, 2023, Vancouver, Canada</title>
        <p>Work done entirely at EPFL.
(A. Ailamaki)
(A. Ailamaki)
0000-0002-4799-8467 (V. Sanca); 0000-0002-9949-3639
PCIe</p>
        <p>MC
NUMA 0
2</p>
        <p>PCIe
MC
PCIe
MC
MC</p>
        <p>PCIe</p>
        <sec id="sec-2-3-1">
          <title>NUMA 0 HBM</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>HBM NUMA 1</title>
          <p>Accelerators</p>
          <p>Accelerators
e
ch Cores
a
C
e
ch Cores
a
C
Accelerators</p>
          <p>Cores
Cores</p>
          <p>Accelerators
NUMA 2 HBM CPU Socket HBM NUMA 3
3</p>
          <p>PCIe
tegrated memory controller scaled to 3) Multi-die chip design.</p>
          <p>Instead of only being present in multi-socket systems, NUMA
evolved to include socket-local granularity.
compilation [2, 3] in contrast to the Volcano-style
iteration model to reduce the overheads of previous systems
that were appropriate and tuned to the previously
available hardware systems and memory hierarchies [4].</p>
          <p>
            Similarly, when GPUs became more prevalent and
driven by the machine learning community, data
management research focused on using them as data processing In conjunction, novel applications and the
postaccelerators [5, 6, 7], being especially useful for compu- Moore’s law era are bringing towards horizontal scaling
tationally and random-access-heavy operators such as in CPUs, with multi-die chip designs (3) (Figure 1). While
joins [8]. The advent of accelerators has resulted in sys- this design was already present in Intel Xeon Phi Knights
tems with computational, memory, and interconnect het- Landing CPUs [21], more recently, chiplet-based
deerogeneity, consequentially resulting in system designs signs have become mainstream with vendors such as
aimed to reduce their complexity for practical use [
            <xref ref-type="bibr" rid="ref13 ref2 ref36">9, 10</xref>
            ]. AMD [22, 23], Apple [24], and yet again Intel [25]. In
With memory-bound analytics, the next goal was reduc- such designs, rather than being a monolithic die, the CPU
ing the efects of new bottlenecks, such as PCIe intercon- is composed of multiple individual chiplets
interconnects that are required to transfer the data from the main nected in a single package.
memory to the limited capacity, high-bandwidth GPU Novel hardware brings novel challenges and
oppormemory [11], while also studying and utilizing faster tunities for data management system design. For this
available interconnects [12]. Using all the available re- reason, we study the performance characteristics and
sources efectively leads to fast and eficient data pro- present an initial evaluation of Intel Sapphire Rapids (4th
cessing, which requires system designs that adapt to the Generation Xeon) CPUs and:
available hardware and build appropriate abstractions to
make it practical. • present and describe the novel multi-die chip
ar
          </p>
          <p>
            From another perspective, data management systems chitecture and the interplay of individual
componot only adapt and drive the hardware trends but are also nents in Section 2,
the backbone of making sense of, producing insights, and • analyze individually the efects of
Highderiving value from the increasingly high data volume Bandwidth Memory (HBM) on access patterns
and various sources. The ways the data is processed in Section 3, the use of hardware-supported
change with advancements in fields such as data min- half-precision intrinsics in Section 4, and the
ing and machine learning. While relational data is the novel on-core matrix-accelerators (AMX) in
long-established way to store and analyze data in many Section 5,
use cases, with increasingly high amounts of data, ma- • evaluate the combined efect of individual
comchine learning models have become more powerful for ponents on a vector-heavy workload motivated
various previously human-driven processing, such as by recent embedding methods in Section 6.
object recognition or semantic analysis. With increas- The post-Moore’s law world resulted in CPU designs
ingly practical and desirable models, machine learning to which data analytics systems must be tailored for
ef(and inference) has become part of data analytics, both in fective and eficient use. In a unique fusion of features
terms of machine learning for databases such as learned between high-bandwidth memory to tackle the memory
indexes [13] or reinforcement learning for query opti- bandwidth wall and half-precision analytics and
accelermization [14], as well as optimizing databases for ma- ators for ML workloads in a hierarchical NUMA scaling
chine learning, for example using tensor computation package, we explore their individual and combined
efruntimes [
            <xref ref-type="bibr" rid="ref28 ref29">15</xref>
            ]. fects.
          </p>
          <p>
            Furthermore, With more recent findings in the domain
of ML, such as the Transformer model architectures [16]
and data embeddings, a class of multi-modal models 2. Design: Scaling and Element
has presented state-of-the-art solutions for processing Fusion
context-rich data. Such novel use cases and findings
dictate the next important way to support extracting value The evolution of the scalability of physical CPU design
and processing data. This is noticeable as machine learn- (Figure 1) exposes bottlenecks or primitives that provide
ing and inference are becoming increasingly available challenges and opportunities for hardware-conscious
in data management engines such as BigQuery ML, IBM software system design. While miniaturization allowed
DB2 Data Insights, Azure Data Studio/SQL Server, Oracle iftting more components on a die and allowed the
memMachine Learning, Amazon Aurora Machine Learning, ory and peripheral controller to be integrated (2) rather
and others. In particular, extracting and learning embed- than separated via Northbridge (1), the same trends have
dings from relational tables [17, 18] and using external long been known to slow down, as predicted by Moore’s
models to enrich the data [
            <xref ref-type="bibr" rid="ref12">19, 20</xref>
            ] are some of the ways of law. We consider a NUMA region a unit where
procesincorporating learning-driven data transformation. This sors can uniformly access memory, typically present in
presents novel challenges in building declarative, optimiz- a single CPU socket. To scale, adding more cores
enable, and hardware-conscious hybrid model-relational tails adding chiplets representing individual NUMA
reengines [
            <xref ref-type="bibr" rid="ref12">19</xref>
            ] and optimizing them to modern hardware gions with memory and peripheral links (3). This creates
capabilities. hierarchical NUMA regions inside the same socket by
          </p>
          <p>HBM</p>
          <p>HBM
Cores</p>
          <p>Cores
Cores</p>
          <p>Cores
HBM</p>
          <p>HBM
NUMA 14 NUMA 15</p>
          <p>NUMA 5
DDR5
DDR5
NUMA 7
adding more chips instead of further reducing their size. problem [30], and the technology is already deployed
Comparatively, this NUMA phenomenon was seen and in high-end GPUs. With memory-bound analytics and
studied in Intel Xeon Phi Knights Landing (KNL) proces- with many random-access analytical patterns that are a
sors [26] about 5 years ago. However, the hierarchical natural fit for GPUs such as hash joins [ 8], higher
bandNUMA design and chiplets are more critical nowadays width combined with many cores can ofer a diferent
for scaling, with major vendors providing their solutions. tradeof, having the benefit of main memory locality.
While AMD chiplet-based CPUs focus on scalability with This has also been demonstrated in the analysis of using
uniform cores [23] to provide increasingly more cores, the Intel Xeon Phi Knight’s Landing CPU over TPC-H
memory, and PCIe bandwidth, Intel’s approach is to ex- queries [4]. In contrast to GPUs and remote
acceleratend the design with accelerators, novel intrinsics, and tors that have interconnects that reduce the transfers
high-bandwidth memory [25]. to a fraction of the available local memory bandwidth,</p>
          <p>Instead of simply adding more NUMA regions and CPU-integrated HBM efectively allows near-memory
conversely approaching this new hardware design with processing [31]. Since Intel Xeon KNL had an earlier
adaptive NUMA placement strategies [27], Intel Sapphire variant of HBM and chiplet-based architecture that is
Rapids (Figure 2) introduces further hardware changes similar to Figure 2, there is existing applicable research
to this design space that is tailored to contemporary on the topics of optimizing for chiplet-based hierarchical
use-cases and workloads. In particular, we present the NUMA and HBM [32, 33, 26]. We study this behavior in
schema of a dual-socket Intel Xeon CPU MAX 9480 [28] Section 3.
with 56 physical cores and 64GB HBM per socket. Every However, the next important distinction is that
halfNUMA region has 16GB of HBM and 14 physical cores, precision integer (int8) and floating point ( fp16)
opand 28 with hyperthreading. erations are supported through hardware intrinsics,
in</p>
          <p>First, High-Bandwidth Memory (HBM) modules are cluding a brain-float format ( bf16), common in machine
added to each NUMA region, ofering lower capacity learning. Beyond being optimized for machine learning
than main memory (in this case, 16GB per region) but workloads or approximate query processing [34], this
significantly higher bandwidth. For example, in compari- allows changing the relative throughput and memory
son to DDR5 memory that reaches ∼ 60 GB/s per region, footprint for general analytics and designing data
strucHBM2 [29] reaches ∼ 230 GB/s. In aggregate, HBM per tures that are more compact and make use of full
hardsocket reaches nearly 1 TB/s, with DRAM at 240 GB/s. ware processing support. Half-precision types efectively
HBM allows addressing the long-standing memory wall enable twice the throughput of the single-precision types
and, in conjunction with high-bandwidth memory, may 0 1 2 3 4 5 6 7
change the access-pattern behavior and shift
memoryto compute-bound workloads. We explore diverse ac- 0 60.04 60.69 60.16 60.25 58.56 58.69 59.24 58.52
cess patterns and these novel tradeofs that come with
combining HBM and half-precision types in Section 4. 1 60.48 59.72 59.98 60.34 58.54 58.56 59.33 58.53</p>
          <p>Finally, in addition to the high-bandwidth memory and
native half-precision intrinsics, core-local accelerators 2 60.34 60.39 59.47 60.58 58.61 58.58 59.48 58.55
introduce processing heterogeneity and workload
specialization. In conjunction with homogeneous compute 3 60.31 60.56 60.39 59.88 58.51 58.54 59.50 58.59
cores, Intel Sapphire Rapids CPU comes with specialized
components such as tile registers and matrix multiplica- 4 59.55 58.67 58.27 58.89 59.61 60.32 60.47 60.06
tion units (AMX) aimed at machine learning workloads,
data encryption and compression accelerators (QAT), and 5 59.32 58.56 58.30 58.84 60.37 59.61 60.39 60.20
data streaming accelerator (DSA) to provide hardware
acceleration and ofload the CPU. In particular, we focus 6 59.27 58.54 58.49 58.76 60.19 60.03 60.08 60.47
on AMX and matrix operations as support for machine
learning and vector-embedding-related analytical tasks. 7 59.23 58.65 58.54 58.85 60.05 60.13 60.65 59.42
Currently, only a single AMX accelerator is available,
being Tile matrix multiply unit (TMUL) that processes
matrix data stored in 8x1KB register files called tile registers. Figure 3: DRAM Bandwidth (GB/s) between NUMA regions.
As accelerators allow ofloading tasks from CPU cores
to specialized hardware and achieve better eficiency, we
study briefly the impact of AMX in comparison to
CPUonly execution in Section 5. The server is configured using Sub-NUMA clustering 4</p>
          <p>
            Overall, this CPU architecture introduces a novel de- (SNC-4), as in Figure 2. Each socket has 56 physical cores,
sign space and opportunities for changing existing trade- equally partitioned over four tiles interconnected using
ofs by fusing high bandwidth memory, accelerators, and Intel’s embedded multi-die interconnect bridge (EMIB)
half-precision processing in a single package. This ex- technology in a mesh. This yields 14 physical cores in a
tends the memory hierarchy as it enables fast access to tile (NUMA region), with direct HBM (separate NUMA
the data in HBM, in addition to the main memory, or region) and DRAM access, exposed in the finest available
NVMe drives via PCIe links that allow comparable aggre- granularity to the OS.
gate bandwidth [35]. This is especially important when
modern analytics start to use embeddings and tensor data 3.1. Bandwidth and Latency
formulations [
            <xref ref-type="bibr" rid="ref12 ref28 ref29">17, 19, 15</xref>
            ], where CPU-local accelerators
With data movement becoming increasingly expensive,
can better support not only machine learning workloads
we evaluate the bandwidth and latency characteristics
but equally data analytics by having fast and immediate
of DRAM and HBM using the Intel Memory Latency
access to the data with better-tailored operations, such
as half-precision processing or accelerating matrix multi- Checker [36].
plications, that is lightweight enough not to necessitate
GPU processing and crossing the comparatively slower 3.1.1. DRAM
interconnect.
93.9 104.1 111.7 121.6 226.3 233.4 235.7 238
8
0
8
9
1
9
10
11
          </p>
          <p>On the other hand, the socket-local HBM module
la3 122.46 126.79 142.36 224.60 7 122.29 126.82 142.30 226.52 tencies (Figure 7) indicate little to no impact due to the
socket local NUMA and HBM mesh enabled by Intel EMIB
Packaging Technology [37], which is the socket-local
Fmiegausruere5m: eHnBtMforbeaancdhwCidPtUh smocaktreitxw( GitBh/4s)C,tPhUe tsiolecsk/eNt-UloMcaAl interconnect of all the cores and multi-chip packaging
regions. components. Figure 8 summarizes the relative latency
slowdown (in percentages) of accessing HBM over DRAM
by cores in the indicated NUMA regions.</p>
          <p>The available NUMA granularity using SNC-4 allows
3.1.2. HBM better resource and workload control at the expense of
Next, we evaluate the HBM bandwidth and latency char- more complex placement and potential interference
beacteristics similarly. Figure 5 indicates the available band- tween similar memory and HBM bandwidths. Less
granwidth between socket-local NUMA regions. There is ular exposure of cores is possible using Quad mode (2S)
almost a 2x performance impact of the locality. Thus or HBM using Caching or HBM-only mode.
judicious data placement and locality maintenance are
imperative to achieve the full bandwidth. This contrasts 3.2. Data Access Patterns
the DRAM bandwidth matrix (Figure 3), where the
impact is negligible or nonexistent. Socket-local NUMA With HBM becoming a CPU-local, high-bandwidth, and
regions are 0-3 and 4-7, with corresponding HBM NUMA low-latency part of the memory hierarchy rather than
regions of 8-11 and 12-15 respectively. Figure 6 summa- separated by an interconnect, it is a candidate for use in
rizes the relative bandwidth speedup (in percentages) of a traditional sense where sequential and random access
HBM over DRAM in the corresponding NUMA regions. patterns appear in various workloads.
We abstract out three access patterns that bind the
28.22 21.13 20.50 20.23 1.68 0.60 0.68 1.60
decision of paying the random access cost versus
initiating a full-bandwidth scan [38]. For that reason, we
evaluate three access patterns in Figure 9 for both HBM
and DRAM to find an appropriate sweet spot between:
1. sequential scan (SCAN), that has the benefit of</p>
          <p>using the full available bandwidth,
2. random access (RANDM), which we simulate by
generating random indexes to probe the original
data with,
3. sequential access with probing (SEQM), in which
we generate an index column to probe the original
data; however, this column is fully sorted, and we
pay for the indirection cost as an ideal probing
cost, such as in late materialization approaches,
with the benefit of partially scanning the data if
possible.</p>
          <p>We run the experiment using a single FP32 column
with 1 billion tuples and generate an integer index column
randomly and sequentially for RANDM and SEQM cases,
which are of reduced size depending on the indicated
selectivity factor. For HBM experiments, we bind the
allocations to NUMA node 8, and for DRAM experiments,
to NUMA node 0. Computation remains on NUMA node
0, with 14 physical/28 logical cores available.</p>
          <p>Despite the higher bandwidth, relative data access
characteristics of DRAM versus HBM remain similar but
accordingly shifted. Therefore, the tradeof shifts toward
the sequential option when selecting random or
sequenFP32-HBM-SEQM
HBM-SCAN
FP32-RAM-SEQM</p>
          <p>RAM-SCAN
4
7
14</p>
          <p>28</p>
          <p>Threads
tial access. We also note that the general overheads of
random access are also reduced, allowing faster random
access pattern processing than DRAM.</p>
          <p>We note that the evaluation in Figure 9 uses all the
available threads (28 total, 14 physical and 14
hyperthreads). The RAND access patterns are expected to
be lower using HBM according to the latency slowdown
matrix (Figure 8), and we study this through sensitivity
analysis to the number of threads. While the efects of
higher latency are visible using fewer cores, there is a
crossing point where scalability continues in the case of</p>
          <p>HBM compared to DRAM at 14 threads (Figure 10).
Threads
3.3. Value Aggregation
Besides data access and movement, the practical
implication behind HBM is that it, for now, breaks the existing
DRAM bandwidth wall.</p>
          <p>Figure 11 illustrates this phenomenon, where DRAM
workload plateaus at 60GB/s (which is the NUMA-local
available DDR5 bandwidth), HBM-local data continues
scaling and is not bound by memory. Both branching and
sequential algorithms were sustained in either DRAM
or HBM case. We use two workloads, one with a pure
sequential summation of elements (SUM), and one with
a predicate that introduces branching (SUM-IF).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Half-Precision Computation</title>
      <p>with pinned cores and controlling the afinities using the
numactl utility.
4.1. Throughput and Access Patterns
We briefly revisit the previously explained SEQM pattern,
extending it with FP16 and FP64 experiments in Figure 12.</p>
      <p>While in highly selective queries, the efect of using a
wrong (larger than needed) data type is not large, the
benefit of using HBM can be significantly diminished when
processing most of the data. Hardware intrinsics now
allow making even more fine-grained decisions between
FP32 and FP16 data types, in contrast to prior support of
FP32 and FP64 only.</p>
      <p>While half-precision numerical support is common in 4.2. BF16 Conversion
GPUs and for machine learning and inference workloads,
CPUs have typically not supported this with hardware An important factor to consider with half-precision data
intrinsics. With support for brain float 16 (BF16), half- types using current Intel Sapphire Rapids processor
inprecision floating point (FP16), and INT8 data types, trinsics is that BF16 is meant only as an intermediate data
higher processing throughput is expected due to more val- type for computation. Accordingly, there are no load and
ues fitting per cache line and in registers for processing. store instructions, just conversion from other data types
Such data formats have a lower memory footprint, and and computations. In particular, two BF16 values can be
higher throughput is expected compared to full (FP32) created from a single FP32 value and then processed, for
and double (FP64) precision types. This consequently example, using a dot-product intrinsic.
allows the designing of data structures and algorithms The dot product is a common operation for vector
that can more flexibly use available instructions and data and matrix data processing, for example, in computing
layouts. the cosine vector similarity. We, therefore, explore the</p>
      <p>It is instructive to mention that kernel and compiler practicality of using BF16 as an intermediate format and
support is necessary for using half-precision types. We evaluate the conversion speed in Figure 13. The task
use Rocky Linux 9.1 (Blue Onyx) with Linux 6.3.1-1 kernel consists of computing a dot product and summing it over
and Clang 16, implement all our prototypes, and experi- two columns of data containing elements of diferent
ment using C++. Standard practices of aligned memory data types (FP64, FP32, FP16). We include the
experiallocation have been followed, including a thread pool ment where starting from FP32 data using DRAM only,
intermediate BF16 data is converted, and a specialized
FP-16
FP-32</p>
      <p>FP-64
1
2
4
14
21</p>
      <p>28
intrinsic is used for dot product computation. We use all
the NUMA-local cores (28 in total using hyperthreading).</p>
      <p>While the computation consists only of the
multiplication and addition of elements, the computational
overhead of the FP64 data type is significantly higher due
to data movement and computation requirements.
Furthermore, we can notice that the BF16 execution time
is similar to FP32. This is due to scanning the data in
FP32 format and not benefiting from reduced data access,
only from faster computations. Still, this overhead is
comparably insignificant in more complex computations,
such as in computationally intensive machine learning
operations.
4.3. HBM + Half-Precision
Still, with the availability of HBM, we are interested in
if higher bandwidth can further ofset the FP32 to BF16
conversion cost and bring the processing time closer to
one of FP16. We keep the setup but run the experiments
on local DRAM and HBM separately, using all 28 tile-local
threads.</p>
      <p>Figure 14 indicates that the best computation time is
indeed when FP16 is used, which is allocated as such,
benefiting from lower data access and computational cost.</p>
      <p>This is true for both DRAM and HBM. However, when
able to use more than 14 threads and consequently more
HBM bandwidth, the FP32 access cost diminishes, and
the conversion pays of even compared to FP16 in DRAM.</p>
      <p>Thus, breaking the bandwidth wall between DRAM and
HBM is also apparent at the 4 threads point, briefly
demonstrating the importance of both half-precision data
types and high bandwidth memory.
1
2
4
14
21</p>
      <p>28</p>
      <p>This is equally apparent in Figure 15, where we
rerun the SUM-IF query using diferent precision floating
points in HBM and DRAM. So long as the computations
can sustain the increased bandwidth, it will be useful and
break through the existing 60GB/s NUMA-local DRAM
bandwidth. Interestingly, corresponding data types have
a similar bandwidth until reaching that wall with the
number of threads required to transition between being
compute and memory bound. Still, to fully benefit from
added bandwidth and the impact of data type change, the
workload must be able to consume the available
bandwidth or initially sufer from data movement bottlenecks.</p>
      <p>In this case, the relative bandwidth of consuming the
same number of tuples between FP16, FP32, and FP64
is not purely memory bound (i.e., the data consumption
bandwidth does not scale linearly).</p>
      <p>NUMA: CPU 0, MEM 0 (CPU-DRAM)</p>
      <p>FP-16
FP-32
FP-64</p>
    </sec>
    <sec id="sec-4">
      <title>5. On-CPU Accelerators (AMX)</title>
      <p>Finally, we analyze the last on-CPU building block that
allows eficient data processing through hardware
specialization. While other accelerators (QAT, DSA) are
specialized for other tasks, we focus on Advanced Matrix
Extensions (AMX) as a building block for modern
analytical and machine learning workload acceleration over
vector data types.</p>
      <p>AMX introduces matrix multiplication intrinsics that
operate over 8 tile registers of 1KB each, storing
halfprecision floats and integers and performing dot product
operation. In contrast, traditional AVX-512 intrinsics
also have dot product instructions specialized for
halfprecision computation, however, working on fewer data
points that fit in standard registers.
5.1. Cores vs Accelerators
5000
4500
4000
3500
The natural question to ask in the presence of
specialized hardware along a general-purpose core is where
the line between the two lies. In particular, we want to Figure 16: 1 thread AMX vs. AVX-512 and parallel execution
answer how many regular cores one AMX accelerator on CPU-tile-local cores (up to 28 with hyperthreading) on
replaces. We generate 1 million tuples at random, each NUMA node 0 and local DRAM on NUMA node 0.
tuple being a 512-dimensional vector. The given
workload is to perform a cosine distance computation over all
the tuples against a single 512-dimensional vector. To 6. Fusion: HBM + Half-Precision +
answer the question, we run AMX on a single thread ex- AMX
clusively while varying the threads for other data types
and AVX-512-supported execution. At this point, we have individually presented the
compo</p>
      <p>We first start with DRAM and HBM execution over nents, and now we analyze them together: HBM, native
diferent precision data types in Figure 16, similar to the half-precision hardware support, and matrix-processing
experimental setup of Section 4.3. accelerators. For eficient use of hardware resources and</p>
      <p>The experiment is conclusive that care should be taken with increasingly high heterogeneity, it is essential to
to use appropriate data type precision, as they signifi- evaluate the entire design space and component
intercantly impact the execution time and eficiency. While actions. The experimental setup is as in Section 5, we
HBM can ofset some of that cost, the computation would generate 1 million tuples randomly, each tuple being a
still dominate. 512-dimensional vector. The given workload is to
per</p>
      <p>In the case of DRAM, even in a single-threaded exe- form a cosine distance computation over all the tuples
cution, AMX is almost as fast as all-core execution on against a single 512-dimensional vector. We run AMX
FP16 and FP32 data types and faster than FP64, indicating on a single thread exclusively while varying the threads
high accelerator eficiency for the given task. This allows for other data types and AVX-512-supported execution.
ofloading other tasks to the available CPU cores. The most granular NUMA configuration (SNC-4)
al</p>
      <p>When HBM is used, higher bandwidth allows moving lows fine-grained data movement and access control over
the overhead of repeatedly loading the data to AVX-512 HBM, DRAM, and processor resources on sockets and
registers closer to AMX, and the crossing point indicates tiles. However, certain workloads, such as performing
that, in this case, AMX replaces about 6 CPU cores, as matrix multiplication using FP64 datatype (Figure 19),
shown in Figure 17. This enables more eficient resource result in workloads natively fully handled better by
speutilization in the presence of suitable workloads and re- cialized accelerators, if at all possible to perform the
conduces the reliance on GPUs and slow data transfers for version or approximate.
matrix-specialized operations. Still, data locality should When there is a balance between data movement and
be orchestrated carefully to benefit from the added spe- computation (Figure 20), an equally balanced use between
cialized accelerators, as accessing remote data, even on accelerators and general computational resources is
posan adjacent CPU tile (socket-local NUMA), reduces the sible, depending on particular optimization goals. The
eficiency to about replacing 3 CPU cores (Figure 18). increased bandwidth of HBM helps in the data movement
NUMA: CPU 0, MEM 8 (CPU-HBM)
NUMA: CPU 0, MEM 1 (CPU-DRAM)</p>
      <p>FP-16
FP-32
FP-64
tasks and further reduces the previously memory-bound
workload time.</p>
      <p>Finally, using all the hardware characteristics: HBM,
FP16, and AMX allows fine-tuning the particular use case
to the available resources. When respecting NUMA
locality, the traditional CPU cores are utilized the best, and
half-precision types reduce the stress on the memory
bandwidth and data movement. This allows using
remaining resources, such as HBM and available cores for
other parallel tasks, and pushes the global eficiency of
the whole NUMA node in comparison to using a
specialized AMX accelerator, as depicted in Figure 21.</p>
      <p>The increased hardware accelerator and memory
hierarchy heterogeneity provide opportunities and solutions
to address the existing bottlenecks. On the other hand,
careful orchestration and tailoring to the workload are
necessary to benefit from the fine-grained features and
their interactions. Finally, this also spans data placement
and memory management problems, as HBM capacity
is limited, as well as specializing the computation using
correct data types or CPU-specific intrinsics.
5000
4500
4000
3500
half-precision support, specialized accelerators, and
highbandwidth memory allows a unique fusion of features.</p>
      <p>Firstly, high-bandwidth memory (HBM) allows scaling
the memory wall, accelerators (AMX) allow eficiency
7. Conclusion and Opportunities and alleviating general-purpose cores, and half-precision
types allow tailoring the computation to modern
workThe looming end of Moore’s law has introduced novel loads.</p>
      <p>CPU architectural solutions. We analyzed 4th Genera- To use the underlying hardware eficiently, systems
tion Intel Xeon Scalable processors known as Sapphire must be tuned to their characteristics. With a new design
Rapids, as one proposed answer to continue CPU scalabil- and tradeofs in comparison to prior CPUs, our study on
ity. In particular, a combination of homogeneous cores, Sapphire Rapids is an initial starting point for
hardware1
2
14
21</p>
      <p>28
4
We thank the anonymous reviewers for their insightful
comments and detailed feedback that improved the paper.</p>
      <p>We also thank the Intel Developer Cloud staf for their
support in facilitating access to hardware, especially the
team at Intel DCAI Labs in Swindon, who generously
provided access to the server equipped with the 4th
Generation Intel Xeon Scalable CPU (Sapphire Rapids) for
testing and evaluation.
software codesign of future chiplet-based heterogeneous
CPUs and processing units, and how they impact both
traditional and novel data management challenges.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments References</title>
      <p>10.14778/3015274.3015275. doi:10.14778/3015274. D. Iyengar, et al., Embedded multi-die interconnect
3015275. bridge (emib)–a high density, high bandwidth
pack[28] Intel, 2023. URL: https://www.intel.com/ aging interconnect, in: 2016 IEEE 66th Electronic
content/www/us/en/products/sku/232592/ Components and Technology Conference (ECTC),
intel-xeon-cpu-max-9480-processor-112-5m-cache-1-90-ghIzE/EE, 2016, pp. 557–565.</p>
      <p>specifications.html. [38] M. S. Kester, M. Athanassoulis, S. Idreos, Access
[29] H. Jun, J. Cho, K. Lee, H.-Y. Son, K. Kim, H. Jin, path selection in main-memory optimized data
sysK. Kim, Hbm (high bandwidth memory) dram tech- tems: Should i scan or should i probe?, in:
Proceednology and architecture, in: 2017 IEEE Interna- ings of the 2017 ACM International Conference on
tional Memory Workshop (IMW), 2017, pp. 1–4. Management of Data, 2017, pp. 715–730.
doi:10.1109/IMW.2017.7939084.
[30] W. A. Wulf, S. A. McKee, Hitting the memory wall:</p>
      <p>Implications of the obvious, ACM SIGARCH
computer architecture news 23 (1995) 20–24.
[31] W. Cui, Q. Zhang, S. Blanas, J. Camacho-Rodríguez,</p>
      <p>B. Haynes, Y. Li, R. Ramamurthy, P. Cheng, R. Sen,
M. Interlandi, Query processing on gaming
consoles, in: Proceedings of the 19th International
Workshop on Data Management on New Hardware,
DaMoN ’23, Association for Computing Machinery,
New York, NY, USA, 2023, p. 86–88. URL: https:
//doi.org/10.1145/3592980.3595313. doi:10.1145/
3592980.3595313.
[32] S. Ramos, T. Hoefler, Capability models for
manycore memory systems: A case-study with xeon phi
knl, in: 2017 IEEE International Parallel and
Distributed Processing Symposium (IPDPS), 2017, pp.</p>
      <p>297–306. doi:10.1109/IPDPS.2017.30.
[33] S. Jha, B. He, M. Lu, X. Cheng, H. P. Huynh,
Improving main memory hash joins on intel xeon phi
processors: An experimental approach, Proc. VLDB
Endow. 8 (2015) 642–653. URL: https://doi.org/
10.14778/2735703.2735704. doi:10.14778/2735703.</p>
      <p>2735704.
[34] V. Sanca, A. Ailamaki, Sampling-based AQP in
modern analytical engines, in: S. Blanas, N. May
(Eds.), International Conference on Management
of Data, DaMoN 2022, Philadelphia, PA, USA, 13
June 2022, ACM, 2022, pp. 4:1–4:8. URL: https:
//doi.org/10.1145/3533737.3535095. doi:10.1145/
3533737.3535095.
[35] H. Nicholson, A. Raza, P. Chrysogelos, A.
Ailamaki, Hetcache: Synergising nvme
storage and GPU acceleration for memory-eficient
analytics, in: 13th Conference on
Innovative Data Systems Research, CIDR 2023,
Amsterdam, The Netherlands, January 8-11, 2023,
www.cidrdb.org, 2023. URL: https://www.cidrdb.</p>
      <p>org/cidr2023/papers/p84-nicholson.pdf.
[36] K. Viswanathan, Intel® memory latency checker
v3.9a, 2023. URL: https://www.intel.com/
content/www/us/en/developer/articles/tool/
intelr-memory-latency-checker.html.
[37] R. Mahajan, R. Sankman, N. Patel, D.-W. Kim, K.
Aygun, Z. Qian, Y. Mekonnen, I. Salama, S. Sharan,</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>data management under the test of time, in: generation analytics</article-title>
          ,
          <source>in: 2023 IEEE 39th Inter-</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>10th Conference on Innovative Data Systems national Conference on Data Engineering (ICDE),</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Research</surname>
            ,
            <given-names>CIDR</given-names>
          </string-name>
          <year>2020</year>
          , Amsterdam, The Nether- 2023, pp.
          <fpage>3699</fpage>
          -
          <lpage>3707</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDE55515.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>lands</surname>
          </string-name>
          ,
          <source>January 12-15</source>
          ,
          <year>2020</year>
          , Online Proceed-
          <year>2023</year>
          .00298.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          ings, www.cidrdb.org,
          <year>2020</year>
          . URL: http://cidrdb.org/ [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Saur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Banda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sen</surname>
          </string-name>
          , M. Interlandi,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>cidr2020/papers/p18-raza-cidr20.pdf. K. Karanasos</article-title>
          ,
          <string-name>
            <surname>End-</surname>
            to-end optimization of ma[12]
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lutz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Breß</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Zeuch</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Rabl</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Markl</surname>
          </string-name>
          ,
          <article-title>Pump chine learning prediction queries</article-title>
          , in: Z. G. Ives,
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>up the volume: Processing large data on gpus A</source>
          .
          <string-name>
            <surname>Bonifati</surname>
            ,
            <given-names>A. E.</given-names>
          </string-name>
          Abbadi (Eds.), SIGMOD '22: In-
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alawini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. Q.</given-names>
            <surname>Ngo</surname>
          </string-name>
          (Eds.), Pro- Philadelphia, PA, USA, June 12 - 17,
          <year>2022</year>
          , ACM,
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>ceedings of the 2020 International Conference on</source>
          <year>2022</year>
          , pp.
          <fpage>587</fpage>
          -
          <lpage>601</lpage>
          . URL: https://doi.org/10.1145/
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>Management of Data, SIGMOD Conference</source>
          <year>2020</year>
          ,
          <volume>3514221</volume>
          .3526141. doi:
          <volume>10</volume>
          .1145/3514221.3526141.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          online conference [Portland,
          <string-name>
            <surname>OR</surname>
          </string-name>
          , USA], June 14- [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sodani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gramunt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Corbal</surname>
          </string-name>
          , H.-S. Kim,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          19,
          <year>2020</year>
          , ACM,
          <year>2020</year>
          , pp.
          <fpage>1633</fpage>
          -
          <lpage>1649</lpage>
          . URL:
          <string-name>
            <surname>https: K. Vinod</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Chinthamani</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hutsell</surname>
          </string-name>
          , R. Agarwal,
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          //doi.org/10.1145/3318464.3389705. doi:
          <volume>10</volume>
          .1145/ Y.-C. Liu, Knights landing: Second-generation in-
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          3318464.3389705.
          <article-title>tel xeon phi product</article-title>
          ,
          <source>IEEE Micro 36</source>
          (
          <year>2016</year>
          )
          <fpage>34</fpage>
          -
          <lpage>46</lpage>
          . [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kraska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Beutel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          , N. Poly- doi:10.1109/MM.
          <year>2016</year>
          .
          <volume>25</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>zotis, The case for learned index structures</article-title>
          , in: [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nafziger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Burd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lepak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. H.</given-names>
            <surname>Loh</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>ceedings of the 2018 International Conference on nology and design for the amd epyc™ and ryzen™</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>Management of Data, SIGMOD Conference</source>
          <year>2018</year>
          ,
          <article-title>processor families: Industrial product</article-title>
          ,
          <source>in: 2021</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Houston</surname>
          </string-name>
          , TX, USA, June 10-15,
          <year>2018</year>
          , ACM,
          <year>2018</year>
          , ACM/IEEE 48th Annual International Symposium
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          pp.
          <fpage>489</fpage>
          -
          <lpage>504</lpage>
          . URL: https://doi.org/10.1145/3183713. on Computer Architecture (ISCA), IEEE,
          <year>2021</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          3196909. doi:
          <volume>10</volume>
          .1145/3183713.3196909.
          <fpage>57</fpage>
          -
          <lpage>70</lpage>
          . [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sioulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          ,
          <string-name>
            <surname>Scalable</surname>
            multi-query execu- [23]
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Nafziger</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lepak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Paraschou</surname>
          </string-name>
          , M. Subramony,
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>tion using reinforcement learning</article-title>
          , in: G.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>2.2 amd chiplet architecture for high-performance</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          , D. Srivastava (Eds.), SIGMOD '21:
          <article-title>Inter- server and desktop products</article-title>
          , in: 2020 IEEE Inter-
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>tual Event</surname>
          </string-name>
          , China, June 20-25,
          <year>2021</year>
          , ACM,
          <year>2021</year>
          , pp.
          <source>IEEE</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>44</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          1651-
          <fpage>1663</fpage>
          . URL: https://doi.org/10.1145/3448016. [24]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kenyon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Capano</surname>
          </string-name>
          , Apple silicon performance
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          3452799. doi:
          <volume>10</volume>
          .1145/3448016.3452799. in scientific computing, in: 2022 IEEE High Per[15]
          <string-name>
            <given-names>D.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Nakandala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Banda</surname>
          </string-name>
          , R. Sen, formance Extreme Computing Conference (HPEC),
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Saur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Curino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Rodríguez</surname>
          </string-name>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Karanasos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Interlandi</surname>
          </string-name>
          , Query processing on [25]
          <string-name>
            <given-names>N.</given-names>
            <surname>Nassif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Munch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Molnar</surname>
          </string-name>
          , G. Pas-
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <volume>15</volume>
          (
          <year>2022</year>
          )
          <fpage>2811</fpage>
          -
          <lpage>2825</lpage>
          . URL: https://www.vldb.org/ S. Venkataraman,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kandula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Marom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Kern</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>pvldb/vol15/p2811-he.pdf. B</source>
          .
          <string-name>
            <surname>Bowhill</surname>
            ,
            <given-names>D. R.</given-names>
          </string-name>
          <string-name>
            <surname>Mulvihill</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Nimmagadda</surname>
            , V. Ka[16]
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Parmar</surname>
          </string-name>
          , J. Uszkoreit, lidindi, J. Krause,
          <string-name>
            <surname>M. M. Haq</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Sharma</surname>
          </string-name>
          , K. Duda,
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <article-title>tention is all you need, Advances in neural infor- scalable processor</article-title>
          , in: 2022 IEEE International
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <source>mation processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ). Solid- State
          <source>Circuits Conference (ISSCC)</source>
          , volume
          <volume>65</volume>
          , [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bordawekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Shmueli</surname>
          </string-name>
          ,
          <source>Using word em- 2022</source>
          , pp.
          <fpage>44</fpage>
          -
          <lpage>46</lpage>
          . doi:
          <volume>10</volume>
          .1109/ISSCC42614.
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <article-title>bedding to enable semantic queries in relational 9731107.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          databases,
          <source>in: Proceedings of the 1st Workshop</source>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ionkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lang</surname>
          </string-name>
          , Numa dis-
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>Learning</surname>
          </string-name>
          , DEEM'17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing ings of the Workshop on Memory Centric Pro-
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>Machinery</surname>
          </string-name>
          , New York, NY, USA,
          <year>2017</year>
          .
          <article-title>URL: https: gramming for HPC</article-title>
          ,
          <source>MCHPC'17</source>
          , Association for
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          //doi.org/10.1145/3076246.3076251. doi:
          <volume>10</volume>
          .1145/ Computing Machinery, New York, NY, USA,
          <year>2017</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          3076246.3076251. p.
          <fpage>30</fpage>
          -
          <lpage>34</lpage>
          . URL: https://doi.org/10.1145/3145617. [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bordawekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Shmueli</surname>
          </string-name>
          ,
          <source>Enabling cognitive 3145620. doi:10.1145/3145617</source>
          .3145620.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <article-title>intelligence queries in relational databases using [27] I</article-title>
          . Psaroudakis,
          <string-name>
            <given-names>T.</given-names>
            <surname>Scheuer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>May</surname>
          </string-name>
          , A. Sellami,
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <source>arXiv:1603.07185</source>
          (
          <year>2016</year>
          ).
          <article-title>ment and task scheduling for analytical work</article-title>
          [19]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          ,
          <article-title>Analytical engines with loads in main-memory column-stores</article-title>
          ,
          <source>Proc.</source>
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <source>context-rich processing: Towards eficient next- VLDB Endow</source>
          .
          <volume>10</volume>
          (
          <year>2016</year>
          )
          <fpage>37</fpage>
          -
          <lpage>48</lpage>
          . URL: https://doi.org/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>