<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multiplication-Like Tasks on GP Us</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xincheng Xie</string-name>
          <email>xie.xincheng@columbia.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Junyoung Kim</string-name>
          <email>junyoung2@cs.columbia.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>KennethRoss</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>GPU, Data Analytics, Relational Algebra</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Columbia University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Workshop Proce dings</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>couver</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data scientists are often eager to develop new operators for data science and machine learning applications, which can be written as matrix multiplication-like nested loops. However, these matrix multiplication-like tasks are not supported in standard libraries and often struggle to achieve high performance. Deep learning compilers can optimize some of these operations, but they require significant time to search for optimal configurations. To address these issues, we introduce operators in relational algebra to extend matrix multiplications from a database point of view and develop a matrix multiplication-like task library for GPU. Our libraryG,amut, can recognize and generate optimized code for a variety of tasks, and we propose a tile-based model to incorporate relational algebra operators. Through experiments, we demonstrateGtahmautt achieves high performance with low compilation overhead compared to both matrix multiplication libraries and deep learning compilers.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>As memory capacity rises, more data analytical tasks are entirely put into RAM in order to eliminate the bottle</title>
      </sec>
      <sec id="sec-1-2">
        <title>Memory (HBM) coupled with thousands of cores 1[], graphics processing units (GPUs) have increasingly been used for data science tasks2[]. It can provide these tasks</title>
        <p>for (i = 0; i &lt; M; i++)
for (j = 0; j &lt; N; j++)
for (k = 0; k &lt; K; k++)</p>
        <p>C[Pzip[i], Rzip[j]] += P[i, k] * R[k, j];
ence for eating at diferent restaurants.
neck of disk I/O. With the utilization of High-BandwidthFigure 2: Example 2. A task that calculates people’s
preferespecially machine and deep learning.</p>
        <p>CPU, which facilitates the development of data sciencel,earning application. Suppose tha[t, ∶]
with higher memory throughput and parallelism than thewhich is used to amplify strong signals in a machine
represents a
signal vector at locatio nwith features. Each feature
multiplication-like tasks (MMLTs).</p>
        <p>Matrix multiplication is one of the most basic build-corresponds to an intensity at each spectru m[∶., ]
ing blocks in data science tasks3[]. Many tasks can be is a weight vector of observation. Each observation is
represented as simple variants of standard matrix multia- weighted combination of the signal strength of each
plication, which share a similar nestedf“or” loop struc- spectrum. And  [, ]
ture. One typical example is convolution, which usually  at locatio n. This query filters out signals of spectra
is implemented as matrix multiplication using toeplitzwhose weighted intensity is less than the threshold at
matrices [4]. Here are two additional examples of matrix location , and doubles the stronger parts of the scores.
is the score for each observation</p>
      </sec>
      <sec id="sec-1-3">
        <title>The first example (Figure 1) comes from Amulet [ 5, 6],</title>
        <p>for (i = 0; i &lt; M; i++)
for (j = 0; j &lt; N; j++)
for (k = 0; k &lt; K; k++)</p>
        <p>N[i, j] += S[i, k]*W[k, j] +</p>
        <p>(S[i, k]*W[k, j] &gt; thres[i]) * (S[i, k]*W[k, j] - thres[i]);
and generates a weighted combination of signals in diferent
spectra.
nEvelop-O
(K. Ross)</p>
        <sec id="sec-1-3-1">
          <title>Joint Workshops at 49th International Conference on Very Large Data</title>
        </sec>
        <sec id="sec-1-3-2">
          <title>Bases (VLDBW’23) —Workshop on Accelerating Analytics and Data</title>
        </sec>
        <sec id="sec-1-3-3">
          <title>Management Systems (ADMS’23), August 28 - September 1, 2023, Van</title>
          <p>Consider another example (Figure2) from the
recommendation system, which analyzes the willingness of
people to eat at a restaura n[t,.∶]
corresponds to a
weight vector of people ’s taste. Each featur e can
represent people’s preference towards degree of
spiciness, degree of saltiness, etc.[∶, ]
is a style vector of
restauran t, which records the style of dishes this
restaurant advocates .
and 
are the zip code of
people’s addresses and restaurants’ addresses, respectively.
Therefore, [ [], []]</p>
          <p>shows that people living in
diferent areas prefer to eat which area’s restaurants.</p>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>Although these tasks can be written by similar nested</title>
        <p>
          loops with standard matrix multiplications, they are not
as well-studied as standard matrix multiplications whose
eficient GPU algorithms have already been encapsulated
by libraries such as cuBLAS [
          <xref ref-type="bibr" rid="ref6">7</xref>
          ] and CUTLASS [8].
Tuning these libraries manually is not flexible enough to fit
of GPU architecture and the large code base of thoseincluding the introduction of relational algebra operators
libraries. that expand the scope of MMLTs. We then discuss the
de
        </p>
        <p>On the other hand, compilers specifically for deep sign of Gamut in Section4 and evaluate its performance
learning have been developed, such as TVM [9]. New in Section5.
operators can be passed to the compiler as nested loops,
and computation optimizations such as loop reordering
and tiling are applied to generate eficient GPU codes. 2. Background
Compared with libraries, compilers are more flexible but In this section, we describe Amulet5[, 6], the prior work
spend more time on compiling, from several minutes to of MMLT on CPU and TVM, a deep learning compiler.
hours depending on the problem size. And for each size, We also present a brief summary of GPU architecture.
codes need to be regenerated. Although some works,
such as ROLLER [10], try to reduce compilation time,
they limit tasks to those that can be expressed as lambd2a.1. Amulet
functions. Amulet [5, 6] is built on an open-source compiler that</p>
        <p>As a result, MMLTs face the trade-of between im- can identify and optimize MMLTs. It applies an adaptive
plementation flexibility and compiling performance strategy that finds the best configuration during
execu[11]. Matrix multiplication libraries can compile once tion. After recognition of MMLT, Amulet transforms
and run multiple times but with limited extendabilityt; he three-level nested loop into a parameterizable tiled
while deep learning compilers can compile more matrix loop. At runtime, a small subset of input data is used to
multiplication-like queries but with high compilationifnd the performant parameters, and then the remaining
time, and the compiled codes cannot be reused for queries computation continues with the found parameters.
with diferent operand sizes. These issues obstruct the ap- Amulet achieves speedups compared to the
state-ofplication of matrix multiplication-like operations in thethe-art compiler and decent performance compared to
machine learning community due to slow performance libraries, but it still has some limitations. As discussed in
and low scalability 5[, 6]. the previous section, Amulet only supports the MMLT</p>
        <p>In this paper, we proposeGamut1, a growing library that changes the inner arithmetic of the nested loop.
that can generate and optimize GPU codes for MMLTs.In addition, although Amulet explores data parallelism
We extend the range of MMLT studied in Amulet5,[6] (SIMD, Single Instruction Multiple Data), it is not
conby introducing operators from relational algebra12[], so cerned about task parallelism since CPUs have limited
MMLTs are not limited to simply changing inner arith-cores. GPUs use Single Instruction Multiple Thread
metic. We create a declarative syntax for programmers to(SIMT) models with thousands of cores, which have
specify an MMLT. Our library then parses the inputs andhigher flexibility than SIMD instructions 1[3, 14]. This
recognizes corresponding operators. To better integratepresents an opportunity for extending MMLTs to more
these operators into current fast matrix multiplication adli-verse queries.
gorithms on GPU, we present a tile-based iterator model
which fuses operators into data loading and storing
iterators. After compilation into GPU machine code, meta 2.2. TVM
information and object files are stored into our library. TVM [9] is a deep learning compiler that enables
graphIn this way, when similar queries, even with diferent and operator-level optimizations for a deep learning
operand sizes, arrive, they are not compiled again butmodel on a wide range of diferent hardware platforms.
are directly executed in order to amortize compilationOperators are translated into nested multi-level loops.
Afoverhead. ter translation, they use optimizations such as loop tiling,</p>
        <p>We study a wide range of MMLT examples. Through cache read and write, etc., to rearrange the nested loops.
experiments, we show that our library performs simi- TVM also optimizes computation graphs via operator
larly to matrix multiplication libraries for standard maf-usion and data layout transformation techniques. The
trix multiplication. For matrix multiplication-like queriesc,ombination of all these techniques introduces a large
Gamut achieves speedups compared to deep learning search space. To search eficiently, hardwares are
abcompilers while reducing a large amount of compilation stracted into a cost-based model, and TVM uses machine
time. Our compilation time is stable and manageable for learning-based models such as XGBoost 1[5] to search
diferent problem sizes. optimal configurations. Despite all these search methods,</p>
        <p>The remainder of the paper is structured as follows.TVM still needs a long time of tuning, ranging from days
After providing background information in Section 2, we to weeks, depending on the size of a DNN model 1[0].
define the matrix multiplication-like tasks in Sectio3n,
1GPUs Applied to MUltiplication-like Tasks
(ctohrreead) (ctohrreead) (thread blSocMk) (ctohrreead) (ctohrreead) (thread blSocMk)
…</p>
        <p>…
register register</p>
        <p>register register
shared memory
shared memory</p>
        <p>…</p>
        <p>L2 cache
global memory
multiplication, the problem can be decomposed into
multiple two-dimensional matrix multiplications, each of
which can be launched as a CUDA stream or processed
by separate GPUs.
3.1. Generalizing Matrix Multiplication
(a)
for (int i = 0; i &lt; M; i++)
for (int j = 0; j &lt; N; j++)
forC[i(,injt]k+== A0[;i,k k&lt;]K*; kB+[+k), j];
(b)</p>
        <p>SELECT
A.ind AS i, B.ind AS j,
dot(A.a, B.b) AS C[i,j]
FROM A, B</p>
        <p>(c)
2.3. GPU Architecture</p>
      </sec>
      <sec id="sec-1-5">
        <title>The simplest form of matrix multiplication-like tasks</title>
        <p>GPUs provide the ability to run thousands of threads ininvolves all the three nestedf“or” loops, but this broad
parallel by leveraging multi-core architecture and high-structure is too general to be optimized efectively. To
bandwidth memory [13, 14]. The simplified architecture address this, more specific characteristics are identified
is shown in Figure 3. and utilized to generalize matrix multiplication.</p>
        <p>To organize a large number of threads, GPU uses First, three axes, typically denoted b,y , and  , are
thread hierarchy. The smallest unit of execution is theindependent. For instance, the starting point of thaexis
thread. Thirty-two threads are automatically groupeids not related to the current state of tahxeis. These axes
into a warp by hardware. Threads within the same warpcan be categorized into two types: two spatial axeasn(d
execute the same instruction, known as SIMT (single ) and one reduction axis (). The spatial axes correspond
instruction multiple threads). Multiple warps are thento the rows and columns of the resulting matrix. The
grouped into a thread block. The thread block is the bar-eduction axis refers to the dimension that is collapsed
sic scheduling unit and runs on a Streaming Processor. during the calculation. In standard matrix multiplication,
Note that when the resources that a thread block requirteshe reduction axis is typically the column of the first
exceed the resources a streaming processor can provide, matrix and the row of the second matrix. By classifying
it will fail to run. A program can launch multiple threadthe axes in this manner, we can categorize the input and
blocks to compute. output matrices of MMLTs into four types based on their</p>
        <p>Memory has a similar hierarchical structure to threadi.ndexing in the inner operator.</p>
        <p>The largest but slowest one is global memory, which is
shared across all streaming processors. Programmers 1. Two spatial axes. It is indexed by,  or ,  . And
can allocate and manage global memory. L2 cache is also this is the only type of output matrix since the
reshared by all the thread blocks, but it is not programmable. duction axis is always collapsed after calculation.
It is smaller but faster than global memory. L1 cache, also 2. One spatial axis with one reduction axis. It is
incalled shared memory in the programming model, is local dexed by ,  or ,  and vice versa. The input
to all the threads within one thread block. Threads of matrices of standard matrix multiplication can be
one thread block cannot access others’ shared memory. classified into this type.</p>
        <p>Diferent from the CPU’s L1 cache, shared memory is 3. Only one spatial axis. It is indexed by or  . The
explicitly manageable by programmers. The smallest but input matrix is actually a vector.
fastest layer is the register file, which is local to each 4. Only one reduction axis. It is indexed by .
thread.</p>
        <p>While standard matrix multiplication uses th+e=“”
operator in the inner operation to accumulate the elements
3. Extension of Matrix along the reduction axis into a single scalar value, MMLTs
Multiplication are not limited to this method. The accumulation
operation could take on various forms, such as*“=”, maximum,
In this section, we try to define matrix multiplication-average, and so on.
like tasks and discuss how to apply relational
algebra to MMLTs. We will focus our discussion on
twodimensional matrices. For higher-dimensional matrix
focuses on changing the inner operator, which is limited.</p>
      </sec>
      <sec id="sec-1-6">
        <title>To achieve further generality, we can view MMLT from a database perspective.</title>
        <p>Considering a matrix of size  × 
, we can treat it as
a database table wi th records and  attributes. In the
case of standard matrix multiplication, wher=e 
⊤
and  has size  ×</p>
        <p>, we can interpret this as each row of
the table being dot producted with each row of the 
for i in range(M):
for j in range(N):
for k in range(K):
SELECT
FROM A, B
SELECT
Ad.oti(ndA.aA,S Bi.,b)B.iAnSdc,AS j,
proj(c) AS R[i,j]
A.ind AS i, B.ind AS j,
dot(A.a, B.b) AS C[i,j]
FROM A, B
WHERE C[i,j] &gt; 10
SELECT
SELECT
FROM A, B
GROUP BY A.ai, B.bi
A.ind AS i, B.ind AS j,
dot(A.a, B.b) AS C[i,j]
FROM A, B
WHERE A.ai = B.bi
SUM(dot(A.a, B.b)) AS C[A.ai, B.bi]
same spatial information, for instance, max pooling in</p>
      </sec>
      <sec id="sec-1-7">
        <title>ML. To accomplish this, two additional indexing matri</title>
        <p>ces are required for each spatial axis, which indicate the
spatial information (e.g., column.
and column .
).</p>
      </sec>
      <sec id="sec-1-8">
        <title>The Join operator filters out computations where the</title>
        <p>operands do not satisfy the predicate, thus performing
a check before reduction takes place. It can be used to
cluster data in the same group or classification.</p>
      </sec>
      <sec id="sec-1-9">
        <title>We can also incorporate duplicate elimination by utilizing a hash table as an output data structure. For ordering a limited number of output records, we can employ a heap data structure.</title>
        <p>and nested loops.
3.2. Applying Relational Algebra
The form of MMLT described in the previous section only 4.1. Fast Matrix Multiplication on GPU
nested loops, shown in Figure5.
bra operators and discuss how to convert them back intoFigure 6: Fast Matrix Multiplication Algorithm on GPU.</p>
        <p>The Project operator corresponds to the inner
product between two row vectors, and can also denote the We first provide a brief overview of CUTLASS’s fast
operation applied to the resulting scalar output. In MmLatrix multiplication algorithm on GPU. The whole
proapplications, it can be used to integrate element-wisceess is shown in Figure 6
functions such as activation layers. ThSeelect
operator determines whether the resulting scalar output wishere  is an  × 
relevant. Typically, the predicate is selective, so the outis- an  × 
Assume that = 
has a problem size of  ×  × 
matrix. The reduction computation of each
put is often a sparse matrix. In ML, it can function as element of  is independent, allowing for the distribution
a filter to eliminate undesired results. TheAggregate</p>
        <p>of computation to separate threads. To maintain good
operator groups together output elements that share tlhoecality, each thread typically computes a block of
elements in  of size   ×   . As a result, a thread must load
gmem→smem</p>
        <p>smem→reg
(a) Loading
(b) Storing
1</p>
        <p>1

 
 
  stripes
smem→gmem
time. Similarly, a   ×   sized matrix is output from</p>
        <p>the shared memory to the global memory. The entire
process consists of  iterations.
a   ×  submatrix from  and a ×   submatrix from small data slices, and the results are accumulated
repeat . However, due to the limited size of registers within edly until the final result is computed and then output
a thread, it may not be possible to hold these matricesh,ierarchically by the storing iterator.
particularly i f is large. Therefore, each thread loads Due to the hierarchical structure of GPU memory, the
only a slice of the matrices at a time, which is a × 1 iterator consists of two components: thread block-level
vector from  and a1 ×   vector from  . data transfer between global memory and shared
mem</p>
        <p>Using the thread hierarchy of the GPU, the algorithmory, and thread-level data transfer between shared
memgroups adjacent threads into thread blocks, assumingory and register. Since the data is categorized into four
a block matrix size of  ×   . Consequently, there types in Section3.1, we can design an iterator for each
are   ×   threads within each thread block. To taketype.</p>
        <p>advantage of the GPU’s memory hierarchy, the algorithm
typically fetches data in a step-by-step manner. Initially,
one thread block fetches a  ×   submatrix from 
and a  ×   submatrix from  into the shared memory.</p>
        <p>Subsequently, each thread retrieves its own vectors to
registers and performs the computations.</p>
        <p>After the computations are complete, storing the
results is also performed in a step-by-step manner. Since
each thread has a n  ×   sized register, each thread
block produces a   ×   sized result, which typically
exceeds the size of shared memory. Therefore, only a
1 ×   sized slice of the resulting matrix from each thread
is output from the register to the shared memory at a
1. One spatial axis with one reduction axis. This type
can reuse the method that the standard matrix
multiplication algorithm uses to load its data.
Assuming the size of the spatial axis is, and the
size of the reduction axis is , the iterator of one
thread block loads a  ×   tile from the global
memory to the shared memory. The iterator of
one thread loads a  × 1 slice from the shared
memory to the register.
2. Only one reduction axis. Instead of loading a 
sized vector into the shared memory and then
loading one element into the register, we can
directly load a  sized vector into the register
every iteration sinc e  is not large.
3. Spatial axis only. Since this type of data is
irrelevant to the reduction axis, only a single loading
operation is required for all the iterations. We can
directly load a  sized vector into the register
once from the global memory in the first iteration.</p>
        <p>In the following iterations, we can read directly
from the register.
4.2. MMLT Code Construction
Algorithm 1 Main Loop Pseudo-Code
1: for  ∶= 1 to  step   do
2: Load tiles to shared memory.
3: TileModelIn(block)
4: for  ∶= 1 to   step 1 do
5: Load slices to register.
6: InnerOp(slices)
7: end for
8: for  ∶= 1 to   step 1 do
9: Store one row to shared memory.
10: TileModelOut(block)
11: Store one block to global memory.
12: end for
13: end for</p>
        <p>The inner computation then computes a block whose
problem size is   ×   × 1. It can directly read from
the registers and output to a resulting register sized
  ×   . The initialization of the resulting register may
difer based on the accumulation operator. For instance,
summation initializes the register to all zeros, while the
maximum may initialize the register to the minimum
number of data precision.
4.3. Tile-based Iterator Model</p>
      </sec>
      <sec id="sec-1-10">
        <title>One simple approach for applying relational algebra op</title>
        <p>erators likeProject, Select, and Aggregate involves
storing the intermediate results back into global memory</p>
        <p>To better adapt the standard fast algorithm to MMLTbes,fore launching another kernel to process them.
Howthe algorithm is divided into four components: loading ever, given the opportunity presented by threads within
iterators, inner computation, a storing iterator, andaathread block cooperating to store a tile of data from
skeleton main loop. shared memory to global memory, it is possible to
per</p>
        <p>The pseudo-code for the main loop is shown in Algo- form data transformations within the iterator. To this
rithm 1, which remains unchanged during code genera- end, we propose a tile-based iterator model that
intetion. The loading iterator for each input loads a tile ogfrates these operators with the data loading and storing
data hierarchically from the global memory to the regis-process.
ters. The inner computation of one thread computes via
 [],   []] , 
.</p>
        <p>)</p>
        <p>▷ Compute prefix-sum.
ℎ ]) ▷ atomicAdd returns value before
.</p>
        <p>Category
Direct Mapping
Linear Mapping
Nonlinear Mapping</p>
        <p>After obtaining the intermediate results, we identify For the last case, where the data structure is nonlinear,
two types of mappings: one-to-one mapping and many- like a heap or hash map, a hierarchical structure is used.
to-one mapping. One-to-one mapping, likeProject, can This involves storing a local data structure for each thread
be directly computed in-register after reduction. Our block in shared memory, and a global data structure in
tile-based iterator model, on the other hand, is designedglobal memory, along with a local lock in shared memory
to address the many-to-one mapping. and a global lock in global memory. Each thread first</p>
        <p>Tile-based iterators are categorized based on how theyacquires the local lock of its thread block, outputs its data
index into the output, with the pseudo-code available itno shared memory, and then one thread in the thread
Table 1. The first category is direct mapping, as seen in block acquires the global lock and merges the data in the
Aggregate, where the output position of each elementlocal data structure into the global data structure.
is already known before computation. To facilitate this, We concentrate on natural join for thJeoin operator.
two slices of indexing information, one for each spatial Instead of using a hash map in hash join algorithms1[7,
axis, each sized  , are loaded into shared memory be- 18], a set of CUDA streams is created, with each stream
fore output. During output, we can use the informationfocusing on one key. The whole computation is divided
in shared memory to index, and because multiple ele- into multiple small blocked MMLTs. For one blocked
ments are accumulated into one output position, atomiMcMLT, we initially retrieve the indices of input matrices
primitives provided by CUDA can be directly used. that match the key for which this stream is responsible.</p>
        <p>The two remaining cases of the tile-based iteratorThus, we can select data from global memory based on
model do not have knowledge of the output positionthe computed indices instead of creating an additional
prior to computation. In one of these cases, where thecopy for each blocked MMLT.
data is output to a linear data structure, sucShelaesct,
a global counter is used to indicate the next output po4si.-4. Growing Library
tion. The number of outputs for each thread is computed
and stored in a local counter array in shared memory.We use a declarative syntax similar to AMULET5,[6], as
Then, a single thread computes the prefix sum [16] and illustrated in Figure7, but with the added requirement
reserves a block for the thread block in global memory for users to specify both the spatial and reduction axes
by incrementing the global counter usinga“tomicAdd”. explicitly. The range of   in the syntax denote[s  ,   ).
Each thread can then determine its output position bTyhe loop in Figure 2 is rewritten in declarative form in
adding the prefix sum of the corresponding thread to the Figure 8.
start pointer of the reserved block. Gamut generates code from the declarative syntax by</p>
        <p>parameters cannot fully occupy a streaming processor
of GPU, Gamut scales up by increasing the size of tiles.</p>
        <p>Otherwise, for applications that have more data in each
thread block,Gamut decreases the tile footprint until
the setting becomes runnable. The occupation can be
calculated using the compilation information provided
by the compiler and hardware specialization1[9]. If there
are multiple optional configurations,Gamut launches
one thread block for each configuration and selects the
configuration with the best performance.</p>
        <p>GPU code compilation takes a large amount of time
compared to computation, even if the parameters are
ifxed. For example, a standard matrix multiplication with
a problem size of 4096 × 4096 × 4096 takes 5 seconds for
compilation and only 80 milliseconds for computation.</p>
        <p>To amortize the compilation time, the parsing and
parameter information, along with a hashing summary of
ifrst parsing the inner computation and selecting iter-the parse tree, are stored with the compiled object file
ators based on the indexing type. Checks are also per-into storage. This information is used to compare parse
formed during parsing to ensure the query conforms to trees when encountering similar queries, and if a
duplithe MMLT definition. These checks ensure that there are cate query is encountered, the compiled object is linked
only three axes and that two spatial axes are associatdeidrectly, thus saving compilation time.
with the output matrix. Additionally, input matrices are
identified as diferent types, and each input matrix is 5. Evaluation
assigned an input iterator based on its recognized type.</p>
        <p>Then, the library recognizes the relational algebra opera-The core of Gamut is implemented in CUDA C++, and
tor pattern and selects the appropriate algorithm for thiattutilizes a custom parser built by ANTLR20[] and a
operator, which is fused into the output iterators. search algorithm implemented by CuPy [21].</p>
        <p>The code generated byGamut is parameterized by a All experiments2 are conducted on an Nvidia T4 GPU
set of parameters described in Section4.1, which is   , with an Intel Haswell CPU running Ubuntu 22.04 and
  ,   ,   , and   . However, not all combinations can CUDA 11.7. It is assumed that the GPU memory is
sufbe run on a GPU due to limitations on the number of ficient to accommodate all data. Each experiment is
reregisters, shared memory size, and the number of threads peated 10 times, and the average execution time is
reinside one thread block. To reduce the search space, ported.
we use a heuristic from CUTLASS that favors square- The same performance measurement as Amulet, which
shaped thread tiles by fixing   ∶   = 1 ∶ 1 and is Scaled Processing Rate (SPR), is used 5[, 6]. For a
  ∶   = 1 ∶ 1. Through experiments, we found that problem size of  × × and execution tim e in seconds,
the problem size has little efect on the best configura- SPR is calculated as 109 . Higher SPR indicates better
tion when launched thread blocks exceed the number of performance.
multiprocessors in GPU. The thread block under the best
configuration tends to be close to the largest runnable5.1. Standard Matrix Multiplication
setting. This is because larger thread blocks result in
fewer launched thread blocks and fewer thread blocks Performance
running on each multiprocessor, which in turn reduces To begin with, we demonstrate that our library
perthe overhead of context switching and minimizes cache forms comparably to state-of-the-art matrix
multiplithrashing while maximizing the throughput. Therefore, cation libraries in standard matrix multiplication tasks.
we use the scale-up-before-scale-out principle to avoidWe conducted experiments with square matrices (i.e.,
blind searching.  =  =  ) of orders 1024, 2048, 4096, 8192, 16384, and</p>
        <p>Upon installation of the librarGy,amut finds the best 32768, referred to as 1k, 2k, 4k, 8k, 16k, and 32k,
respecparameters for standard matrix multiplication and savetsively. Three baseline libraries were used in the
experithem as starting parameters taking into account tmheents: cuBLAS, which is the most widely used library
resource diferences among various GPUs. For a new for linear algebra on GPUs; CUTLASS, an open-source
MMLT query, the library searches for the best
configuration starting from the saved parameters. If the starting</p>
        <p>2Codes are available at https://github.com/xxcisxxc/GAMUT-release
PR1500
S
1000
500
0
1K
2K
4K 8K
Matrix Order
16K</p>
        <p>32K
cuBLAS CUTLASS GAMUT TVM w/ TVM w/o</p>
        <p>Query 1
where (i in [0, M], j in [0, N]) {
accum = 0;
reduce (k in [0, K]) {
accum += S[i, k]*W[k, j] +</p>
        <p>(S[i, k]*W[k, j] &gt; thres[i]) * (S[i, k]*W[k, j] - thres[i]);
}
matrix multiplication library on GPUs with similar per- Query 4
formance to cuBLAS; and TVM, a deep learning compiler. whearcecum(i=in0;[0, M], j in [0, N]) when (Pzip[i] == Rzip[j]) {
We test two diferent settings for TVM: one where we reduce (k in [0, K]) {
allow TVM to search from scratch with a fixed number } accum += P[i, k] * R[k, j];
of searches, and another where we provide TVM with } C[Pzip[i], Rzip[j]] += accum;
CUTLASS’s default configuration to make an
apple-toapple comparison with our library. AsGamut focuses on wQhueerrey 5(i in [0, M], j in [0, N]) {
optimizing the standard cubic matrix multiplication al- arcecduumce=(k0;in [0, K]) {
gorithm on GPUs, we do not compare it with algorithms } accum += P[i, k] * R[k, j];
such as the Strassen algorithm. accum &gt; thres ? C_sparse.add(accum);</p>
        <p>The performance comparison between our library and }
three other baselines, cuBLAS, CUTLASS, and TVM, is wQhueerrey 6(i in [0, M], j in [0, N]) {
shown in Figure 9. The results reveal that cuBLAS and accum = 0;
CUTLASS perform slightly better than our library due to redauccceum(k+=inmax[0(,R[Ki],)k{], R[k, j]) * max(R[i, k], R[k, j]);
their optimization techniques, such as avoiding memory }accum &gt; thres ? C_sparse.add(accum);
bank conflict and software pipelining. These techniques, }
however, require a specific data layout and a larger shared Query 7
memory footprint, resulting in reduced flexibility. TVM whearcecum(i=in0;[0, M], j in [0, N]) {
has reasonable performance when the configuration is redauccceum(k+=inP[i[0,,kK]])* {R[k, j];
given. For TVM without configuration, its compilation }
time increases with the problem size, shown by Table } min_heap_128 .add(accum);
2. As a result, our library is able to achieve high
performance while keeping the compilation time constant and
significantly less compared to TVM.
5.2. Performance on General Queries
To investigate the performance of other MMLT queries,5, but with a diferent arithmetic, seeking restaurants
we examined a set of queries presented in Figure10. with complementary styles, where the style vector is
Queries 1 and 2 are examples from the introduction. normalized. Therefore, two style vectors can be
comQuery 3 calculates the number of people-restaurant pairsbined through the maximum of each feature, and the dot
within two areas where the mutual preference exceedsproduct of the combined vector will have a high absolute
a threshold. It utilizes thPeroject and Aggregate op- value if the two vectors are complementary. Finally, the
erators. Query 4 only considers people and restaurantslast query returns the top 128 mutual preferences, so the
in the same area and involves aJoin operation. Query heap size is 128.
5 outputs individual people-restaurant pairs whose mu- Diferent from Amulet [ 5, 6], Compilation of MMLTs
tual preference is above a threshold. We assume that theon GPUs is more involved than CPUs because
perforthreshold is selective, meaning only a few pairs qualify, mance requires the eficient use of parallel resources.
resulting in a sparse matrix. Query 6 is similar to query Tools such as TVM provide infrastructure to help
programmers avoid low-level parallel programming. How- dard matrix multiplications, even though it only
modever, due to the limitations of TVM, we were only able toifies the inner loop computation. This is most likely
compare Gamut with TVM on Query 1. This is because because the modified computation does not fit the
norTVM does not provide access to atomic or lock primitives, mal matrix-multiply-add structure that is optimized by
as confirmed by the TVM community [ 22], preventing specialized hardware in the GPU, leading to lower
instructhe compilation of the other queries. tion throughput. Query 4 has a diferent performance</p>
        <p>The results of all these queries are shown in Figure pattern compared to the other queries because it is
bro11. It shows that TVM without default configurationken down into multiple smaller matrix multiplications,
has much lower performance. This may be because TVM resulting in less computation than full multiplications.
does not optimize the loading of the extra threshold inputT.herefore, it has a much higher SPR and is slower to
con</p>
        <p>Query 1 is approximately 3 times slower than stan-verge. In the case of Query 7, it exhibits low performance</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Acknowledgments</title>
      <sec id="sec-2-1">
        <title>This work was supported by the National Science Foundation under grant IIS-2008295 and by a gift from Oracle.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>boosting system</article-title>
          ,
          <source>in: Proceedings of the 22nd</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>edge Discovery and Data Mining</source>
          , KDD '16, Asso-
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>USA</surname>
          </string-name>
          ,
          <year>2016</year>
          , p.
          <fpage>785</fpage>
          -
          <lpage>794</lpage>
          . URL: https://doi.org/10.1145/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          2939672.2939785. doi:
          <volume>10</volume>
          .1145/2939672.2939785. [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sengupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Owens</surname>
          </string-name>
          , Parallel prefix
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>sum (scan) with cuda</article-title>
          ,
          <source>GPU gems 3</source>
          (
          <year>2007</year>
          )
          <fpage>851</fpage>
          -
          <lpage>876</lpage>
          . [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Balkesen</surname>
          </string-name>
          , G. Alonso,
          <string-name>
            <given-names>J.</given-names>
            <surname>Teubner</surname>
          </string-name>
          , M. T. Özsu,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>volume 7</source>
          ,
          <string-name>
            <given-names>VLDB</given-names>
            <surname>Endowment</surname>
          </string-name>
          ,
          <year>2013</year>
          , pp.
          <fpage>85</fpage>
          -
          <lpage>96</lpage>
          . [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kaldewey</surname>
          </string-name>
          , G. Lohman,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Volk</surname>
          </string-name>
          , Gpu
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>ment on New Hardware</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>62</lpage>
          . [19]
          <string-name>
            <surname>NVIDIA</surname>
          </string-name>
          , NVCC :: CUDA Toolkit Documen-
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>tation</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL:https://docs.nvidia.com/cuda/
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>cuda-compiler-driver-nvcc/</article-title>
          . [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Parr</surname>
          </string-name>
          ,
          <article-title>Antlr (another tool for language recogni-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>tion)</source>
          ,
          <year>2023</year>
          . URL:https://www.antlr.org./ [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Okuta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Unno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nishino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hido</surname>
          </string-name>
          , C. Loomis,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>tion Processing Systems (NIPS)</source>
          ,
          <year>2017</year>
          . URL:http://
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          learningsys.org/nips17/assets/papers/paper_16.pd.f [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Community</surname>
          </string-name>
          , Discussion in tvm community,
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>2022. URL: https://discuss.tvm.apache.org/t/</mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>non-</surname>
          </string-name>
          predefined-reduction/13719.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>