1. Introduction

Multiplication-Like Tasks on GP Us

Xincheng Xie

xie.xincheng@columbia.edu 0 1 2

Junyoung Kim

junyoung2@cs.columbia.edu 0 1 2

KennethRoss

0 1 2

GPU, Data Analytics, Relational Algebra

0 Department of Computer Science, Columbia University 1 Workshop Proce dings 2 couver , Canada

Data scientists are often eager to develop new operators for data science and machine learning applications, which can be written as matrix multiplication-like nested loops. However, these matrix multiplication-like tasks are not supported in standard libraries and often struggle to achieve high performance. Deep learning compilers can optimize some of these operations, but they require significant time to search for optimal configurations. To address these issues, we introduce operators in relational algebra to extend matrix multiplications from a database point of view and develop a matrix multiplication-like task library for GPU. Our libraryG,amut, can recognize and generate optimized code for a variety of tasks, and we propose a tile-based model to incorporate relational algebra operators. Through experiments, we demonstrateGtahmautt achieves high performance with low compilation overhead compared to both matrix multiplication libraries and deep learning compilers.

1. Introduction As memory capacity rises, more data analytical tasks are entirely put into RAM in order to eliminate the bottle Memory (HBM) coupled with thousands of cores 1[], graphics processing units (GPUs) have increasingly been used for data science tasks2[]. It can provide these tasks

for (i = 0; i < M; i++) for (j = 0; j < N; j++) for (k = 0; k < K; k++)

C[Pzip[i], Rzip[j]] += P[i, k] * R[k, j]; ence for eating at diferent restaurants. neck of disk I/O. With the utilization of High-BandwidthFigure 2: Example 2. A task that calculates people’s preferespecially machine and deep learning.

CPU, which facilitates the development of data sciencel,earning application. Suppose tha[t, ∶] with higher memory throughput and parallelism than thewhich is used to amplify strong signals in a machine represents a signal vector at locatio nwith features. Each feature multiplication-like tasks (MMLTs).

Matrix multiplication is one of the most basic build-corresponds to an intensity at each spectru m[∶., ] ing blocks in data science tasks3[]. Many tasks can be is a weight vector of observation. Each observation is represented as simple variants of standard matrix multia- weighted combination of the signal strength of each plication, which share a similar nestedf“or” loop struc- spectrum. And [, ] ture. One typical example is convolution, which usually at locatio n. This query filters out signals of spectra is implemented as matrix multiplication using toeplitzwhose weighted intensity is less than the threshold at matrices [4]. Here are two additional examples of matrix location , and doubles the stronger parts of the scores. is the score for each observation

The first example (Figure 1) comes from Amulet [ 5, 6],

for (i = 0; i < M; i++) for (j = 0; j < N; j++) for (k = 0; k < K; k++)

N[i, j] += S[i, k]*W[k, j] +

(S[i, k]*W[k, j] > thres[i]) * (S[i, k]*W[k, j] - thres[i]); and generates a weighted combination of signals in diferent spectra. nEvelop-O (K. Ross)

Joint Workshops at 49th International Conference on Very Large Data Bases (VLDBW’23) —Workshop on Accelerating Analytics and Data Management Systems (ADMS’23), August 28 - September 1, 2023, Van

Consider another example (Figure2) from the recommendation system, which analyzes the willingness of people to eat at a restaura n[t,.∶] corresponds to a weight vector of people ’s taste. Each featur e can represent people’s preference towards degree of spiciness, degree of saltiness, etc.[∶, ] is a style vector of restauran t, which records the style of dishes this restaurant advocates . and are the zip code of people’s addresses and restaurants’ addresses, respectively. Therefore, [ [], []]

shows that people living in diferent areas prefer to eat which area’s restaurants.

Although these tasks can be written by similar nested

loops with standard matrix multiplications, they are not as well-studied as standard matrix multiplications whose eficient GPU algorithms have already been encapsulated by libraries such as cuBLAS [ 7 ] and CUTLASS [8]. Tuning these libraries manually is not flexible enough to fit of GPU architecture and the large code base of thoseincluding the introduction of relational algebra operators libraries. that expand the scope of MMLTs. We then discuss the de

On the other hand, compilers specifically for deep sign of Gamut in Section4 and evaluate its performance learning have been developed, such as TVM [9]. New in Section5. operators can be passed to the compiler as nested loops, and computation optimizations such as loop reordering and tiling are applied to generate eficient GPU codes. 2. Background Compared with libraries, compilers are more flexible but In this section, we describe Amulet5[, 6], the prior work spend more time on compiling, from several minutes to of MMLT on CPU and TVM, a deep learning compiler. hours depending on the problem size. And for each size, We also present a brief summary of GPU architecture. codes need to be regenerated. Although some works, such as ROLLER [10], try to reduce compilation time, they limit tasks to those that can be expressed as lambd2a.1. Amulet functions. Amulet [5, 6] is built on an open-source compiler that

As a result, MMLTs face the trade-of between im- can identify and optimize MMLTs. It applies an adaptive plementation flexibility and compiling performance strategy that finds the best configuration during execu[11]. Matrix multiplication libraries can compile once tion. After recognition of MMLT, Amulet transforms and run multiple times but with limited extendabilityt; he three-level nested loop into a parameterizable tiled while deep learning compilers can compile more matrix loop. At runtime, a small subset of input data is used to multiplication-like queries but with high compilationifnd the performant parameters, and then the remaining time, and the compiled codes cannot be reused for queries computation continues with the found parameters. with diferent operand sizes. These issues obstruct the ap- Amulet achieves speedups compared to the state-ofplication of matrix multiplication-like operations in thethe-art compiler and decent performance compared to machine learning community due to slow performance libraries, but it still has some limitations. As discussed in and low scalability 5[, 6]. the previous section, Amulet only supports the MMLT

In this paper, we proposeGamut1, a growing library that changes the inner arithmetic of the nested loop. that can generate and optimize GPU codes for MMLTs.In addition, although Amulet explores data parallelism We extend the range of MMLT studied in Amulet5,[6] (SIMD, Single Instruction Multiple Data), it is not conby introducing operators from relational algebra12[], so cerned about task parallelism since CPUs have limited MMLTs are not limited to simply changing inner arith-cores. GPUs use Single Instruction Multiple Thread metic. We create a declarative syntax for programmers to(SIMT) models with thousands of cores, which have specify an MMLT. Our library then parses the inputs andhigher flexibility than SIMD instructions 1[3, 14]. This recognizes corresponding operators. To better integratepresents an opportunity for extending MMLTs to more these operators into current fast matrix multiplication adli-verse queries. gorithms on GPU, we present a tile-based iterator model which fuses operators into data loading and storing iterators. After compilation into GPU machine code, meta 2.2. TVM information and object files are stored into our library. TVM [9] is a deep learning compiler that enables graphIn this way, when similar queries, even with diferent and operator-level optimizations for a deep learning operand sizes, arrive, they are not compiled again butmodel on a wide range of diferent hardware platforms. are directly executed in order to amortize compilationOperators are translated into nested multi-level loops. Afoverhead. ter translation, they use optimizations such as loop tiling,

We study a wide range of MMLT examples. Through cache read and write, etc., to rearrange the nested loops. experiments, we show that our library performs simi- TVM also optimizes computation graphs via operator larly to matrix multiplication libraries for standard maf-usion and data layout transformation techniques. The trix multiplication. For matrix multiplication-like queriesc,ombination of all these techniques introduces a large Gamut achieves speedups compared to deep learning search space. To search eficiently, hardwares are abcompilers while reducing a large amount of compilation stracted into a cost-based model, and TVM uses machine time. Our compilation time is stable and manageable for learning-based models such as XGBoost 1[5] to search diferent problem sizes. optimal configurations. Despite all these search methods,

The remainder of the paper is structured as follows.TVM still needs a long time of tuning, ranging from days After providing background information in Section 2, we to weeks, depending on the size of a DNN model 1[0]. define the matrix multiplication-like tasks in Sectio3n, 1GPUs Applied to MUltiplication-like Tasks (ctohrreead) (ctohrreead) (thread blSocMk) (ctohrreead) (ctohrreead) (thread blSocMk) …

… register register

…

L2 cache global memory multiplication, the problem can be decomposed into multiple two-dimensional matrix multiplications, each of which can be launched as a CUDA stream or processed by separate GPUs. 3.1. Generalizing Matrix Multiplication (a) for (int i = 0; i < M; i++) for (int j = 0; j < N; j++) forC[i(,injt]k+== A0[;i,k k<]K*; kB+[+k), j]; (b)

SELECT A.ind AS i, B.ind AS j, dot(A.a, B.b) AS C[i,j] FROM A, B

The simplest form of matrix multiplication-like tasks

GPUs provide the ability to run thousands of threads ininvolves all the three nestedf“or” loops, but this broad parallel by leveraging multi-core architecture and high-structure is too general to be optimized efectively. To bandwidth memory [13, 14]. The simplified architecture address this, more specific characteristics are identified is shown in Figure 3. and utilized to generalize matrix multiplication.

To organize a large number of threads, GPU uses First, three axes, typically denoted b,y , and , are thread hierarchy. The smallest unit of execution is theindependent. For instance, the starting point of thaexis thread. Thirty-two threads are automatically groupeids not related to the current state of tahxeis. These axes into a warp by hardware. Threads within the same warpcan be categorized into two types: two spatial axeasn(d execute the same instruction, known as SIMT (single ) and one reduction axis (). The spatial axes correspond instruction multiple threads). Multiple warps are thento the rows and columns of the resulting matrix. The grouped into a thread block. The thread block is the bar-eduction axis refers to the dimension that is collapsed sic scheduling unit and runs on a Streaming Processor. during the calculation. In standard matrix multiplication, Note that when the resources that a thread block requirteshe reduction axis is typically the column of the first exceed the resources a streaming processor can provide, matrix and the row of the second matrix. By classifying it will fail to run. A program can launch multiple threadthe axes in this manner, we can categorize the input and blocks to compute. output matrices of MMLTs into four types based on their

Memory has a similar hierarchical structure to threadi.ndexing in the inner operator.

The largest but slowest one is global memory, which is shared across all streaming processors. Programmers 1. Two spatial axes. It is indexed by, or , . And can allocate and manage global memory. L2 cache is also this is the only type of output matrix since the reshared by all the thread blocks, but it is not programmable. duction axis is always collapsed after calculation. It is smaller but faster than global memory. L1 cache, also 2. One spatial axis with one reduction axis. It is incalled shared memory in the programming model, is local dexed by , or , and vice versa. The input to all the threads within one thread block. Threads of matrices of standard matrix multiplication can be one thread block cannot access others’ shared memory. classified into this type.

Diferent from the CPU’s L1 cache, shared memory is 3. Only one spatial axis. It is indexed by or . The explicitly manageable by programmers. The smallest but input matrix is actually a vector. fastest layer is the register file, which is local to each 4. Only one reduction axis. It is indexed by . thread.

While standard matrix multiplication uses th+e=“” operator in the inner operation to accumulate the elements 3. Extension of Matrix along the reduction axis into a single scalar value, MMLTs Multiplication are not limited to this method. The accumulation operation could take on various forms, such as*“=”, maximum, In this section, we try to define matrix multiplication-average, and so on. like tasks and discuss how to apply relational algebra to MMLTs. We will focus our discussion on twodimensional matrices. For higher-dimensional matrix focuses on changing the inner operator, which is limited.

To achieve further generality, we can view MMLT from a database perspective.

Considering a matrix of size × , we can treat it as a database table wi th records and attributes. In the case of standard matrix multiplication, wher=e ⊤ and has size ×

, we can interpret this as each row of the table being dot producted with each row of the for i in range(M): for j in range(N): for k in range(K): SELECT FROM A, B SELECT Ad.oti(ndA.aA,S Bi.,b)B.iAnSdc,AS j, proj(c) AS R[i,j] A.ind AS i, B.ind AS j, dot(A.a, B.b) AS C[i,j] FROM A, B WHERE C[i,j] > 10 SELECT SELECT FROM A, B GROUP BY A.ai, B.bi A.ind AS i, B.ind AS j, dot(A.a, B.b) AS C[i,j] FROM A, B WHERE A.ai = B.bi SUM(dot(A.a, B.b)) AS C[A.ai, B.bi] same spatial information, for instance, max pooling in

ML. To accomplish this, two additional indexing matri

ces are required for each spatial axis, which indicate the spatial information (e.g., column. and column . ).

The Join operator filters out computations where the

operands do not satisfy the predicate, thus performing a check before reduction takes place. It can be used to cluster data in the same group or classification.

We can also incorporate duplicate elimination by utilizing a hash table as an output data structure. For ordering a limited number of output records, we can employ a heap data structure.

and nested loops. 3.2. Applying Relational Algebra The form of MMLT described in the previous section only 4.1. Fast Matrix Multiplication on GPU nested loops, shown in Figure5. bra operators and discuss how to convert them back intoFigure 6: Fast Matrix Multiplication Algorithm on GPU.

The Project operator corresponds to the inner product between two row vectors, and can also denote the We first provide a brief overview of CUTLASS’s fast operation applied to the resulting scalar output. In MmLatrix multiplication algorithm on GPU. The whole proapplications, it can be used to integrate element-wisceess is shown in Figure 6 functions such as activation layers. ThSeelect operator determines whether the resulting scalar output wishere is an × relevant. Typically, the predicate is selective, so the outis- an × Assume that = has a problem size of × × matrix. The reduction computation of each put is often a sparse matrix. In ML, it can function as element of is independent, allowing for the distribution a filter to eliminate undesired results. TheAggregate

of computation to separate threads. To maintain good operator groups together output elements that share tlhoecality, each thread typically computes a block of elements in of size × . As a result, a thread must load gmem→smem

smem→reg (a) Loading (b) Storing 1

1 stripes smem→gmem time. Similarly, a × sized matrix is output from

the shared memory to the global memory. The entire process consists of iterations. a × submatrix from and a × submatrix from small data slices, and the results are accumulated repeat . However, due to the limited size of registers within edly until the final result is computed and then output a thread, it may not be possible to hold these matricesh,ierarchically by the storing iterator. particularly i f is large. Therefore, each thread loads Due to the hierarchical structure of GPU memory, the only a slice of the matrices at a time, which is a × 1 iterator consists of two components: thread block-level vector from and a1 × vector from . data transfer between global memory and shared mem

Using the thread hierarchy of the GPU, the algorithmory, and thread-level data transfer between shared memgroups adjacent threads into thread blocks, assumingory and register. Since the data is categorized into four a block matrix size of × . Consequently, there types in Section3.1, we can design an iterator for each are × threads within each thread block. To taketype.

advantage of the GPU’s memory hierarchy, the algorithm typically fetches data in a step-by-step manner. Initially, one thread block fetches a × submatrix from and a × submatrix from into the shared memory.

Subsequently, each thread retrieves its own vectors to registers and performs the computations.

After the computations are complete, storing the results is also performed in a step-by-step manner. Since each thread has a n × sized register, each thread block produces a × sized result, which typically exceeds the size of shared memory. Therefore, only a 1 × sized slice of the resulting matrix from each thread is output from the register to the shared memory at a 1. One spatial axis with one reduction axis. This type can reuse the method that the standard matrix multiplication algorithm uses to load its data. Assuming the size of the spatial axis is, and the size of the reduction axis is , the iterator of one thread block loads a × tile from the global memory to the shared memory. The iterator of one thread loads a × 1 slice from the shared memory to the register. 2. Only one reduction axis. Instead of loading a sized vector into the shared memory and then loading one element into the register, we can directly load a sized vector into the register every iteration sinc e is not large. 3. Spatial axis only. Since this type of data is irrelevant to the reduction axis, only a single loading operation is required for all the iterations. We can directly load a sized vector into the register once from the global memory in the first iteration.

In the following iterations, we can read directly from the register. 4.2. MMLT Code Construction Algorithm 1 Main Loop Pseudo-Code 1: for ∶= 1 to step do 2: Load tiles to shared memory. 3: TileModelIn(block) 4: for ∶= 1 to step 1 do 5: Load slices to register. 6: InnerOp(slices) 7: end for 8: for ∶= 1 to step 1 do 9: Store one row to shared memory. 10: TileModelOut(block) 11: Store one block to global memory. 12: end for 13: end for

The inner computation then computes a block whose problem size is × × 1. It can directly read from the registers and output to a resulting register sized × . The initialization of the resulting register may difer based on the accumulation operator. For instance, summation initializes the register to all zeros, while the maximum may initialize the register to the minimum number of data precision. 4.3. Tile-based Iterator Model

One simple approach for applying relational algebra op

erators likeProject, Select, and Aggregate involves storing the intermediate results back into global memory

To better adapt the standard fast algorithm to MMLTbes,fore launching another kernel to process them. Howthe algorithm is divided into four components: loading ever, given the opportunity presented by threads within iterators, inner computation, a storing iterator, andaathread block cooperating to store a tile of data from skeleton main loop. shared memory to global memory, it is possible to per

The pseudo-code for the main loop is shown in Algo- form data transformations within the iterator. To this rithm 1, which remains unchanged during code genera- end, we propose a tile-based iterator model that intetion. The loading iterator for each input loads a tile ogfrates these operators with the data loading and storing data hierarchically from the global memory to the regis-process. ters. The inner computation of one thread computes via [], []] , .

)

▷ Compute prefix-sum. ℎ ]) ▷ atomicAdd returns value before .

Category Direct Mapping Linear Mapping Nonlinear Mapping

After obtaining the intermediate results, we identify For the last case, where the data structure is nonlinear, two types of mappings: one-to-one mapping and many- like a heap or hash map, a hierarchical structure is used. to-one mapping. One-to-one mapping, likeProject, can This involves storing a local data structure for each thread be directly computed in-register after reduction. Our block in shared memory, and a global data structure in tile-based iterator model, on the other hand, is designedglobal memory, along with a local lock in shared memory to address the many-to-one mapping. and a global lock in global memory. Each thread first

Tile-based iterators are categorized based on how theyacquires the local lock of its thread block, outputs its data index into the output, with the pseudo-code available itno shared memory, and then one thread in the thread Table 1. The first category is direct mapping, as seen in block acquires the global lock and merges the data in the Aggregate, where the output position of each elementlocal data structure into the global data structure. is already known before computation. To facilitate this, We concentrate on natural join for thJeoin operator. two slices of indexing information, one for each spatial Instead of using a hash map in hash join algorithms1[7, axis, each sized , are loaded into shared memory be- 18], a set of CUDA streams is created, with each stream fore output. During output, we can use the informationfocusing on one key. The whole computation is divided in shared memory to index, and because multiple ele- into multiple small blocked MMLTs. For one blocked ments are accumulated into one output position, atomiMcMLT, we initially retrieve the indices of input matrices primitives provided by CUDA can be directly used. that match the key for which this stream is responsible.

The two remaining cases of the tile-based iteratorThus, we can select data from global memory based on model do not have knowledge of the output positionthe computed indices instead of creating an additional prior to computation. In one of these cases, where thecopy for each blocked MMLT. data is output to a linear data structure, sucShelaesct, a global counter is used to indicate the next output po4si.-4. Growing Library tion. The number of outputs for each thread is computed and stored in a local counter array in shared memory.We use a declarative syntax similar to AMULET5,[6], as Then, a single thread computes the prefix sum [16] and illustrated in Figure7, but with the added requirement reserves a block for the thread block in global memory for users to specify both the spatial and reduction axes by incrementing the global counter usinga“tomicAdd”. explicitly. The range of in the syntax denote[s , ). Each thread can then determine its output position bTyhe loop in Figure 2 is rewritten in declarative form in adding the prefix sum of the corresponding thread to the Figure 8. start pointer of the reserved block. Gamut generates code from the declarative syntax by

parameters cannot fully occupy a streaming processor of GPU, Gamut scales up by increasing the size of tiles.

Otherwise, for applications that have more data in each thread block,Gamut decreases the tile footprint until the setting becomes runnable. The occupation can be calculated using the compilation information provided by the compiler and hardware specialization1[9]. If there are multiple optional configurations,Gamut launches one thread block for each configuration and selects the configuration with the best performance.

GPU code compilation takes a large amount of time compared to computation, even if the parameters are ifxed. For example, a standard matrix multiplication with a problem size of 4096 × 4096 × 4096 takes 5 seconds for compilation and only 80 milliseconds for computation.

To amortize the compilation time, the parsing and parameter information, along with a hashing summary of ifrst parsing the inner computation and selecting iter-the parse tree, are stored with the compiled object file ators based on the indexing type. Checks are also per-into storage. This information is used to compare parse formed during parsing to ensure the query conforms to trees when encountering similar queries, and if a duplithe MMLT definition. These checks ensure that there are cate query is encountered, the compiled object is linked only three axes and that two spatial axes are associatdeidrectly, thus saving compilation time. with the output matrix. Additionally, input matrices are identified as diferent types, and each input matrix is 5. Evaluation assigned an input iterator based on its recognized type.

Then, the library recognizes the relational algebra opera-The core of Gamut is implemented in CUDA C++, and tor pattern and selects the appropriate algorithm for thiattutilizes a custom parser built by ANTLR20[] and a operator, which is fused into the output iterators. search algorithm implemented by CuPy [21].

The code generated byGamut is parameterized by a All experiments2 are conducted on an Nvidia T4 GPU set of parameters described in Section4.1, which is , with an Intel Haswell CPU running Ubuntu 22.04 and , , , and . However, not all combinations can CUDA 11.7. It is assumed that the GPU memory is sufbe run on a GPU due to limitations on the number of ficient to accommodate all data. Each experiment is reregisters, shared memory size, and the number of threads peated 10 times, and the average execution time is reinside one thread block. To reduce the search space, ported. we use a heuristic from CUTLASS that favors square- The same performance measurement as Amulet, which shaped thread tiles by fixing ∶ = 1 ∶ 1 and is Scaled Processing Rate (SPR), is used 5[, 6]. For a ∶ = 1 ∶ 1. Through experiments, we found that problem size of × × and execution tim e in seconds, the problem size has little efect on the best configura- SPR is calculated as 109 . Higher SPR indicates better tion when launched thread blocks exceed the number of performance. multiprocessors in GPU. The thread block under the best configuration tends to be close to the largest runnable5.1. Standard Matrix Multiplication setting. This is because larger thread blocks result in fewer launched thread blocks and fewer thread blocks Performance running on each multiprocessor, which in turn reduces To begin with, we demonstrate that our library perthe overhead of context switching and minimizes cache forms comparably to state-of-the-art matrix multiplithrashing while maximizing the throughput. Therefore, cation libraries in standard matrix multiplication tasks. we use the scale-up-before-scale-out principle to avoidWe conducted experiments with square matrices (i.e., blind searching. = = ) of orders 1024, 2048, 4096, 8192, 16384, and

Upon installation of the librarGy,amut finds the best 32768, referred to as 1k, 2k, 4k, 8k, 16k, and 32k, respecparameters for standard matrix multiplication and savetsively. Three baseline libraries were used in the experithem as starting parameters taking into account tmheents: cuBLAS, which is the most widely used library resource diferences among various GPUs. For a new for linear algebra on GPUs; CUTLASS, an open-source MMLT query, the library searches for the best configuration starting from the saved parameters. If the starting

2Codes are available at https://github.com/xxcisxxc/GAMUT-release PR1500 S 1000 500 0 1K 2K 4K 8K Matrix Order 16K

32K cuBLAS CUTLASS GAMUT TVM w/ TVM w/o

Query 1 where (i in [0, M], j in [0, N]) { accum = 0; reduce (k in [0, K]) { accum += S[i, k]*W[k, j] +

(S[i, k]*W[k, j] > thres[i]) * (S[i, k]*W[k, j] - thres[i]); } matrix multiplication library on GPUs with similar per- Query 4 formance to cuBLAS; and TVM, a deep learning compiler. whearcecum(i=in0;[0, M], j in [0, N]) when (Pzip[i] == Rzip[j]) { We test two diferent settings for TVM: one where we reduce (k in [0, K]) { allow TVM to search from scratch with a fixed number } accum += P[i, k] * R[k, j]; of searches, and another where we provide TVM with } C[Pzip[i], Rzip[j]] += accum; CUTLASS’s default configuration to make an apple-toapple comparison with our library. AsGamut focuses on wQhueerrey 5(i in [0, M], j in [0, N]) { optimizing the standard cubic matrix multiplication al- arcecduumce=(k0;in [0, K]) { gorithm on GPUs, we do not compare it with algorithms } accum += P[i, k] * R[k, j]; such as the Strassen algorithm. accum > thres ? C_sparse.add(accum);

The performance comparison between our library and } three other baselines, cuBLAS, CUTLASS, and TVM, is wQhueerrey 6(i in [0, M], j in [0, N]) { shown in Figure 9. The results reveal that cuBLAS and accum = 0; CUTLASS perform slightly better than our library due to redauccceum(k+=inmax[0(,R[Ki],)k{], R[k, j]) * max(R[i, k], R[k, j]); their optimization techniques, such as avoiding memory }accum > thres ? C_sparse.add(accum); bank conflict and software pipelining. These techniques, } however, require a specific data layout and a larger shared Query 7 memory footprint, resulting in reduced flexibility. TVM whearcecum(i=in0;[0, M], j in [0, N]) { has reasonable performance when the configuration is redauccceum(k+=inP[i[0,,kK]])* {R[k, j]; given. For TVM without configuration, its compilation } time increases with the problem size, shown by Table } min_heap_128 .add(accum); 2. As a result, our library is able to achieve high performance while keeping the compilation time constant and significantly less compared to TVM. 5.2. Performance on General Queries To investigate the performance of other MMLT queries,5, but with a diferent arithmetic, seeking restaurants we examined a set of queries presented in Figure10. with complementary styles, where the style vector is Queries 1 and 2 are examples from the introduction. normalized. Therefore, two style vectors can be comQuery 3 calculates the number of people-restaurant pairsbined through the maximum of each feature, and the dot within two areas where the mutual preference exceedsproduct of the combined vector will have a high absolute a threshold. It utilizes thPeroject and Aggregate op- value if the two vectors are complementary. Finally, the erators. Query 4 only considers people and restaurantslast query returns the top 128 mutual preferences, so the in the same area and involves aJoin operation. Query heap size is 128. 5 outputs individual people-restaurant pairs whose mu- Diferent from Amulet [ 5, 6], Compilation of MMLTs tual preference is above a threshold. We assume that theon GPUs is more involved than CPUs because perforthreshold is selective, meaning only a few pairs qualify, mance requires the eficient use of parallel resources. resulting in a sparse matrix. Query 6 is similar to query Tools such as TVM provide infrastructure to help programmers avoid low-level parallel programming. How- dard matrix multiplications, even though it only modever, due to the limitations of TVM, we were only able toifies the inner loop computation. This is most likely compare Gamut with TVM on Query 1. This is because because the modified computation does not fit the norTVM does not provide access to atomic or lock primitives, mal matrix-multiply-add structure that is optimized by as confirmed by the TVM community [ 22], preventing specialized hardware in the GPU, leading to lower instructhe compilation of the other queries. tion throughput. Query 4 has a diferent performance

The results of all these queries are shown in Figure pattern compared to the other queries because it is bro11. It shows that TVM without default configurationken down into multiple smaller matrix multiplications, has much lower performance. This may be because TVM resulting in less computation than full multiplications. does not optimize the loading of the extra threshold inputT.herefore, it has a much higher SPR and is slower to con

Query 1 is approximately 3 times slower than stan-verge. In the case of Query 7, it exhibits low performance

Acknowledgments This work was supported by the National Science Foundation under grant IIS-2008295 and by a gift from Oracle.

boosting system , in: Proceedings of the 22nd

edge Discovery and Data Mining , KDD '16, Asso-

USA , 2016 , p. 785 - 794 . URL: https://doi.org/10.1145/

2939672.2939785. doi: 10 .1145/2939672.2939785. [16]

Harris ,

Sengupta ,

J. D.

Owens , Parallel prefix

sum (scan) with cuda , GPU gems 3 ( 2007 ) 851 - 876 . [17]

Balkesen , G. Alonso,

Teubner , M. T. Özsu,

volume 7 ,

VLDB

Endowment , 2013 , pp. 85 - 96 . [18]

Kaldewey , G. Lohman,

Mueller ,

Volk , Gpu

ment on New Hardware , 2012 , pp. 55 - 62 . [19] NVIDIA , NVCC :: CUDA Toolkit Documen-

tation , 2023 . URL:https://docs.nvidia.com/cuda/

cuda-compiler-driver-nvcc/ . [20]

Parr , Antlr (another tool for language recogni-

tion) , 2023 . URL:https://www.antlr.org./ [21]

Okuta ,

Unno ,

Nishino ,

Hido , C. Loomis,

tion Processing Systems (NIPS) , 2017 . URL:http://

learningsys.org/nips17/assets/papers/paper_16.pd.f [22]

Community , Discussion in tvm community,

2022. URL: https://discuss.tvm.apache.org/t/

non- predefined-reduction/13719.