Evaluating CUDA-Aware Approximate Computing Techniques Işıl Öz1 1 Computer Engineering Department, Izmir Institute of Technology, Izmir, Turkey Abstract Approximate computing techniques offer performance improvements by performing inexact computations. Moreover, CUDA programs written to be executed on GPU devices employ specific features to utilize the parallel computation units of heterogeneous GPU architectures. While generic software-level approximate computing techniques have been applied to heterogeneous CUDA programs, CUDA-specific approaches may introduce promising performance improvements by not corrupting the target computations. In this work, we propose software approximation techniques for CUDA programs: kernel-aware loop perforation, partition-level synchronization, block-level atomic operations, and warp divergence elimination. We perform source code transformations on target benchmark programs by applying our techniques. We evaluate performance improvements by trading off accuracy in our target computations. Our experimental results reveal that CUDA-aware approximation techniques offer significant performance improvements at the expense of acceptable accuracy loss. Keywords CUDA programming model, GPU computing, approximate computing 1. Introduction grams, but we also propose CUDA-specific methods to tar- get parallel CUDA threads. Our main contributions are as Heterogeneous computer systems, combining general- follows: purpose processors and GPU devices as accelerators, enable high-performance and energy-efficient executions. How- • We propose kernel-aware loop perforation by adapt- ever, the applications from various domains such as AI ing the loop perforation technique for CUDA pro- acceleration, big-data processing, and high-performance grams. To reduce the synchronization overhead, we computing (HPC) with large computing requirements make propose partition-level synchronization and block- power consumption an important concern in these systems level atomic operations based on CUDA cooperative [1]. Since modern GPU architectures employ complex struc- groups and CUDA thread scope atomic functions. tures and the target workloads exploit the massively parallel Additionally, we eliminate the warp divergence in- resources, energy efficiency becomes critical for large-scale side CUDA kernel functions to prevent serial execu- GPU executions [2, 3, 4]. tion caused by branch instructions. To solve the conflict between performance and energy • We modify the target codes by inserting compiler efficiency, approximate computing maintains high perfor- directives enabling our techniques, and generate our mance and low power consumption for applications that can approximate versions based on the given compiler tolerate inexact computations. While the architecture-level options. approximate computing techniques are enabled by modi- • We perform an experimental study to evaluate the fying processor units and memory components, software impact of the modifications by our approximations. solutions based on compiler transformations or manual code Our experimental study includes applications from modifications also offer approximate computations [5, 6]. different domains to observe the performance and While using inexact hardware or voltage scaling maintains accuracy variations for the target execution. Our ex- hardware solutions, techniques like loop perforation or re- perimental results reveal that CUDA-aware approx- laxed synchronization offer performance-accuracy tradeoffs imation techniques offer significant performance at the software level. improvements at the expense of acceptable accuracy Since GPU systems aim for applications from different loss. domains, they employ approximate computing techniques to improve performance and energy efficiency by trading The remainder of this paper is organized as follows: Sec- with the inaccuracy in the target computations. Besides tion 2 presents some background on approximate computing inherently error-tolerant graphics and image processing ap- and the CUDA programming model. We explain our ap- plications [7], general-purpose GPU programs benefit from proximation methods in Section 3. Then, the experimental approximations with reasonable incorrect computations [8]. results are outlined in Section 4. Section 5 presents relevant While some works reuse generic techniques like perfora- studies about CUDA approximations. Finally, in Section 6, tion [9], some methods utilize GPU-specific hardware or we summarize the work with some conclusive remarks. software components to employ approximations [10]. Ad- ditionally, simulation-based evaluations propose hardware modifications by either approximate units or supporting the 2. Background and Motivation approximate computations [11, 12]. In this work, we propose software-based approximations 2.1. Approximate Computing for CUDA programs running on GPU architectures. Not Approximate computing introduces acceptable inaccuracies only do we adapt the existing techniques for the GPU pro- into the computing process and promises significant perfor- mance and energy gains. Some techniques employ the loop RAW’24: The 3rd workshop on Resource AWareness of Systems and Society perforation approach, which works by skipping some loop (RAW 2024), July 02–05, 2024, Maribor, Slovenia $ isiloz@iyte.edu.tr (I. Öz) iterations to reduce computational overhead [9, 13]. Relaxed  0000-0002-8310-1143 (I. Öz) synchronization shortens the waiting time of the threads © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu- tion 4.0 International (CC BY 4.0). that wait for the completion of the other threads’ work CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings __global__ void vectorAdd(double *a, double *b, double *c, int n) { int id = blockIdx.x*blockDim.x+threadIdx.x; if (id < n) c[id] = a[id] + b[id]; } int main(int argc, char** argv) { double *d_a, d_b, d_c; int n; //... memory allocations and copy operations int blockSize = 1024;// Number of threads in each thread block int gridSize = (int)ceil((float)n/blockSize); vecAdd<<>>(d_a, d_b, d_c, n); //... memory copy and deallocations } Listing 1: Vector addition kernel function and its launch in CUDA. [14, 15]. Reduced precision computation employs fewer tures in the CUDA code. Secondly, we replace synchroniza- compute cycles with insignificant value modifications for tion primitives with relaxed versions and propose partition- error-tolerant applications targeting low-precision execu- level synchronization for threads based on CUDA coopera- tions [16]. tive groups and block-level atomic operations using CUDA thread scopes. Finally, we remove the warp divergence, a 2.2. GPU Programming Model serial bottleneck in GPU executions. We evaluate target CUDA source codes and perform code While modern GPU architectures evolve as the requirements transformations based on compiler directives. While our ap- of the target applications change, GPU devices employ SIMT proach requires manual code analysis and modifications to (Single Instruction Multiple Threads) execution to accelerate introduce directives in code segments, the semi-automatic data processing tasks in their parallel execution cores. configuration enables us to generate target codes that em- A program written in the CUDA programming model ploy approximations selectively by compiling the code with [17], which is a parallel programming model for NVIDIA specific options. GPU devices, starts its execution in a CPU, allocates memory space on the GPU, transfers data into GPU global memory, 3.1. Kernel-Aware Loop Perforation and starts a kernel function execution by creating thousands of threads. Each thread executes the same program (SIMT) While loop perforation skips some loop iterations in a serial by processing different parts of the given data. Threads that program, the same technique can be applied to CUDA pro- execute on the GPU are part of a compute kernel specified grams by adapting the perforation accordingly. We consider by a function. Besides data-parallel applications that can three approximation versions derived from loop perforation: benefit from many parallel execution units of GPUs, large- scale irregular computations utilize the massive degree of Kernel launch perforation: We skip the iterations of parallelism and the high memory bandwidth provided by the loop, which launches one or multiple CUDA kernel GPUs. Listing 1 presents the CUDA kernel function and ker- functions at each iteration. The perforation is simply a nel launch configuration for the vector addition operation. regular loop perforation with kernel launches per iteration. For simplicity, we skip the memory allocation and copy op- In a code segment given in Listing 2 (Fdtd2d program from erations. The execution launches the vectorAdd function by Polybench suite [18]), we reduce the number of iterations specifying the number of blocks and the number of threads and consequently, the kernel launches by assigning a smaller in each block. The hardware scheduler schedules the blocks upper bound (for _PB_TMAX variable in the example code). into SM (Streaming Multiprocessor) units and thread groups (warps) into GPU cores inside SMs. Each thread executes Kernel launch configuration perforation: In data- the kernel function and performs the addition operation parallel CUDA programs, CUDA threads execute implicit based on its global thread identifier. loops in parallel by performing the computations that be- long to one or more iterations of the serial program loop. 3. Approximation for CUDA We launch target kernel executions by reducing the number of threads in the configuration; hence, the original loop in Programs the serial program is perforated. In a code segment given in Listing 3 (Convolution2D program from Polybench suite We propose three main approximations for target CUDA [18]), we modify the block or grid configuration parameters programs. Firstly, we exploit loop perforation by adapt- of the convolution2D_kernel kernel by reducing the X or Y ing the popular method for loop and loop-similar struc- dimensions of the grid. ... for(int t = 0; t < _PB_TMAX; t++) { fdtd_step1_kernel<<>>(nx, ny, _fict_gpu, ex_gpu, ey_gpu, hz_gpu, t); cudaDeviceSynchronize(); fdtd_step2_kernel<<>>(nx, ny, ex_gpu, ey_gpu, hz_gpu, t); cudaDeviceSynchronize(); fdtd_step3_kernel<<>>(nx, ny, ex_gpu, ey_gpu, hz_gpu, t); cudaDeviceSynchronize(); } ... Listing 2: The Fdtd2d code with kernel function calls inside a loop. ... dim3 block(DIM_THREAD_BLOCK_X, DIM_THREAD_BLOCK_Y); dim3 grid(ceil(((float)NI) / ((float)block.x)), ceil(((float)NJ) / ((float)block.y))); convolution2D_kernel <<< grid,block >>> (ni, nj, A_gpu,B_gpu); ... Listing 3: The Convolution2D code kernel launch configuration. ... 3.2. Relaxed Synchronization __global__ void mean_kernel(int m, int n, DATA_TYPE * mean, DATA_TYPE *data) Multiple CUDA threads require time-consuming synchro- { nization to access shared data or resolve data dependencies, int j = blockIdx.x * blockDim.x + threadIdx.x; utilizing atomic operations (like atomicAdd) and barrier op- erations (like __syncthreads()), respectively. The relaxed if (j < _PB_M) { synchronization offers performance gains by synchronizing mean[j] = 0.0; fewer threads in exchange for output accuracy loss. We consider two main relaxations based on CUDA cooperative int i; groups and CUDA thread scopes: for(i = 0; i < _PB_N; i++) { mean[j] += data[i * M + j]; Partition-level synchronization: CUDA threads within } a block can cooperate by synchronizing their execution mean[j] /= (DATA_TYPE)FLOAT_N; to coordinate memory accesses. The programmer can de- } fine synchronization points by calling the __syncthreads() } ... function, which acts as a barrier and makes waiting for all threads. While CUDA employs block-level synchroniza- Listing 4: The Covariance code with loop structures inside a tion by __syncthreads() function, __syncwarp() function, kernel function mean_kernel. which synchronizes the threads within a warp, has become available on CUDA 9. This is important for porting code to modern GPU architectures after Volta, in which threads within a warp can be scheduled separately. Additionally, the Intra-kernel loop perforation: We perform the stan- Cooperative Groups API [19] provides a rich set of thread- dard loop perforation method for the code inside kernel synchronization primitives by forming partitions with a set functions. In a code segment given in Listing 4 (Covariance of threads. Listing 5 presents different code snippets to or- program from Polybench suite [18]), we reduce the number ganize groups of threads. While the first group, blockgroup, of loop iterations inside mean_kernel kernel (_PB_N variable represents all the threads in a thread block, warpgroup repre- in the example code). sents all the threads in a warp. If we want to synchronize the threads in those groups, the behavior will be the same with We modify each program code by inserting __syncthreads() and __syncwarp() functions, respectively. compiler directives for a set of loop perforation For implementing the approximation, first, we search types. Specifically, we define four directives: KER- for all __syncthreads() function calls in the target kernel NEL_LAUNCH_PERFORATION, GRID_PERFORATION, functions and configure the synchronization level for each BLOCK_PERFORATION, LOOP_PERFORATION, and compile synchronization point. Specifically, we either completely the programs by enabling the directives with specific values, skip __syncthreads() (SKIP) or replace it with a relaxed ver- which represent the perforation rate as the reduction ratio sion. For relaxing synchronization, we choose __syncwarp() of the target loop. By enabling the chosen perforation (WARP) or utilize cooperative thread groups (the details type(s) and rate(s) at compile time, we evaluate the impacts are given below). We modify each program code and inject on the execution. #ifdef directives to guide the compiler based on user prefer- ences. For each __syncthreads() code block, we define one directive and compile the code by specifying one or more directives. // Cooperative group for the current thread block auto blockgroup = cooperative_groups::this_thread_block(); // Cooperative group for each warp in the thread block auto warpgroup = cooperative_groups::tiled_partition<32>(threadblock); // Cooperative group for each 16 threads in the thread block auto subwarp16 = cooperative_groups::tiled_partition<16>(threadblock); // Cooperative group for all currently coalesced threads in the warp auto coalescedgroup = cooperative_groups::coalesced_threads(); // Thread block groups can sync blockgroup.sync(); Listing 5: CUDA cooperative groups [19]. //Replaced code version 1 (4TILE) thread_group tile32 = tiled_partition(this_thread_block(), 32); thread_group tile4 = tiled_partition(tile32, 4); tile4.sync(); //Replaced code version 2 (ACTIVE) thread_group active = coalesced_threads(); active.sync(); Listing 6: Partition-level synchronization configurations. For our partition-level approach, we define two thread the same time. Due to SIMD execution model, when threads partitions (as given in Listing 6): 1) 4TILE: Cooperative in the same warp need to perform different operations, the thread groups with four threads in the corresponding warp, execution of the different branches is serialized, thus hurt- 2) ACTIVE: Currently coalesced threads in the warp. When ing performance improvement that could be gained from data-dependent conditional branches in the code cause parallelism. Figure 1 presents an example scenario for warp threads within a warp to diverge, the SM disables (deac- divergence. The eight threads (assuming we have an 8- tivates) threads that do not take the branch. The threads thread warp size for simplicity) start the kernel execution, that remain active on the path are referred to as coalesced. then at Branch point, there is an if statement that causes different path executions. While four threads execute the Block-level atomic operations: While the atomic opera- instruction at Path A, the other four continue the execution tions in standard C or C++ are uniform, the CUDA program- at Path B. When the first four threads execute Path A, the ming model offers atomic functions at different scopes. A others must wait and perform no operation. The marked thread scope specifies the set of threads that can synchronize execution prevents full warp utilization by activating only with each other using atomic operations. Atomic functions four threads simultaneously in an 8-thread warp structure. with _system suffix (e.g., atomicAdd_system) are atomic at system scope, where the system refers to the system running on multiple GPUs and CPUs. Atomic functions without a suffix (e.g., atomicAdd) are atomic at device scope, where the device refers to the target GPU device. Atomic func- tions with _block suffix (e.g., atomicAdd_block) are atomic at thread block scope, which refers to the synchronization of the threads executing on the same thread block. In block-level atomic operations, we target that the threads perform atomic operations at the largest thread block scope. Like our synchronization approach, we search for all atomic functions in the target CUDA code and re- duce the atomic scope accordingly. For instance, we re- place atomicAdd function calls with atomicAdd_block, or we completely remove the function call. Hence, we aim for atomic operations with fewer threads than the original code. Similarly, we could replace atomicAdd_system or skip atomicAdd_block functions. Figure 1: SIMD Warp Divergence [20]. 3.3. Warp Divergence Elimination To eliminate the divergence overhead, as an approxima- tion method, we execute only one path in case of multiple CUDA threads are executed in groups of 32 threads (warps), paths in a warp. For instance, for the code given in Listing and all threads in a warp execute the same instruction at 7 (from Grappolo application [21]), we configure to execute Path 1, Path 2, or Path 3. Alternatively, we completely skip grid-level kernel launch configuration perforation, and the divergent code segment in our evaluations. block-level kernel launch configuration perforation, we per- form 90% and 80% perforation rates. If the program does ... not support the target approximation (e.g., Correlation does if (currCId == FLAG_FREE) { not have a kernel launch inside a loop), we simply do not /*Path 1*/ have the corresponding result in our evaluation. The values } in Figure 2 present 1/Speedup and the rate of incorrectly if (currCId == (1 + dataItem->cId)) { /*Path 2*/ computed elements. We define the performance in terms } of speedup, the ratio of the compute time for the original else execution to the time for the approximate execution, and { report the 1/Speedup values in our results. For instance, /*Path 3*/ the execution time for the original Correlation execution is } ... 1.785 milliseconds, and it computes 4194304 array elements. When we perforate the kernel function loops by 90% (Loop Listing 7: SIMD warp divergence code example. (90%)), we have 1.302 milliseconds and 793356 incorrect computations. Therefore, the performance improvement rate equals 1.302/1.785=0.73, and the rate of the incorrect computations is 793356/4194304=0.19, shown in Figure 2. By reporting performance improvement and inaccuracy val- 4. Experimental Study ues in this way, one can evaluate performance gains and incorrect results for each approximation and make design 4.1. Experimental Setup decisions. Based on the program characteristics, each ap- proximation affects the execution outcome differently. We To evaluate our approximation methods, we select CUDA can have up to 60% performance improvements (Loop (80%) applications from Polybench [18] and Gardenia [22] bench- for Correlation) in exchange for 40% of the elements incor- mark suites and utilize an optimized CUDA implementation rectly computed. Some approximations offer good tradeoff of the Louvain community detection algorithm, namely points, like grid-level kernel launch configuration perfora- Grappolo [21]. While Polybench applications mostly employ tions (Grid (90%) and Grid (80%)) in Fdtd2d. We can have data-parallel computations with multiple loop structures, 20% and 30% performance improvements by losing 30% and Gardenia implements a set of graph algorithms that include 50% of correct computations. On the other hand, there is synchronization primitives. Grappolo, with computationally no performance improvement with small inaccuracy values intensive and complex structures, includes code segments (like kernel launch configuration perforations in Covariance) for our evaluations on relaxed synchronization and warp or intolerable output loss with improvement in execution divergence-based approximations. times (like loop perforations in Syrk). We compile our programs with CUDA 12.1 [23] and run our approximation experiments in a system with an NVIDIA GeForce RTX 3050 Ti Mobile GPU device. The GPU device, 4.2.2. Relaxed Synchronization built on Ampere architecture [24], has 4 GB GDDR6 mem- We evaluate the Betweenness Centrality (bc) in the Gardenia ory. benchmark suite [22], which has four different implementa- tions. For a sample graph (soc-LiveJournal1 [25]), we execute 4.2. Experimental Results each version and select the one with the lowest execution time. Since the version already employs optimizations, we We evaluate our three main approximations for target CUDA apply our approximation methods to that version for fair programs separately. We execute both original and approxi- comparison. The implementation (i.e., bc_topo_lb) has four mated versions, measure GPU execution times, and collect main kernel functions with synchronization primitives (i.e., result outputs. By comparing execution time and output __syncthreads()). We apply our relaxed synchronization accuracies, we perform a tradeoff analysis for target com- techniques for each seven __syncthreads() function call in putations. the target kernel functions and perform four specific modi- fications: 1) SKIP: Remove __syncthreads(), 2) WARP: Syn- 4.2.1. Kernel-Aware Loop Perforation chronize threads in the same warp, 3) 4TILE: Synchronize For our loop perforation techniques, we select six programs four threads in the same cooperative group, 4) ACTIVE: from the Polybench benchmark suite [18]. The programs Synchronize coalesced threads. Finally, we have 28 differ- have data-parallel characteristics and each employs differ- ent versions. We execute those versions with 19 different ent loop structures. We execute Correlation, Covariance, datasets. We observe execution time and output differences Syrk, Fdtd2d with STANDARD input sizes and Jacobi-2D, for only a subset of our executions, specifically, relaxations and 2DConv with LARGE input sizes to have longer execu- for three __syncthreads() function calls on only one kernel tion times. We collect GPU execution times and incorrect function. For three synchronization points, we also consider computations by comparing them with the original output. the relaxation of their combinations. Since the programs work with array structures and compute Table 1 presents the execution times and the number of array elements as the final output, we evaluate the number incorrect computations in the observed output for the spec- of array elements that are computed incorrectly. ified graphs. We can observe that SKIP, WARP, and ACTIVE Figure 2 demonstrates performance improvement and in- mostly outperform 4TILE, probably due to the overhead of accuracy values for the programs when our loop-perforation fine-grained group creation. While relaxing individual syn- methods are applied. For each applicable method, namely, chronization points (i.e., SYNC 1, SYNC 2, SYNC 3) offers kernel launch perforation, intra-kernel loop perforation, performance gains significantly with non-significant accu- Figure 2: Speedup-Inaccuracy variation for loop perforation approximation methods. Table 1 Execution time and incorrect computations (out of given expected correct values) for relaxed synchronization methods for bc application. ljournal-2008 socLiveJournal cage15 Time Incorrect Time Incorrect Time Incorrect (5,363,260) (4,847,571) (5,154,859) ORIGINAL 61.156 0 57.116 0 47.096 0 SKIP 60.637 954 57.116 276 46.780 28,326 SYNC 1 WARP 60.199 793 53.611 355 47.109 70,611 4TILE 60.579 754 57.205 227 46.939 76,002 ACTIVE 60.205 1113 55.431 292 47.164 71,563 SKIP 59.770 19,888 52.756 1636 47.425 486,970 SYNC 2 WARP 59.314 11,230 52.485 1693 47.811 498,751 4TILE 61.364 14,087 52.300 1522 48.465 303,306 ACTIVE 59.397 8345 52.532 1647 47.850 463,191 SKIP 61.328 64 53.996 36 47.198 15,531 SYNC 3 WARP 60.643 1996 53.774 38 47.843 14,341 4TILE 61.374 137 53.961 34 47.186 13,383 ACTIVE 60.744 160 53.872 32 47.775 13,975 SKIP 56.800 20,137 50.572 1739 46.420 98,691 SYNC 1+2 WARP 56.377 17,035 50.401 1633 46.879 104,232 4TILE 57.779 12,311 51.199 1678 47.217 138,951 ACTIVE 56.195 22,491 50.495 1657 46.908 84,210 SKIP 57.088 20,346 50.871 1729 46.966 85,029 SYNC 1+2+3 WARP 56.747 20,719 50.774 1600 47.320 62,026 4TILE 57.801 14,807 51.513 1649 47.678 59,262 ACTIVE 56.608 17,248 50.880 1556 47.352 87,300 racy loss, the combinations (i.e., SYNC 1+2, SYNC 1+2+3) vain community detection algorithm [26]. Louvain is a further improve the performance without hurting the out- greedy graph processing method that assigns each vertex to put quality much. While the accuracy loss depends on the a community, which maximizes the overall Modularity and target dataset, we see the most promising relaxation options generates a new graph in which the communities become for performance gains (around 8%-10%) with the SYNC 1+2 new vertices. Since the output metric, Modularity, does not version. present an exact result, trading the output accuracy with Since bc has no atomic operations, we consider another performance improvement can be an interesting evaluation application to observe the impact of our approximation for the execution. The Modularity metric evaluation de- techniques for atomic operations. We utilize Grappolo code pends on the application domain utilizing the community [21], a highly-optimized CUDA implementation of the Lou- detection, however, a Modularity value close to 1 presents Figure 3: Execution time-Modularity values for relaxing atomic operations in Grappolo. higher quality output. While the Grappolo employs both syn- tion. Firstly, we execute only one path out of three, but chronization and atomic operation primitives, we evaluate our execution does not end (infinite loop) with each path only atomic operations and perform our relaxation methods. choice. Then, we apply a different strategy by eliminating Specifically, for atomicAdd and atomicCAS function calls, the code in the target path executions and returning the we either replace them with the non-atomic operation or previously computed value with no computation. For this the block-level atomic function calls (i.e., atomicAdd_block method, our execution is completed in a shorter time with or atomicCAS_block). lower Modularity values. Figure 3 presents execution time and modularity values Table 2 presents execution time and modularity values for as the performance and the accuracy metrics, respectively. the original execution and our approximate version. While Besides individual atomic operations, we relax the combi- we can see a decrease in all execution times, the approx- nations of the atomic operations to see the impact on the imation also destroys modularity values with one excep- outcome. In our target program, we have five atomicAdd tion, namely the wb-edu dataset. Since this approximation and three atomicCAS function calls. We label the relaxations completely eliminates some code segments, accuracy loss by considering the index and type of the method. Specif- becomes inevitable for most cases, and it requires a more ically, we use SKIP or BLOCK as the prefix and the order rigorous analysis of the target code. of the corresponding function as the suffix. For instance, SKIP_ATOMICADD_1 replaces the first atomicAdd with the non-atomic operation; BLOCK_ATOMICADD_3 replaces the 5. Related Work third atomicAdd with atomicAdd_block. For the combined SAGE [7] presents a static compiler that generates a set of relaxations, we concatenate the index of each operation approximated CUDA kernels and a runtime system that em- such as SKIP_ATOMICADD_1_2. We only select a subset of ploys selective discarding of atomic operations, data pack- the combinations since it is not practical to execute all of ing, and thread fusion optimizations. It yields 2.5× speedup them. While we conduct experiments for 19 datasets, we with less than 10% quality loss for machine learning and include five of them that present the most interesting design image processing kernels. While SAGE proposes approxi- points. All five graphs demonstrate significant performance mations for CUDA computations and significantly improves improvements with little modularity losses. Depending on performance, it relies on generic approximation methods the modularity evaluation of the target domain utilizing the instead of CUDA-specific techniques. community detection, one can easily prefer approximated Freytag et al. [27] propose efficient executions for sci- versions. The executions that have large modularity val- entific simulation applications by building multiple kernel ues in the original version exhibit significant performance implementations with different precision levels. They exe- improvements without hurting the modularity very much. cute approximated kernel versions by switching from one Especially, SKIP_ATOMIC_ADD_1_2 version promises up to version to another at runtime based on Target Output Qual- 3x performance gains with 0.01 modularity loss. ity (TOQ) scenarios. By employing execution configurations based on an analysis of the accuracy loss, the experiments 4.2.3. Warp Divergence Elimination reveal high-performance and energy-efficient executions for We utilize Grappolo for our warp divergence elimination target precision levels. While the authors build application- method due to its complex structure that employs branch layer approximations for the target code, they modify the instructions inside kernel functions. We work with two precision levels of the target code while not introducing kernel functions and perform different divergence elimina- CUDA-specific methods. Table 2 Execution time and Modularity values with warp divergence elimination for Grappolo. Original Approx. Dataset Time Modularity Time Modularity relat9 1.206 0.491 0.616 0.254 cage15 1.341 0.893 1.077 0.727 rel9 1.094 0.458 0.446 0.253 ljournal 2.034 0.759 1.573 0.588 rgg23 1.295 0.991 1.206 0.718 soc-LiveJournal1 2.042 0.753 1.606 0.603 wb-edu 3.468 0.995 1.621 0.980 Liu et al. [28] present cuSpAMM, the CUDA adaptation [3] Q. Zeng, Y. Du, K. Huang, K. K. Leung, Energy-efficient of the Sparse Approximate Matrix Multiply algorithm, by resource management for federated edge learning with utilizing thread parallelism, memory tiling, and the tensor CPU-GPU heterogeneous computing, IEEE Transac- cores in multiple GPU devices. While the proposed work tions on Wireless Communications 20 (2021) 7947– implements an approximation algorithm by considering 7962. GPU-specific features, the implementation, rather than ap- [4] V. Raca, S. W. Umboh, E. Mehofer, B. Scholz, Runtime proximation, relies on GPU optimization techniques. and energy constrained work scheduling for hetero- geneous systems, The Journal of Supercomputing 78 (2022) 17150–17177. 6. Conclusions and Future Work [5] S. Mittal, A survey of techniques for approximate computing, ACM Computing Surveys 48 (2016). In this work, we propose CUDA-specific approximation [6] P. Stanley-Marbell, A. Alaghi, M. Carbin, E. Darulova, methods based on loop perforation, relaxed synchroniza- L. Dolecek, A. Gerstlauer, G. Gillani, D. Jevdjic, tion, and warp divergence elimination. We define approxi- T. Moreau, M. Cacciotti, A. Daglis, N. E. Jerger, B. Fal- mations as compiler directives and enable them for target safi, S. Misailovic, A. Sampson, D. Zufferey, Exploit- executions. Our experimental results demonstrate that our ing errors for efficiency: A survey from circuits to approximation techniques promise good performance im- applications, ACM Computing Surveys 53 (2020). provements without hurting output accuracy significantly. doi:10.1145/3394898. Our approximations are enabled based on compiler direc- [7] M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, tives. While the directives offer some level of automation, S. Mahlke, Sage: Self-tuning approximation for graph- we can extend our work by building a fully automated tool ics engines, in: 2013 46th Annual IEEE/ACM Inter- that performs source-to-source compiler transformations. national Symposium on Microarchitecture (MICRO), Thus, we can easily generate our approximated versions. 2013. Moreover, a design space exploration technique potentially [8] D. Peroni, M. Imani, H. Nejatollahi, N. Dutt, T. Rosing, helps to choose the best design points considering perfor- Arga: Approximate reuse for GPGPU acceleration, in: mance improvements and inaccuracy values. 2019 56th ACM/IEEE Design Automation Conference While approximate computing offers performance im- (DAC), 2019, pp. 1–6. provements, it is essential to evaluate the power consump- [9] D. Maier, B. Cosenza, B. Juurlink, Local memory- tion of the target execution. We can extend our work by aware kernel perforation, in: Proceedings of the including energy measurements for GPU devices and in- 2018 International Symposium on Code Generation clude that criterion as our resource-aware evaluation. and Optimization, CGO 2018, Association for Comput- ing Machinery, New York, NY, USA, 2018, p. 278–287. Acknowledgments doi:10.1145/3168814. [10] A. Li, S. L. Song, M. Wijtvliet, A. Kumar, H. Corporaal, This work was supported by the Scientific and Technological Sfu-driven transparent approximation acceleration on Research Council of Turkey (TÜBİTAK), Grant No: 122E395. GPUs, in: Proceedings of the 2016 International Con- This work is partially supported by CERCIRAS COST Action ference on Supercomputing, ICS ’16, Association for CA19135 funded by COST Association. Computing Machinery, New York, NY, USA, 2016. [11] R. Garcia, F. Asgarinejad, B. Khaleghi, T. Rosing, M. Imani, Trulook: A framework for configurable References GPU approximation, in: 2021 Design, Automation and Test in Europe Conference and Exhibition (DATE), [1] S. Mittal, J. S. Vetter, A survey of methods for ana- 2021, pp. 487–490. doi:10.23919/DATE51398.2021. lyzing and improving GPU energy efficiency, ACM 9474239. Computing Surveys 47 (2014). URL: https://doi.org/10. [12] F. Vaverka, V. Mrazek, Z. Vasicek, L. Sekanina, Tfap- 1145/2636342. doi:10.1145/2636342. prox: Towards a fast emulation of dnn approximate [2] K. Ma, X. Li, W. Chen, C. Zhang, X. Wang, GreenGPU: hardware accelerators on GPU, in: 2020 Design, A holistic approach to energy efficiency in GPU-CPU Automation and Test in Europe Conference and Ex- heterogeneous architectures, in: 2012 41st Interna- hibition (DATE), 2020, pp. 294–297. doi:10.23919/ tional Conference on Parallel Processing, 2012, pp. DATE48585.2020.9116299. 48–57. [13] D. Maier, B. Juurlink, Model-based loop perforation, in: Euro-Par 2021: Parallel Processing Workshops, Springer International Publishing, Cham, 2022, pp. 549–554. [14] K. Iliakis, H. Timko, S. Xydis, P. Tsapatsaris, D. Soudris, Enabling large scale simulations for particle acceler- ators, IEEE Transactions on Parallel and Distributed Systems 33 (2022) 4425–4439. doi:10.1109/TPDS. 2022.3192707. [15] A. L. C. Bueno, N. de La Rocque Rodriguez, E. D. Sotelino, Adaptive relaxed synchronization through the use of supervised learning methods, Future Gen- eration Computer Systems 106 (2020) 260–269. [16] S. Cherubin, G. Agosta, Tools for reduced precision computation: A survey, ACM Computing Surveys 53 (2020). URL: https://doi.org/10.1145/3381039. doi:10. 1145/3381039. [17] D. B. Kirk, W. mei W. Hwu, Programming Massively Parallel Processors (Third Edition), Morgan Kaufmann, 2017. [18] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, J. Cavazos, Auto-tuning a high-level language targeted to GPU codes, 2012 Innovative Parallel Computing (InPar), 2012. [19] M. Harris, K. Perelygin, Cooperative groups: Flexible CUDA thread programming, https://developer.nvidia. com/blog/cooperative-groups/, 2017. [20] T. M. Aamodt, W. W. L. Fung, T. G. Rogers, M. Martonosi, General-Purpose Graphics Processor Architecture, 2018. [21] A. K. Mahantesh Halappanavar, Howard (Hao) Lu, S. Ghosh, Grappolo community detection, https:// github.com/ECP-ExaGraph/grappolo, 2024. [22] Z. Xu, X. Chen, J. Shen, Y. Zhang, C. Chen, C. Yang, Gardenia: A graph processing benchmark suite for next-generation accelerators, ACM Journal on Emerg- ing Technologies in Computing Systems 15 (2019). URL: https://doi.org/10.1145/3283450. doi:10.1145/ 3283450. [23] Nvidia, CUDA toolkit 12.1, https://developer.nvidia. com/cuda-12-1-0-download-archive, 2024. [24] Nvidia, Nvidia ampere ga102 GPU architecture white paper, https://www.nvidia.com/content/PDF/ nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2. 1.pdf, 2021. [25] J. Leskovec, A. Krevl, SNAP Datasets: Stanford large network dataset collection, http://snap.stanford.edu/ data, 2014. [26] V. Blondel, J. Guillaume, R. Lambiotte, E. Mech, Fast unfolding of communities in large networks, Journal of Statistical Mechanics: Theory and Experiment (2008). [27] G. Freytag, C. A. Künas, P. Rech, P. O. A. Navaux, In- terleaved execution of approximated cuda kernels in iterative applications, in: 2024 32nd Euromicro In- ternational Conference on Parallel, Distributed and Network-Based Processing (PDP), 2024, pp. 60–67. doi:10.1109/PDP62718.2024.00017. [28] X. Liu, Y. Liu, H. Yang, M. Dun, B. Yin, Z. Luan, D. Qian, Accelerating approximate matrix multipli- cation for near-sparse matrices on gpus, Journal of Supercomputing 78 (2022) 11464–11491. URL: https: //doi.org/10.1007/s11227-022-04334-5. doi:10.1007/ s11227-022-04334-5.