Evaluating CUDA-Aware Approximate Computing Techniques
                         Işıl Öz1
                         1
                             Computer Engineering Department, Izmir Institute of Technology, Izmir, Turkey


                                             Abstract
                                             Approximate computing techniques offer performance improvements by performing inexact computations. Moreover, CUDA programs
                                             written to be executed on GPU devices employ specific features to utilize the parallel computation units of heterogeneous GPU
                                             architectures. While generic software-level approximate computing techniques have been applied to heterogeneous CUDA programs,
                                             CUDA-specific approaches may introduce promising performance improvements by not corrupting the target computations. In this work,
                                             we propose software approximation techniques for CUDA programs: kernel-aware loop perforation, partition-level synchronization,
                                             block-level atomic operations, and warp divergence elimination. We perform source code transformations on target benchmark
                                             programs by applying our techniques. We evaluate performance improvements by trading off accuracy in our target computations. Our
                                             experimental results reveal that CUDA-aware approximation techniques offer significant performance improvements at the expense of
                                             acceptable accuracy loss.

                                             Keywords
                                             CUDA programming model, GPU computing, approximate computing


                         1. Introduction                                                                                                     grams, but we also propose CUDA-specific methods to tar-
                                                                                                                                             get parallel CUDA threads. Our main contributions are as
                         Heterogeneous computer systems, combining general-                                                                  follows:
                         purpose processors and GPU devices as accelerators, enable
                         high-performance and energy-efficient executions. How-                                                                   • We propose kernel-aware loop perforation by adapt-
                         ever, the applications from various domains such as AI                                                                     ing the loop perforation technique for CUDA pro-
                         acceleration, big-data processing, and high-performance                                                                    grams. To reduce the synchronization overhead, we
                         computing (HPC) with large computing requirements make                                                                     propose partition-level synchronization and block-
                         power consumption an important concern in these systems                                                                    level atomic operations based on CUDA cooperative
                         [1]. Since modern GPU architectures employ complex struc-                                                                  groups and CUDA thread scope atomic functions.
                         tures and the target workloads exploit the massively parallel                                                              Additionally, we eliminate the warp divergence in-
                         resources, energy efficiency becomes critical for large-scale                                                              side CUDA kernel functions to prevent serial execu-
                         GPU executions [2, 3, 4].                                                                                                  tion caused by branch instructions.
                            To solve the conflict between performance and energy                                                                  • We modify the target codes by inserting compiler
                         efficiency, approximate computing maintains high perfor-                                                                   directives enabling our techniques, and generate our
                         mance and low power consumption for applications that can                                                                  approximate versions based on the given compiler
                         tolerate inexact computations. While the architecture-level                                                                options.
                         approximate computing techniques are enabled by modi-                                                                    • We perform an experimental study to evaluate the
                         fying processor units and memory components, software                                                                      impact of the modifications by our approximations.
                         solutions based on compiler transformations or manual code                                                                 Our experimental study includes applications from
                         modifications also offer approximate computations [5, 6].                                                                  different domains to observe the performance and
                         While using inexact hardware or voltage scaling maintains                                                                  accuracy variations for the target execution. Our ex-
                         hardware solutions, techniques like loop perforation or re-                                                                perimental results reveal that CUDA-aware approx-
                         laxed synchronization offer performance-accuracy tradeoffs                                                                 imation techniques offer significant performance
                         at the software level.                                                                                                     improvements at the expense of acceptable accuracy
                            Since GPU systems aim for applications from different                                                                   loss.
                         domains, they employ approximate computing techniques
                         to improve performance and energy efficiency by trading                                                                The remainder of this paper is organized as follows: Sec-
                         with the inaccuracy in the target computations. Besides                                                             tion 2 presents some background on approximate computing
                         inherently error-tolerant graphics and image processing ap-                                                         and the CUDA programming model. We explain our ap-
                         plications [7], general-purpose GPU programs benefit from                                                           proximation methods in Section 3. Then, the experimental
                         approximations with reasonable incorrect computations [8].                                                          results are outlined in Section 4. Section 5 presents relevant
                         While some works reuse generic techniques like perfora-                                                             studies about CUDA approximations. Finally, in Section 6,
                         tion [9], some methods utilize GPU-specific hardware or                                                             we summarize the work with some conclusive remarks.
                         software components to employ approximations [10]. Ad-
                         ditionally, simulation-based evaluations propose hardware
                         modifications by either approximate units or supporting the
                                                                                                                                             2. Background and Motivation
                         approximate computations [11, 12].
                            In this work, we propose software-based approximations
                                                                                                                                             2.1. Approximate Computing
                         for CUDA programs running on GPU architectures. Not                                                                 Approximate computing introduces acceptable inaccuracies
                         only do we adapt the existing techniques for the GPU pro-                                                           into the computing process and promises significant perfor-
                                                                                                                                             mance and energy gains. Some techniques employ the loop
                         RAW’24: The 3rd workshop on Resource AWareness of Systems and Society
                                                                                                                                             perforation approach, which works by skipping some loop
                         (RAW 2024), July 02–05, 2024, Maribor, Slovenia
                         $ isiloz@iyte.edu.tr (I. Öz)                                                                                        iterations to reduce computational overhead [9, 13]. Relaxed
                          0000-0002-8310-1143 (I. Öz)                                                                                       synchronization shortens the waiting time of the threads
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu-
                                     tion 4.0 International (CC BY 4.0).                                                                     that wait for the completion of the other threads’ work

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
__global__ void vectorAdd(double *a, double *b, double *c, int n)
{
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    if (id < n)
        c[id] = a[id] + b[id];
}

int main(int argc, char** argv)
{
    double *d_a, d_b, d_c;
    int n;

    //... memory allocations and copy operations

    int blockSize = 1024;// Number of threads in each thread block
    int gridSize = (int)ceil((float)n/blockSize);

    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    //... memory copy and deallocations
}


Listing 1: Vector addition kernel function and its launch in CUDA.


[14, 15]. Reduced precision computation employs fewer                tures in the CUDA code. Secondly, we replace synchroniza-
compute cycles with insignificant value modifications for            tion primitives with relaxed versions and propose partition-
error-tolerant applications targeting low-precision execu-           level synchronization for threads based on CUDA coopera-
tions [16].                                                          tive groups and block-level atomic operations using CUDA
                                                                     thread scopes. Finally, we remove the warp divergence, a
2.2. GPU Programming Model                                           serial bottleneck in GPU executions.
                                                                        We evaluate target CUDA source codes and perform code
While modern GPU architectures evolve as the requirements            transformations based on compiler directives. While our ap-
of the target applications change, GPU devices employ SIMT           proach requires manual code analysis and modifications to
(Single Instruction Multiple Threads) execution to accelerate        introduce directives in code segments, the semi-automatic
data processing tasks in their parallel execution cores.             configuration enables us to generate target codes that em-
   A program written in the CUDA programming model                   ploy approximations selectively by compiling the code with
[17], which is a parallel programming model for NVIDIA               specific options.
GPU devices, starts its execution in a CPU, allocates memory
space on the GPU, transfers data into GPU global memory,
                                                                     3.1. Kernel-Aware Loop Perforation
and starts a kernel function execution by creating thousands
of threads. Each thread executes the same program (SIMT)             While loop perforation skips some loop iterations in a serial
by processing different parts of the given data. Threads that        program, the same technique can be applied to CUDA pro-
execute on the GPU are part of a compute kernel specified            grams by adapting the perforation accordingly. We consider
by a function. Besides data-parallel applications that can           three approximation versions derived from loop perforation:
benefit from many parallel execution units of GPUs, large-
scale irregular computations utilize the massive degree of           Kernel launch perforation: We skip the iterations of
parallelism and the high memory bandwidth provided by                the loop, which launches one or multiple CUDA kernel
GPUs. Listing 1 presents the CUDA kernel function and ker-           functions at each iteration. The perforation is simply a
nel launch configuration for the vector addition operation.          regular loop perforation with kernel launches per iteration.
For simplicity, we skip the memory allocation and copy op-           In a code segment given in Listing 2 (Fdtd2d program from
erations. The execution launches the vectorAdd function by           Polybench suite [18]), we reduce the number of iterations
specifying the number of blocks and the number of threads            and consequently, the kernel launches by assigning a smaller
in each block. The hardware scheduler schedules the blocks           upper bound (for _PB_TMAX variable in the example code).
into SM (Streaming Multiprocessor) units and thread groups
(warps) into GPU cores inside SMs. Each thread executes              Kernel launch configuration perforation: In data-
the kernel function and performs the addition operation              parallel CUDA programs, CUDA threads execute implicit
based on its global thread identifier.                               loops in parallel by performing the computations that be-
                                                                     long to one or more iterations of the serial program loop.
3. Approximation for CUDA                                            We launch target kernel executions by reducing the number
                                                                     of threads in the configuration; hence, the original loop in
   Programs                                                          the serial program is perforated. In a code segment given
                                                                     in Listing 3 (Convolution2D program from Polybench suite
We propose three main approximations for target CUDA                 [18]), we modify the block or grid configuration parameters
programs. Firstly, we exploit loop perforation by adapt-             of the convolution2D_kernel kernel by reducing the X or Y
ing the popular method for loop and loop-similar struc-              dimensions of the grid.
...
for(int t = 0; t < _PB_TMAX; t++)
{
    fdtd_step1_kernel<<<grid,block>>>(nx, ny, _fict_gpu, ex_gpu, ey_gpu, hz_gpu, t);
    cudaDeviceSynchronize();
    fdtd_step2_kernel<<<grid,block>>>(nx, ny, ex_gpu, ey_gpu, hz_gpu, t);
    cudaDeviceSynchronize();
    fdtd_step3_kernel<<<grid,block>>>(nx, ny, ex_gpu, ey_gpu, hz_gpu, t);
    cudaDeviceSynchronize();
}
...


Listing 2: The Fdtd2d code with kernel function calls inside a loop.


...
dim3 block(DIM_THREAD_BLOCK_X, DIM_THREAD_BLOCK_Y);
dim3 grid(ceil(((float)NI) / ((float)block.x)), ceil(((float)NJ) / ((float)block.y)));

convolution2D_kernel <<< grid,block >>> (ni, nj, A_gpu,B_gpu);
...


Listing 3: The Convolution2D code kernel launch configuration.


...                                                                    3.2. Relaxed Synchronization
__global__ void mean_kernel(int m, int n, DATA_TYPE *
     mean, DATA_TYPE *data)                                            Multiple CUDA threads require time-consuming synchro-
{                                                                      nization to access shared data or resolve data dependencies,
 int j = blockIdx.x * blockDim.x + threadIdx.x;                        utilizing atomic operations (like atomicAdd) and barrier op-
                                                                       erations (like __syncthreads()), respectively. The relaxed
 if (j < _PB_M)
 {
                                                                       synchronization offers performance gains by synchronizing
    mean[j] = 0.0;                                                     fewer threads in exchange for output accuracy loss. We
                                                                       consider two main relaxations based on CUDA cooperative
      int i;                                                           groups and CUDA thread scopes:
      for(i = 0; i < _PB_N; i++)
      {
          mean[j] += data[i * M + j];                                  Partition-level synchronization: CUDA threads within
      }                                                                a block can cooperate by synchronizing their execution
      mean[j] /= (DATA_TYPE)FLOAT_N;                                   to coordinate memory accesses. The programmer can de-
 }
                                                                       fine synchronization points by calling the __syncthreads()
}
...
                                                                       function, which acts as a barrier and makes waiting for
                                                                       all threads. While CUDA employs block-level synchroniza-
Listing 4: The Covariance code with loop structures inside a           tion by __syncthreads() function, __syncwarp() function,
kernel function mean_kernel.                                           which synchronizes the threads within a warp, has become
                                                                       available on CUDA 9. This is important for porting code
                                                                       to modern GPU architectures after Volta, in which threads
                                                                       within a warp can be scheduled separately. Additionally, the
Intra-kernel loop perforation: We perform the stan-                    Cooperative Groups API [19] provides a rich set of thread-
dard loop perforation method for the code inside kernel                synchronization primitives by forming partitions with a set
functions. In a code segment given in Listing 4 (Covariance            of threads. Listing 5 presents different code snippets to or-
program from Polybench suite [18]), we reduce the number               ganize groups of threads. While the first group, blockgroup,
of loop iterations inside mean_kernel kernel (_PB_N variable           represents all the threads in a thread block, warpgroup repre-
in the example code).                                                  sents all the threads in a warp. If we want to synchronize the
                                                                       threads in those groups, the behavior will be the same with
  We modify each program code by inserting                             __syncthreads() and __syncwarp() functions, respectively.
compiler directives for a set of loop perforation                         For implementing the approximation, first, we search
types. Specifically, we define four directives: KER-                   for all __syncthreads() function calls in the target kernel
NEL_LAUNCH_PERFORATION,               GRID_PERFORATION,                functions and configure the synchronization level for each
BLOCK_PERFORATION, LOOP_PERFORATION, and compile                       synchronization point. Specifically, we either completely
the programs by enabling the directives with specific values,          skip __syncthreads() (SKIP) or replace it with a relaxed ver-
which represent the perforation rate as the reduction ratio            sion. For relaxing synchronization, we choose __syncwarp()
of the target loop. By enabling the chosen perforation                 (WARP) or utilize cooperative thread groups (the details
type(s) and rate(s) at compile time, we evaluate the impacts           are given below). We modify each program code and inject
on the execution.                                                      #ifdef directives to guide the compiler based on user prefer-
                                                                       ences. For each __syncthreads() code block, we define one
                                                                       directive and compile the code by specifying one or more
                                                                       directives.
// Cooperative group for the current thread block
auto blockgroup = cooperative_groups::this_thread_block();

// Cooperative group for each warp in the thread block
auto warpgroup = cooperative_groups::tiled_partition<32>(threadblock);

// Cooperative group for each 16 threads in the thread block
auto subwarp16 = cooperative_groups::tiled_partition<16>(threadblock);

// Cooperative group for all currently coalesced threads in the warp
auto coalescedgroup = cooperative_groups::coalesced_threads();

// Thread block groups can sync
blockgroup.sync();


Listing 5: CUDA cooperative groups [19].


//Replaced code version 1 (4TILE)
thread_group tile32 = tiled_partition(this_thread_block(), 32);
thread_group tile4 = tiled_partition(tile32, 4);
tile4.sync();

//Replaced code version 2 (ACTIVE)
thread_group active = coalesced_threads();
active.sync();


Listing 6: Partition-level synchronization configurations.


   For our partition-level approach, we define two thread        the same time. Due to SIMD execution model, when threads
partitions (as given in Listing 6): 1) 4TILE: Cooperative        in the same warp need to perform different operations, the
thread groups with four threads in the corresponding warp,       execution of the different branches is serialized, thus hurt-
2) ACTIVE: Currently coalesced threads in the warp. When         ing performance improvement that could be gained from
data-dependent conditional branches in the code cause            parallelism. Figure 1 presents an example scenario for warp
threads within a warp to diverge, the SM disables (deac-         divergence. The eight threads (assuming we have an 8-
tivates) threads that do not take the branch. The threads        thread warp size for simplicity) start the kernel execution,
that remain active on the path are referred to as coalesced.     then at Branch point, there is an if statement that causes
                                                                 different path executions. While four threads execute the
Block-level atomic operations: While the atomic opera-           instruction at Path A, the other four continue the execution
tions in standard C or C++ are uniform, the CUDA program-        at Path B. When the first four threads execute Path A, the
ming model offers atomic functions at different scopes. A        others must wait and perform no operation. The marked
thread scope specifies the set of threads that can synchronize   execution prevents full warp utilization by activating only
with each other using atomic operations. Atomic functions        four threads simultaneously in an 8-thread warp structure.
with _system suffix (e.g., atomicAdd_system) are atomic at
system scope, where the system refers to the system running
on multiple GPUs and CPUs. Atomic functions without a
suffix (e.g., atomicAdd) are atomic at device scope, where
the device refers to the target GPU device. Atomic func-
tions with _block suffix (e.g., atomicAdd_block) are atomic
at thread block scope, which refers to the synchronization
of the threads executing on the same thread block.
   In block-level atomic operations, we target that the
threads perform atomic operations at the largest thread
block scope. Like our synchronization approach, we search
for all atomic functions in the target CUDA code and re-
duce the atomic scope accordingly. For instance, we re-
place atomicAdd function calls with atomicAdd_block, or
we completely remove the function call. Hence, we aim
for atomic operations with fewer threads than the original
code. Similarly, we could replace atomicAdd_system or skip
atomicAdd_block functions.                                       Figure 1: SIMD Warp Divergence [20].


3.3. Warp Divergence Elimination                                    To eliminate the divergence overhead, as an approxima-
                                                                 tion method, we execute only one path in case of multiple
CUDA threads are executed in groups of 32 threads (warps),
                                                                 paths in a warp. For instance, for the code given in Listing
and all threads in a warp execute the same instruction at
                                                                 7 (from Grappolo application [21]), we configure to execute
Path 1, Path 2, or Path 3. Alternatively, we completely skip   grid-level kernel launch configuration perforation, and
the divergent code segment in our evaluations.                 block-level kernel launch configuration perforation, we per-
                                                               form 90% and 80% perforation rates. If the program does
...                                                            not support the target approximation (e.g., Correlation does
if (currCId == FLAG_FREE) {                                    not have a kernel launch inside a loop), we simply do not
    /*Path 1*/                                                 have the corresponding result in our evaluation. The values
}
                                                               in Figure 2 present 1/Speedup and the rate of incorrectly
if (currCId == (1 + dataItem->cId)) {
    /*Path 2*/
                                                               computed elements. We define the performance in terms
}                                                              of speedup, the ratio of the compute time for the original
else                                                           execution to the time for the approximate execution, and
{                                                              report the 1/Speedup values in our results. For instance,
    /*Path 3*/
                                                               the execution time for the original Correlation execution is
}
...
                                                               1.785 milliseconds, and it computes 4194304 array elements.
                                                               When we perforate the kernel function loops by 90% (Loop
Listing 7: SIMD warp divergence code example.                  (90%)), we have 1.302 milliseconds and 793356 incorrect
                                                               computations. Therefore, the performance improvement
                                                               rate equals 1.302/1.785=0.73, and the rate of the incorrect
                                                               computations is 793356/4194304=0.19, shown in Figure 2.
                                                               By reporting performance improvement and inaccuracy val-
4. Experimental Study                                          ues in this way, one can evaluate performance gains and
                                                               incorrect results for each approximation and make design
4.1. Experimental Setup                                        decisions. Based on the program characteristics, each ap-
                                                               proximation affects the execution outcome differently. We
To evaluate our approximation methods, we select CUDA
                                                               can have up to 60% performance improvements (Loop (80%)
applications from Polybench [18] and Gardenia [22] bench-
                                                               for Correlation) in exchange for 40% of the elements incor-
mark suites and utilize an optimized CUDA implementation
                                                               rectly computed. Some approximations offer good tradeoff
of the Louvain community detection algorithm, namely
                                                               points, like grid-level kernel launch configuration perfora-
Grappolo [21]. While Polybench applications mostly employ
                                                               tions (Grid (90%) and Grid (80%)) in Fdtd2d. We can have
data-parallel computations with multiple loop structures,
                                                               20% and 30% performance improvements by losing 30% and
Gardenia implements a set of graph algorithms that include
                                                               50% of correct computations. On the other hand, there is
synchronization primitives. Grappolo, with computationally
                                                               no performance improvement with small inaccuracy values
intensive and complex structures, includes code segments
                                                               (like kernel launch configuration perforations in Covariance)
for our evaluations on relaxed synchronization and warp
                                                               or intolerable output loss with improvement in execution
divergence-based approximations.
                                                               times (like loop perforations in Syrk).
   We compile our programs with CUDA 12.1 [23] and run
our approximation experiments in a system with an NVIDIA
GeForce RTX 3050 Ti Mobile GPU device. The GPU device,         4.2.2. Relaxed Synchronization
built on Ampere architecture [24], has 4 GB GDDR6 mem-         We evaluate the Betweenness Centrality (bc) in the Gardenia
ory.                                                           benchmark suite [22], which has four different implementa-
                                                               tions. For a sample graph (soc-LiveJournal1 [25]), we execute
4.2. Experimental Results                                      each version and select the one with the lowest execution
                                                               time. Since the version already employs optimizations, we
We evaluate our three main approximations for target CUDA      apply our approximation methods to that version for fair
programs separately. We execute both original and approxi-     comparison. The implementation (i.e., bc_topo_lb) has four
mated versions, measure GPU execution times, and collect       main kernel functions with synchronization primitives (i.e.,
result outputs. By comparing execution time and output         __syncthreads()). We apply our relaxed synchronization
accuracies, we perform a tradeoff analysis for target com-     techniques for each seven __syncthreads() function call in
putations.                                                     the target kernel functions and perform four specific modi-
                                                               fications: 1) SKIP: Remove __syncthreads(), 2) WARP: Syn-
4.2.1. Kernel-Aware Loop Perforation                           chronize threads in the same warp, 3) 4TILE: Synchronize
For our loop perforation techniques, we select six programs    four threads in the same cooperative group, 4) ACTIVE:
from the Polybench benchmark suite [18]. The programs          Synchronize coalesced threads. Finally, we have 28 differ-
have data-parallel characteristics and each employs differ-    ent versions. We execute those versions with 19 different
ent loop structures. We execute Correlation, Covariance,       datasets. We observe execution time and output differences
Syrk, Fdtd2d with STANDARD input sizes and Jacobi-2D,          for only a subset of our executions, specifically, relaxations
and 2DConv with LARGE input sizes to have longer execu-        for three __syncthreads() function calls on only one kernel
tion times. We collect GPU execution times and incorrect       function. For three synchronization points, we also consider
computations by comparing them with the original output.       the relaxation of their combinations.
Since the programs work with array structures and compute         Table 1 presents the execution times and the number of
array elements as the final output, we evaluate the number     incorrect computations in the observed output for the spec-
of array elements that are computed incorrectly.               ified graphs. We can observe that SKIP, WARP, and ACTIVE
   Figure 2 demonstrates performance improvement and in-       mostly outperform 4TILE, probably due to the overhead of
accuracy values for the programs when our loop-perforation     fine-grained group creation. While relaxing individual syn-
methods are applied. For each applicable method, namely,       chronization points (i.e., SYNC 1, SYNC 2, SYNC 3) offers
kernel launch perforation, intra-kernel loop perforation,      performance gains significantly with non-significant accu-
         Figure 2: Speedup-Inaccuracy variation for loop perforation approximation methods.


    Table 1
    Execution time and incorrect computations (out of given expected correct values) for relaxed synchronization methods for bc
    application.
                                   ljournal-2008                   socLiveJournal                    cage15
                                       Time           Incorrect         Time            Incorrect     Time       Incorrect
                                                     (5,363,260)                       (4,847,571)              (5,154,859)
         ORIGINAL                      61.156             0             57.116              0         47.096         0
                        SKIP           60.637            954            57.116            276        46.780       28,326
          SYNC 1        WARP           60.199            793            53.611            355        47.109       70,611
                        4TILE          60.579            754            57.205            227        46.939       76,002
                       ACTIVE          60.205           1113            55.431            292        47.164       71,563
                        SKIP           59.770          19,888           52.756            1636       47.425       486,970
          SYNC 2        WARP           59.314          11,230           52.485            1693       47.811       498,751
                        4TILE          61.364          14,087           52.300            1522       48.465       303,306
                       ACTIVE          59.397           8345            52.532            1647       47.850       463,191
                        SKIP           61.328            64             53.996             36        47.198       15,531
          SYNC 3        WARP           60.643           1996            53.774             38        47.843       14,341
                        4TILE          61.374            137            53.961             34        47.186       13,383
                       ACTIVE          60.744            160            53.872             32        47.775       13,975
                        SKIP           56.800          20,137           50.572            1739       46.420       98,691
         SYNC 1+2       WARP           56.377          17,035           50.401            1633       46.879       104,232
                        4TILE          57.779          12,311           51.199            1678       47.217       138,951
                       ACTIVE          56.195          22,491           50.495            1657       46.908       84,210
                        SKIP           57.088          20,346           50.871            1729       46.966       85,029
        SYNC 1+2+3      WARP           56.747          20,719           50.774            1600       47.320       62,026
                        4TILE          57.801          14,807           51.513            1649       47.678       59,262
                       ACTIVE          56.608          17,248           50.880            1556       47.352       87,300


racy loss, the combinations (i.e., SYNC 1+2, SYNC 1+2+3)            vain community detection algorithm [26]. Louvain is a
further improve the performance without hurting the out-            greedy graph processing method that assigns each vertex to
put quality much. While the accuracy loss depends on the            a community, which maximizes the overall Modularity and
target dataset, we see the most promising relaxation options        generates a new graph in which the communities become
for performance gains (around 8%-10%) with the SYNC 1+2             new vertices. Since the output metric, Modularity, does not
version.                                                            present an exact result, trading the output accuracy with
   Since bc has no atomic operations, we consider another           performance improvement can be an interesting evaluation
application to observe the impact of our approximation              for the execution. The Modularity metric evaluation de-
techniques for atomic operations. We utilize Grappolo code          pends on the application domain utilizing the community
[21], a highly-optimized CUDA implementation of the Lou-            detection, however, a Modularity value close to 1 presents
         Figure 3: Execution time-Modularity values for relaxing atomic operations in Grappolo.


higher quality output. While the Grappolo employs both syn-        tion. Firstly, we execute only one path out of three, but
chronization and atomic operation primitives, we evaluate          our execution does not end (infinite loop) with each path
only atomic operations and perform our relaxation methods.         choice. Then, we apply a different strategy by eliminating
Specifically, for atomicAdd and atomicCAS function calls,          the code in the target path executions and returning the
we either replace them with the non-atomic operation or            previously computed value with no computation. For this
the block-level atomic function calls (i.e., atomicAdd_block       method, our execution is completed in a shorter time with
or atomicCAS_block).                                               lower Modularity values.
   Figure 3 presents execution time and modularity values             Table 2 presents execution time and modularity values for
as the performance and the accuracy metrics, respectively.         the original execution and our approximate version. While
Besides individual atomic operations, we relax the combi-          we can see a decrease in all execution times, the approx-
nations of the atomic operations to see the impact on the          imation also destroys modularity values with one excep-
outcome. In our target program, we have five atomicAdd             tion, namely the wb-edu dataset. Since this approximation
and three atomicCAS function calls. We label the relaxations       completely eliminates some code segments, accuracy loss
by considering the index and type of the method. Specif-           becomes inevitable for most cases, and it requires a more
ically, we use SKIP or BLOCK as the prefix and the order           rigorous analysis of the target code.
of the corresponding function as the suffix. For instance,
SKIP_ATOMICADD_1 replaces the first atomicAdd with the
non-atomic operation; BLOCK_ATOMICADD_3 replaces the                5. Related Work
third atomicAdd with atomicAdd_block. For the combined
                                                                   SAGE [7] presents a static compiler that generates a set of
relaxations, we concatenate the index of each operation
                                                                   approximated CUDA kernels and a runtime system that em-
such as SKIP_ATOMICADD_1_2. We only select a subset of
                                                                   ploys selective discarding of atomic operations, data pack-
the combinations since it is not practical to execute all of
                                                                   ing, and thread fusion optimizations. It yields 2.5× speedup
them. While we conduct experiments for 19 datasets, we
                                                                   with less than 10% quality loss for machine learning and
include five of them that present the most interesting design
                                                                   image processing kernels. While SAGE proposes approxi-
points. All five graphs demonstrate significant performance
                                                                   mations for CUDA computations and significantly improves
improvements with little modularity losses. Depending on
                                                                   performance, it relies on generic approximation methods
the modularity evaluation of the target domain utilizing the
                                                                   instead of CUDA-specific techniques.
community detection, one can easily prefer approximated
                                                                      Freytag et al. [27] propose efficient executions for sci-
versions. The executions that have large modularity val-
                                                                   entific simulation applications by building multiple kernel
ues in the original version exhibit significant performance
                                                                   implementations with different precision levels. They exe-
improvements without hurting the modularity very much.
                                                                   cute approximated kernel versions by switching from one
Especially, SKIP_ATOMIC_ADD_1_2 version promises up to
                                                                   version to another at runtime based on Target Output Qual-
3x performance gains with 0.01 modularity loss.
                                                                   ity (TOQ) scenarios. By employing execution configurations
                                                                   based on an analysis of the accuracy loss, the experiments
4.2.3. Warp Divergence Elimination                                 reveal high-performance and energy-efficient executions for
We utilize Grappolo for our warp divergence elimination            target precision levels. While the authors build application-
method due to its complex structure that employs branch            layer approximations for the target code, they modify the
instructions inside kernel functions. We work with two             precision levels of the target code while not introducing
kernel functions and perform different divergence elimina-         CUDA-specific methods.
    Table 2
    Execution time and Modularity values with warp divergence elimination for Grappolo.
                                                Original                   Approx.
                                Dataset          Time         Modularity    Time          Modularity
                                  relat9          1.206         0.491       0.616           0.254
                                 cage15           1.341         0.893       1.077           0.727
                                   rel9           1.094         0.458       0.446           0.253
                                ljournal          2.034         0.759       1.573           0.588
                                  rgg23           1.295         0.991       1.206           0.718
                            soc-LiveJournal1      2.042         0.753       1.606           0.603
                                wb-edu            3.468         0.995       1.621           0.980


   Liu et al. [28] present cuSpAMM, the CUDA adaptation             [3] Q. Zeng, Y. Du, K. Huang, K. K. Leung, Energy-efficient
of the Sparse Approximate Matrix Multiply algorithm, by                 resource management for federated edge learning with
utilizing thread parallelism, memory tiling, and the tensor             CPU-GPU heterogeneous computing, IEEE Transac-
cores in multiple GPU devices. While the proposed work                  tions on Wireless Communications 20 (2021) 7947–
implements an approximation algorithm by considering                    7962.
GPU-specific features, the implementation, rather than ap-          [4] V. Raca, S. W. Umboh, E. Mehofer, B. Scholz, Runtime
proximation, relies on GPU optimization techniques.                     and energy constrained work scheduling for hetero-
                                                                        geneous systems, The Journal of Supercomputing 78
                                                                        (2022) 17150–17177.
6. Conclusions and Future Work                                      [5] S. Mittal, A survey of techniques for approximate
                                                                        computing, ACM Computing Surveys 48 (2016).
In this work, we propose CUDA-specific approximation
                                                                    [6] P. Stanley-Marbell, A. Alaghi, M. Carbin, E. Darulova,
methods based on loop perforation, relaxed synchroniza-
                                                                        L. Dolecek, A. Gerstlauer, G. Gillani, D. Jevdjic,
tion, and warp divergence elimination. We define approxi-
                                                                        T. Moreau, M. Cacciotti, A. Daglis, N. E. Jerger, B. Fal-
mations as compiler directives and enable them for target
                                                                        safi, S. Misailovic, A. Sampson, D. Zufferey, Exploit-
executions. Our experimental results demonstrate that our
                                                                        ing errors for efficiency: A survey from circuits to
approximation techniques promise good performance im-
                                                                        applications, ACM Computing Surveys 53 (2020).
provements without hurting output accuracy significantly.
                                                                        doi:10.1145/3394898.
   Our approximations are enabled based on compiler direc-
                                                                    [7] M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati,
tives. While the directives offer some level of automation,
                                                                        S. Mahlke, Sage: Self-tuning approximation for graph-
we can extend our work by building a fully automated tool
                                                                        ics engines, in: 2013 46th Annual IEEE/ACM Inter-
that performs source-to-source compiler transformations.
                                                                        national Symposium on Microarchitecture (MICRO),
Thus, we can easily generate our approximated versions.
                                                                        2013.
Moreover, a design space exploration technique potentially
                                                                    [8] D. Peroni, M. Imani, H. Nejatollahi, N. Dutt, T. Rosing,
helps to choose the best design points considering perfor-
                                                                        Arga: Approximate reuse for GPGPU acceleration, in:
mance improvements and inaccuracy values.
                                                                        2019 56th ACM/IEEE Design Automation Conference
   While approximate computing offers performance im-
                                                                        (DAC), 2019, pp. 1–6.
provements, it is essential to evaluate the power consump-
                                                                    [9] D. Maier, B. Cosenza, B. Juurlink, Local memory-
tion of the target execution. We can extend our work by
                                                                        aware kernel perforation, in: Proceedings of the
including energy measurements for GPU devices and in-
                                                                        2018 International Symposium on Code Generation
clude that criterion as our resource-aware evaluation.
                                                                        and Optimization, CGO 2018, Association for Comput-
                                                                        ing Machinery, New York, NY, USA, 2018, p. 278–287.
Acknowledgments                                                         doi:10.1145/3168814.
                                                                   [10] A. Li, S. L. Song, M. Wijtvliet, A. Kumar, H. Corporaal,
This work was supported by the Scientific and Technological             Sfu-driven transparent approximation acceleration on
Research Council of Turkey (TÜBİTAK), Grant No: 122E395.                GPUs, in: Proceedings of the 2016 International Con-
This work is partially supported by CERCIRAS COST Action                ference on Supercomputing, ICS ’16, Association for
CA19135 funded by COST Association.                                     Computing Machinery, New York, NY, USA, 2016.
                                                                   [11] R. Garcia, F. Asgarinejad, B. Khaleghi, T. Rosing,
                                                                        M. Imani, Trulook: A framework for configurable
References                                                              GPU approximation, in: 2021 Design, Automation
                                                                        and Test in Europe Conference and Exhibition (DATE),
 [1] S. Mittal, J. S. Vetter, A survey of methods for ana-
                                                                        2021, pp. 487–490. doi:10.23919/DATE51398.2021.
     lyzing and improving GPU energy efficiency, ACM
                                                                        9474239.
     Computing Surveys 47 (2014). URL: https://doi.org/10.
                                                                   [12] F. Vaverka, V. Mrazek, Z. Vasicek, L. Sekanina, Tfap-
     1145/2636342. doi:10.1145/2636342.
                                                                        prox: Towards a fast emulation of dnn approximate
 [2] K. Ma, X. Li, W. Chen, C. Zhang, X. Wang, GreenGPU:
                                                                        hardware accelerators on GPU, in: 2020 Design,
     A holistic approach to energy efficiency in GPU-CPU
                                                                        Automation and Test in Europe Conference and Ex-
     heterogeneous architectures, in: 2012 41st Interna-
                                                                        hibition (DATE), 2020, pp. 294–297. doi:10.23919/
     tional Conference on Parallel Processing, 2012, pp.
                                                                        DATE48585.2020.9116299.
     48–57.
                                                                   [13] D. Maier, B. Juurlink, Model-based loop perforation,
     in: Euro-Par 2021: Parallel Processing Workshops,
     Springer International Publishing, Cham, 2022, pp.
     549–554.
[14] K. Iliakis, H. Timko, S. Xydis, P. Tsapatsaris, D. Soudris,
     Enabling large scale simulations for particle acceler-
     ators, IEEE Transactions on Parallel and Distributed
     Systems 33 (2022) 4425–4439. doi:10.1109/TPDS.
     2022.3192707.
[15] A. L. C. Bueno, N. de La Rocque Rodriguez, E. D.
     Sotelino, Adaptive relaxed synchronization through
     the use of supervised learning methods, Future Gen-
     eration Computer Systems 106 (2020) 260–269.
[16] S. Cherubin, G. Agosta, Tools for reduced precision
     computation: A survey, ACM Computing Surveys 53
     (2020). URL: https://doi.org/10.1145/3381039. doi:10.
     1145/3381039.
[17] D. B. Kirk, W. mei W. Hwu, Programming Massively
     Parallel Processors (Third Edition), Morgan Kaufmann,
     2017.
[18] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula,
     J. Cavazos, Auto-tuning a high-level language targeted
     to GPU codes, 2012 Innovative Parallel Computing
     (InPar), 2012.
[19] M. Harris, K. Perelygin, Cooperative groups: Flexible
     CUDA thread programming, https://developer.nvidia.
     com/blog/cooperative-groups/, 2017.
[20] T. M. Aamodt, W. W. L. Fung, T. G. Rogers,
     M. Martonosi, General-Purpose Graphics Processor
     Architecture, 2018.
[21] A. K. Mahantesh Halappanavar, Howard (Hao) Lu,
     S. Ghosh, Grappolo community detection, https://
     github.com/ECP-ExaGraph/grappolo, 2024.
[22] Z. Xu, X. Chen, J. Shen, Y. Zhang, C. Chen, C. Yang,
     Gardenia: A graph processing benchmark suite for
     next-generation accelerators, ACM Journal on Emerg-
     ing Technologies in Computing Systems 15 (2019).
     URL: https://doi.org/10.1145/3283450. doi:10.1145/
     3283450.
[23] Nvidia, CUDA toolkit 12.1, https://developer.nvidia.
     com/cuda-12-1-0-download-archive, 2024.
[24] Nvidia, Nvidia ampere ga102 GPU architecture
     white paper, https://www.nvidia.com/content/PDF/
     nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.
     1.pdf, 2021.
[25] J. Leskovec, A. Krevl, SNAP Datasets: Stanford large
     network dataset collection, http://snap.stanford.edu/
     data, 2014.
[26] V. Blondel, J. Guillaume, R. Lambiotte, E. Mech, Fast
     unfolding of communities in large networks, Journal of
     Statistical Mechanics: Theory and Experiment (2008).
[27] G. Freytag, C. A. Künas, P. Rech, P. O. A. Navaux, In-
     terleaved execution of approximated cuda kernels in
     iterative applications, in: 2024 32nd Euromicro In-
     ternational Conference on Parallel, Distributed and
     Network-Based Processing (PDP), 2024, pp. 60–67.
     doi:10.1109/PDP62718.2024.00017.
[28] X. Liu, Y. Liu, H. Yang, M. Dun, B. Yin, Z. Luan,
     D. Qian, Accelerating approximate matrix multipli-
     cation for near-sparse matrices on gpus, Journal of
     Supercomputing 78 (2022) 11464–11491. URL: https:
     //doi.org/10.1007/s11227-022-04334-5. doi:10.1007/
     s11227-022-04334-5.