<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Evaluating CUDA-Aware Approximate Computing Techniques</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Işıl</forename><surname>Öz</surname></persName>
							<email>isiloz@iyte.edu.tr</email>
							<affiliation key="aff0">
								<orgName type="department">Computer Engineering Department</orgName>
								<orgName type="institution">Izmir Institute of Technology</orgName>
								<address>
									<settlement>Izmir</settlement>
									<country key="TR">Turkey</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Evaluating CUDA-Aware Approximate Computing Techniques</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">3E8357F0EBA3FA2A069DDE2736EBD7A1</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T20:06+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>CUDA programming model</term>
					<term>GPU computing</term>
					<term>approximate computing</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Approximate computing techniques offer performance improvements by performing inexact computations. Moreover, CUDA programs written to be executed on GPU devices employ specific features to utilize the parallel computation units of heterogeneous GPU architectures. While generic software-level approximate computing techniques have been applied to heterogeneous CUDA programs, CUDA-specific approaches may introduce promising performance improvements by not corrupting the target computations. In this work, we propose software approximation techniques for CUDA programs: kernel-aware loop perforation, partition-level synchronization, block-level atomic operations, and warp divergence elimination. We perform source code transformations on target benchmark programs by applying our techniques. We evaluate performance improvements by trading off accuracy in our target computations. Our experimental results reveal that CUDA-aware approximation techniques offer significant performance improvements at the expense of acceptable accuracy loss.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Heterogeneous computer systems, combining generalpurpose processors and GPU devices as accelerators, enable high-performance and energy-efficient executions. However, the applications from various domains such as AI acceleration, big-data processing, and high-performance computing (HPC) with large computing requirements make power consumption an important concern in these systems <ref type="bibr" target="#b0">[1]</ref>. Since modern GPU architectures employ complex structures and the target workloads exploit the massively parallel resources, energy efficiency becomes critical for large-scale GPU executions <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4]</ref>.</p><p>To solve the conflict between performance and energy efficiency, approximate computing maintains high performance and low power consumption for applications that can tolerate inexact computations. While the architecture-level approximate computing techniques are enabled by modifying processor units and memory components, software solutions based on compiler transformations or manual code modifications also offer approximate computations <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref>. While using inexact hardware or voltage scaling maintains hardware solutions, techniques like loop perforation or relaxed synchronization offer performance-accuracy tradeoffs at the software level.</p><p>Since GPU systems aim for applications from different domains, they employ approximate computing techniques to improve performance and energy efficiency by trading with the inaccuracy in the target computations. Besides inherently error-tolerant graphics and image processing applications <ref type="bibr" target="#b6">[7]</ref>, general-purpose GPU programs benefit from approximations with reasonable incorrect computations <ref type="bibr" target="#b7">[8]</ref>. While some works reuse generic techniques like perforation <ref type="bibr" target="#b8">[9]</ref>, some methods utilize GPU-specific hardware or software components to employ approximations <ref type="bibr" target="#b9">[10]</ref>. Additionally, simulation-based evaluations propose hardware modifications by either approximate units or supporting the approximate computations <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12]</ref>.</p><p>In this work, we propose software-based approximations for CUDA programs running on GPU architectures. Not only do we adapt the existing techniques for the GPU pro-grams, but we also propose CUDA-specific methods to target parallel CUDA threads. Our main contributions are as follows:</p><p>• We propose kernel-aware loop perforation by adapting the loop perforation technique for CUDA programs. To reduce the synchronization overhead, we propose partition-level synchronization and blocklevel atomic operations based on CUDA cooperative groups and CUDA thread scope atomic functions. Additionally, we eliminate the warp divergence inside CUDA kernel functions to prevent serial execution caused by branch instructions. • We modify the target codes by inserting compiler directives enabling our techniques, and generate our approximate versions based on the given compiler options.</p><p>• We perform an experimental study to evaluate the impact of the modifications by our approximations.</p><p>Our experimental study includes applications from different domains to observe the performance and accuracy variations for the target execution. Our experimental results reveal that CUDA-aware approximation techniques offer significant performance improvements at the expense of acceptable accuracy loss.</p><p>The remainder of this paper is organized as follows: Section 2 presents some background on approximate computing and the CUDA programming model. We explain our approximation methods in Section 3. Then, the experimental results are outlined in Section 4. Section 5 presents relevant studies about CUDA approximations. Finally, in Section 6, we summarize the work with some conclusive remarks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background and Motivation</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Approximate Computing</head><p>Approximate computing introduces acceptable inaccuracies into the computing process and promises significant performance and energy gains. Some techniques employ the loop perforation approach, which works by skipping some loop iterations to reduce computational overhead <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b12">13]</ref>. Relaxed synchronization shortens the waiting time of the threads that wait for the completion of the other threads' work Listing 1: Vector addition kernel function and its launch in CUDA. <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15]</ref>. Reduced precision computation employs fewer compute cycles with insignificant value modifications for error-tolerant applications targeting low-precision executions <ref type="bibr" target="#b15">[16]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">GPU Programming Model</head><p>While modern GPU architectures evolve as the requirements of the target applications change, GPU devices employ SIMT (Single Instruction Multiple Threads) execution to accelerate data processing tasks in their parallel execution cores.</p><p>A program written in the CUDA programming model <ref type="bibr" target="#b16">[17]</ref>, which is a parallel programming model for NVIDIA GPU devices, starts its execution in a CPU, allocates memory space on the GPU, transfers data into GPU global memory, and starts a kernel function execution by creating thousands of threads. Each thread executes the same program (SIMT) by processing different parts of the given data. Threads that execute on the GPU are part of a compute kernel specified by a function. Besides data-parallel applications that can benefit from many parallel execution units of GPUs, largescale irregular computations utilize the massive degree of parallelism and the high memory bandwidth provided by GPUs. Listing 1 presents the CUDA kernel function and kernel launch configuration for the vector addition operation. For simplicity, we skip the memory allocation and copy operations. The execution launches the vectorAdd function by specifying the number of blocks and the number of threads in each block. The hardware scheduler schedules the blocks into SM (Streaming Multiprocessor) units and thread groups (warps) into GPU cores inside SMs. Each thread executes the kernel function and performs the addition operation based on its global thread identifier.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Approximation for CUDA Programs</head><p>We propose three main approximations for target CUDA programs. Firstly, we exploit loop perforation by adapting the popular method for loop and loop-similar struc-tures in the CUDA code. Secondly, we replace synchronization primitives with relaxed versions and propose partitionlevel synchronization for threads based on CUDA cooperative groups and block-level atomic operations using CUDA thread scopes. Finally, we remove the warp divergence, a serial bottleneck in GPU executions. We evaluate target CUDA source codes and perform code transformations based on compiler directives. While our approach requires manual code analysis and modifications to introduce directives in code segments, the semi-automatic configuration enables us to generate target codes that employ approximations selectively by compiling the code with specific options.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Kernel-Aware Loop Perforation</head><p>While loop perforation skips some loop iterations in a serial program, the same technique can be applied to CUDA programs by adapting the perforation accordingly. We consider three approximation versions derived from loop perforation: Kernel launch perforation: We skip the iterations of the loop, which launches one or multiple CUDA kernel functions at each iteration. The perforation is simply a regular loop perforation with kernel launches per iteration. In a code segment given in Listing 2 (Fdtd2d program from Polybench suite <ref type="bibr" target="#b17">[18]</ref>), we reduce the number of iterations and consequently, the kernel launches by assigning a smaller upper bound (for _PB_TMAX variable in the example code).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Kernel launch configuration perforation:</head><p>In dataparallel CUDA programs, CUDA threads execute implicit loops in parallel by performing the computations that belong to one or more iterations of the serial program loop. We launch target kernel executions by reducing the number of threads in the configuration; hence, the original loop in the serial program is perforated. In a code segment given in Listing 3 (Convolution2D program from Polybench suite <ref type="bibr" target="#b17">[18]</ref>), we modify the block or grid configuration parameters of the convolution2D_kernel kernel by reducing the X or Y dimensions of the grid.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Intra-kernel loop perforation:</head><p>We perform the standard loop perforation method for the code inside kernel functions. In a code segment given in Listing 4 (Covariance program from Polybench suite <ref type="bibr" target="#b17">[18]</ref>), we reduce the number of loop iterations inside mean_kernel kernel (_PB_N variable in the example code).</p><p>We modify each program code by inserting compiler directives for a set of loop perforation types. Specifically, we define four directives: KER-NEL_LAUNCH_PERFORATION, GRID_PERFORATION, BLOCK_PERFORATION, LOOP_PERFORATION, and compile the programs by enabling the directives with specific values, which represent the perforation rate as the reduction ratio of the target loop. By enabling the chosen perforation type(s) and rate(s) at compile time, we evaluate the impacts on the execution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Relaxed Synchronization</head><p>Multiple CUDA threads require time-consuming synchronization to access shared data or resolve data dependencies, utilizing atomic operations (like atomicAdd) and barrier operations (like __syncthreads()), respectively. The relaxed synchronization offers performance gains by synchronizing fewer threads in exchange for output accuracy loss. We consider two main relaxations based on CUDA cooperative groups and CUDA thread scopes: Partition-level synchronization: CUDA threads within a block can cooperate by synchronizing their execution to coordinate memory accesses. The programmer can define synchronization points by calling the __syncthreads() function, which acts as a barrier and makes waiting for all threads. While CUDA employs block-level synchronization by __syncthreads() function, __syncwarp() function, which synchronizes the threads within a warp, has become available on CUDA 9. This is important for porting code to modern GPU architectures after Volta, in which threads within a warp can be scheduled separately. Additionally, the Cooperative Groups API <ref type="bibr" target="#b18">[19]</ref> provides a rich set of threadsynchronization primitives by forming partitions with a set of threads. Listing 5 presents different code snippets to organize groups of threads. While the first group, blockgroup, represents all the threads in a thread block, warpgroup represents all the threads in a warp. If we want to synchronize the threads in those groups, the behavior will be the same with __syncthreads() and __syncwarp() functions, respectively.</p><p>For implementing the approximation, first, we search for all __syncthreads() function calls in the target kernel functions and configure the synchronization level for each synchronization point. Specifically, we either completely skip __syncthreads() (SKIP) or replace it with a relaxed version. For relaxing synchronization, we choose __syncwarp() (WARP) or utilize cooperative thread groups (the details are given below). We modify each program code and inject #ifdef directives to guide the compiler based on user preferences. For each __syncthreads() code block, we define one directive and compile the code by specifying one or more directives. For our partition-level approach, we define two thread partitions (as given in Listing 6): 1) 4TILE: Cooperative thread groups with four threads in the corresponding warp, 2) ACTIVE: Currently coalesced threads in the warp. When data-dependent conditional branches in the code cause threads within a warp to diverge, the SM disables (deactivates) threads that do not take the branch. The threads that remain active on the path are referred to as coalesced.</p><p>Block-level atomic operations: While the atomic operations in standard C or C++ are uniform, the CUDA programming model offers atomic functions at different scopes. A thread scope specifies the set of threads that can synchronize with each other using atomic operations. Atomic functions with _system suffix (e.g., atomicAdd_system) are atomic at system scope, where the system refers to the system running on multiple GPUs and CPUs. Atomic functions without a suffix (e.g., atomicAdd) are atomic at device scope, where the device refers to the target GPU device. Atomic functions with _block suffix (e.g., atomicAdd_block) are atomic at thread block scope, which refers to the synchronization of the threads executing on the same thread block.</p><p>In block-level atomic operations, we target that the threads perform atomic operations at the largest thread block scope. Like our synchronization approach, we search for all atomic functions in the target CUDA code and reduce the atomic scope accordingly. For instance, we replace atomicAdd function calls with atomicAdd_block, or we completely remove the function call. Hence, we aim for atomic operations with fewer threads than the original code. Similarly, we could replace atomicAdd_system or skip atomicAdd_block functions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Warp Divergence Elimination</head><p>CUDA threads are executed in groups of 32 threads (warps), and all threads in a warp execute the same instruction at the same time. Due to SIMD execution model, when threads in the same warp need to perform different operations, the execution of the different branches is serialized, thus hurting performance improvement that could be gained from parallelism. Figure <ref type="figure" target="#fig_3">1</ref>   To eliminate the divergence overhead, as an approximation method, we execute only one path in case of multiple paths in a warp. For instance, for the code given in Listing 7 (from Grappolo application <ref type="bibr" target="#b20">[21]</ref>), we configure to execute Path 1, Path 2, or Path 3. Alternatively, we completely skip the divergent code segment in our evaluations. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Study</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Experimental Setup</head><p>To evaluate our approximation methods, we select CUDA applications from Polybench <ref type="bibr" target="#b17">[18]</ref> and Gardenia <ref type="bibr" target="#b21">[22]</ref> benchmark suites and utilize an optimized CUDA implementation of the Louvain community detection algorithm, namely Grappolo <ref type="bibr" target="#b20">[21]</ref>. While Polybench applications mostly employ data-parallel computations with multiple loop structures, Gardenia implements a set of graph algorithms that include synchronization primitives. Grappolo, with computationally intensive and complex structures, includes code segments for our evaluations on relaxed synchronization and warp divergence-based approximations.</p><p>We compile our programs with CUDA 12.1 <ref type="bibr" target="#b22">[23]</ref> and run our approximation experiments in a system with an NVIDIA GeForce RTX 3050 Ti Mobile GPU device. The GPU device, built on Ampere architecture <ref type="bibr" target="#b23">[24]</ref>, has 4 GB GDDR6 memory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Experimental Results</head><p>We evaluate our three main approximations for target CUDA programs separately. We execute both original and approximated versions, measure GPU execution times, and collect result outputs. By comparing execution time and output accuracies, we perform a tradeoff analysis for target computations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1.">Kernel-Aware Loop Perforation</head><p>For our loop perforation techniques, we select six programs from the Polybench benchmark suite <ref type="bibr" target="#b17">[18]</ref>. The programs have data-parallel characteristics and each employs different loop structures. We execute Correlation, Covariance, Syrk, Fdtd2d with STANDARD input sizes and Jacobi-2D, and 2DConv with LARGE input sizes to have longer execution times. We collect GPU execution times and incorrect computations by comparing them with the original output. Since the programs work with array structures and compute array elements as the final output, we evaluate the number of array elements that are computed incorrectly.</p><p>Figure <ref type="figure" target="#fig_5">2</ref> demonstrates performance improvement and inaccuracy values for the programs when our loop-perforation methods are applied. For each applicable method, namely, kernel launch perforation, intra-kernel loop perforation, grid-level kernel launch configuration perforation, and block-level kernel launch configuration perforation, we perform 90% and 80% perforation rates. If the program does not support the target approximation (e.g., Correlation does not have a kernel launch inside a loop), we simply do not have the corresponding result in our evaluation. The values in Figure <ref type="figure" target="#fig_5">2</ref> present 1/Speedup and the rate of incorrectly computed elements. We define the performance in terms of speedup, the ratio of the compute time for the original execution to the time for the approximate execution, and report the 1/Speedup values in our results. For instance, the execution time for the original Correlation execution is 1.785 milliseconds, and it computes 4194304 array elements. When we perforate the kernel function loops by 90% (Loop (90%)), we have 1.302 milliseconds and 793356 incorrect computations. Therefore, the performance improvement rate equals 1.302/1.785=0.73, and the rate of the incorrect computations is 793356/4194304=0.19, shown in Figure <ref type="figure" target="#fig_5">2</ref>. By reporting performance improvement and inaccuracy values in this way, one can evaluate performance gains and incorrect results for each approximation and make design decisions. Based on the program characteristics, each approximation affects the execution outcome differently. We can have up to 60% performance improvements (Loop (80%) for Correlation) in exchange for 40% of the elements incorrectly computed. Some approximations offer good tradeoff points, like grid-level kernel launch configuration perforations (Grid (90%) and Grid (80%)) in Fdtd2d. We can have 20% and 30% performance improvements by losing 30% and 50% of correct computations. On the other hand, there is no performance improvement with small inaccuracy values (like kernel launch configuration perforations in Covariance) or intolerable output loss with improvement in execution times (like loop perforations in Syrk).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2.">Relaxed Synchronization</head><p>We evaluate the Betweenness Centrality (bc) in the Gardenia benchmark suite <ref type="bibr" target="#b21">[22]</ref>, which has four different implementations. For a sample graph (soc-LiveJournal1 <ref type="bibr" target="#b24">[25]</ref>), we execute each version and select the one with the lowest execution time. Since the version already employs optimizations, we apply our approximation methods to that version for fair comparison. The implementation (i.e., bc_topo_lb) has four main kernel functions with synchronization primitives (i.e., __syncthreads()). We apply our relaxed synchronization techniques for each seven __syncthreads() function call in the target kernel functions and perform four specific modifications: 1) SKIP: Remove __syncthreads(), 2) WARP: Synchronize threads in the same warp, 3) 4TILE: Synchronize four threads in the same cooperative group, 4) ACTIVE: Synchronize coalesced threads. Finally, we have 28 different versions. We execute those versions with 19 different datasets. We observe execution time and output differences for only a subset of our executions, specifically, relaxations for three __syncthreads() function calls on only one kernel function. For three synchronization points, we also consider the relaxation of their combinations.</p><p>Table <ref type="table" target="#tab_2">1</ref> presents the execution times and the number of incorrect computations in the observed output for the specified graphs. We can observe that SKIP, WARP, and ACTIVE mostly outperform 4TILE, probably due to the overhead of fine-grained group creation. While relaxing individual synchronization points (i.e., SYNC 1, SYNC 2, SYNC 3) offers performance gains significantly with non-significant accu-  Since bc has no atomic operations, we consider another application to observe the impact of our approximation techniques for atomic operations. We utilize Grappolo code <ref type="bibr" target="#b20">[21]</ref>, a highly-optimized CUDA implementation of the Lou-vain community detection algorithm <ref type="bibr" target="#b25">[26]</ref>. Louvain is a greedy graph processing method that assigns each vertex to a community, which maximizes the overall Modularity and generates a new graph in which the communities become new vertices. Since the output metric, Modularity, does not present an exact result, trading the output accuracy with performance improvement can be an interesting evaluation for the execution. The Modularity metric evaluation depends on the application domain utilizing the community detection, however, a Modularity value close to 1 presents higher quality output. While the Grappolo employs both synchronization and atomic operation primitives, we evaluate only atomic operations and perform our relaxation methods. Specifically, for atomicAdd and atomicCAS function calls, we either replace them with the non-atomic operation or the block-level atomic function calls (i.e., atomicAdd_block or atomicCAS_block).</p><p>Figure <ref type="figure" target="#fig_6">3</ref> presents execution time and modularity values as the performance and the accuracy metrics, respectively. Besides individual atomic operations, we relax the combinations of the atomic operations to see the impact on the outcome. In our target program, we have five atomicAdd and three atomicCAS function calls. We label the relaxations by considering the index and type of the method. Specifically, we use SKIP or BLOCK as the prefix and the order of the corresponding function as the suffix. For instance, SKIP_ATOMICADD_1 replaces the first atomicAdd with the non-atomic operation; BLOCK_ATOMICADD_3 replaces the third atomicAdd with atomicAdd_block. For the combined relaxations, we concatenate the index of each operation such as SKIP_ATOMICADD_1_2. We only select a subset of the combinations since it is not practical to execute all of them. While we conduct experiments for 19 datasets, we include five of them that present the most interesting design points. All five graphs demonstrate significant performance improvements with little modularity losses. Depending on the modularity evaluation of the target domain utilizing the community detection, one can easily prefer approximated versions. The executions that have large modularity values in the original version exhibit significant performance improvements without hurting the modularity very much. Especially, SKIP_ATOMIC_ADD_1_2 version promises up to 3x performance gains with 0.01 modularity loss.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.3.">Warp Divergence Elimination</head><p>We utilize Grappolo for our warp divergence elimination method due to its complex structure that employs branch instructions inside kernel functions. We work with two kernel functions and perform different divergence elimina-tion. Firstly, we execute only one path out of three, but our execution does not end (infinite loop) with each path choice. Then, we apply a different strategy by eliminating the code in the target path executions and returning the previously computed value with no computation. For this method, our execution is completed in a shorter time with lower Modularity values.</p><p>Table <ref type="table" target="#tab_3">2</ref> presents execution time and modularity values for the original execution and our approximate version. While we can see a decrease in all execution times, the approximation also destroys modularity values with one exception, namely the wb-edu dataset. Since this approximation completely eliminates some code segments, accuracy loss becomes inevitable for most cases, and it requires a more rigorous analysis of the target code.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Related Work</head><p>SAGE <ref type="bibr" target="#b6">[7]</ref> presents a static compiler that generates a set of approximated CUDA kernels and a runtime system that employs selective discarding of atomic operations, data packing, and thread fusion optimizations. It yields 2.5× speedup with less than 10% quality loss for machine learning and image processing kernels. While SAGE proposes approximations for CUDA computations and significantly improves performance, it relies on generic approximation methods instead of CUDA-specific techniques.</p><p>Freytag et al. <ref type="bibr" target="#b26">[27]</ref> propose efficient executions for scientific simulation applications by building multiple kernel implementations with different precision levels. They execute approximated kernel versions by switching from one version to another at runtime based on Target Output Quality (TOQ) scenarios. By employing execution configurations based on an analysis of the accuracy loss, the experiments reveal high-performance and energy-efficient executions for target precision levels. While the authors build applicationlayer approximations for the target code, they modify the precision levels of the target code while not introducing CUDA-specific methods. Liu et al. <ref type="bibr" target="#b27">[28]</ref> present cuSpAMM, the CUDA adaptation of the Sparse Approximate Matrix Multiply algorithm, by utilizing thread parallelism, memory tiling, and the tensor cores in multiple GPU devices. While the proposed work implements an approximation algorithm by considering GPU-specific features, the implementation, rather than approximation, relies on GPU optimization techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions and Future Work</head><p>In this work, we propose CUDA-specific approximation methods based on loop perforation, relaxed synchronization, and warp divergence elimination. We define approximations as compiler directives and enable them for target executions. Our experimental results demonstrate that our approximation techniques promise good performance improvements without hurting output accuracy significantly.</p><p>Our approximations are enabled based on compiler directives. While the directives offer some level of automation, we can extend our work by building a fully automated tool that performs source-to-source compiler transformations. Thus, we can easily generate our approximated versions. Moreover, a design space exploration technique potentially helps to choose the best design points considering performance improvements and inaccuracy values.</p><p>While approximate computing offers performance improvements, it is essential to evaluate the power consumption of the target execution. We can extend our work by including energy measurements for GPU devices and include that criterion as our resource-aware evaluation.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Listing 2 :Listing 3 :Listing 4 :</head><label>234</label><figDesc>The Fdtd2d code with kernel function calls inside a loop. ... dim3 block(DIM_THREAD_BLOCK_X, DIM_THREAD_BLOCK_Y); dim3 grid(ceil(((float)NI) / ((float)block.x)), ceil(((float)NJ) / ((float)block.y))); convolution2D_kernel &lt;&lt;&lt; grid,block &gt;&gt;&gt; (ni, nj, A_gpu,B_gpu); ... The Convolution2D code kernel launch configuration. ... __global__ void mean_kernel(int m, int n, DATA_TYPE * mean, DATA_TYPE *data) { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j &lt; _PB_M) { mean[j] = 0.0; int i; for(i = 0; i &lt; _PB_N; i++) { mean[j] += data[i * M + j]; } mean[j] /= (DATA_TYPE)FLOAT_N; The Covariance code with loop structures inside a kernel function mean_kernel.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>/Listing 5 :Listing 6 :</head><label>56</label><figDesc>/ Cooperative group for the current thread block auto blockgroup = cooperative_groups::this_thread_block(); // Cooperative group for each warp in the thread block auto warpgroup = cooperative_groups::tiled_partition&lt;32&gt;(threadblock); // Cooperative group for each 16 threads in the thread block auto subwarp16 = cooperative_groups::tiled_partition&lt;16&gt;(threadblock); // Cooperative group for all currently coalesced threads in the warp auto coalescedgroup = cooperative_groups::coalesced_threads(); // Thread block groups can sync blockgroup.sync(); CUDA cooperative groups [19]. //Replaced code version 1 (4TILE) thread_group tile32 = tiled_partition(this_thread_block(), 32); thread_group tile4 = tiled_partition(tile32, 4); tile4.sync(); //Replaced code version 2 (ACTIVE) thread_group active = coalesced_threads(); active.sync(); Partition-level synchronization configurations.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>presents an example scenario for warp divergence. The eight threads (assuming we have an 8thread warp size for simplicity) start the kernel execution, then at Branch point, there is an if statement that causes different path executions. While four threads execute the instruction at Path A, the other four continue the execution at Path B. When the first four threads execute Path A, the others must wait and perform no operation. The marked execution prevents full warp utilization by activating only four threads simultaneously in an 8-thread warp structure.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: SIMD Warp Divergence [20].</figDesc><graphic coords="4,337.46,525.19,157.95,164.23" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Listing 7 :</head><label>7</label><figDesc>SIMD warp divergence code example.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Speedup-Inaccuracy variation for loop perforation approximation methods.</figDesc><graphic coords="6,103.28,191.51,128.21,124.91" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Execution time-Modularity values for relaxing atomic operations in Grappolo.</figDesc><graphic coords="7,157.72,181.68,138.89,108.57" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 1</head><label>1</label><figDesc>Execution time and incorrect computations (out of given expected correct values) for relaxed synchronization methods for bc application.</figDesc><table><row><cell></cell><cell></cell><cell>ljournal-2008</cell><cell></cell><cell>socLiveJournal</cell><cell></cell><cell>cage15</cell><cell></cell></row><row><cell></cell><cell></cell><cell>Time</cell><cell>Incorrect</cell><cell>Time</cell><cell>Incorrect</cell><cell>Time</cell><cell>Incorrect</cell></row><row><cell></cell><cell></cell><cell></cell><cell>(5,363,260)</cell><cell></cell><cell>(4,847,571)</cell><cell></cell><cell>(5,154,859)</cell></row><row><cell>ORIGINAL</cell><cell></cell><cell>61.156</cell><cell>0</cell><cell>57.116</cell><cell>0</cell><cell>47.096</cell><cell>0</cell></row><row><cell></cell><cell>SKIP</cell><cell>60.637</cell><cell>954</cell><cell>57.116</cell><cell>276</cell><cell>46.780</cell><cell>28,326</cell></row><row><cell>SYNC 1</cell><cell>WARP</cell><cell>60.199</cell><cell>793</cell><cell>53.611</cell><cell>355</cell><cell>47.109</cell><cell>70,611</cell></row><row><cell></cell><cell>4TILE</cell><cell>60.579</cell><cell>754</cell><cell>57.205</cell><cell>227</cell><cell>46.939</cell><cell>76,002</cell></row><row><cell></cell><cell>ACTIVE</cell><cell>60.205</cell><cell>1113</cell><cell>55.431</cell><cell>292</cell><cell>47.164</cell><cell>71,563</cell></row><row><cell></cell><cell>SKIP</cell><cell>59.770</cell><cell>19,888</cell><cell>52.756</cell><cell>1636</cell><cell>47.425</cell><cell>486,970</cell></row><row><cell>SYNC 2</cell><cell>WARP</cell><cell>59.314</cell><cell>11,230</cell><cell>52.485</cell><cell>1693</cell><cell>47.811</cell><cell>498,751</cell></row><row><cell></cell><cell>4TILE</cell><cell>61.364</cell><cell>14,087</cell><cell>52.300</cell><cell>1522</cell><cell>48.465</cell><cell>303,306</cell></row><row><cell></cell><cell>ACTIVE</cell><cell>59.397</cell><cell>8345</cell><cell>52.532</cell><cell>1647</cell><cell>47.850</cell><cell>463,191</cell></row><row><cell></cell><cell>SKIP</cell><cell>61.328</cell><cell>64</cell><cell>53.996</cell><cell>36</cell><cell>47.198</cell><cell>15,531</cell></row><row><cell>SYNC 3</cell><cell>WARP</cell><cell>60.643</cell><cell>1996</cell><cell>53.774</cell><cell>38</cell><cell>47.843</cell><cell>14,341</cell></row><row><cell></cell><cell>4TILE</cell><cell>61.374</cell><cell>137</cell><cell>53.961</cell><cell>34</cell><cell>47.186</cell><cell>13,383</cell></row><row><cell></cell><cell>ACTIVE</cell><cell>60.744</cell><cell>160</cell><cell>53.872</cell><cell>32</cell><cell>47.775</cell><cell>13,975</cell></row><row><cell></cell><cell>SKIP</cell><cell>56.800</cell><cell>20,137</cell><cell>50.572</cell><cell>1739</cell><cell>46.420</cell><cell>98,691</cell></row><row><cell>SYNC 1+2</cell><cell>WARP</cell><cell>56.377</cell><cell>17,035</cell><cell>50.401</cell><cell>1633</cell><cell>46.879</cell><cell>104,232</cell></row><row><cell></cell><cell>4TILE</cell><cell>57.779</cell><cell>12,311</cell><cell>51.199</cell><cell>1678</cell><cell>47.217</cell><cell>138,951</cell></row><row><cell></cell><cell>ACTIVE</cell><cell>56.195</cell><cell>22,491</cell><cell>50.495</cell><cell>1657</cell><cell>46.908</cell><cell>84,210</cell></row><row><cell></cell><cell>SKIP</cell><cell>57.088</cell><cell>20,346</cell><cell>50.871</cell><cell>1729</cell><cell>46.966</cell><cell>85,029</cell></row><row><cell>SYNC 1+2+3</cell><cell>WARP</cell><cell>56.747</cell><cell>20,719</cell><cell>50.774</cell><cell>1600</cell><cell>47.320</cell><cell>62,026</cell></row><row><cell></cell><cell>4TILE</cell><cell>57.801</cell><cell>14,807</cell><cell>51.513</cell><cell>1649</cell><cell>47.678</cell><cell>59,262</cell></row><row><cell></cell><cell>ACTIVE</cell><cell>56.608</cell><cell>17,248</cell><cell>50.880</cell><cell>1556</cell><cell>47.352</cell><cell>87,300</cell></row><row><cell cols="4">racy loss, the combinations (i.e., SYNC 1+2, SYNC 1+2+3)</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="4">further improve the performance without hurting the out-</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="4">put quality much. While the accuracy loss depends on the</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="4">target dataset, we see the most promising relaxation options</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="4">for performance gains (around 8%-10%) with the SYNC 1+2</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>version.</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2</head><label>2</label><figDesc>Execution time and Modularity values with warp divergence elimination for Grappolo.</figDesc><table><row><cell></cell><cell>Original</cell><cell></cell><cell>Approx.</cell><cell></cell></row><row><cell>Dataset</cell><cell>Time</cell><cell>Modularity</cell><cell>Time</cell><cell>Modularity</cell></row><row><cell>relat9</cell><cell>1.206</cell><cell>0.491</cell><cell>0.616</cell><cell>0.254</cell></row><row><cell>cage15</cell><cell>1.341</cell><cell>0.893</cell><cell>1.077</cell><cell>0.727</cell></row><row><cell>rel9</cell><cell>1.094</cell><cell>0.458</cell><cell>0.446</cell><cell>0.253</cell></row><row><cell>ljournal</cell><cell>2.034</cell><cell>0.759</cell><cell>1.573</cell><cell>0.588</cell></row><row><cell>rgg23</cell><cell>1.295</cell><cell>0.991</cell><cell>1.206</cell><cell>0.718</cell></row><row><cell>soc-LiveJournal1</cell><cell>2.042</cell><cell>0.753</cell><cell>1.606</cell><cell>0.603</cell></row><row><cell>wb-edu</cell><cell>3.468</cell><cell>0.995</cell><cell>1.621</cell><cell>0.980</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK), Grant No: 122E395. This work is partially supported by CERCIRAS COST Action CA19135 funded by COST Association.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A survey of methods for analyzing and improving GPU energy efficiency</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mittal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Vetter</surname></persName>
		</author>
		<idno type="DOI">10.1145/2636342</idno>
		<ptr target="https://doi.org/10.1145/2636342.doi:10.1145/2636342" />
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">GreenGPU: A holistic approach to energy efficiency in GPU-CPU heterogeneous architectures</title>
		<author>
			<persName><forename type="first">K</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2012 41st International Conference on Parallel Processing</title>
				<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="48" to="57" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Energy-efficient resource management for federated edge learning with CPU-GPU heterogeneous computing</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">K</forename><surname>Leung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Wireless Communications</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="7947" to="7962" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Runtime and energy constrained work scheduling for heterogeneous systems</title>
		<author>
			<persName><forename type="first">V</forename><surname>Raca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>Umboh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Mehofer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Scholz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Journal of Supercomputing</title>
		<imprint>
			<biblScope unit="volume">78</biblScope>
			<biblScope unit="page" from="17150" to="17177" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A survey of techniques for approximate computing</title>
		<author>
			<persName><forename type="first">S</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">48</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Exploiting errors for efficiency: A survey from circuits to applications</title>
		<author>
			<persName><forename type="first">P</forename><surname>Stanley-Marbell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Alaghi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Carbin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Darulova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dolecek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gerstlauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gillani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jevdjic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Moreau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cacciotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Daglis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">E</forename><surname>Jerger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Falsafi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Misailovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sampson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zufferey</surname></persName>
		</author>
		<idno type="DOI">10.1145/3394898</idno>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">53</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Sage: Self-tuning approximation for graphics engines</title>
		<author>
			<persName><forename type="first">M</forename><surname>Samadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Jamshidi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hormati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mahlke</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)</title>
				<imprint>
			<date type="published" when="2013">2013. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Arga: Approximate reuse for GPGPU acceleration</title>
		<author>
			<persName><forename type="first">D</forename><surname>Peroni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Imani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Nejatollahi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Dutt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Rosing</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">56th ACM/IEEE Design Automation Conference (DAC)</title>
				<imprint>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Local memoryaware kernel perforation</title>
		<author>
			<persName><forename type="first">D</forename><surname>Maier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Cosenza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Juurlink</surname></persName>
		</author>
		<idno type="DOI">10.1145/3168814</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO 2018</title>
				<meeting>the 2018 International Symposium on Code Generation and Optimization, CGO 2018<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="278" to="287" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Sfu-driven transparent approximation acceleration on GPUs</title>
		<author>
			<persName><forename type="first">A</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">L</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wijtvliet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2016 International Conference on Supercomputing, ICS &apos;16</title>
				<meeting>the 2016 International Conference on Supercomputing, ICS &apos;16<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Trulook: A framework for configurable GPU approximation</title>
		<author>
			<persName><forename type="first">R</forename><surname>Garcia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Asgarinejad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Khaleghi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Rosing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Imani</surname></persName>
		</author>
		<idno type="DOI">10.23919/DATE51398.2021.9474239</idno>
	</analytic>
	<monogr>
		<title level="m">2021 Design, Automation and Test in Europe Conference and Exhibition</title>
				<meeting><address><addrLine>DATE</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="487" to="490" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Tfapprox: Towards a fast emulation of dnn approximate hardware accelerators on GPU</title>
		<author>
			<persName><forename type="first">F</forename><surname>Vaverka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Mrazek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Vasicek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Sekanina</surname></persName>
		</author>
		<idno type="DOI">10.23919/DATE48585.2020.9116299</idno>
	</analytic>
	<monogr>
		<title level="m">2020 Design, Automation and Test in Europe Conference and Exhibition (DATE)</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="294" to="297" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Model-based loop perforation</title>
		<author>
			<persName><forename type="first">D</forename><surname>Maier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Juurlink</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Euro-Par 2021: Parallel Processing Workshops</title>
				<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="549" to="554" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Enabling large scale simulations for particle accelerators</title>
		<author>
			<persName><forename type="first">K</forename><surname>Iliakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Timko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Xydis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Tsapatsaris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Soudris</surname></persName>
		</author>
		<idno type="DOI">10.1109/TPDS.2022.3192707</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Parallel and Distributed Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="4425" to="4439" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Adaptive relaxed synchronization through the use of supervised learning methods</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L C</forename><surname>Bueno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>De La Rocque Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">D</forename><surname>Sotelino</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Future Generation Computer Systems</title>
		<imprint>
			<biblScope unit="volume">106</biblScope>
			<biblScope unit="page" from="260" to="269" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Tools for reduced precision computation: A survey</title>
		<author>
			<persName><forename type="first">S</forename><surname>Cherubin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Agosta</surname></persName>
		</author>
		<idno type="DOI">10.1145/3381039</idno>
		<ptr target="https://doi.org/10.1145/3381039.doi:10.1145/3381039" />
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">53</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Programming Massively Parallel Processors</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">B</forename><surname>Kirk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Mei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Hwu</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
			<publisher>Morgan Kaufmann</publisher>
		</imprint>
	</monogr>
	<note>Third Edition</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Grauer-Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Searles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ayalasomayajula</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cavazos</surname></persName>
		</author>
		<title level="m">Auto-tuning a high-level language targeted to GPU codes</title>
				<meeting><address><addrLine>InPar</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012">2012. 2012</date>
		</imprint>
	</monogr>
	<note>Innovative Parallel Computing</note>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Harris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Perelygin</surname></persName>
		</author>
		<ptr target="https://developer.nvidia.com/blog/cooperative-groups/" />
		<title level="m">Cooperative groups: Flexible CUDA thread programming</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">M</forename><surname>Aamodt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">W L</forename><surname>Fung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">G</forename><surname>Rogers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Martonosi</surname></persName>
		</author>
		<title level="m">General-Purpose Graphics Processor Architecture</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Mahantesh Halappanavar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Howard</forename><forename type="middle">(</forename><surname>Hao) Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ghosh</surname></persName>
		</author>
		<ptr target="https://github.com/ECP-ExaGraph/grappolo" />
		<title level="m">Grappolo community detection</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Gardenia: A graph processing benchmark suite for next-generation accelerators</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Yang</surname></persName>
		</author>
		<idno type="DOI">10.1145/3283450</idno>
		<ptr target="https://doi.org/10.1145/3283450.doi:10.1145/3283450" />
	</analytic>
	<monogr>
		<title level="j">ACM Journal on Emerging Technologies in Computing Systems</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<ptr target="https://developer.nvidia.com/cuda-12-1-0-download-archive" />
		<title level="m">CUDA toolkit 12</title>
				<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="volume">1</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<ptr target="https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2" />
		<title level="m">Nvidia ampere ga102 GPU architecture white paper</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Leskovec</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Krevl</surname></persName>
		</author>
		<ptr target="http://snap.stanford.edu/data" />
		<title level="m">SNAP Datasets: Stanford large network dataset collection</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Fast unfolding of communities in large networks</title>
		<author>
			<persName><forename type="first">V</forename><surname>Blondel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Guillaume</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Lambiotte</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Mech</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Statistical Mechanics: Theory and Experiment</title>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Interleaved execution of approximated cuda kernels in iterative applications</title>
		<author>
			<persName><forename type="first">G</forename><surname>Freytag</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">A</forename><surname>Künas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rech</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">O A</forename><surname>Navaux</surname></persName>
		</author>
		<idno type="DOI">10.1109/PDP62718.2024.00017</idno>
	</analytic>
	<monogr>
		<title level="m">Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)</title>
				<imprint>
			<date type="published" when="2024">2024. 2024</date>
			<biblScope unit="page" from="60" to="67" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Accelerating approximate matrix multiplication for near-sparse matrices on gpus</title>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Luan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Qian</surname></persName>
		</author>
		<idno type="DOI">10.1007/s11227-022-04334-5</idno>
		<ptr target="https://doi.org/10.1007/s11227-022-04334-5.doi:10.1007/s11227-022-04334-5" />
	</analytic>
	<monogr>
		<title level="j">Journal of Supercomputing</title>
		<imprint>
			<biblScope unit="volume">78</biblScope>
			<biblScope unit="page" from="11464" to="11491" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
