<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Performance and Reliability of Selective Redundant Multithreading for GPGPU Applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ercüment Kaya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ömer Faruk Karadaş</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Işil Öz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Engineering Department, Izmir Institute of Technology</institution>
          ,
          <addr-line>Izmir</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Electrical Electronics Engineering Department, Izmir Institute of Technology</institution>
          ,
          <addr-line>Izmir</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the widespread use of GPU architectures in general-purpose computations, evaluating the soft error vulnerability of GPGPU programs and employing eficient fault tolerance techniques for more reliable execution becomes more prominent. Performing full redundancy, based on the redundant execution of the complete program, results in resource consumption and performance loss as well as energy ineficiency. Therefore, determining the most error-prone regions of the target program code and replicating only those parts maintains both high performance and acceptable error rates. In this study, we propose a partial redundant multithreading mechanism based on the soft error vulnerability of GPGPU applications and perform a trade-of analysis between performance and reliability. Firstly, we perform fault injection experiments to evaluate the SDC rates for each kernel function. Then, based on the outcome of the fault injection experiments, we determine the kernel function to-be-replicated. According to the pragmas denoting the redundancy points in the source code, our custom LLVM pass generates the code that enables the redundant execution for the specified code region. We evaluate both the reliability and performance of the redundant execution scenarios measuring the execution time of the redundant program generated by our compiler-managed redundancy technique. Our results demonstrate that protecting only the most vulnerable kernel functions enables high reliability without hurting the performance significantly.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Soft error reliability</kwd>
        <kwd>Fault injection</kwd>
        <kwd>Redundant execution</kwd>
        <kwd>GPGPU programs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Heterogeneous computing systems ofer high performance and less energy consumption by
combining a wide range of device structures and configurations. Building heterogeneous
systems by bringing together general-purpose multi-core processors (CPUs) and data-parallel
graphic processing units (GPUs) enables eficient computation for high performance and energy
consumption in large-scale computing platforms. Recently, the high computation power of
the GPU architectures has been largely utilized for general-purpose computations as well as
graphics applications [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Therefore, the soft error vulnerability becomes a more significant
design concern for the target general-purpose programs than the inherently error-tolerant
graphics computations.
      </p>
      <p>
        To deal with the efects of the hardware errors on the target programs, fault tolerance
techniques can be employed by executing the code redundantly. Compiler-level redundancy,
employed for both CPU and GPU applications, is one of the widely used approaches in the
reliability-aware computing domain. The traditional approach, SWIFT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], enables
instructionlevel redundancy by providing high-level protection. Moreover, nZDC [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] presents near-zero
silent data corruption by targeting to remove the limitations of the SWIFT, such as not wholly
protected branches. Wadden et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] propose a compiler-level redundant multithreading
technique for OpenCL-based GPU applications. The proposed method duplicates the work-groups
instead of the functions or the instructions by eliminating the overheads of the instruction-based
replication. Our work targets CUDA-based programs and kernel function replications.
      </p>
      <p>
        While the redundant approaches achieve high fault coverage, performing full redundancy,
based on the redundant execution of the complete program, results in resource consumption
and performance loss as well as energy ineficiency. Therefore, determining the most
errorprone regions of the target program code and replicating only those parts maintain both high
performance and acceptable error rates. Bohman et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] present compiler-assisted software
fault tolerance (COAST) for microcontrollers, which are widely used in task-based programming
and/or in extraordinary environments. While the COAST uses selective replication based on the
code annotations, its instruction-based replication strategy increases the code size significantly
and causes a memory bottleneck. To eliminate this overhead, we target the replication of the
functions instead of the small-grained instructions. The recent work, ArmorALL [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], presents a
selective compiler-level solution to protect GPUs against soft errors. ArmorALL provides three
compiler-based redundancy schemes, including Address Armor, Value Armor, and Hybrid Armor.
While Address Armor protects only the addresses used by memory instructions, Value Armor
protects the values by duplicating all instructions that participate in the value computation.
Hybrid Armor protects both of them. Even though ArmorALL uses selective redundancy
based on its redundancy schemes, it does not allow the programmer to define the specific code
regions for the redundant execution. To reduce the overhead and give more control to the
programmer, our work uses selective redundancy based on the annotations. Moreover, Yang
et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] recently propose a selective replication scheme by remapping threads with the same
vulnerability behavior into the same warps. They evaluate individual thread vulnerabilities
and combine reliable and unreliable threads by placing similar ones into the same warps. Their
technique relies on the fact that having reliable and unreliable threads in the same warp can hurt
the performance. The replication of the entire warp, which includes only unreliable threads, is
more eficient than replicating the individual threads in a mixed warp. In our work, we evaluate
kernel-level soft error vulnerabilities instead of thread-level, which is more cost-eficient. While
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] requires architectural modifications to enable thread remapping, our work ofers compiler
support and the programmer directly controls the replication by utilizing the annotations.
      </p>
      <p>In this study, we propose a partial redundant multithreading mechanism based on the soft
error vulnerability of GPGPU applications and evaluate the performance and reliability of the
target executions. Firstly, in our fault injection framework, we evaluate the most vulnerable
code regions of the target applications by considering kernel functions. Then, based on the
outcome of the fault injection experiments, we determine the kernel function to be replicated.
According to the pragmas denoting the redundancy points in the source code, our compiler
generates the code that enables the redundant execution for the specified code region.</p>
      <p>
        Our debugger-based regional fault injection tool [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] generates fault injection points for each
kernel function based on the information gathered during the profiling phase. It evaluates the
silent data corruption (SDC) rates to get the soft error vulnerability of each kernel function in
the target GPGPU program. Additionally, our LLVM-based compiler framework generates the
target executable including redundant code sections for the specified kernel functions marked
by the programmer, who utilizes the feedback from our fault injection analysis. We perform an
experimental study to reveal the eficiency of our approach for a set of GPGPU applications. Our
fault injection experiments demonstrate that the code regions inside GPGPU programs exhibit
diferent characteristics in terms of soft error vulnerability, pointing to a partial redundancy
for both higher performance and reliability. Based on the recommendations of our soft error
vulnerability analysis, we perform redundant executions that replicate only the most vulnerable
parts of the target programs. We evaluate both the reliability and performance of the redundant
execution scenarios by conducting fault injection experiments and measuring the execution
time of the redundant program generated by our compiler-managed redundancy technique.
Our results demonstrate that protecting only the most vulnerable kernel functions enables high
reliability without hurting the performance significantly.
      </p>
      <p>The remainder of this paper is organized as follows: Section 2 presents some background
on soft error reliability and redundant multithreading. We explain our selective redundant
multithreading methodology in Section 3. Then the experimental results are outlined in Section
4. Finally, in Section 5, we summarize the work with some conclusive remarks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Motivation</title>
      <sec id="sec-2-1">
        <title>2.1. Soft Error Reliability in GPGPUs</title>
        <p>
          Soft errors, resulting from single-bit flips in computer system structures, are caused by alpha
particles, cosmic rays, thermal neutrons, or other environmental causes [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ]. If a soft error
hits a register or a memory location, it may cause data corruption or program crash when
consumed by the application. While the fault may be masked due to be ignored by the program
or corrected by any error correction mechanism, it might afect the program outcome by means
of silent data corruption (SDC) or detected unrecoverable error (DUE). Silent data corruption is
the most critical one since it produces incorrect program outputs while the program seems to
end successfully.
        </p>
        <p>As GPUs are increasingly being utilized for the acceleration of the general-purpose
computations, their soft error resilience is becoming more critical than before when they were used
only for graphics and considered inherently fault-tolerant. Hence, in recent years, the soft error
problem for GPU systems has become a first-class design challenge especially. Consequently,
we focus on soft error evaluation of GPU systems in this study.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Redundant Multithreading</title>
        <p>
          To deal with soft errors in computer systems, redundant multithreading (RMT) approaches
have been implemented in various levels [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Based on the replication of both the code and
data for the target execution, RMT enables the execution of two redundant copies of the
program. The key concepts for redundant multithreading are the components included in the
redundant execution, namely sphere of replication (SOR). The SOR identifies the components
to be replicated while the components outside SOR must be protected by some other fault
tolerance mechanism.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Our selective redundant multithreading framework consists of three main components, as
shown in Figure 1:
• Code annotation: After selecting the kernel function to execute redundantly, we insert
the #pragma directive that notifies the compiler about the redundant execution and
provides additional information to the compiler. Our custom compiler supports this
directive and enables redundant execution for the given function with the essential
parameters.
• Redundant code generation and execution: After getting the target kernel function
from the #pragma directive, our compiler framework generates the code for redundant
execution by replicating both kernel instructions and output data to be generated by the
redundant copies. Moreover, it adds the majority voting function to obtain the corrected
result in case of any error.</p>
      <sec id="sec-3-1">
        <title>3.1. Regional Soft Error Reliability Evaluation</title>
        <p>
          We utilize the fault injection tool that enables regional vulnerability analysis for GPGPU
programs [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The debugger-based fault injector targets specified kernel function execution
as each fault injection point and enables us to evaluate the fault rates for the target kernel
functions. By utilizing its Configuration interface, we specify the kernel functions that we target
for the fault injection. Then Profiling phase enables us to collect information about the target
code and Fault Generation phase determines the specific instructions in the given function and
the target register bits. Finally, Fault Injection phase flips in the specified register bit during the
execution of the specified instruction generated in the fault generation phase. We perform fault
injection experiments for each kernel function in the given GPGPU program and obtain silent
data corruption (SDC) rates to quantify the soft error vulnerabilities. In this way, we can get
the vulnerability values for each kernel function and compare the fault characteristics to obtain
the kernel function with the largest vulnerability for the redundant execution.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Compiler-Level Redundant Execution</title>
        <p>
          We build our compiler-based redundant execution framework on top of LLVM framework [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
As shown in Figure 2, LLVM consists of three major components: Front-end, Optimizer, and
Back-end.
        </p>
        <p>The Front-end component is language dependent, it takes the source code as an input and
generates the LLVM IR code. The Back-end component is architecture dependent, it takes the
LLVM IR code as an input and generates the machine code. The Optimizer component takes
LLVM IR code and generates the optimized LLVM IR code. Our general compilation flow, based
on LLVM framework and shown in Figure 3, consists of three parts: 1) Generating the LLVM IR
code, 2) Generating the new LLVM IR code for both the host and the device using the custom
LLVM pass, 3) Generating the executable by using the generated LLVM IR codes.</p>
        <p>Our compiler-level RMT scheme replicates the kernel functions specified by our directive
and the output data, implements the majority function, and enables the redundant execution of
the code marked by the programmer. There are four major components in our implementation:
1) Compiler directive, 2) Output replication, 3) Kernel function replication, 4) Majority voting
implementation. We perform both host and device code modifications in the compilation phase
given in Figure 3. While the implementation of the Output Replication and the Kernel Function
Call Replication components requires modifications in the host code, we update both the host
code and the device code for the Majority Voting. Listing 3 and Listing 4 present an example
code with our annotation and the target code to be generated by our compiler, respectively.
In the following part, we will explain our implementation details by presenting them in the
example code snippet.
void mm3Cuda ( int ni ,
. . . / / Other p a r a m e t e r s
)
DATA_TYPE ∗ A_gpu ;
. . . / / Other d e c l a r a t i o n s
{
}
cudaMalloc ( ( void ∗ ∗ ) &amp; A_gpu , s i z e o f ( DATA_TYPE ) ∗ NI ∗ NK ) ;
. . . / / Other a l l o c a t i o n s
cudaMemcpy ( A_gpu , A , s i z e o f ( DATA_TYPE ) ∗ NI ∗ NK,</p>
        <p>cudaMemcpyHostToDevice ) ;
. . / / Other Memory t r a n s f e r s
Listing 1: Annotated CUDA code
% c a l l 4 2 = c a l l i 3 2 @cudaConfigureCall ( i 6 4 %120 , i 3 2 %122 ,
i 6 4 %126 , i 3 2 %128 ,
i 6 4 0 ,
%s t r u c t . CUstream_st ∗ n u l l )
%t o b o o l 4 3 = icmp ne i 3 2 % c a l l 4 2 , 0
br i 1 %t o b o o l 4 3 , l a b e l % k c a l l . end45 ,</p>
        <p>l a b e l % k c a l l . c o n f i g o k 4 4
k c a l l . c o n f i g o k 4 4 :
%139 = l o a d i32 , i 3 2 ∗ %n i . addr , a l i g n 4
%140 = l o a d i32 , i 3 2 ∗ %n j . addr , a l i g n 4
%141 = l o a d i32 , i 3 2 ∗ %nk . addr , a l i g n 4
%142 = l o a d i32 , i 3 2 ∗ %n l . addr , a l i g n 4
%143 = l o a d i32 , i 3 2 ∗ %nm . addr , a l i g n 4
%144 = l o a d f l o a t ∗ , f l o a t ∗ ∗ %C_gpu , a l i g n 8
%145 = l o a d f l o a t ∗ , f l o a t ∗ ∗ %D_gpu , a l i g n 8
%146 = l o a d f l o a t ∗ , f l o a t ∗ ∗ %F_gpu , a l i g n 8
c a l l void @ _ Z 1 1 m m 3 _ k e r n e l 2 i i i i i P f S _ S _ ( i 3 2 %139 , i 3 2 %140 ,
i 3 2 %141 , i 3 2 %142 ,
i 3 2 %143 , f l o a t ∗ %144 ,
f l o a t ∗ %145 , f l o a t ∗ % 1 4 6 ) ,
! Redundancy ! 3
br l a b e l % k c a l l . end45
. . .
! 3 = ! { ! " I n p u t s &amp;C_gpu&amp;D_gpu Outputs &amp;F_gpu " }</p>
        <p>Listing 2: Compiled code</p>
        <p>Listing 3: Initial code</p>
        <p>Listing 4: Target code
①
③
①
②</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Compiler Directive</title>
          <p>In order to get the kernel function to be replicated from the programmer, our compiler scheme
defines a compiler directive. By utilizing the directive, the programmer can annotate the function
as well as its input and output variables. The syntax of the directive is as follows:
#pragma redundant in &lt;input&gt; out &lt;output&gt;</p>
          <p>Listing 2 presents how customized Clang generates the annotated function call for the target
code given in Listing 1. The annotated function call has metadata, Redundancy, which indicates
both the input and the output of the function call. The attached metadata can be used in the
optimizer phase.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Output Replication</title>
          <p>In order to store the result values for all the function executions, that is, one original and two
redundant executions, we replicate the output variable of the annotated GPU kernel (given as
○ 1 in Listing 4). This phase consists of three steps:
1. Variable declaration: We define two more variables in the type of the output variable of
the kernel function to be replicated in the CPU code (Line 36 and Line 47 in Listing 4)
2. Memory allocation: We allocate space in the GPU by utilizing cudaMalloc function calls
(Line 37 and Line 48 in Listing 4).
3. Initialization: Since we need to initialize the output variables due to the utilization of the
initial values in the target function executions, we copy the data values from the original
output variable into the redundant copies (Line 39 and Line 50 in Listing 4). Since we
have the initialized values in the GPU memory, we utilize the device to device memory
copy operation (via cudaMemcpyDevicetoDevice parameter) to avoid the overhead due to
copy from the host CPU to the GPU device.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Kernel Function Call Replication</title>
          <p>Since our aim is to execute the target kernel function redundantly, we include two more function
calls in addition to the original one (given as ○ 2 in Listing 4). We provide the redundant copies
of the output variables, which are created before, as the output parameters of the redundant
function calls. The other parameters including the grid size, the block size, and the input
variables remain the same as the original function call.</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>3.2.4. Majority Voting Implementation</title>
          <p>After executing the redundant copies of the kernel functions with redundant output variables,
we need to compare their results to detect and correct the potential errors. Therefore, we
implement a majority voting function, which produces a single output by comparing the three
outputs generated by the redundant function executions (given as ○ 3 in Listing 4).</p>
          <p>We implement the majority function as a GPU kernel since the outputs are already in the
GPU global memory, and also the parallel execution can accelerate the comparison operations.
In our compiler implementation, we need to modify both the host code and the device code.
While the device code is updated by the addition of the function body, the modified host code
includes the majority function call with the argument setup. Since a GPU application may
contain multiple data types, we need multiple majority function implementations. We use the
built-in type ID of the type of the output in the name of the majority voting function and call
the appropriate one depending on the output’s data type.</p>
          <p>Our majority voting function consists of five parameters. The first three of them are outputs
from kernel function executions. The remaining are the final output and the size of the marked
output. There are two steps of our majority voting function: Thread ID calculation and
Comparison. Thread ID calculation is very straightforward as it can be observed on line 6 in Listing
4. The next step, Comparison, lies on the assumption that at least two values are equal to each
other. Therefore only one comparison would be suficient. The comparison is between the first
and second output. If they are equal to each other, we will assign the first value to the output. If
they are not, we can say that the third value will be equal to either one based on our assumption.
Therefore, we can assign the third value to the output without further comparison.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Study</title>
      <p>In this section, we present the experiments for our redundant executions after providing the
SDC rates obtained from our fault injection experiments. Firstly, we explain our experimental
setup and evaluation methodology, and then we present our results.</p>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <p>
          We utilize the GPU programs from PolyBench benchmark suite [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and evaluate the efects of
the redundant execution on the kernel functions of the target programs. Specifically, we choose
ifve applications including Bicg, Correlation, Covariance, Fdtd2d, and Gramschmidt. Firstly, we
perform fault injection experiments for 15 kernel functions from those five programs and
obtain the SDC rates as the soft error vulnerability metric. Then, we compile the codes that are
annotated to specify the functions to be replicated.
        </p>
        <p>
          We choose the NVIDIA Pascal architecture [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] for our evaluation platform. We build our
compilation framework on Clang compiler version 10.0 and generate the target binaries for
our redundant execution scenarios by compiling the modified Clang version. To collect the
execution times for the kernel functions and the specific operations (e.g., memory copy, kernel
function execution) during the program execution, we utilize nvprof [16] and NVIDIA Nsight
Compute [17], which are NVIDIA built-in tools for the CUDA programs. On the other hand,
due to the performance and the compatibility reasons, we compile our target programs with
nvcc compiler [18] to perform the fault injection experiments. We use 1000 fault injections per
each kernel function by using a statistical approach [19] with the confidence level of 95% and
an error margin of 3%.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Results</title>
        <p>We perform fault injection experiments to obtain the most vulnerable kernel function(s) in our
target programs. While our experiments report Masked, Crash, and SDC rates, we only utilize
the SDC rates as the soft error vulnerability metric of the execution. Figure 4 presents the SDC
rates for each kernel function revealed from our fault injection experiments. Additionally, Table
1 presents the execution times of the corresponding kernels, which we refer to in our redundant
execution discussion.</p>
        <p>After determining the vulnerability of the kernel functions, we select the most vulnerable
one with the highest SDC rate for each GPU application, which is marked with red in Figure 4.
Then we perform redundant executions with diferent redundancy schemes by employing our
RMT framework. Specifically, we execute our target programs with the following scenarios:
1) No Redundancy, where we run the program as it is, 2) Full Redundancy, where we run all
the kernel functions in the program redundantly, that is, three times, 3) One kernel redundant,
where we select only one kernel and run that redundantly.</p>
        <p>Figure 5 shows the normalized execution times of our target applications with diferent
redundant schemes. We mark the execution scenario that the kernel function with the highest
SDC rate is replicated in the figure for each program, e.g., bicg_kernel2 in Bicg, corr_kernel in
Correlation. If we look at Bicg, with the highest SDC rate for the bicg_kernel2 function, we can
see that the redundant execution, where the bicg_kernel2 is replicated, takes significantly shorter
time than the full redundant case. Compared to the full redundancy, our selective scheme yields
performance gain. On the other hand, Correlation and Covariance do not behave in the same
way. The corr_kernel of the Correlation with the highest SDC rate, has also the highest execution
time, which dominates the other kernels. Therefore we can say that the diference between the
Full redundancy and only corr_kernel redundancy is negligible as it can be observed in the Figure
5. Even though, the corr_kernel has the highest SDC rate, the SDC rate of the std_kernel is not
so diferent (see Figure 4). Since the execution time of the std_kernel is so small in comparison to
the execution time of the corr_kernel (see Table 1), it would be beneficial if we replicate both of
them in order to decrease the SDC rate without significant performance degradation. Similarly,
in Covariance, the covar_kernel dominates all the application since the execution times of the
other kernels are negligible. On the other hand, the mean_kernel has close SDC rate to the
covar_kernel as it can observed in Figure 4. If we replicate both of those functions, we achieve
potentially lower vulnerability with negligible execution time diference. We must keep in mind
that the time diference between the Full redundancy and only covar_kernel redundancy is tiny.
Therefore, we can say that the best redundancy option is the Full redundancy. By observing the
trends in Correlation and Covariance, we can say that for the programs that one kernel function
dominates the others in terms of the execution time, our selective scheme does not help, and
applying full redundancy can be a good option.</p>
        <p>Diferent from the other applications, the proportion between the execution time and the SDC
rate is inverse in Fdtd2d and Gramschmidt. For example, while the fdtd2d_step2_kernel has the
highest SDC rate (see Figure 4), its execution time is the shortest (see Table 1) among the other
kernel functions in the application. Among all the redundancy options of fdtd2d shown in Figure
5, the option with the fdtd2d_step2_kernel redundant has the lowest execution time. Since the
kernel has the highest SDC rate, we get a decent amount of SDC rate reduction by sacrificing
relatively less execution time. On the other hand, the SDC rate of the fdtd2d_step3_kernel is
close to the fdtd2d_step2_kernel, and the diference between their execution times is not large
(see Table 1). Therefore, we can say that it would be beneficial if we replicate both of them.
Gramschmidt has similar characteristics. For instance, the gramschmidt_kernel2 has the highest
SDC rate, yet it has the lowest execution time among the other kernels of the application.
Therefore replicating only this function or two functions, namely the gramschmidt_kernel1
and the gramschmidt_kernel2, with similar execution times can be beneficial in terms of both
performance and reliability.</p>
        <p>While we discuss the performance of the redundant execution schemes previously, we also
want to analyze the performance and reliability gains together. To evaluate the
performancereliability tradeof between the diferent replication schemes, we demonstrate the change (in
percentage) for both the SDC rates as the vulnerability metric and the execution times in Figure
6. We assume that our redundancy method, based on triple replication, provides full protection,
and our redundantly executed code will not cause any erroneous cases by masking the faults.
We calculate the SDC rates accordingly. We utilize two redundant execution scenarios for
each application including the replication of only the most vulnerable kernel function and the
replication of the most vulnerable two kernel functions. For Bicg, essentially, we present the
Full redundancy and the most vulnerable function redundancy schemes since it consists of
(a) Bicg
(b) Fdtd2d
(c) Gramschmidt
two kernel functions. If we perform Full redundancy, where we replicate two kernel functions,
the SDC rate drops to zero (100% decrease). However, the reliability gain obtained from the
replication of the bicg_kernel2 is limited (∼ 65%). Therefore we can say that if we have no
time limitation and/or no tolerance to SDCs, it would more useful to apply full redundancy.
Otherwise, using only bicg_kernel2 redundancy would be beneficial. For both Correlation and
Covariance, as discussed earlier, the execution time does not difer while the vulnerability
gain gets larger for the two-function replication case. Even though, in Figure 6, there is a
large percentage diference between only fdtd2d_step2_kernel redundant option and both of
fdtd2d_step2_kernel and fdtd2d_step3_kernel redundant option, the absolute diference is not
significant since the execution times of the kernels are short (see Table 1). However, for the
cases, where Fdtd is executed with a very large amount of data, and the execution incurs
longer times, the performance gain obtained from the replication of only the most vulnerable
function becomes significant. For Gramschmidt, we can clearly see the tradeof between the
vulnerability and the performance for the alternative redundancy schemes. The replication
of only the most vulnerable function, i.e., gramschmidt_kernel2, provides almost the same (∼
30%) performance and vulnerability gains. Therefore, one needs to consider the requirements
of the system including the execution time and the reliability, and make a decision about the
redundancy level accordingly.</p>
        <p>Figure 7 presents the execution time profile of each function for the redundancy scenarios.
For each redundant execution case, we measure the percentage of the operations performed
during the execution. Specifically, we profile the kernel function executions, the majority
function, and the memory copy operations including the copy of the input from CPU to GPU
(CUDA memcpy HtoD), the copy of the output from GPU to CPU (CUDA memcpy DtoH ), and
the redundant output copy operations form GPU to GPU (CUDA memcpy DtoD). We can see
that the most of the time is spent during the kernel function executions. While the redundant
output copy operations also take significant time in Fdtd2d (Figure 7b) and Gramschmidt (Figure
7c), the percentage of those operations is small for the other programs. We provide the small
percentages in a more detailed view for Bicg (see Figure 7a), however, we omit the details
for Correlation and Covariance, which both spend almost all the time in the dominant kernel
function executions, corr_kernel and covar_kernel, respectively.</p>
        <p>Although we include additional memory operations and majority function as well as
redundant kernel functions in our redundant scenarios, the main reason of the increase in the
execution time is the replicated function executions. The majority voting function and the
memory operations do not take significant time in comparison to the kernel executions. Therefore,
we need to focus on reducing the time spent for the redundant kernel executions. It is possible
to utilize the parallel execution units of the GPU by executing the redundant copies in parallel
either using the streams or replicating the number of threads working on the redundant copies.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this study, we propose a partial redundant multithreading mechanism based on the soft error
vulnerability of GPGPU applications and perform a trade-of analysis between performance and
reliability. Firstly, we run fault injection experiments and collect SDC rates for the programs.
Based on the SDC rates, we determine the kernel function to be replicated. Using our custom
pragma annotation, our custom LLVM pass generates additional kernel function calls, required
memory operations, and majority voting function. Our experiments indicate that the majority
voting function and the memory operations such as cudaMemCpy do not consume significant
time. The main reason for the performance overhead is the execution time of the kernel
functions. Therefore, for future work, we will focus on increasing the execution overlaps of the
redundant kernel function executions.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank anonymous reviewers for their comments. This work was supported by
the Scientific and Technological Research Council of Turkey (TÜBİTAK), Grant No: 119E011.
This work is partially supported by CERCIRAS COST Action CA19135 funded by COST
Association.
[16] Nvidia nvprof profiling tool, 2021. URL: https://docs.nvidia.com/cuda/profiler-users-guide/
index.html#nvprof-overview.
[17] Nvidia nsight compute, 2021. URL: https://developer.nvidia.com/nsight-compute.
[18] Nvidia, cuda llvm compiler, 2021. URL: https://developer.nvidia.com/cuda-llvm-compiler.
[19] R. Leveugle, A. Calvez, P. Maistri, P. Vanhauwaert, Statistical fault injection: Quantified
error and confidence, Proceedings of the Conference on Design, Automation and Test in
Europe (DATE), 2009.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Aamodt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. W. L.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. G.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martonosi</surname>
          </string-name>
          ,
          <string-name>
            <surname>General-Purpose Graphics Processor Architecture</surname>
          </string-name>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Reis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vachharajani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rangan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>August</surname>
          </string-name>
          ,
          <article-title>Swift: software implemented fault tolerance</article-title>
          ,
          <source>in: International Symposium on Code Generation and Optimization</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>254</lpage>
          . doi:
          <volume>10</volume>
          .1109/CGO.
          <year>2005</year>
          .
          <volume>34</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Didehban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <article-title>Nzdc: A compiler technique for near zero silent data corruption</article-title>
          ,
          <source>in: Proceedings of the 53rd Annual Design Automation Conference</source>
          , DAC '16,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2016</year>
          . URL: https://doi.org/10. 1145/2897937.2898054. doi:
          <volume>10</volume>
          .1145/2897937.2898054.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wadden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lyashevsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gurumurthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sridharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Skadron</surname>
          </string-name>
          ,
          <article-title>Real-world design and evaluation of compiler-managed gpu redundant multithreading</article-title>
          ,
          <source>in: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>73</fpage>
          -
          <lpage>84</lpage>
          . doi:
          <volume>10</volume>
          . 1109/ISCA.
          <year>2014</year>
          .
          <volume>6853227</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bohman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>James</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Wirthlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Quinn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Goeders</surname>
          </string-name>
          ,
          <article-title>Microcontroller compilerassisted software fault tolerance</article-title>
          ,
          <source>IEEE Transactions on Nuclear Science</source>
          <volume>66</volume>
          (
          <year>2019</year>
          )
          <fpage>223</fpage>
          -
          <lpage>232</lpage>
          . doi:
          <volume>10</volume>
          .1109/TNS.
          <year>2018</year>
          .
          <volume>2886094</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kalra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Previlon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kaeli</surname>
          </string-name>
          ,
          <article-title>Armorall: Compiler-based resilience targeting gpu applications</article-title>
          ,
          <source>ACM Trans. Archit. Code Optim</source>
          .
          <volume>17</volume>
          (
          <year>2020</year>
          ). URL: https://doi.org/10.1145/ 3382132. doi:
          <volume>10</volume>
          .1145/3382132.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jog</surname>
          </string-name>
          , E. Smirni,
          <article-title>Enabling software resilience in gpgpu applications via partial thread protection</article-title>
          ,
          <source>in: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1248</fpage>
          -
          <lpage>1259</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICSE43902.
          <year>2021</year>
          .
          <volume>00114</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Oz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. F.</given-names>
            <surname>Karadas</surname>
          </string-name>
          ,
          <article-title>Regional soft error vulnerability and error propagation analysis for gpgpu applications</article-title>
          ,
          <source>Journal of Supercomputing</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s11227-021-04026-6.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Shivakumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kistler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Keckler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Burger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Alvisi</surname>
          </string-name>
          ,
          <article-title>Modeling the efect of technology trends on the soft error rate of combinational logic</article-title>
          ,
          <source>in: Proceedings of International Conference on Dependable Systems and Networks (DSN)</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Weaver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Emer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Reinhardt</surname>
          </string-name>
          ,
          <article-title>Techniques to reduce the soft error rate of a high-performance microprocessor</article-title>
          ,
          <source>in: 31st Annual International Symposium on Computer Architecture (ISCA)</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>I.</given-names>
            <surname>Oz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arslan</surname>
          </string-name>
          ,
          <article-title>A survey on multithreading alternatives for soft error fault tolerance</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>52</volume>
          (
          <year>2019</year>
          ). URL: https://doi.org/10.1145/3302255.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lattner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Adve</surname>
          </string-name>
          ,
          <article-title>Llvm: a compilation framework for lifelong program analysis amp; transformation</article-title>
          , in:
          <source>International Symposium on Code Generation and Optimization</source>
          ,
          <year>2004</year>
          .
          <source>CGO</source>
          <year>2004</year>
          .,
          <year>2004</year>
          , pp.
          <fpage>75</fpage>
          -
          <lpage>86</lpage>
          . doi:
          <volume>10</volume>
          .1109/CGO.
          <year>2004</year>
          .
          <volume>1281665</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lattner</surname>
          </string-name>
          , Llvm,
          <year>2011</year>
          . URL: https://www.aosabook.org/en/llvm.html,
          <source>last accessed 27 August</source>
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.-N.</given-names>
            <surname>Pouchet</surname>
          </string-name>
          , Polybench/c,
          <year>2016</year>
          . URL: https://web.cse.ohio-state.edu/%7epouchet.2/ software/polybench/,
          <source>last accessed 27 August</source>
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Nvidia</surname>
          </string-name>
          , pascal architecture whitepaper,
          <year>2021</year>
          . URL: https://www.nvidia.
          <article-title>com/en-us/ data-center/resources/pascal-architecture-whitepaper.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>