Towards a Safety Case for Hardware Fault Tolerance in Convolutional Neural Networks Using Activation Range Supervision Florian Geissler1∗ , Syed Qutub1 , Sayanta Roychowdhury1 , Ali Asgari2 , Yang Peng1 , Akash Dhamasia1 , Ralf Graefe1 , Karthik Pattabiraman2 and Michael Paulitsch1 1 Intel, Germany 2 University of British Columbia, Canada Abstract C1: Operational design domain: Inference of pretrained classifier C4: The chance of a soft error event to occur can be given. networks with protection layers, G1: System is C5: The simulated weight/neuron Convolutional neural networks (CNNs) have be- input represented by given dataset. sufficiently safe in fault model appropriately represents C2: An appropriate independent realistic soft errors. come an established part of numerous safety- dataset for bound extraction exists. the presence of C6: The data representation has soft errors. eight exponent bits (FP32, BF16). critical computer vision applications, including C3: “Sufficiently safe” is well defined by the end user and is proportional C7: A fallback system/re-execution human robot interactions and automated driving. to the overall risk. can be used for uncorrectable errors. Real-world implementations will need to guarantee their robustness against hardware soft errors cor- G2: System detects critical G3: System mitigates soft G4: System does not increase soft errors. errors. the error severity. rupting the underlying platform memory. Based on E2a: SDC/DUE events appear in E4a: DUE events can be handled the previously observed efficacy of activation clip- conjunction with oob events with a E3a: The probability of SDC/DUE events is significantly reduced by with negligible risk for any error ping techniques, we build a prototypical safety case high conditional probability. restricting oob activations in severity. protection layers. E4b: The severity of residual SDC for classifier CNNs by demonstrating that range su- E2b: Oob events are detected by threshold-based protection layers. depends on the application. As an E3b: DUE events can further be pervision represents a highly reliable fault detec- mitigated by referring to a fallback example, we study the scenario of MioVision and ResNet50 and find system or via re-execution. tor and mitigator with respect to relevant bit flips, E2c: DUE events can further be detected by Nan/Inf monitoring. that the average severity of errors is comparable or lower. adopting an eight-exponent floating point data rep- resentation. We further explore novel, non-uniform Figure 1: Structured safety argument for the fault tolerance of a range restriction methods that effectively suppress CNN in the presence of soft errors, using range restrictions. The the probability of silent data corruptions and un- notation follows [10] including goals (G), context (C), and evidence correctable errors. As a safety-relevant end-to-end (E). ”Oob” denotes ”out-of-bounds”. use case, we showcase the benefit of our approach in a vehicle classification scenario, using ResNet- 50 and the traffic camera data set MIOVision. The quantitative evidence provided in this work can be leveraged to inspire further and possibly more com- Soft errors typically manifest as single or multiple bit up- plex CNN safety arguments. sets in the platform’s memory elements [5]. As a conse- quence, network parameters (weight faults) or local compu- tational states (neuron faults) can be altered during inference 1 Motivation time, and invalidate the network prediction in a safety-critical With the widespread use of convolutional neural networks way, for example, by misclassifying a person as a background (CNN) across many safety-critical domains such as auto- image in an automated driving context [6–8]. This has led to mated robots and cars, one of the most prevailing challenges a search for strategies to verify CNN-based systems against is the establishment of a safety certification for such artifi- hardware faults at the inference stage [9]. With chip technol- cial intelligence (AI) components, e.g., with respect to the ogy nodes scaling to smaller sizes and larger memory density ISO 26262 [1] or ISO/PAS 21448 (SOTIF) [2]. This certifi- per area, future platforms are expected to be even more sus- cation requires not only a high fault-tolerance of the trained ceptible to soft errors [5]. network against unknown or adversarial input, but also ef- ficient protection against hardware faults of the underlying In this paper, we evaluate range restriction techniques in platform [3, 4]. Importantly this includes transient soft er- CNNs exposed to platform soft errors with respect to the key rors, meaning disturbances originating from events such as elements of a prototypical safety case. This means that we cosmic neutron radiation, isotopes emitting alpha particles, formulate arguments (in the form of ”goals”) that constitute or electromagnetic leakage on the computer circuitry itself. essential parts of a complete safety case, and provide quanti- ∗ corresponding author, Email: florian.geissler@intel.com. tative evidence to support these goals in the studied context Copyright © 2021 for this paper by its authors. Use permitted under (see Fig. 1). Individual safety arguments can be reused as Creative Commons License Attribution 4.0 International (CC BY building blocks of more complex safety cases. The structure 4.0). of our goals is based on the probabilistic, high-level safety No fault Weight fault propagation due to the monotonicity of most neural network (a) No protection (b) No protection (c) Ranger operations [11]. To suppress the propagation of such cor- Conv1 rupted values, additional range restriction layers are inserted Relu Ranger in the network at strategic positions following the approach MaxPool of Chen et. al. [8] (see Fig. 2 for an example). At inference Ranger time, the protection layers then compare the intermediate ac- Activation magnitudes Conv2 Relu Ranger tivations against previously extracted interval thresholds in MaxPool Ranger order to detect and reset anomalously large values. Deriva- Reshape Ranger tive approaches have been shown to be efficient in recovering FC1 Relu network performance [6–8, 12] and, advantageously, do not Ranger require the retraining of CNN parameters nor computation- FC2 Relu ally expensive functional duplications. Ranger FC3 The focus of this paper is to examine alternative restriction Classes schemes for optimized soft error mitigation. In a CNN, the Weight fault output of every kernel is represented as a two-dimensional (d) Clipping (e) Rescaling (f) Backflip (g) Fmap average (2D) feature map, where the activation magnitudes encode specific features, on which the network bases its prediction. Soft errors will manifest as distortions of feature maps in all subsequent layers that make use of the corrupted value, as shown in Fig. 2(a)-(b). The problem of mitigating soft errors in a CNN can therefore be rephrased as restoring the fault- free topology of feature maps. Previous analyses have adopted uniform range restriction schemes that truncate out-of-bound values to a finite thresh- old [7, 8], e.g., Fig. 2(c)-(d). We instead follow the intuition that optimized, non-uniform range restriction methods that attempt to reconstruct feature maps (see Fig. 2(e)-(g), and de- tails in Sec. 5) can not only reduce SDC to a comparable or even lower level, but may also lead to less critical misclassi- Figure 2: Visualization example of the impact of a weight fault fications in the case of an SDC. This is because classes with using LeNet-5 and the MNIST data set. Range restriction layers more similar attributes will display more similar high-level (”Ranger”) are inserted following [8] (top left). The rows represent features (e.g., pedestrian and biker will both exhibit upright the feature maps of the individual network layers after range restric- silhouette, in contrast to car and truck classes). tion was applied, where linear layers (FC1-FC3) were reshaped to Finally, a safety analysis has to consider that not all SDC a 2D feature map as well for visualization purposes. In (b) - (g), a large weight fault value is injected in the second filter of the first con- events pose an equal risk to the user. We study a safety- volutional layer. For the unprotected model (b), this leads to a SDC critical use case evaluating cluster-wise class confusions in event (”0” gets changed to ”7”). The columns (c) - (g) then illustrate a vehicle classification scenario (Sec. 6). The example shows the effect of the different investigated range restriction methods. that range supervision reduces the severe confusions propor- tionally with the overall number of confusions, meaning that the total risk is indeed mitigated. objective of minimizing the overall risk [10], expressed as: In summary, this paper make the following contributions: • Fault detection: We quantify the correlation between   Ploss (i) = Pfailure (i) (1 − Pdetection (i)) + (1 − Pmitigation (i)) , SDC events and the occurrence of out-of-bound activa- Risk = ∑ Ploss (i) · Severity(i). tions to demonstrate the high efficiency of fault detection i (1) by monitoring intermediate activations, Explicitly, for a fault type i, this includes the sub-goals of • Fault mitigation: We explore three novel range restric- efficient error detection and mitigation, as well as a consid- tion methods that build on the preservation of the feature eration of the fault severity in a given use case. On the other map topologies instead of mere value truncation, hand, the probability of occurrence of a soft error (i.e., Pfailure in Eq. 1) is assumed to be a constant system property that • Fault severity: We demonstrate the benefit of range su- cannot be controlled by run-time monitoring methods such as pervision in an end-to-end use case of vehicle classifica- activation range supervision. tion where high and low severities are estimated by the generic safety-criticality of class confusions. In a nutshell, range restriction builds on the observation that silent data corruption (SDC) and detected uncorrectable The article is structured as follows: Section 2 reviews rele- errors (DUE), e.g., NaN and Inf occurrences), stem primarily vant previous work while section 3 describes the setup used from those bit flips that cause very large values, for example in this paper. Subsequently, the sections 4, 5, and 6 discuss in high exponential bits [6]. Those events result in large ac- error detection, mitigation, and an exemplary risk analysis, tivation peaks that typically grow even more during forward respectively, before section 7 concludes the paper. 2 Related work Parity or error-correcting code (ECC) can protect memory el- ements against single soft errors [5, 13]. However, due to the No faults high compute and area overhead, this is typically done only for selected critical memory blocks. Component replication Inject techniques such as triple modular redundancy can be used for faults the full CNN execution at the cost of a large overhead. Se- lective hardening of hardware elements with the most salient parameters can improve the robustness of program execution in the presence of underlying faults [6, 14]. On a software Faults level, the estimation of the CNN’s vulnerable feature maps DUE (fmaps) and the selective protection by duplicated computa- tions [15], or the assertive re-execution with stored, healthy reference values [16] has been investigated. Approaches us- ing algorithm-based fault tolerance (ABFT) [17] seek to pro- SDC tect networks against soft errors by checking invariants that are characteristic for a specific operation (e.g., matrix mul- Figure 3: Illustration of SDC and DUE events. Errors are detected tiplication). Symptom-based error detection may for exam- or missed in the case of out-of-bound (oob) or in-bound (ib) events, ple include the interpretation of feature map traces by a sec- respectively. (Green) Samples of the data set that form the subset ondary companion network [18]. The restriction of inter- of a given filtering stage, (Yellow) samples of the data set that are mediate ranges was explored [6, 12] in the form of modi- discarded at the given stage, (White) samples that were filtered out fied (layer-insensitive) activation functions such as tanh or at a previous stage. ReLU6. This concept was extended to find specific uniform protection thresholds for neuron faults [8] or clipping bounds for weight faults [7]. An alternative line of research is cen- rate test input, which is taken from the training data sets of tered around fault-aware retraining [19]. ImageNet (143K images used) and MIOVision (83K images used), respectively. This step has to be performed only once. Bound extraction depends on the data set and will in general 3 Experimental setup impact the safety argument (see Fig. 1). To check the suit- 3.1 Models, data sets, and system ability of the bounds, we verify that no out-of-bound events CNNs are the most commonly used network variant for com- were detected during the test phase in the absence of faults, puter vision tasks such as object classification and detection. so the baseline accuracy is the same with and without pro- We compare the three standard classifier CNNs ResNet-50 tection. While all minimum bounds are zero in the studied [20], VGG-16 [21], and AlexNet [22] together with the test setup, the maximum activation values for ImageNet vary by data set ImageNet [23] and MIOVision [24] for the investiga- layer in a range of (see also Sec. 5) 1 < Tup < 45 for ResNet- tion of a safety-critical example use case. Since fault injec- 50, 20 < Tup < 360 for VGG-16, and 65 < Tup < 170 for tion is compute-intensive, we rescale our test data set for Im- AlexNet. For MIOVision and ResNet-50, we find maximum ageNet to a subset of 1000 images representing 20 randomly bounds between 1 < Tup < 19. selected classes. For MIOVision, a subset of 1100 images (100 per class) that were correctly classified in the absence of 3.3 Fault model and injection faults was chosen. All experiments adopt a single-precision In line with previous investigations, we distinguish two dif- floating point format (FP32) according to the IEEE754 stan- ferent manifestations of memory bit flips referred to here as dard [25]. Our conclusions apply as well to other floating weight faults and neuron faults. The former represent soft er- point formats with the same number of exponent bits, such as rors affecting memory elements that store the learned network BF16 [26], since no relevant effect was observed from fault parameters, while the latter refer to errors in memory that injections in mantissa bits (Sec. 4). holds temporary states such as intermediate network layer Experiments were performed in PyTorch (version 1.8.0) outputs. While neuron faults may also impact states used deploying torchvision models (version 0.9.0). For MIOVi- for logical instructions, it was demonstrated that bit flip in- sion, the ResNet-50 model was retrained [27]. We used Intel® jections in the output of the affected layer are generally a Core™ i9 CPUs, with inferences running on GeForce RTX good model approximation [28]. Memory elements can be 2080, Titan RTX, and RTX 3090 GPUs. protected against single bit flips by mechanisms such as par- ity and ECC [5, 13]. However, this kind of protection is not 3.2 Protection layers and bound extraction always available due to the associated compute and area over- We insert protection layers at strategic positions in the net- head. Further, ECC typically cannot correct multi-bit flips. work such as after activation, pooling, reshape or concatenate We inject faults either directly in the weights of CNN lay- layers, according to the model of Chen et al. [8]. Each pro- ers (weight faults) or in the output of the latter (neuron faults), tection layer requires specific bound values for the expected using a customized fault injection framework based on Py- activation ranges as a parameter. We extract those by mon- torchFI [29]. To speed up the experiments we focus on bit itoring the minimal and maximal activations from a sepa- flips in the most relevant bit positions 0 − 8 (sign bit and ex- ponential bits, neglecting mantissa) unless stated otherwise. Fault locations (i.e., layer index, kernel index, channel etc.) in the network are randomly chosen with an equal weight, so without further constraints on the selection process to reflect the arbitrary occurrence of soft errors. As weights are typ- ically stored in the main memory and loaded only once for a given application, we keep the same weight faults for one entire epoch, running all tested input images. In total, we run 500 epochs, i.e., fault configurations, each one applied to 1K images. Neuron faults, on the other hand, apply to memory representing temporary states that are overwritten for each new input. Therefore, we inject new neuron faults for each new input and run 100 epochs resulting in 100K fault config- urations, each one applied to a single image. 3.4 Evaluation To quantify the impact of faults on the system safety, we measure the rate of SDC events. Throughout, we consider the Top-1 prediction to determine SDC. In line with previous Figure 4: Bit-distribution across all weight parameters in conv2d layers. Values are represented in FP32, where only the sign bit (0) work [6, 8], SDC is defined as the ratio of images that are and the exponent bits (1 − 8) are shown. misclassified in the presence of faults (without exceptions) but correctly classified in the absence of faults and the overall number of images, p(sdc) = Nincorrect /Ntest , correct (Fig. 3). 4 Error detection coverage During the forward pass, non-numerical exceptions in the To effectively protect the network against faults, we first ver- form of Inf and NaN values can be encountered, due to ify the error detection coverage for silent errors. Those er- the following reasons: i) Inf values occur if large activa- rors are detected by a given protection layer if the activation tion values accumulate (for example during conv2d, linear, values exceed (fall short of) the upper (lower) bound. If at avgpool2d operations) until they exceed the maximum of the least one protection layer is triggered per inference run, we data representation. This effect becomes particularly appar- register an out-of-bound (oob) event. Otherwise, we have an ent when flips of the most significant bit (MSB, position index in-bound (ib) event. In addition, we quantify the probabilities 1) are injected. ii) NaN values are found when denominators of SDC and regular correct classification (cl) events, as well are undetermined or multiple Inf values get added, e.g., in as the respective conditional probabilities that correct and in- BatchNorm2d layers, iii) NaN values can be generated di- correct classifications occur given that oob or ib events were rectly via bit flips in conv2d layers due to the fact that FP32 detected. This allows us to define true positive (Tp), false encodes NaN as all eight exponent bits being in state ”1”. In positive (Fp), and false negative (Fn) SDC detection rates as the studied classifier networks, the latter effect is very rare for single bit flips in weights (see Sec. 4) but not necessarily for Tp = p(sdc|oob) · p(oob), single neuron bit flips or multiple flips of either type. Fp = p(cl|oob) · p(oob), (2) The creation of the above exceptions is found to differ Fn = p(sdc|ib) · p(ib). slightly between CPU and GPU executions, as well as be- tween experiments with different batch sizes on the acceler- The fault detector then is characterized by precision, P = ator. We attribute this observation to algorithmic optimiza- Tp/(Tp + Fp), and recall, R = Tp/(Tp + Fn). tions on the GPU that are not necessarily IEEE754-compliant The Tab. 1 displays the chances of oob and sdc events re- and thus affect the floating point precision [30]. To miti- sulting from a single fault per image in the absence of range gate the effect of exception handling we monitor the occur- protection. For weight faults, we find that all three CNNs rences of Inf and NaN in the output of any network layer. showcase a high correlation between oob situations and ei- All forward passes with an exception are separated and de- ther SDC or DUE events (p(sdc|oob) + p(due|oob) > 0.99), fine the detected uncorrectable error (DUE) rate, p(due) = which can be associated with the chance of a successful error Nexceptions /Ntest, correct , see Fig. 3. detection, Pdetection (see Eq. 1). The chance of finding SDC In a real system, DUE events can be readily monitored and after ib events is very small ( 1e−3 ), leading to a very high the execution is typically halted on detection. However, due precision and recall performance (> 0.99). For neuron faults, to the non-numerical nature of these errors we cannot apply while the recall remains very high, the precision is reduced (in the same mitigation strategy that is adopted for SDC events. particular VGG-16 and AlexNet) due to additional Fp events We therefore make the assumption that either a fallback sys- where non-MSB oob events still get classified correctly. tem (e.g., alternative classifier, emergency stop of vehicle, We further verify that SDC events from single weight faults etc.) can be leveraged or a timely re-execution is possible to are attributed almost exclusively to flips of the MSB. This can recover from transient DUE events. This in turn assumes that be explained with the distribution of parameters in the studied DUEs do not impact the system safety but may compromise networks (Fig. 4). The weight values are closely centered the system availability when occurring frequently. around zero, and thus exhibit characteristic properties when represented in an eight-exponent data format. In the fault-free Weight faults Neuron faults case, the MSB always has state “0”, while the exponent bits 2 to 4 are almost always in state “1”. This means that among the ResNet-50: relevant exponential bits all single bit flips of the MSB will p(sdc) 0.018 ± 0.001 0.013 ± 6e−4 produce large values, while those of the other exponential bits p(oob) 0.019 ± 0.001 0.013 ± 6e−4 will either be from “1” → “0” or will be too small to have a p(sdc|oob) 0.981 ± 0.008 0.974 ± 0.008 significant effect. p(sdc|ib) 5e−5 ± 4e−5 0.0 ± 0.0 For neuron faults, on the other hand, the distribution of p(MSB|sdc) 0.998 ± 0.002 0.961 ± 0.012 fault-free values is input-dependent and broader, leading in P 0.997 ± 0.002 0.980 ± 0.006 general to a smaller quota of MSB flips to SDC, in favor of R 0.997 ± 0.002 1.0 ± 0.0 flips of other exponential bits and the sign bit. No SDC due p(due) 3e−4 ± 1e−4 5e−4 ± 1e−4 to mantissa bit flips were observed in either weight or neuron p(due|oob) 0.016 ± 0.008 0.006 ± 0.005 faults. DUE events are unlikely (< 0.01) for a single bit flip as p(MSB|due) 1.0 ± 0.0 1.0 ± 0.0 there are not multiple large values to add up. Further, network VGG-16: weights are usually < 1, meaning that at least two exponent p(sdc) 0.024 ± 0.001 0.016 ± 9e−4 bits are in state ”0”, and hence at least two bit flips are needed p(oob) 0.027 ± 0.001 0.020 ± 0.001 to directly generate a NaN value. p(sdc|oob) 0.893 ± 0.010 0.778 ± 0.016 p(sdc|ib) 7e−5 ± 7e−5 0.0 ± 0.0 5 Range restriction methods for error p(MSB|sdc) 0.997 ± 0.003 0.397 ± 0.017 mitigation P 0.999 ± 0.001 0.820 ± 0.014 R 0.997 ± 0.003 1.0 ± 0.0 5.1 Model p(due) 0.003 ± 4e−4 0.006 ± 4e−4 We refer to a subset of the tensor given by a specific index in p(due|oob) 0.106 ± 0.011 0.051 ± 0.012 the batch and channel dimensions as a 2D feature map, de- p(MSB|due) 1.0 ± 0.0 1.0 ± 0.0 noted by f . Let x be an activation value from a given feature AlexNet: map tensor f ∈ { f1 , f2 , . . . , fCout }. Further, Tup and Tlow de- p(sdc) 0.022 ± 0.001 0.013 ± 0.001 note the upper and lower activation bounds assigned to the p(oob) 0.024 ± 0.001 0.015 ± 0.001 protection layer, respectively. p(sdc|oob) 0.907 ± 0.012 0.877 ± 0.023 Ranger: For a given set of ( f , Tup , Tlow ), Ranger [8] maps p(sdc|ib) 2e−4 ± 1e−4 9e−5 ± 5e−5 out-of-bound values to the expected interval (see Fig. 2c), p(MSB|sdc) 0.995 ± 0.003 0.245 ± 0.031  P 1.0 ± 0.0 0.913 ± 0.025 Tup if x > Tup , R 0.989 ± 0.005 0.994 ± 0.004 rranger (x) = Tlow if x < Tlow , (3) p(due) 0.003 ± 3e−4 0.005 ± 3e−4 0.093 ± 0.012 0.040 ± 0.011  x otherwise. p(due|oob) p(MSB|due) 1.0 ± 0.0 1.0 ± 0.0 Clipper: In a similar way, clipping truncates activations that are out of bound to zero [7], Table 1: Statistical absolute and conditional probabilities of SDC,  DUE events and the related precision and recall of the fault detector. 0 if x > Tup or x < Tlow , Experiments of 10K fault injections were repeated 10 times, where rclipping (x) = (4) x otherwise. a single fault per image was injected in any of the 32 bits for each image (from ImageNet, using a batch size of one). We further list The intuition is that it can be favorable to eliminate corrupted what proportion of SDC or DUE events were caused by MSB flips. elements rather than to re-establish finite activations. FmapRescale: While uniform restriction methods help in eliminating large out-of-bound values, the information en- neuronal faults, where we may assume that a specific acti- coded in relative differences of activation magnitudes is lost vation value is bit-flipped directly. For weight faults, on the when all out-of-bound values are flattened to the same value. other hand, the observed out-of-bound output activation is the The idea of rescaling is to linearly map all large out-of-bound result of a multiply-and-accumulate operation of an input ten- values back onto the interval [Tlow , Tup ], implying that smaller sor with a bit-flipped weight value. However, we argue that out-of-bound values are reduced more. This follows the in- the presented back-flip operation will recover a representa- tuition that the out-of-bound values can originate from the tive product, given that the input component is of the order of entire spectrum of in-bound values. magnitude of one. To restore a flipped value, we distinguish the following cases: (x−min( f ))(Tup −Tlow )   max( f )−min( f ) + Tlow if x > Tup ,   0 if x > Tup · 264 , rrescale (x) = Tlow if x < Tlow , (5)   if Tup · 264 > x > Tup · 2,  2     x otherwise. rbackflip (x) = Tup if Tup · 2 > x > Tup , (6)  Backflip: We analyze the underlying bit flips that may   T low if x < T low ,  have caused out-of-bound values. This reasoning holds for  x otherwise. 49.9 ResNet-50 reconstruct a corrupted fmap. The intuition behind this ap- 1 FI SDC rate (in %) 40 10 FI proach is as follows: Every filter in a given conv2d layer tries 20 8.7 to establish characteristic features of the input image. Typi- 2.5 2.3 0.0 0.0 0.0 0.0 0.0 0.2 0.4 0.2 0 cally, there is a certain redundancy in the topology of fmaps, 65.0 VGG-16 since not all features the network was trained to recognize 60 53.5 1 FI may be strongly pronounced for a given image (instead mix- SDC rate (in %) 10 FI 40 11.8 tures of potential features may form), or because multiple 20 8.3 9.0 0.0 0.5 0.0 0.5 0.0 0.5 0.8 features resemble each other at the given processing stage. 0 62.4 AlexNet Therefore, replacing a corrupted fmap with a non-corrupted 54.9 1 FI 60 fmap from a different kernel can help to obtain an estimate of SDC rate (in %) 10 FI 40 21.1 the original topology. We average all healthy (i.e., not con- 20 10.9 5.0 0.1 1.2 0.1 1.2 0.1 1.2 1.6 taining out-of-bound activations) fmaps by 0 No_protection Ranger Clipper BackFlip FmapAvg FmapRescale ind = {i = 1 . . .Cout | max( fi ) ≤ Tup , min( fi ) ≥ Tlow }, (a) Weight faults 37.2 ResNet-50 1 favg = ∑ fi . (7) 30 1 FI |ind| j∈ind SDC rate (in %) 10 FI 20 16.1 10 4.7 0.0 0.2 0.0 0.0 0.0 0.0 0.1 3.8 0.3 If there are no healthy feature maps, favg will be the zero- 0 tensor. Subsequently, we replace oob values in a corrupted 40.4 VGG-16 40 1 FI fmap with their counterparts from the estimate of Eq. (7), SDC rate (in %) 10 FI 20  8.2 f (x) if x > Tup or x < Tlow , 5.0 0.0 0.3 0.0 0.1 0.0 0.1 0.0 1.3 0.2 rfavg (x) = avg (8) 0 x otherwise. 31.9 AlexNet 30 1 FI 24.2 SDC rate (in %) 20 10 FI 16.4 5.2 Results 10 3.8 0.3 2.7 0.1 0.4 0.0 0.4 0.5 1.0 In Fig. 5 we present results for the SDC mitigation exper- 0 No_protection Ranger Clipper BackFlip FmapAvg FmapRescale iments with different range supervision methods. Compar- ing 1 and 10 fault injections per input image, we note that (b) Neuron faults the unprotected models are dramatically corrupted with an increasing fault rate (SDC rate becomes ≥ 0.50 for weights, Figure 5: SDC rates for weight (a) and neuron (b) faults using dif- ≥ 0.32 for neurons in the presence of 10 faults). We can asso- ferent range supervision techniques. Note that compared to Tab. 1 rates are around 4× higher since we inject only in the bits 0 − 8 here. ciate the SDC rate with the chance of unsuccessful mitigation, 1 − Pmitigation , in Eq. 1. Weight faults have a higher impact than neuron faults since they directly corrupt a multitude of activations in a layer’s fmap output (in contrast to individual The above thresholds are motivated by the following logic: activations for neuron faults) and thus propagate faster than Given appropriate bounds, an activation is < Tup before a bit neuron faults. flip. Any flip of an exponential bit i ∈ {1 . . . 8} effectively All the studied range restriction methods reduce the SDC multiplies a factor of pow(2, 28−i ). Hence, any value beyond rate by a significant margin, but perform differently for Tup · 264 must have originated from a flip ”0” → ”1” of the weight and neuron fault types: For weight faults, we observe MSB, meaning that the original value was between 0 and 2. that Clipper, Backflip, and FmapAvg are highly efficient in We then set back all out-of-bound values in this regime to all three networks, with SDC rates suppressed to values of zero, assuming that lower reset values represent a more con- . 0.01 (SDC reduction of > 50×). Ranger provides a much servative choice in eliminating faults. Next, flipped values weaker protection, in particular in the more shallow networks that are between Tup · 264 > x > Tup · 2 can possibly originate VGG-16 and AlexNet. FmapRescale performs better than from a flip of any exponential bit. Given that Tup is typically Ranger but worse than the aforementioned methods. The > 1, a bit flip has to produce a corrupted absolute value > 2 in deepest studied network, ResNet-50, benefits the most from this regime. This is possible only if either the MSB is flipped any type of range restriction in the presence of weight faults. from ”0” → ”1”, or the MSB is already at ”1” and another When it comes to neuron faults (Fig. 5b), we see that Clip- exponential bit is flipped ”0” → ”1”. In all variants of the lat- per and Backflip provide the best protection (SDC rate is sup- ter case, the original value had to be already > 2 itself, and pressed to < 0.005, reduction of > 38×), followed by the hence we conservatively reset out-of-bound values to 2. Fi- also very effective Ranger (except for AlexNet). FmapAvg nally, corrupted values of Tup · 2 > x > Tup may originate from appears to be less efficient for higher fault rates in this sce- any non-sign bit flip. Lower exponential or even fraction bit nario, while FmapRescale again falls behind all the above. flips result from already large values close to Tup in this case, Overall, we conclude that the pruning-inspired mitigation which is why we set back those values to the upper bound. techniques Clipper and Backflip represent the best choices As in Ranger, values that are too small are reset to Tlow . among the investigated ranger supervision methods, as they FmapAvg: The last proposed range restriction technique succeed in mitigating both weight and neuron faults to very uses the remaining, healthy fmaps of a convolutional layer to small residual SDC rates. Articulated truck, Non-VRU 33.1 No_protection VRU cluster Safety-critical fault Bus, Car, cluster 35 Ranger Clipper SDC rate with safety-critical confusions (in %) Motorcycle, Pickup truck, Single-unit FmapRescale Pedestrian, truck, Work-van, 30 Backflip Bicycle Non-motorized FmapAvg Non-critical vehicle Non-critical fault Non-critical fault fault 25 20 15 12.0 Background 6.8 10 Background 4.1 cluster 5 2.8 1.0 0.0 0.3 0.0 0.0 0.0 0.6 0.1 0.0 1.5 Figure 6: Formation of class clusters in MIOVision (VRU denotes 0.3 0.0 0.1 0.0 0.0 0.0 0.2 0.0 0.0 vulnerable road user). We make the assumption here that confusions 0 1 10 towards less vulnerable clusters are the most safety-critical ones. Faults per epoch (a) Weight faults 20.1 In the experiments of Fig. 5, the encountered DUE rates for 20.0 No_protection Ranger 1 weight or neuron fault (0.003 for ResNet, 0.03 for VGG-16 Clipper SDC rate with safety-critical confusions (in %) or AlexNet) are only slightly reduced by range restrictions. 17.5 FmapRescale Backflip However, for a fault rate of 10 we find the following trends: 15.5 FmapAvg 15.0 i) For weights, the DUE is significantly reduced in ResNet (from 0.15 to 0.002), while rates in VGG (0.22) and AlexNet 12.5 (0.26) remain. ii) For neurons, Ranger, Clipper and Backflip suppress the DUE rate by a factor of up to 2× in all networks. 10.0 The studied range restriction techniques require different 7.5 compute costs due to the different number of additional graph 5.0 5.0 operations. In PyTorch, not all needed functions can be im- 5.0 plemented with the same efficiency though. For example, 2.8 3.5 3.5 Ranger is executed with a single clamp operation, while no 2.5 2.2 0.8 0.9 equivalent formulation is available for Clipper and instead 0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 three operations are necessary (two masks to select oob val- 0.0 1 10 Faults per image ues greater and smaller than the threshold, and a masked-fill operation to clip to zero). As a consequence, measured laten- (b) Neuron faults cies are framework-dependent and a fair comparison cannot Figure 7: SDC rates for ResNet-50 and MIOvision. We inject 1 and be made at this point. Given the complexity of the protection 10 faults targeting bits 0 − 8 in the network weights (a) and neurons operations, we may instead give a qualitative performance (b). The portion of safety-critical SDC events according to Fig. 6 is ranking of the described methods: FmapRescale appears to displayed as a dark-color overlay. be the most expensive restriction method due to the needed number of operations, followed by FmapAvg and Backflip. Clipper and Ranger are the least complex, with the latter out- see Fig. 6. Misclassifications that lead to the prediction of performing the former in the used framework, due to its more a class in a less vulnerable cluster are assumed to be safety- efficient use of optimized built-in operations. critical (Severity ≈ 1 in Eq. 1, e.g., a pedestrian is misclassi- fied as background), while confusions within the same clus- 6 Analysis of traffic camera use case ter or towards a more vulnerable cluster are considered non- critical (Severity ≈ 0) as they typically lead only to similar or As a selected safety-critical use case, we study object classifi- a more cautious behavior. This binary estimation allows us cation in the presence of soft errors with a retrained ResNet- quantify the overall risk as the portion of SDC events associ- 50 and the MIOVision data set [24]. The data contains im- ated with the respective critical class confusions. ages of 11 classes including for example pedestrian, bike, car, From our results in Fig. 7 we make the following obser- or background, that were taken by traffic cameras. The cor- vations: i) The relative proportion of critical confusions is rect identification of an object type or category can be safety- lower for weight than for neuron faults in the unprotected and critical for example to an automated vehicle that uses the sup- most protected models. For weight faults, the most frequent port of infrastructure sensors for augmented perception [31]. confusions are from other classes to the class ”car” (the most However, not every class confusion is equally harmful. robust class of MIOVision, with the most images in the train- To estimate the severity of an error-induced misclassifica- ing set), which are statistically mostly non-critical. Neuron tion we establish three clusters of vulnerable, as well as non- faults, on the other hand, distort feature maps in a way that vulnerable road users (VRU or non-VRU), and background, induces with the highest frequency misclassifications towards the class ”background”. Those events are all safety-critical Acknowledgment (see Fig. 6), leading to a high critical-to-total SDC ratio. ii) Range supervision is not only effective in reducing the over- Our research was partially funded by the Federal Ministry of all SDC count, but also suppresses the critical SDC count Transport and Digital Infrastructure of Germany in the project proportionally. For example, we observe that the most fre- Providentia++ (01MM19008). Further, this research was par- quent critical class confusion caused by 1 or 10 weight faults tially supported by a grant from the Natural Sciences and En- is from the class ”pedestrian” to ”car” (≈ 0.2 of all critical gineering Research Council of Canada (NSERC), and a re- SDC cases), where > 0.99 of those cases can be mitigated search gift from Intel to UBC. by Clipper or Backflip. For neuron faults, the largest criti- cal SDC contribution is from ”pedestrian” to ”background” References (1 fault) or ”car” to ”background” (10 faults), both in about [1] International Organization for Standardization, “ISO 26262,” 0.1 of all critical SDC cases. Clipper or Backflip are able to Tech. Rep., 2018. [Online]. Available: https://www.iso.org/ suppress > 0.91 of those events. standard/68383.html As a consequence, all studied range-restricted models ex- [2] ——, “Road vehicles - Safety of the intended functionality,” hibit a critical-to-total SDC ratio that is similar to or lower Tech. Rep., 2019. [Online]. Available: https://www.iso.org/ than one of the unprotected network (< 0.41 for weight, standard/70939.html < 0.78 for neuron faults), meaning that faults in the presence [3] J. Athavale, A. Baldovin, R. Graefe, M. Paulitsch, and of range supervision have on average a similar or lower sever- R. Rosales, “AI and Reliability Trends in Safety-Critical Au- ity than faults that do not face range restrictions. A lower ratio tonomous Systems on Ground and Air,” Proceedings - 50th can be interpreted as a better preservation of the feature map Annual IEEE/IFIP International Conference on Dependable topology: If the reconstructed features are more similar to the Systems and Networks, DSN-W 2020, pp. 74–77, 2020. original features there is a higher chance of the incorrect class being similar to the original class and thus to stay within the [4] H. D. Dixit, S. Pendharkar, M. Beadon, C. Mason, same cluster. The total probability of critical SDC events – T. Chakravarthy, B. Muthiah, and S. Sankar, Silent Data Corruptions at Scale. Association for Computing Machinery, and therefore the relative risk according to Eq. 1 – is negligi- 2021, vol. 1, no. 1. [Online]. Available: arxiv.org/abs/2102. ble in the studied setup in the presence of Clipper or Backflip 11245 range protection. [5] A. Neale and M. Sachdev, “Neutron Radiation Induced Soft The mean DUE rates in the unprotected model are 0.0 Error Rates for an Adjacent-ECC Protected SRAM in 28 nm (0.02) for 1 weight (neuron) fault and 0.11 (0.17) for 10 CMOS,” IEEE Transactions on Nuclear Science, vol. 63, no. 3, faults. Using any of the protection methods, the system’s pp. 1912–1917, 2016. availability increases as DUE rates are negligible for 1 fault, [6] G. Li, S. K. S. Hari, M. Sullivan, T. Tsai, K. Pattabiraman, and reduce to < 0.03 (< 0.05) for 10 weight (neuron) faults. J. Emer, and S. W. Keckler, “Understanding error propagation in Deep Learning Neural Network (DNN) accelerators and ap- plications,” in Proceedings of the International Conference for 7 Conclusion High Performance Computing, Networking, Storage and Anal- ysis, SC 2017, 2017. In this paper, we investigated the efficacy of range supervi- [7] L.-H. Hoang, M. A. Hanif, and M. Shafique, “FT-ClipAct: sion techniques for constructing a safety case for computer Resilience Analysis of Deep Neural Networks and Improving vision AI applications that use Convolutional Neural Net- their Fault Tolerance using Clipped Activation,” 2019. works (CNNs) in the presence of platform soft errors. In the [Online]. Available: https://arxiv.org/abs/1912.00941 given experimental setup, we demonstrated that the imple- [8] Z. Chen, G. Li, and K. Pattabiraman, “Ranger: Boosting mentation of activation bounds allows for a highly efficient Error Resilience of Deep Neural Networks through Range detection of SDC-inducing faults, most importantly featur- Restriction,” 2020. [Online]. Available: https://arxiv.org/abs/ ing a recall of > 0.99. Furthermore, we found that the range 2003.13874 restriction layers can mitigate the once-detected faults effec- [9] J. M. Cluzeau, X. Henriquel, G. Rebender, G. Soudain, tively by mapping out-of-bound values back to the expected L. van Dijk, A. Gronskiy, D. Haber, C. Perret-Gentil, intervals. Exploring distinct restriction methods, we observed and R. Polak, “Concepts of Design Assurance for that Clipper and Backflip perform best for both weight and Neural Networks ( CoDANN ),” Public Report Extract neuron faults and can reduce the residual SDC rate to . 0.01 Version 1.0, pp. 1–104, 2020. [Online]. Available: https: (reduction by a factor of > 38×). Finally, we studied the //www.easa.europa.eu/document-library/general-publications/ selected use case of vehicle classification to quantify the im- concepts-design-assurance-neural-networks-codann pact of range restriction on the severity of SDC events (repre- [10] P. Koopman and B. Osyk, “Safety argument considerations for sented by cluster-wise class confusions). All discussed tech- public road testing of autonomous vehicles,” SAE Technical niques reduce critical and non-critical events proportionally, Papers, vol. 2019-April, no. April, 2019. meaning that the average severity of SDC is not increased. [11] Z. Chen, G. Li, K. Pattabiraman, and N. Debardeleben, “BinFI: Therefore, we conclude that the presented approach reduces An efficient fault injector for safety-critical machine learn- the overall risk and thus enhances the safety of the user in the ing systems,” International Conference for High Performance presence of platform soft errors. Computing, Networking, Storage and Analysis, SC, 2019. [12] S. Hong, P. Frigo, Y. Kaya, C. Giuffrida, and T. Dumitras, [26] Intel Corporation, “bfloat16 - Hardware Numer- “Terminal brain damage: Exposing the graceless degradation ics Definition,” Tech. Rep., 2018. [Online]. Avail- in deep neural networks under hardware fault attacks,” in Pro- able: https://software.intel.com/content/www/us/en/develop/ ceedings of the 28th USENIX Security Symposium, 2019. download/bfloat16-hardware-numerics-definition.html [13] A. Lotfi, S. Hukerikar, K. Balasubramanian, P. Racunas, [27] R. Theagarajan, F. Pala, and B. Bhanu, “EDeN: Ensemble of N. Saxena, R. Bramley, and Y. Huang, “Resiliency of auto- Deep Networks for Vehicle Classification,” IEEE Computer motive object detection networks on GPU architectures,” Pro- Society Conference on Computer Vision and Pattern Recog- ceedings - International Test Conference, vol. 2019-Novem, nition Workshops, 2017. pp. 1–9, 2019. [28] C. K. Chang, S. Lym, N. Kelly, M. B. Sullivan, and M. Erez, [14] M. A. Hanif and M. Shafique, “SalvagedNn: Salvaging “Evaluating and accelerating high-fidelity error injection for deep neural network accelerators with permanent faults HPC,” Proceedings - International Conference for High Per- through saliency-driven fault-aware mapping,” Philosophical formance Computing, Networking, Storage, and Analysis, SC Transactions of the Royal Society A: Mathematical, Physical 2018, pp. 577–589, 2019. and Engineering Sciences, 2020. [Online]. Available: https: [29] A. Mahmoud, N. Aggarwal, A. Nobbe, J. R. Sanchez Vicarte, //royalsocietypublishing.org/doi/10.1098/rsta.2019.0164 S. V. Adve, C. W. Fletcher, I. Frosio, and S. K. S. Hari, “Py- TorchFI: A Runtime Perturbation Tool for DNNs,” in DSN- [15] A. Mahmoud, S. K. Sastry Hari, C. W. Fletcher, S. V. Adve, DSML, 2020. C. Sakr, N. Shanbhag, P. Molchanov, M. B. Sullivan, T. Tsai, and S. W. Keckler, “Hardnn: Feature map vulnerability evalu- [30] Nvidia, “Cuda toolkit documentation,” 2021. [Online]. ation in CNNS,” 2020. Available: https://docs.nvidia.com/cuda/floating-point/index. html [16] J. Ponader, S. Kundu, and Y. Solihin, “MILR: Mathematically [31] A. Krämmer, C. Schöller, D. Gulati, and A. Knoll, Induced Layer Recovery for Plaintext Space Error Correction “Providentia - A large scale sensing system for the assistance of CNNs,” 2020. [Online]. Available: http://arxiv.org/abs/ of autonomous vehicles,” arXiv, 2019. [Online]. Available: 2010.14687 arxiv:1906.06789 [17] K. Zhao, S. Di, S. Li, X. Liang, Y. Zhai, J. Chen, K. Ouyang, F. Cappello, and Z. Chen, “FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks,” IEEE Trans- actions on Parallel and Distributed Systems, vol. 32, no. 7, pp. 1677–1689, 2021. [18] C. Schorn, A. Guntoro, and G. Ascheid, “Efficient On-Line Error Detection and Mitigation for Deep Neural Network Ac- celerators,” in Safecomp 2018, vol. 11093 LNCS, 2018. [19] L. Yang and B. Murmann, “SRAM voltage scaling for energy- efficient convolutional neural networks,” in Proceedings - In- ternational Symposium on Quality Electronic Design, ISQED. IEEE Computer Society, may 2017, pp. 7–12. [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of the IEEE Computer So- ciety Conference on Computer Vision and Pattern Recognition, vol. 2016-Decem, pp. 770–778, 2016. [21] K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition,” 3rd International Conference on Learning Representations, ICLR 2015 - Con- ference Track Proceedings, 2015. [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet clas- sification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 2, pp. 1097– 1105, 2012. [23] Jia Deng, Wei Dong, R. Socher, Li-Jia Li, Kai Li, and Li Fei- Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009. [24] Z. Luo, F. B. Charron, C. Lemaire, J. Konrad, S. Li, A. Mishra, A. Achkar, J. Eichel, and P.-M. Jodoin, “MIO-TCD: A new benchmark dataset for vehicle classification and localization,” IEEE Transactions on Image Processing, 2018. [25] IEEE, “754-2019 - IEEE Standard for Floating-Point Arith- metic,” Tech. Rep., 2019.