=Paper=
{{Paper
|id=Vol-2916/paper_6
|storemode=property
|title=Towards a Safety Case for Hardware Fault Tolerance in Convolutional Neural
Networks Using Activation Range Supervision
|pdfUrl=https://ceur-ws.org/Vol-2916/paper_6.pdf
|volume=Vol-2916
|authors=Florian Geissler,Syed Qutub,Sayanta Roychowdhury,Ali Asgari,Yang Peng,Akash Dhamasia,Ralf Graefe,Karthik Pattabiraman,Michael Paulitsch
|dblpUrl=https://dblp.org/rec/conf/ijcai/GeisslerQRAPDGP21
}}
==Towards a Safety Case for Hardware Fault Tolerance in Convolutional Neural
Networks Using Activation Range Supervision==
Towards a Safety Case for Hardware Fault Tolerance in Convolutional Neural
Networks Using Activation Range Supervision
Florian Geissler1∗ , Syed Qutub1 , Sayanta Roychowdhury1 , Ali Asgari2 , Yang Peng1 , Akash
Dhamasia1 , Ralf Graefe1 , Karthik Pattabiraman2 and Michael Paulitsch1
1 Intel, Germany
2 University of British Columbia, Canada
Abstract C1: Operational design domain:
Inference of pretrained classifier
C4: The chance of a soft error
event to occur can be given.
networks with protection layers, G1: System is C5: The simulated weight/neuron
Convolutional neural networks (CNNs) have be- input represented by given dataset.
sufficiently safe in fault model appropriately represents
C2: An appropriate independent realistic soft errors.
come an established part of numerous safety- dataset for bound extraction exists. the presence of C6: The data representation has
soft errors. eight exponent bits (FP32, BF16).
critical computer vision applications, including C3: “Sufficiently safe” is well defined
by the end user and is proportional C7: A fallback system/re-execution
human robot interactions and automated driving. to the overall risk. can be used for uncorrectable errors.
Real-world implementations will need to guarantee
their robustness against hardware soft errors cor- G2: System detects critical G3: System mitigates soft G4: System does not increase
soft errors. errors. the error severity.
rupting the underlying platform memory. Based on
E2a: SDC/DUE events appear in E4a: DUE events can be handled
the previously observed efficacy of activation clip- conjunction with oob events with a
E3a: The probability of SDC/DUE
events is significantly reduced by with negligible risk for any error
ping techniques, we build a prototypical safety case high conditional probability. restricting oob activations in severity.
protection layers. E4b: The severity of residual SDC
for classifier CNNs by demonstrating that range su- E2b: Oob events are detected by
threshold-based protection layers. depends on the application. As an
E3b: DUE events can further be
pervision represents a highly reliable fault detec- mitigated by referring to a fallback example, we study the scenario of
MioVision and ResNet50 and find
system or via re-execution.
tor and mitigator with respect to relevant bit flips, E2c: DUE events can further be
detected by Nan/Inf monitoring.
that the average severity of errors
is comparable or lower.
adopting an eight-exponent floating point data rep-
resentation. We further explore novel, non-uniform Figure 1: Structured safety argument for the fault tolerance of a
range restriction methods that effectively suppress CNN in the presence of soft errors, using range restrictions. The
the probability of silent data corruptions and un- notation follows [10] including goals (G), context (C), and evidence
correctable errors. As a safety-relevant end-to-end (E). ”Oob” denotes ”out-of-bounds”.
use case, we showcase the benefit of our approach
in a vehicle classification scenario, using ResNet-
50 and the traffic camera data set MIOVision. The
quantitative evidence provided in this work can be
leveraged to inspire further and possibly more com- Soft errors typically manifest as single or multiple bit up-
plex CNN safety arguments. sets in the platform’s memory elements [5]. As a conse-
quence, network parameters (weight faults) or local compu-
tational states (neuron faults) can be altered during inference
1 Motivation time, and invalidate the network prediction in a safety-critical
With the widespread use of convolutional neural networks way, for example, by misclassifying a person as a background
(CNN) across many safety-critical domains such as auto- image in an automated driving context [6–8]. This has led to
mated robots and cars, one of the most prevailing challenges a search for strategies to verify CNN-based systems against
is the establishment of a safety certification for such artifi- hardware faults at the inference stage [9]. With chip technol-
cial intelligence (AI) components, e.g., with respect to the ogy nodes scaling to smaller sizes and larger memory density
ISO 26262 [1] or ISO/PAS 21448 (SOTIF) [2]. This certifi- per area, future platforms are expected to be even more sus-
cation requires not only a high fault-tolerance of the trained ceptible to soft errors [5].
network against unknown or adversarial input, but also ef-
ficient protection against hardware faults of the underlying In this paper, we evaluate range restriction techniques in
platform [3, 4]. Importantly this includes transient soft er- CNNs exposed to platform soft errors with respect to the key
rors, meaning disturbances originating from events such as elements of a prototypical safety case. This means that we
cosmic neutron radiation, isotopes emitting alpha particles, formulate arguments (in the form of ”goals”) that constitute
or electromagnetic leakage on the computer circuitry itself. essential parts of a complete safety case, and provide quanti-
∗ corresponding author, Email: florian.geissler@intel.com. tative evidence to support these goals in the studied context
Copyright © 2021 for this paper by its authors. Use permitted under (see Fig. 1). Individual safety arguments can be reused as
Creative Commons License Attribution 4.0 International (CC BY building blocks of more complex safety cases. The structure
4.0). of our goals is based on the probabilistic, high-level safety
No fault Weight fault propagation due to the monotonicity of most neural network
(a) No protection (b) No protection (c) Ranger
operations [11]. To suppress the propagation of such cor-
Conv1
rupted values, additional range restriction layers are inserted
Relu
Ranger
in the network at strategic positions following the approach
MaxPool of Chen et. al. [8] (see Fig. 2 for an example). At inference
Ranger
time, the protection layers then compare the intermediate ac-
Activation magnitudes
Conv2
Relu
Ranger tivations against previously extracted interval thresholds in
MaxPool
Ranger
order to detect and reset anomalously large values. Deriva-
Reshape
Ranger
tive approaches have been shown to be efficient in recovering
FC1
Relu
network performance [6–8, 12] and, advantageously, do not
Ranger require the retraining of CNN parameters nor computation-
FC2
Relu ally expensive functional duplications.
Ranger
FC3 The focus of this paper is to examine alternative restriction
Classes
schemes for optimized soft error mitigation. In a CNN, the
Weight fault output of every kernel is represented as a two-dimensional
(d) Clipping (e) Rescaling (f) Backflip (g) Fmap average
(2D) feature map, where the activation magnitudes encode
specific features, on which the network bases its prediction.
Soft errors will manifest as distortions of feature maps in all
subsequent layers that make use of the corrupted value, as
shown in Fig. 2(a)-(b). The problem of mitigating soft errors
in a CNN can therefore be rephrased as restoring the fault-
free topology of feature maps.
Previous analyses have adopted uniform range restriction
schemes that truncate out-of-bound values to a finite thresh-
old [7, 8], e.g., Fig. 2(c)-(d). We instead follow the intuition
that optimized, non-uniform range restriction methods that
attempt to reconstruct feature maps (see Fig. 2(e)-(g), and de-
tails in Sec. 5) can not only reduce SDC to a comparable or
even lower level, but may also lead to less critical misclassi-
Figure 2: Visualization example of the impact of a weight fault fications in the case of an SDC. This is because classes with
using LeNet-5 and the MNIST data set. Range restriction layers more similar attributes will display more similar high-level
(”Ranger”) are inserted following [8] (top left). The rows represent features (e.g., pedestrian and biker will both exhibit upright
the feature maps of the individual network layers after range restric- silhouette, in contrast to car and truck classes).
tion was applied, where linear layers (FC1-FC3) were reshaped to Finally, a safety analysis has to consider that not all SDC
a 2D feature map as well for visualization purposes. In (b) - (g), a
large weight fault value is injected in the second filter of the first con-
events pose an equal risk to the user. We study a safety-
volutional layer. For the unprotected model (b), this leads to a SDC critical use case evaluating cluster-wise class confusions in
event (”0” gets changed to ”7”). The columns (c) - (g) then illustrate a vehicle classification scenario (Sec. 6). The example shows
the effect of the different investigated range restriction methods. that range supervision reduces the severe confusions propor-
tionally with the overall number of confusions, meaning that
the total risk is indeed mitigated.
objective of minimizing the overall risk [10], expressed as: In summary, this paper make the following contributions:
• Fault detection: We quantify the correlation between
Ploss (i) = Pfailure (i) (1 − Pdetection (i)) + (1 − Pmitigation (i)) ,
SDC events and the occurrence of out-of-bound activa-
Risk = ∑ Ploss (i) · Severity(i). tions to demonstrate the high efficiency of fault detection
i
(1) by monitoring intermediate activations,
Explicitly, for a fault type i, this includes the sub-goals of • Fault mitigation: We explore three novel range restric-
efficient error detection and mitigation, as well as a consid- tion methods that build on the preservation of the feature
eration of the fault severity in a given use case. On the other map topologies instead of mere value truncation,
hand, the probability of occurrence of a soft error (i.e., Pfailure
in Eq. 1) is assumed to be a constant system property that • Fault severity: We demonstrate the benefit of range su-
cannot be controlled by run-time monitoring methods such as pervision in an end-to-end use case of vehicle classifica-
activation range supervision. tion where high and low severities are estimated by the
generic safety-criticality of class confusions.
In a nutshell, range restriction builds on the observation
that silent data corruption (SDC) and detected uncorrectable The article is structured as follows: Section 2 reviews rele-
errors (DUE), e.g., NaN and Inf occurrences), stem primarily vant previous work while section 3 describes the setup used
from those bit flips that cause very large values, for example in this paper. Subsequently, the sections 4, 5, and 6 discuss
in high exponential bits [6]. Those events result in large ac- error detection, mitigation, and an exemplary risk analysis,
tivation peaks that typically grow even more during forward respectively, before section 7 concludes the paper.
2 Related work
Parity or error-correcting code (ECC) can protect memory el-
ements against single soft errors [5, 13]. However, due to the No faults
high compute and area overhead, this is typically done only
for selected critical memory blocks. Component replication Inject
techniques such as triple modular redundancy can be used for faults
the full CNN execution at the cost of a large overhead. Se-
lective hardening of hardware elements with the most salient
parameters can improve the robustness of program execution
in the presence of underlying faults [6, 14]. On a software
Faults
level, the estimation of the CNN’s vulnerable feature maps DUE
(fmaps) and the selective protection by duplicated computa-
tions [15], or the assertive re-execution with stored, healthy
reference values [16] has been investigated. Approaches us-
ing algorithm-based fault tolerance (ABFT) [17] seek to pro- SDC
tect networks against soft errors by checking invariants that
are characteristic for a specific operation (e.g., matrix mul- Figure 3: Illustration of SDC and DUE events. Errors are detected
tiplication). Symptom-based error detection may for exam- or missed in the case of out-of-bound (oob) or in-bound (ib) events,
ple include the interpretation of feature map traces by a sec- respectively. (Green) Samples of the data set that form the subset
ondary companion network [18]. The restriction of inter- of a given filtering stage, (Yellow) samples of the data set that are
mediate ranges was explored [6, 12] in the form of modi- discarded at the given stage, (White) samples that were filtered out
fied (layer-insensitive) activation functions such as tanh or at a previous stage.
ReLU6. This concept was extended to find specific uniform
protection thresholds for neuron faults [8] or clipping bounds
for weight faults [7]. An alternative line of research is cen- rate test input, which is taken from the training data sets of
tered around fault-aware retraining [19]. ImageNet (143K images used) and MIOVision (83K images
used), respectively. This step has to be performed only once.
Bound extraction depends on the data set and will in general
3 Experimental setup impact the safety argument (see Fig. 1). To check the suit-
3.1 Models, data sets, and system ability of the bounds, we verify that no out-of-bound events
CNNs are the most commonly used network variant for com- were detected during the test phase in the absence of faults,
puter vision tasks such as object classification and detection. so the baseline accuracy is the same with and without pro-
We compare the three standard classifier CNNs ResNet-50 tection. While all minimum bounds are zero in the studied
[20], VGG-16 [21], and AlexNet [22] together with the test setup, the maximum activation values for ImageNet vary by
data set ImageNet [23] and MIOVision [24] for the investiga- layer in a range of (see also Sec. 5) 1 < Tup < 45 for ResNet-
tion of a safety-critical example use case. Since fault injec- 50, 20 < Tup < 360 for VGG-16, and 65 < Tup < 170 for
tion is compute-intensive, we rescale our test data set for Im- AlexNet. For MIOVision and ResNet-50, we find maximum
ageNet to a subset of 1000 images representing 20 randomly bounds between 1 < Tup < 19.
selected classes. For MIOVision, a subset of 1100 images
(100 per class) that were correctly classified in the absence of 3.3 Fault model and injection
faults was chosen. All experiments adopt a single-precision In line with previous investigations, we distinguish two dif-
floating point format (FP32) according to the IEEE754 stan- ferent manifestations of memory bit flips referred to here as
dard [25]. Our conclusions apply as well to other floating weight faults and neuron faults. The former represent soft er-
point formats with the same number of exponent bits, such as rors affecting memory elements that store the learned network
BF16 [26], since no relevant effect was observed from fault parameters, while the latter refer to errors in memory that
injections in mantissa bits (Sec. 4). holds temporary states such as intermediate network layer
Experiments were performed in PyTorch (version 1.8.0) outputs. While neuron faults may also impact states used
deploying torchvision models (version 0.9.0). For MIOVi- for logical instructions, it was demonstrated that bit flip in-
sion, the ResNet-50 model was retrained [27]. We used Intel® jections in the output of the affected layer are generally a
Core™ i9 CPUs, with inferences running on GeForce RTX good model approximation [28]. Memory elements can be
2080, Titan RTX, and RTX 3090 GPUs. protected against single bit flips by mechanisms such as par-
ity and ECC [5, 13]. However, this kind of protection is not
3.2 Protection layers and bound extraction always available due to the associated compute and area over-
We insert protection layers at strategic positions in the net- head. Further, ECC typically cannot correct multi-bit flips.
work such as after activation, pooling, reshape or concatenate We inject faults either directly in the weights of CNN lay-
layers, according to the model of Chen et al. [8]. Each pro- ers (weight faults) or in the output of the latter (neuron faults),
tection layer requires specific bound values for the expected using a customized fault injection framework based on Py-
activation ranges as a parameter. We extract those by mon- torchFI [29]. To speed up the experiments we focus on bit
itoring the minimal and maximal activations from a sepa- flips in the most relevant bit positions 0 − 8 (sign bit and ex-
ponential bits, neglecting mantissa) unless stated otherwise.
Fault locations (i.e., layer index, kernel index, channel etc.)
in the network are randomly chosen with an equal weight, so
without further constraints on the selection process to reflect
the arbitrary occurrence of soft errors. As weights are typ-
ically stored in the main memory and loaded only once for
a given application, we keep the same weight faults for one
entire epoch, running all tested input images. In total, we run
500 epochs, i.e., fault configurations, each one applied to 1K
images. Neuron faults, on the other hand, apply to memory
representing temporary states that are overwritten for each
new input. Therefore, we inject new neuron faults for each
new input and run 100 epochs resulting in 100K fault config-
urations, each one applied to a single image.
3.4 Evaluation
To quantify the impact of faults on the system safety, we
measure the rate of SDC events. Throughout, we consider
the Top-1 prediction to determine SDC. In line with previous Figure 4: Bit-distribution across all weight parameters in conv2d
layers. Values are represented in FP32, where only the sign bit (0)
work [6, 8], SDC is defined as the ratio of images that are and the exponent bits (1 − 8) are shown.
misclassified in the presence of faults (without exceptions)
but correctly classified in the absence of faults and the overall
number of images, p(sdc) = Nincorrect /Ntest , correct (Fig. 3). 4 Error detection coverage
During the forward pass, non-numerical exceptions in the To effectively protect the network against faults, we first ver-
form of Inf and NaN values can be encountered, due to ify the error detection coverage for silent errors. Those er-
the following reasons: i) Inf values occur if large activa- rors are detected by a given protection layer if the activation
tion values accumulate (for example during conv2d, linear, values exceed (fall short of) the upper (lower) bound. If at
avgpool2d operations) until they exceed the maximum of the least one protection layer is triggered per inference run, we
data representation. This effect becomes particularly appar- register an out-of-bound (oob) event. Otherwise, we have an
ent when flips of the most significant bit (MSB, position index in-bound (ib) event. In addition, we quantify the probabilities
1) are injected. ii) NaN values are found when denominators of SDC and regular correct classification (cl) events, as well
are undetermined or multiple Inf values get added, e.g., in as the respective conditional probabilities that correct and in-
BatchNorm2d layers, iii) NaN values can be generated di- correct classifications occur given that oob or ib events were
rectly via bit flips in conv2d layers due to the fact that FP32 detected. This allows us to define true positive (Tp), false
encodes NaN as all eight exponent bits being in state ”1”. In positive (Fp), and false negative (Fn) SDC detection rates as
the studied classifier networks, the latter effect is very rare for
single bit flips in weights (see Sec. 4) but not necessarily for Tp = p(sdc|oob) · p(oob),
single neuron bit flips or multiple flips of either type. Fp = p(cl|oob) · p(oob), (2)
The creation of the above exceptions is found to differ Fn = p(sdc|ib) · p(ib).
slightly between CPU and GPU executions, as well as be-
tween experiments with different batch sizes on the acceler- The fault detector then is characterized by precision, P =
ator. We attribute this observation to algorithmic optimiza- Tp/(Tp + Fp), and recall, R = Tp/(Tp + Fn).
tions on the GPU that are not necessarily IEEE754-compliant The Tab. 1 displays the chances of oob and sdc events re-
and thus affect the floating point precision [30]. To miti- sulting from a single fault per image in the absence of range
gate the effect of exception handling we monitor the occur- protection. For weight faults, we find that all three CNNs
rences of Inf and NaN in the output of any network layer. showcase a high correlation between oob situations and ei-
All forward passes with an exception are separated and de- ther SDC or DUE events (p(sdc|oob) + p(due|oob) > 0.99),
fine the detected uncorrectable error (DUE) rate, p(due) = which can be associated with the chance of a successful error
Nexceptions /Ntest, correct , see Fig. 3. detection, Pdetection (see Eq. 1). The chance of finding SDC
In a real system, DUE events can be readily monitored and after ib events is very small ( 1e−3 ), leading to a very high
the execution is typically halted on detection. However, due precision and recall performance (> 0.99). For neuron faults,
to the non-numerical nature of these errors we cannot apply while the recall remains very high, the precision is reduced (in
the same mitigation strategy that is adopted for SDC events. particular VGG-16 and AlexNet) due to additional Fp events
We therefore make the assumption that either a fallback sys- where non-MSB oob events still get classified correctly.
tem (e.g., alternative classifier, emergency stop of vehicle, We further verify that SDC events from single weight faults
etc.) can be leveraged or a timely re-execution is possible to are attributed almost exclusively to flips of the MSB. This can
recover from transient DUE events. This in turn assumes that be explained with the distribution of parameters in the studied
DUEs do not impact the system safety but may compromise networks (Fig. 4). The weight values are closely centered
the system availability when occurring frequently. around zero, and thus exhibit characteristic properties when
represented in an eight-exponent data format. In the fault-free Weight faults Neuron faults
case, the MSB always has state “0”, while the exponent bits 2
to 4 are almost always in state “1”. This means that among the ResNet-50:
relevant exponential bits all single bit flips of the MSB will p(sdc) 0.018 ± 0.001 0.013 ± 6e−4
produce large values, while those of the other exponential bits p(oob) 0.019 ± 0.001 0.013 ± 6e−4
will either be from “1” → “0” or will be too small to have a p(sdc|oob) 0.981 ± 0.008 0.974 ± 0.008
significant effect. p(sdc|ib) 5e−5 ± 4e−5 0.0 ± 0.0
For neuron faults, on the other hand, the distribution of p(MSB|sdc) 0.998 ± 0.002 0.961 ± 0.012
fault-free values is input-dependent and broader, leading in P 0.997 ± 0.002 0.980 ± 0.006
general to a smaller quota of MSB flips to SDC, in favor of R 0.997 ± 0.002 1.0 ± 0.0
flips of other exponential bits and the sign bit. No SDC due p(due) 3e−4 ± 1e−4 5e−4 ± 1e−4
to mantissa bit flips were observed in either weight or neuron p(due|oob) 0.016 ± 0.008 0.006 ± 0.005
faults. DUE events are unlikely (< 0.01) for a single bit flip as p(MSB|due) 1.0 ± 0.0 1.0 ± 0.0
there are not multiple large values to add up. Further, network VGG-16:
weights are usually < 1, meaning that at least two exponent p(sdc) 0.024 ± 0.001 0.016 ± 9e−4
bits are in state ”0”, and hence at least two bit flips are needed p(oob) 0.027 ± 0.001 0.020 ± 0.001
to directly generate a NaN value. p(sdc|oob) 0.893 ± 0.010 0.778 ± 0.016
p(sdc|ib) 7e−5 ± 7e−5 0.0 ± 0.0
5 Range restriction methods for error p(MSB|sdc) 0.997 ± 0.003 0.397 ± 0.017
mitigation P 0.999 ± 0.001 0.820 ± 0.014
R 0.997 ± 0.003 1.0 ± 0.0
5.1 Model p(due) 0.003 ± 4e−4 0.006 ± 4e−4
We refer to a subset of the tensor given by a specific index in p(due|oob) 0.106 ± 0.011 0.051 ± 0.012
the batch and channel dimensions as a 2D feature map, de- p(MSB|due) 1.0 ± 0.0 1.0 ± 0.0
noted by f . Let x be an activation value from a given feature AlexNet:
map tensor f ∈ { f1 , f2 , . . . , fCout }. Further, Tup and Tlow de- p(sdc) 0.022 ± 0.001 0.013 ± 0.001
note the upper and lower activation bounds assigned to the p(oob) 0.024 ± 0.001 0.015 ± 0.001
protection layer, respectively. p(sdc|oob) 0.907 ± 0.012 0.877 ± 0.023
Ranger: For a given set of ( f , Tup , Tlow ), Ranger [8] maps p(sdc|ib) 2e−4 ± 1e−4 9e−5 ± 5e−5
out-of-bound values to the expected interval (see Fig. 2c), p(MSB|sdc) 0.995 ± 0.003 0.245 ± 0.031
P 1.0 ± 0.0 0.913 ± 0.025
Tup if x > Tup , R 0.989 ± 0.005 0.994 ± 0.004
rranger (x) = Tlow if x < Tlow , (3) p(due) 0.003 ± 3e−4 0.005 ± 3e−4
0.093 ± 0.012 0.040 ± 0.011
x otherwise. p(due|oob)
p(MSB|due) 1.0 ± 0.0 1.0 ± 0.0
Clipper: In a similar way, clipping truncates activations
that are out of bound to zero [7], Table 1: Statistical absolute and conditional probabilities of SDC,
DUE events and the related precision and recall of the fault detector.
0 if x > Tup or x < Tlow , Experiments of 10K fault injections were repeated 10 times, where
rclipping (x) = (4)
x otherwise. a single fault per image was injected in any of the 32 bits for each
image (from ImageNet, using a batch size of one). We further list
The intuition is that it can be favorable to eliminate corrupted what proportion of SDC or DUE events were caused by MSB flips.
elements rather than to re-establish finite activations.
FmapRescale: While uniform restriction methods help in
eliminating large out-of-bound values, the information en- neuronal faults, where we may assume that a specific acti-
coded in relative differences of activation magnitudes is lost vation value is bit-flipped directly. For weight faults, on the
when all out-of-bound values are flattened to the same value. other hand, the observed out-of-bound output activation is the
The idea of rescaling is to linearly map all large out-of-bound result of a multiply-and-accumulate operation of an input ten-
values back onto the interval [Tlow , Tup ], implying that smaller sor with a bit-flipped weight value. However, we argue that
out-of-bound values are reduced more. This follows the in- the presented back-flip operation will recover a representa-
tuition that the out-of-bound values can originate from the tive product, given that the input component is of the order of
entire spectrum of in-bound values. magnitude of one. To restore a flipped value, we distinguish
the following cases:
(x−min( f ))(Tup −Tlow )
max( f )−min( f ) + Tlow if x > Tup ,
0 if x > Tup · 264 ,
rrescale (x) = Tlow if x < Tlow , (5)
if Tup · 264 > x > Tup · 2,
2
x otherwise. rbackflip (x) = Tup if Tup · 2 > x > Tup , (6)
Backflip: We analyze the underlying bit flips that may
T low if x < T low ,
have caused out-of-bound values. This reasoning holds for
x otherwise.
49.9 ResNet-50 reconstruct a corrupted fmap. The intuition behind this ap-
1 FI
SDC rate (in %)
40 10 FI proach is as follows: Every filter in a given conv2d layer tries
20 8.7 to establish characteristic features of the input image. Typi-
2.5 2.3 0.0 0.0 0.0 0.0 0.0 0.2 0.4 0.2
0 cally, there is a certain redundancy in the topology of fmaps,
65.0 VGG-16 since not all features the network was trained to recognize
60 53.5 1 FI
may be strongly pronounced for a given image (instead mix-
SDC rate (in %)
10 FI
40
11.8
tures of potential features may form), or because multiple
20 8.3 9.0
0.0 0.5 0.0 0.5 0.0 0.5 0.8 features resemble each other at the given processing stage.
0
62.4 AlexNet Therefore, replacing a corrupted fmap with a non-corrupted
54.9 1 FI
60 fmap from a different kernel can help to obtain an estimate of
SDC rate (in %)
10 FI
40
21.1 the original topology. We average all healthy (i.e., not con-
20 10.9
5.0 0.1 1.2 0.1 1.2 0.1 1.2 1.6 taining out-of-bound activations) fmaps by
0
No_protection Ranger Clipper BackFlip FmapAvg FmapRescale
ind = {i = 1 . . .Cout | max( fi ) ≤ Tup , min( fi ) ≥ Tlow },
(a) Weight faults
37.2 ResNet-50
1
favg = ∑ fi . (7)
30
1 FI
|ind| j∈ind
SDC rate (in %)
10 FI
20 16.1
10 4.7
0.0 0.2 0.0 0.0 0.0 0.0 0.1
3.8
0.3 If there are no healthy feature maps, favg will be the zero-
0 tensor. Subsequently, we replace oob values in a corrupted
40.4 VGG-16
40 1 FI fmap with their counterparts from the estimate of Eq. (7),
SDC rate (in %)
10 FI
20
8.2 f (x) if x > Tup or x < Tlow ,
5.0
0.0 0.3 0.0 0.1 0.0 0.1 0.0 1.3 0.2 rfavg (x) = avg (8)
0 x otherwise.
31.9 AlexNet
30 1 FI 24.2
SDC rate (in %)
20
10 FI
16.4 5.2 Results
10 3.8
0.3 2.7 0.1 0.4 0.0 0.4 0.5 1.0
In Fig. 5 we present results for the SDC mitigation exper-
0
No_protection Ranger Clipper BackFlip FmapAvg FmapRescale
iments with different range supervision methods. Compar-
ing 1 and 10 fault injections per input image, we note that
(b) Neuron faults the unprotected models are dramatically corrupted with an
increasing fault rate (SDC rate becomes ≥ 0.50 for weights,
Figure 5: SDC rates for weight (a) and neuron (b) faults using dif-
≥ 0.32 for neurons in the presence of 10 faults). We can asso-
ferent range supervision techniques. Note that compared to Tab. 1
rates are around 4× higher since we inject only in the bits 0 − 8 here. ciate the SDC rate with the chance of unsuccessful mitigation,
1 − Pmitigation , in Eq. 1. Weight faults have a higher impact
than neuron faults since they directly corrupt a multitude of
activations in a layer’s fmap output (in contrast to individual
The above thresholds are motivated by the following logic:
activations for neuron faults) and thus propagate faster than
Given appropriate bounds, an activation is < Tup before a bit
neuron faults.
flip. Any flip of an exponential bit i ∈ {1 . . . 8} effectively
All the studied range restriction methods reduce the SDC
multiplies a factor of pow(2, 28−i ). Hence, any value beyond rate by a significant margin, but perform differently for
Tup · 264 must have originated from a flip ”0” → ”1” of the weight and neuron fault types: For weight faults, we observe
MSB, meaning that the original value was between 0 and 2. that Clipper, Backflip, and FmapAvg are highly efficient in
We then set back all out-of-bound values in this regime to all three networks, with SDC rates suppressed to values of
zero, assuming that lower reset values represent a more con- . 0.01 (SDC reduction of > 50×). Ranger provides a much
servative choice in eliminating faults. Next, flipped values weaker protection, in particular in the more shallow networks
that are between Tup · 264 > x > Tup · 2 can possibly originate VGG-16 and AlexNet. FmapRescale performs better than
from a flip of any exponential bit. Given that Tup is typically Ranger but worse than the aforementioned methods. The
> 1, a bit flip has to produce a corrupted absolute value > 2 in deepest studied network, ResNet-50, benefits the most from
this regime. This is possible only if either the MSB is flipped any type of range restriction in the presence of weight faults.
from ”0” → ”1”, or the MSB is already at ”1” and another When it comes to neuron faults (Fig. 5b), we see that Clip-
exponential bit is flipped ”0” → ”1”. In all variants of the lat- per and Backflip provide the best protection (SDC rate is sup-
ter case, the original value had to be already > 2 itself, and pressed to < 0.005, reduction of > 38×), followed by the
hence we conservatively reset out-of-bound values to 2. Fi- also very effective Ranger (except for AlexNet). FmapAvg
nally, corrupted values of Tup · 2 > x > Tup may originate from appears to be less efficient for higher fault rates in this sce-
any non-sign bit flip. Lower exponential or even fraction bit nario, while FmapRescale again falls behind all the above.
flips result from already large values close to Tup in this case, Overall, we conclude that the pruning-inspired mitigation
which is why we set back those values to the upper bound. techniques Clipper and Backflip represent the best choices
As in Ranger, values that are too small are reset to Tlow . among the investigated ranger supervision methods, as they
FmapAvg: The last proposed range restriction technique succeed in mitigating both weight and neuron faults to very
uses the remaining, healthy fmaps of a convolutional layer to small residual SDC rates.
Articulated truck,
Non-VRU 33.1 No_protection
VRU cluster Safety-critical fault Bus, Car,
cluster
35 Ranger
Clipper
SDC rate with safety-critical confusions (in %)
Motorcycle, Pickup
truck, Single-unit FmapRescale
Pedestrian, truck, Work-van, 30 Backflip
Bicycle Non-motorized FmapAvg
Non-critical vehicle Non-critical
fault
Non-critical fault
fault 25
20
15
12.0
Background
6.8
10
Background 4.1
cluster
5 2.8
1.0
0.0 0.3 0.0 0.0 0.0 0.6 0.1 0.0
1.5
Figure 6: Formation of class clusters in MIOVision (VRU denotes 0.3 0.0 0.1 0.0 0.0 0.0 0.2 0.0 0.0
vulnerable road user). We make the assumption here that confusions 0 1 10
towards less vulnerable clusters are the most safety-critical ones. Faults per epoch
(a) Weight faults
20.1
In the experiments of Fig. 5, the encountered DUE rates for 20.0 No_protection
Ranger
1 weight or neuron fault (0.003 for ResNet, 0.03 for VGG-16 Clipper
SDC rate with safety-critical confusions (in %)
or AlexNet) are only slightly reduced by range restrictions. 17.5 FmapRescale
Backflip
However, for a fault rate of 10 we find the following trends: 15.5
FmapAvg
15.0
i) For weights, the DUE is significantly reduced in ResNet
(from 0.15 to 0.002), while rates in VGG (0.22) and AlexNet 12.5
(0.26) remain. ii) For neurons, Ranger, Clipper and Backflip
suppress the DUE rate by a factor of up to 2× in all networks. 10.0
The studied range restriction techniques require different
7.5
compute costs due to the different number of additional graph
5.0 5.0
operations. In PyTorch, not all needed functions can be im- 5.0
plemented with the same efficiency though. For example, 2.8
3.5 3.5
Ranger is executed with a single clamp operation, while no 2.5 2.2
0.8 0.9
equivalent formulation is available for Clipper and instead 0.0
0.0
0.1
0.0 0.0
0.1
0.0 0.0
0.1
0.0 0.0
0.0 0.0
0.0
three operations are necessary (two masks to select oob val- 0.0 1 10
Faults per image
ues greater and smaller than the threshold, and a masked-fill
operation to clip to zero). As a consequence, measured laten- (b) Neuron faults
cies are framework-dependent and a fair comparison cannot
Figure 7: SDC rates for ResNet-50 and MIOvision. We inject 1 and
be made at this point. Given the complexity of the protection
10 faults targeting bits 0 − 8 in the network weights (a) and neurons
operations, we may instead give a qualitative performance (b). The portion of safety-critical SDC events according to Fig. 6 is
ranking of the described methods: FmapRescale appears to displayed as a dark-color overlay.
be the most expensive restriction method due to the needed
number of operations, followed by FmapAvg and Backflip.
Clipper and Ranger are the least complex, with the latter out- see Fig. 6. Misclassifications that lead to the prediction of
performing the former in the used framework, due to its more a class in a less vulnerable cluster are assumed to be safety-
efficient use of optimized built-in operations. critical (Severity ≈ 1 in Eq. 1, e.g., a pedestrian is misclassi-
fied as background), while confusions within the same clus-
6 Analysis of traffic camera use case ter or towards a more vulnerable cluster are considered non-
critical (Severity ≈ 0) as they typically lead only to similar or
As a selected safety-critical use case, we study object classifi- a more cautious behavior. This binary estimation allows us
cation in the presence of soft errors with a retrained ResNet- quantify the overall risk as the portion of SDC events associ-
50 and the MIOVision data set [24]. The data contains im- ated with the respective critical class confusions.
ages of 11 classes including for example pedestrian, bike, car, From our results in Fig. 7 we make the following obser-
or background, that were taken by traffic cameras. The cor- vations: i) The relative proportion of critical confusions is
rect identification of an object type or category can be safety- lower for weight than for neuron faults in the unprotected and
critical for example to an automated vehicle that uses the sup- most protected models. For weight faults, the most frequent
port of infrastructure sensors for augmented perception [31]. confusions are from other classes to the class ”car” (the most
However, not every class confusion is equally harmful. robust class of MIOVision, with the most images in the train-
To estimate the severity of an error-induced misclassifica- ing set), which are statistically mostly non-critical. Neuron
tion we establish three clusters of vulnerable, as well as non- faults, on the other hand, distort feature maps in a way that
vulnerable road users (VRU or non-VRU), and background, induces with the highest frequency misclassifications towards
the class ”background”. Those events are all safety-critical Acknowledgment
(see Fig. 6), leading to a high critical-to-total SDC ratio. ii)
Range supervision is not only effective in reducing the over- Our research was partially funded by the Federal Ministry of
all SDC count, but also suppresses the critical SDC count Transport and Digital Infrastructure of Germany in the project
proportionally. For example, we observe that the most fre- Providentia++ (01MM19008). Further, this research was par-
quent critical class confusion caused by 1 or 10 weight faults tially supported by a grant from the Natural Sciences and En-
is from the class ”pedestrian” to ”car” (≈ 0.2 of all critical gineering Research Council of Canada (NSERC), and a re-
SDC cases), where > 0.99 of those cases can be mitigated search gift from Intel to UBC.
by Clipper or Backflip. For neuron faults, the largest criti-
cal SDC contribution is from ”pedestrian” to ”background” References
(1 fault) or ”car” to ”background” (10 faults), both in about [1] International Organization for Standardization, “ISO 26262,”
0.1 of all critical SDC cases. Clipper or Backflip are able to Tech. Rep., 2018. [Online]. Available: https://www.iso.org/
suppress > 0.91 of those events. standard/68383.html
As a consequence, all studied range-restricted models ex- [2] ——, “Road vehicles - Safety of the intended functionality,”
hibit a critical-to-total SDC ratio that is similar to or lower Tech. Rep., 2019. [Online]. Available: https://www.iso.org/
than one of the unprotected network (< 0.41 for weight, standard/70939.html
< 0.78 for neuron faults), meaning that faults in the presence
[3] J. Athavale, A. Baldovin, R. Graefe, M. Paulitsch, and
of range supervision have on average a similar or lower sever-
R. Rosales, “AI and Reliability Trends in Safety-Critical Au-
ity than faults that do not face range restrictions. A lower ratio
tonomous Systems on Ground and Air,” Proceedings - 50th
can be interpreted as a better preservation of the feature map Annual IEEE/IFIP International Conference on Dependable
topology: If the reconstructed features are more similar to the Systems and Networks, DSN-W 2020, pp. 74–77, 2020.
original features there is a higher chance of the incorrect class
being similar to the original class and thus to stay within the [4] H. D. Dixit, S. Pendharkar, M. Beadon, C. Mason,
same cluster. The total probability of critical SDC events – T. Chakravarthy, B. Muthiah, and S. Sankar, Silent Data
Corruptions at Scale. Association for Computing Machinery,
and therefore the relative risk according to Eq. 1 – is negligi- 2021, vol. 1, no. 1. [Online]. Available: arxiv.org/abs/2102.
ble in the studied setup in the presence of Clipper or Backflip 11245
range protection.
[5] A. Neale and M. Sachdev, “Neutron Radiation Induced Soft
The mean DUE rates in the unprotected model are 0.0 Error Rates for an Adjacent-ECC Protected SRAM in 28 nm
(0.02) for 1 weight (neuron) fault and 0.11 (0.17) for 10 CMOS,” IEEE Transactions on Nuclear Science, vol. 63, no. 3,
faults. Using any of the protection methods, the system’s pp. 1912–1917, 2016.
availability increases as DUE rates are negligible for 1 fault,
[6] G. Li, S. K. S. Hari, M. Sullivan, T. Tsai, K. Pattabiraman,
and reduce to < 0.03 (< 0.05) for 10 weight (neuron) faults.
J. Emer, and S. W. Keckler, “Understanding error propagation
in Deep Learning Neural Network (DNN) accelerators and ap-
plications,” in Proceedings of the International Conference for
7 Conclusion High Performance Computing, Networking, Storage and Anal-
ysis, SC 2017, 2017.
In this paper, we investigated the efficacy of range supervi- [7] L.-H. Hoang, M. A. Hanif, and M. Shafique, “FT-ClipAct:
sion techniques for constructing a safety case for computer Resilience Analysis of Deep Neural Networks and Improving
vision AI applications that use Convolutional Neural Net- their Fault Tolerance using Clipped Activation,” 2019.
works (CNNs) in the presence of platform soft errors. In the [Online]. Available: https://arxiv.org/abs/1912.00941
given experimental setup, we demonstrated that the imple- [8] Z. Chen, G. Li, and K. Pattabiraman, “Ranger: Boosting
mentation of activation bounds allows for a highly efficient Error Resilience of Deep Neural Networks through Range
detection of SDC-inducing faults, most importantly featur- Restriction,” 2020. [Online]. Available: https://arxiv.org/abs/
ing a recall of > 0.99. Furthermore, we found that the range 2003.13874
restriction layers can mitigate the once-detected faults effec- [9] J. M. Cluzeau, X. Henriquel, G. Rebender, G. Soudain,
tively by mapping out-of-bound values back to the expected L. van Dijk, A. Gronskiy, D. Haber, C. Perret-Gentil,
intervals. Exploring distinct restriction methods, we observed and R. Polak, “Concepts of Design Assurance for
that Clipper and Backflip perform best for both weight and Neural Networks ( CoDANN ),” Public Report Extract
neuron faults and can reduce the residual SDC rate to . 0.01 Version 1.0, pp. 1–104, 2020. [Online]. Available: https:
(reduction by a factor of > 38×). Finally, we studied the //www.easa.europa.eu/document-library/general-publications/
selected use case of vehicle classification to quantify the im- concepts-design-assurance-neural-networks-codann
pact of range restriction on the severity of SDC events (repre- [10] P. Koopman and B. Osyk, “Safety argument considerations for
sented by cluster-wise class confusions). All discussed tech- public road testing of autonomous vehicles,” SAE Technical
niques reduce critical and non-critical events proportionally, Papers, vol. 2019-April, no. April, 2019.
meaning that the average severity of SDC is not increased. [11] Z. Chen, G. Li, K. Pattabiraman, and N. Debardeleben, “BinFI:
Therefore, we conclude that the presented approach reduces An efficient fault injector for safety-critical machine learn-
the overall risk and thus enhances the safety of the user in the ing systems,” International Conference for High Performance
presence of platform soft errors. Computing, Networking, Storage and Analysis, SC, 2019.
[12] S. Hong, P. Frigo, Y. Kaya, C. Giuffrida, and T. Dumitras, [26] Intel Corporation, “bfloat16 - Hardware Numer-
“Terminal brain damage: Exposing the graceless degradation ics Definition,” Tech. Rep., 2018. [Online]. Avail-
in deep neural networks under hardware fault attacks,” in Pro- able: https://software.intel.com/content/www/us/en/develop/
ceedings of the 28th USENIX Security Symposium, 2019. download/bfloat16-hardware-numerics-definition.html
[13] A. Lotfi, S. Hukerikar, K. Balasubramanian, P. Racunas, [27] R. Theagarajan, F. Pala, and B. Bhanu, “EDeN: Ensemble of
N. Saxena, R. Bramley, and Y. Huang, “Resiliency of auto- Deep Networks for Vehicle Classification,” IEEE Computer
motive object detection networks on GPU architectures,” Pro- Society Conference on Computer Vision and Pattern Recog-
ceedings - International Test Conference, vol. 2019-Novem, nition Workshops, 2017.
pp. 1–9, 2019. [28] C. K. Chang, S. Lym, N. Kelly, M. B. Sullivan, and M. Erez,
[14] M. A. Hanif and M. Shafique, “SalvagedNn: Salvaging “Evaluating and accelerating high-fidelity error injection for
deep neural network accelerators with permanent faults HPC,” Proceedings - International Conference for High Per-
through saliency-driven fault-aware mapping,” Philosophical formance Computing, Networking, Storage, and Analysis, SC
Transactions of the Royal Society A: Mathematical, Physical 2018, pp. 577–589, 2019.
and Engineering Sciences, 2020. [Online]. Available: https: [29] A. Mahmoud, N. Aggarwal, A. Nobbe, J. R. Sanchez Vicarte,
//royalsocietypublishing.org/doi/10.1098/rsta.2019.0164 S. V. Adve, C. W. Fletcher, I. Frosio, and S. K. S. Hari, “Py-
TorchFI: A Runtime Perturbation Tool for DNNs,” in DSN-
[15] A. Mahmoud, S. K. Sastry Hari, C. W. Fletcher, S. V. Adve, DSML, 2020.
C. Sakr, N. Shanbhag, P. Molchanov, M. B. Sullivan, T. Tsai,
and S. W. Keckler, “Hardnn: Feature map vulnerability evalu- [30] Nvidia, “Cuda toolkit documentation,” 2021. [Online].
ation in CNNS,” 2020. Available: https://docs.nvidia.com/cuda/floating-point/index.
html
[16] J. Ponader, S. Kundu, and Y. Solihin, “MILR: Mathematically
[31] A. Krämmer, C. Schöller, D. Gulati, and A. Knoll,
Induced Layer Recovery for Plaintext Space Error Correction
“Providentia - A large scale sensing system for the assistance
of CNNs,” 2020. [Online]. Available: http://arxiv.org/abs/
of autonomous vehicles,” arXiv, 2019. [Online]. Available:
2010.14687
arxiv:1906.06789
[17] K. Zhao, S. Di, S. Li, X. Liang, Y. Zhai, J. Chen, K. Ouyang,
F. Cappello, and Z. Chen, “FT-CNN: Algorithm-Based Fault
Tolerance for Convolutional Neural Networks,” IEEE Trans-
actions on Parallel and Distributed Systems, vol. 32, no. 7, pp.
1677–1689, 2021.
[18] C. Schorn, A. Guntoro, and G. Ascheid, “Efficient On-Line
Error Detection and Mitigation for Deep Neural Network Ac-
celerators,” in Safecomp 2018, vol. 11093 LNCS, 2018.
[19] L. Yang and B. Murmann, “SRAM voltage scaling for energy-
efficient convolutional neural networks,” in Proceedings - In-
ternational Symposium on Quality Electronic Design, ISQED.
IEEE Computer Society, may 2017, pp. 7–12.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” Proceedings of the IEEE Computer So-
ciety Conference on Computer Vision and Pattern Recognition,
vol. 2016-Decem, pp. 770–778, 2016.
[21] K. Simonyan and A. Zisserman, “Very deep convolutional net-
works for large-scale image recognition,” 3rd International
Conference on Learning Representations, ICLR 2015 - Con-
ference Track Proceedings, 2015.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet clas-
sification with deep convolutional neural networks,” Advances
in Neural Information Processing Systems, vol. 2, pp. 1097–
1105, 2012.
[23] Jia Deng, Wei Dong, R. Socher, Li-Jia Li, Kai Li, and Li Fei-
Fei, “ImageNet: A large-scale hierarchical image database,”
in 2009 IEEE Conference on Computer Vision and Pattern
Recognition, 2009.
[24] Z. Luo, F. B. Charron, C. Lemaire, J. Konrad, S. Li, A. Mishra,
A. Achkar, J. Eichel, and P.-M. Jodoin, “MIO-TCD: A new
benchmark dataset for vehicle classification and localization,”
IEEE Transactions on Image Processing, 2018.
[25] IEEE, “754-2019 - IEEE Standard for Floating-Point Arith-
metic,” Tech. Rep., 2019.