Utilizing Class Separation Distance for the Evaluation of
Corruption Robustness of Machine Learning Classifiers
Georg Siedel1,2,∗ , Silvia Vock1 , Andrey Morozov2 and Stefan Voß1
1
    Federal Institute for Occupational Safety and Health (BAuA) Germany
2
    University of Stuttgart, Germany


                                             Abstract
                                             Robustness is a fundamental pillar of Machine Learning (ML) classifiers, substantially determining their reliability. Methods
                                             for assessing classifier robustness are therefore essential. In this work, we address the challenge of evaluating corruption
                                             robustness in a way that allows comparability and interpretability on a given dataset. We propose a test data augmentation
                                             method that uses a robustness distance ε derived from the datasets minimal class separation distance. The resulting MSCR
                                             (mean statistical corruption robustness) metric allows a dataset-specific comparison of different classifiers with respect to
                                             their corruption robustness. The MSCR value is interpretable, as it represents the classifiers avoidable loss of accuracy due
                                             to statistical corruptions. On 2D and image data, we show that the metric reflects different levels of classifier robustness.
                                             Furthermore, we observe unexpected optima in classifiers robust accuracy through training and testing classifiers with
                                             different levels of noise. While researchers have frequently reported on a significant tradeoff on accuracy when training
                                             robust models, we strengthen the view that a tradeoff between accuracy and corruption robustness is not inherent. Our
                                             results indicate that robustness training through simple data augmentation can already slightly improve accuracy.

                                             Keywords
                                             corruption robustness, classifier, class separation, metric, accuracy-robustness-tradeoff


1. Introduction                                                                                                                       of the output to a misclassification. The property of a
                                                                                                                                      classifier resistant to any such input corruptions is called
ML functions are deployed to an increasing extent                                                                                     robustness2 . A classifier is a function that assigns a class
over various industries including machinery engineering.                                                                              to any 𝑑-dimensional input 𝑥 ∈ 𝑅𝑑 . Classifier 𝑔 is robust
                                                                                                                                                                                           ′
Within the European domestic market, machinery prod-                                                                                  at a point 𝑥 within a distance 𝜀 > 0, if 𝑔(𝑥) = 𝑔(𝑥 ) holds
                                                                                                                                                                   ′                         ′
ucts are subject to regulation of the Machinery directive,                                                                            for all perturbed points 𝑥 that satisfy 𝑑𝑖𝑠𝑡(𝑥 − 𝑥 ) ≤ 𝜀
which demands a risk assessment1 .                                                                                                    [2, 3]. The 𝑑𝑖𝑠𝑡-function can e.g. be an 𝐿𝑝 -norm distance,
   Risk assessment includes risk estimation and evalua-                                                                               while 𝜀 can be defined based on physical observations of
tion, where risk is defined as a combination of probability                                                                           e.g. which perturbations are imperceptible for humans.
and severity of a hazardous event. Therefore, once ML                                                                                    Robustness is considered a desirable property since in-
functions are deployed in machinery products, where                                                                                   tuitively, a slightly perturbed input (e.g. an imperceptibly
their failure may lead to a hazardous event, being able                                                                               changed image) should not lead to a classifier changing
to quantify the probability and severity of their failures                                                                            its corresponding prediction. In essence, a robustness
becomes mandatory. However, there still exists a gap                                                                                  requirement demands that within a certain input param-
                                                                                                                                                                         ′
between the regulative and normative requirements for                                                                                 eter space around 𝑥, all points 𝑥 have to share the same
safety critical software and the existing methods to assess                                                                           class. This way, a robustness requirement adds addi-
ML safety [1].                                                                                                                        tional information on how the classifier should behave
   This work targets ML classifiers, the failures of which                                                                            near ground truth data points. Authors therefore argue
are misclassifications. Our focus is on the evaluation                                                                                the importance of robustness, being a fundamental pillar
of failure probability specifically, not on failure severity.                                                                         of reliability [4] and quality [5] of ML models.
We address one specific failure mode of ML classifiers:                                                                                  However, popular robustness training methods show
Corrupted or perturbed data inputs that cause a change                                                                                significantly lowered test accuracy compared to standard
                                                                                                                                      training, which has lead to some authors discussing an
The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety                                                                inherent, i.e. inevitable tradeoff between accuracy and
2022), July 24-25, 2022, Vienna, Austria                                                                                              robustness (see Section 2.2).
∗
     Corresponding author.                                                                                                               Two types of robustness need to be clearly distin-
Envelope-Open siedel.georg@baua.bund.de (G. Siedel)
                                                                                                                                      guished [5, 6, 7]: adversarial robustness and corruption
GLOBE https://github.com/georgsiedel/
minimal-separation-corruption-robustness (G. Siedel)                                                                                  robustness.
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
                                                                                                                                      2
    CEUR

           CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073                                                                                                          Robustness includes resistance to any corruption-caused class
1
    Machinery Directive, Directive 2006/42/EC of the European Parlia-                                                                     change, which may not be a failure mode when the original point
    ment and of the Council of 17 May 2006.                                                                                               was already misclassified (cf. footnote 4).
   Adversarial inputs are perturbed data deliberately op-                            ε
timized to fool a classifier into changing its output class.                                                                Class 1
Corruption robustness (sometimes: statistical robust-


                                                                 Input Parameter 2
ness) describes a model’s output stability not against                                                                      Class 2
such worst-case, but against statistically distributed in-
put corruptions. The two types of robustness require
different training methods and are differently hard to                                                                    Low
achieve depending on the data dimension [6]. In practice,                                                                 Robustness
training a model for one of the two robustness types only                                                                 Classifier
shows limited or selective improvement for the other
type [7, 8, 9].
   In the field of research towards ML robustness, most                                          Input Parameter 1
of the attention has been given to adversarial attack and        Figure 1: A robustness requirement (here: 𝐿2 -norm balls with
defense methods. However, from the perspective of ma-            maximum distance 𝜀) assigned to the data points (stars) of a
chinery safety and risk assessment, adversarial robust-          2D binary dataset (2 input parameters, 2 classes). The shown
ness is mainly a security concern and therefore not in           classifier is not robust, since its dotted decision boundary
the scope of this article. [10] argue that instead of ad-        violates the robustness requirement. To evaluate this, addi-
                                                                 tional points (dots) are augmented within ε of each original
versarial robustness evaluation, a corruption robustness
                                                                 point. On those points, the robust accuracy of the classifier is
evaluation is often more applicable to obtain a real-world       measured – for this classifier, some errors arise.
robustness measure and it can be used to estimate a prob-
ability of failure on potentially perturbed inputs for the
overall system.                                              of correct classification on original test data (“clean ac-
   Contribution: In this paper, we investigate corruption    curacy/error”). Robust accuracy represents a combined
robustness using data augmentation for testing and train-    measure for accuracy and robustness4 . A useful way to
ing3 . Our key contributions are twofold:                    obtain a measure of robustness only is by subtracting
      • We propose the ”𝑀𝑆𝐶𝑅” metric to evaluate and robust accuracy/error and clean accuracy/error [8, 11].
         compare classifiers corruption robustness. The         In most cases, the corrupted test dataset is derived
         approach is independent of prior knowledge          from  an original test dataset through data augmentation.
         about corruption distances, but utilizes proper- One or multiple corruptions out of some distribution are
         ties of the underlying dataset, giving the metric a added to every original data points. Figure 1 explains this
         distinct interpretable meaning. We show experi- procedure of data augmentation with corruptions (dots)
         mentally, that the metric captures different levels being added to a test dataset (stars) with 2 parameters
         of classifier corruption robustness.                and 2 classes. It illustrates how a 100% accurate but non-
                                                             robust classifier achieves lower robust accuracy on the
      • We evaluate the tradeoff between accuracy and
                                                             augmented data points.
         robustness from the perspective of corruption
                                                                The corruption distribution can be defined e.g. based
         robustness and present arguments against the
                                                             on physical observations. For the example of image data,
         tradeoff being inherent.
                                                             [8, 12] add corruptions like brightness, blur and contrast,
After giving an overview of related work, we present our while [13, 14] use special weather or sensor corruptions.
approach for the MSCR metric in section 3.1. We then [8] created robustness benchmarks for the most popular
test our approach on simple 2D as well as image data image datasets based on such physical corruptions.
with the setup described in section 3.2. We present and         Corruption distributions can also be defined without
discuss the results in sections 4 and 5.                     physical representations by adding e.g. Gaussian-, salt-
                                                             and-pepper-, or uniformly distributed noise of certain
                                                             magnitude to the inputs [8, 11, 14, 10]. Figure 1 exem-
2. Related Work                                              plary demonstrates uniformly distributed noise within
                                                             𝐿2 -norm distance 𝜀 (in 2D, 𝐿2 -norm is a circle) of the data
2.1. Measuring corruption robustness                         points.
Corruption robustness of classifiers can be numerically         With PROVEN, [15] propose a framework that uses
evaluated by testing the ratio of correctly/incorrectly      statistical data augmentation to estimate bounds on ad-
classified inputs from a corrupted test dataset. This ratio
is called robust accuracy/error, in contrast to the ratio 4 The term astuteness can be used for robust accuracy to differentiate
                                                                    the term from robustness, see [3]. Throughout this work, we use
                                                                    the popular term robustness to describe our metric for consistency
3
    Code available on Github, see front page.                       with works like [8] and [10].
versarial robustness of a model, essentially combining the                           εmin
evaluation of both adversarial and corruption robustness.                                   2r                      Accuracy =

   [4] take a robustness evaluation approach different                                                              Robust Accuracy

from measuring robust accuracy. The authors augment                                                                  MSCR = 0

the entire input space with uniformly distributed data


                                                                 Input Parameter 2
points, independent of a test dataset. They divide the                                                              Accuracy >
input space into cells, the size of which is based on the                                                           Robust Accuracy
r-separation distance described in [3] and in section 2.2.                                                           MSCR < 0
This way, they can assign a conflict free ground truth
class to each cell and evaluate the misclassification ratio                                                         Accuracy <
on all added data points. The approach allows for statis-                                                           Robust Accuracy
tical testing of the entire input space, but does not scale                                                          MSCR > 0
well to high dimensions.
   An analytical way of measuring the robustness of a                                       Input Parameter 1

classifier is through describing the characteristics of its      Figure 2: The MSCR concept, demonstrated on 2D test data.
decision boundary. One possibility is to estimate the            Data augmentation is carried out like in Figure 1. The distance
local Lipschitzness, i.e. a tightened continuity property        (𝜀𝑚𝑖𝑛 ) is determined by the minimal distance (2𝑟) of original
of models in proximity to data points. To the best of our        points from different classes (black and grey). This way, aug-
knowledge however, Lipschitzness has only been used to           mented points of different classes are still separated and classi-
                                                                 fiers can be both accurate and robust. The decision boundaries
investigate adversarial, not corruption robustness [2, 3].
                                                                 of 3 hypothetical classifiers are shown to demonstrate differ-
   Both the measure in [4] and Lipschitzness values lack         ent levels of robustness and their resulting MSCR value.
distinct interpretability in terms of what the calculated
value represents exactly.
                                                                 a classifier can be both robust and accurate as long as
2.2. The Accuracy-Robustness-Tradeoff
                                                                                                  𝜀≤𝑟                            (1)
Significant effort has recently been put into increasing
classifier robustness, commonly targeting adversarial ro-        holds, where 𝜀 is the corruption distance for which ro-
bustness, e.g. in [9, 16, 17, 18, 19]. All these methods         bustness is evaluated and 𝑟 is half this minimal class sep-
cause a significant drop in clean accuracy.                      aration distance. We adopt this notation and set 𝜀𝑚𝑖𝑛 = 𝑟
   [11, 20] and [10] observe a clear tradeoff between            as our corner case corruption distance (see Figure 2). The
corruption robustness and accuracy for different train-          value 𝜀𝑚𝑖𝑛 is not related to any prior physical knowledge
ing methods using data augmentation. The two former              of e.g. which corruptions are imperceptible, but is specific
works then propose specialized training methods for miti-        for the given dataset, i.e. it is based on the fundamental
gating parts of this tradeoff on the popular image datasets      property of minimal class separation. Accordingly, we
CIFAR-10 and ImageNet.                                           call our metric “Minimal Separation Corruption Robust-
   Based on such research, [19] and [21] discuss a tradeoff      ness” (MSCR).
between accuracy and robustness, while [22] even argue
that the cause for this tradeoff is inherent, i.e. inevitable.   3.1. MSCR metric
A counterargument is presented by [3], who argue that
                                                           To measure corruption robustness, we carry out data aug-
accuracy and robustness are not necessarily at odds as
                                                           mentation on the test data with uniformly distributed
long as data points from different classes are separated
                                                           corruptions, generated by a random sampling algorithm,
far enough from each other (see section 3). The authors
                                                           similar to the method shown by [10]. In contrast to [10],
measure this “r-separation” between different classes on
                                                           we set the upper bound of the distance 𝜀𝑡𝑒𝑠𝑡 , within which
various image datasets and find it to be high enough
                                                           the augmented noise is distributed, to 𝜀𝑚𝑖𝑛 , as required in
for classifiers to be both accurate and robust for typical
                                                           Equation 1 (see Figure 2 for an illustration). We measure
perturbation distances.
                                                           robust accuracy on the augmented data, which corre-
                                                           sponds to a combination of clean accuracy and corruption
3. Method                                                  robustness. However, we want to quantify robustness
                                                           independent of clean accuracy for comparability, so we
Our robustness evaluation approach is based on this same subtract the clean accuracy (𝐴𝑐𝑐𝑐𝑙𝑒𝑎𝑛 ) from the robust
idea by [3], who measure the distance 2r for a dataset, accuracy on 𝜀𝑚𝑖𝑛 -augmented test data (𝐴𝑐𝑐𝑟𝑜𝑏−𝜀𝑚𝑖𝑛 ) and
which is the minimal distance between any two points of normalize by the clean accuracy:
different classes (2r in Figure 2). The authors argue that
                                                                   𝑀𝑆𝐶𝑅 = (𝐴𝑐𝑐𝑟𝑜𝑏−𝜀𝑚𝑖𝑛 − 𝐴𝑐𝑐𝑐𝑙𝑒𝑎𝑛 )/𝐴𝑐𝑐𝑐𝑙𝑒𝑎𝑛        (2)
According to [3], a classifier can in principle be robust on     Algorithm 1: MSCR calculation
such augmented noise of magnitude 𝜀𝑚𝑖𝑛 while maintain-            Data: classification dataset
ing accuracy. This can be seen from Figure 2, where the                  {𝑋 (𝑥1 , …, 𝑥𝑛 ), 𝑌 (𝑦1 , …, 𝑦𝑛 )}
circles of radius 𝜀𝑚𝑖𝑛 in which data is augmented, never          Parameters: 𝑚𝑜𝑑𝑒𝑙𝑠 = {𝑚𝑜𝑑𝑒𝑙1 , …, 𝑚𝑜𝑑𝑒𝑙𝑚 },
overlap for different classes. We use an identical radius                          𝑟 = {1, …, 𝑟𝑢𝑛𝑠}, 𝑘, 𝜀𝑡𝑒𝑠𝑡 = {0, 𝜀𝑚𝑖𝑛 }
𝜀𝑚𝑖𝑛 for all classes, assuming that the separation of data        Output: 𝑀𝑆𝐶𝑅 = {𝑀𝑆𝐶𝑅1 , …, 𝑀𝑆𝐶𝑅𝑚 }
points from the classifiers decision boundary is equally        1 𝜀𝑚𝑖𝑛 = ( min {𝑑𝑖𝑠𝑡(𝑥𝑗 − 𝑥𝑖 )|𝑦𝑖 ≠ 𝑦𝑗 })/2
important for all classes. For this noise level 𝜀𝑚𝑖𝑛 , any                 𝑥𝑖∈𝑛 ,𝑥𝑗∈𝑛
non-robust behavior is theoretically avoidable, since a         2 for 𝑚𝑜𝑑𝑒𝑙𝑠 do
classifiers decision boundary can separate the classes          3     for 𝑟 do
even with augmented data, as long as the ML algorithm           4         Train 𝑚𝑜𝑑𝑒𝑙𝑚
is capable of learning the exact function. The MSCR             5         Test model with original test data
metric therefore measures the (relative) win or loss in                    (𝜀𝑡𝑒𝑠𝑡 = 0) → return 𝐴𝑐𝑐𝑐𝑙𝑒𝑎𝑛
accuracy when testing on such noisy data that any loss          6         For every test data point: Uniform random
is just about avoidable. Figure 2 illustrates the impact of                 sample 𝑘 points within 𝑑𝑖𝑠𝑡(𝜀𝑚𝑖𝑛 ) and
the proposed metric using three corner cases:                               augment the test data
                                                                7         Test model with data from step 6 → return
     • 𝑀𝑆𝐶𝑅 = 0, 𝐴𝑐𝑐𝑟𝑜𝑏−𝜀𝑚𝑖𝑛 = 𝐴𝑐𝑐𝑐𝑙𝑒𝑎𝑛 , solid line in                     𝐴𝑐𝑐𝑟𝑜𝑏−𝜀𝑚𝑖𝑛
       Figure 2: A classifier that is as robust as possible     8         𝑀𝑆𝐶𝑅𝑟 = (𝐴𝑐𝑐𝑟𝑜𝑏−𝜀𝑚𝑖𝑛 − 𝐴𝑐𝑐𝑐𝑙𝑒𝑎𝑛 )/𝐴𝑐𝑐𝑐𝑙𝑒𝑎𝑛
       for the given class separation of the dataset. It not                            𝑟𝑢𝑛𝑠
       only correctly classifies the original data points,      9      𝑀𝑆𝐶𝑅𝑚 = (∑𝑟=1 𝑀𝑆𝐶𝑅𝑟 )/𝑟𝑢𝑛𝑠
       but also all augmented data points.
     • 𝑀𝑆𝐶𝑅 < 0, 𝐴𝑐𝑐𝑟𝑜𝑏−𝜀𝑚𝑖𝑛 < 𝐴𝑐𝑐𝑐𝑙𝑒𝑎𝑛 , dotted line in
       Figure 2: A classifier that is not perfectly robust.     𝜀𝑡𝑒𝑠𝑡 , models trained with 𝜀𝑡𝑟𝑎𝑖𝑛 = 𝜀𝑡𝑒𝑠𝑡 are expected to
       It correctly classifies all original data points, but    perform best [10].
       misclassifies a number of augmented data points             As demonstrated in Figure 2, corruption levels below
       due to low robustness.                                   𝜀𝑚𝑖𝑛 theoretically allow a classifier to be robust while not
     • 𝑀𝑆𝐶𝑅 > 0, 𝐴𝑐𝑐𝑟𝑜𝑏−𝜀𝑚𝑖𝑛 > 𝐴𝑐𝑐𝑐𝑙𝑒𝑎𝑛 , dashed line           losing test accuracy. We investigate this theoretical claim
       in Figure 2: A classifier misclassifies some orig-       by [3] additionally to the MSCR metric by evaluating
       inal data points, but correctly classifies some of       changes in robust accuracy when augmenting multiple
       their augmentations. Especially for classifiers          corruption levels 𝜀𝑡𝑒𝑠𝑡 to the test dataset. In contrast to
       that trained to be very robust, we expect this re-       the work of [10], we extensively evaluate more corrup-
       sult to be possible.                                     tion levels below, around and including 𝜀𝑚𝑖𝑛 specifically.
Algorithm 1 shows the MSCR calculation procedure. In            In contrast to the work of [11] and [20], we use simple
step 1, different distance functions (e.g. 𝐿∞ -norm) can        uniformly distributed data augmentation with a fixed
be applied. We account for randomness in the data split-        upper bound of noise for the entire dataset instead of
ting, model training and data augmentation procedures           Gaussian noise. This allows us the comparison of the
by carrying out multiple runs of the same experiment            noise levels with the class separation distances. It shall be
and reporting average values and 95%-confidence inter-          noted however that in contrast to Gaussian noise, where
vals over all runs. The reasonable number of augmented          density decreases with distance, uniform noise does not
points k per original data point varies depending on the        reflect the higher uncertainty in a class assignment when
dataset (see section 3.2). Within the respective for-loop,      the distance from a ground truth data point increases.
variable 𝑚𝑜𝑑𝑒𝑙𝑠 runs through the list of all classifier mod-    Even though our data augmentation method is simple,
els to be compared, while 𝑟 counts up to (the overall           we still expect to find counterexamples for the accuracy-
number of) 𝑟𝑢𝑛𝑠.                                                robustness-tradeoff, based solely on the class-separation
                                                                theory. We believe that the case of finding such coun-
                                                                terexamples with less advanced methods than e.g. [11]
3.2. Experimental details                                       represents even more credible evidence for the argument
Additionally to test data augmentation, we train multiple       of [3] against an inherent accuracy-robustness-tradeoff.
models on datasets augmented with different corruption             We carry out the experiments on 3 binary class 2D
distances 𝜀𝑡𝑟𝑎𝑖𝑛 . Increasing a model’s 𝜀𝑡𝑟𝑎𝑖𝑛 should lead to   datasets as were used and provided by [4]. For clarity,
a growing MSCR value, as it is expected that the model          we only report results with 𝐿∞ -corruptions on one of
robustness grows. This way, we evaluate the trend of            those datasets, which is shown in Figure 3 and features
the MSCR value for models with different corruption             4674 data points. Experiments with the other 2D datasets
robustness levels. Also, on test data corrupted with large      and 𝐿2 -corruptions exhibit similar fundamental results,
                                                               Figure 4: Effect of hyperparameter 𝑘 on robust accuracy and
Figure 3: Data points in the binary class 2D dataset.          its deviation. 2D dataset, 𝜀𝑡𝑟𝑎𝑖𝑛 , 𝜀𝑡𝑒𝑠𝑡 = 0.001.


which can also be found in our Github repository (see          Table 1
frontpage). The two input parameters 𝑥[0] and 𝑥[1] are         Minimal 𝐿∞ class separation and corresponding 𝜀𝑚𝑖𝑛
normalized to the interval [0, 1]. For classification, we
                                                                            Dataset                2𝑟(𝐿∞ )   𝜀𝑚𝑖𝑛
use a random forest (RF) algorithm with 100 trees. We
also compare this classifier with a 1-nearest-neighbor                   2D dataset               0.008026   0.004013
model, which is known to be inherently robust, since it           CIFAR-10 (train and test set)   0.211765   0.105882
classifies based on distance to the 1 nearest data point.
   We choose 𝑘 = 10 augmented data points per original
data point, as we found higher numbers of 𝑘 not signifi-       4. Results
cantly improving the resulting robust accuracy and its
standard deviation. This effect of different values for the    Table 2 displays the matrix of test accuracies for the 2D
hyperparameter 𝑘 is displayed in Figure 4. In order to         dataset for different values of both 𝜀𝑡𝑟𝑎𝑖𝑛 (representing
achieve statistically representative results, we evaluate      different models, along columns) and 𝜀𝑡𝑒𝑠𝑡 (along rows).
how the average test accuracy converges over multiple          The bold values highlight the best model for every level
runs and accordingly choose 1200 runs.                         of test noise. As can be seen, the optima of the accuracy
   The experiments are additionally run in a more applied      do not actually match with the matrix diagonal, where
image classification setting using benchmark dataset           training and test noise are equal (highlighted in light
CIFAR-10. We adopt the classifier architecture from [10],      grey). Instead, when testing with lower noise levels and
using a 28-10 wide residual network with SGD optimizer,        even with clean test data, the model trained on 𝜀𝑡𝑟𝑎𝑖𝑛 =
0.3 dropout rate, training batch size 32 and 30 epochs with    0.007 performs best. The maximum overall accuracy is
a 3-step decreasing learning rate. All pixel values are nor-   achieved with a model trained on 𝜀𝑡𝑟𝑎𝑖𝑛 = 0.007 that is
malized to [0, 1] and random horizontal flips and random       also tested on 𝜀𝑡𝑒𝑠𝑡 = 0.001 corruptions. For higher noise
crops with 4px padding are used for training generaliza-       levels, the optimum robust accuracies are achieved with
tion. For CIFAR-10 we choose 𝑘 = 1, since [10] report          𝜀𝑡𝑟𝑎𝑖𝑛 ≤ 𝜀𝑡𝑒𝑠𝑡 , displaying the opposite trend compared to
one augmented point to be sufficient. We suspect that          low noise levels.
this is due to the multiple epochs of the training process,       The results on CIFAR-10 in Table 3 show a similar
which allows to train the model on multiple augmenta-          trend, albeit less pronounced. For low noise levels, train-
tions per training data point. We choose 20 runs due to        ing with 𝜀𝑡𝑟𝑎𝑖𝑛 = 0.01 appears to be optimal for clean
computational feasibility of all training procedures. Ta-      accuracy. The maximum overall accuracy is achieved
ble 1 shows the minimal class separation distances 2𝑟 and      with 𝜀𝑡𝑟𝑎𝑖𝑛 = 0.02 and 𝜀𝑡𝑒𝑠𝑡 = 0.01. For higher levels of
the corresponding 𝜀𝑚𝑖𝑛 values, measured in 𝐿∞ -distance        test noise, similarly to 2D data, it appears beneficial to
for both datasets. For intuition, the CIFAR-10 𝜀𝑚𝑖𝑛 value      use 𝜀𝑡𝑟𝑎𝑖𝑛 ≤ 𝜀𝑡𝑒𝑠𝑡 . In contrast to the 2D data, where the
translates to a maximum color grade change of 27/255           optimum 𝜀𝑡𝑟𝑎𝑖𝑛 for 𝜀𝑡𝑒𝑠𝑡 = 0 is higher than the 𝜀𝑚𝑖𝑛 value,
on all pixels. Higher values for 2𝑟 are to be expected         for CIFAR-10 it is ∼10 times lower than 𝜀𝑚𝑖𝑛 . The op-
for image data, since 𝐿∞ -norm evaluates the maximum           timum 𝜀𝑡𝑟𝑎𝑖𝑛 = 0.01 translates to a 2.5/255 color grade
distance in any of the 3072 dimensions of CIFAR-10 input       corruption for every pixel.
data.                                                             For both datasets it is visible from the last rows of
Table 2
Clean accuracies (first row) and robust accuracies in percentage plus the MSCR value (last row) for various models (columns)
± the 95% confidence intervals. Models are trained and tested with different levels of 𝐿∞ -noise (𝜀𝑡𝑟𝑎𝑖𝑛 along columns, 𝜀𝑡𝑒𝑠𝑡 along
rows). Bold accuracies: Best model accuracy for every noise level. Bold MSCR value: Highest MSCR value, i.e. highest model
robustness. Last row color scale: Highlights the constant increase of MSCR with increasing 𝜀𝑡𝑟𝑎𝑖𝑛 . Light grey accuracies: Model
trained and tested on the same noise level (𝜀𝑡𝑟𝑎𝑖𝑛 = 𝜀𝑡𝑒𝑠𝑡 ). Dark grey accuracies: Maximum overall accuracy.

  2D Dataset                                                              𝜺𝒕𝒓𝒂𝒊𝒏
              𝟎          0.001        0.002         𝛆𝐦𝐢𝐧         0.007        0.01         0.015         0.02         0.03
   0     99.531±0.014 99.652±0.011 99.699±0.011 99.748±0.010 99.784±0.009 99.769±0.010 99.504±0.016 98.990±0.028 97.347±0.060
   0.001 99.515±0.013 99.640±0.011 99.689±0.010 99.746±0.009 99.785±0.009 99.775±0.009 99.524±0.014 99.017±0.025 97.380±0.057
   0.002 99.495±0.013 99.607±0.011 99.660±0.010 99.732±0.009 99.777±0.008 99.768±0.009 99.528±0.013 99.026±0.024 97.411±0.055
   𝜺𝒎𝒊𝒏 99.405±0.013 99.525±0.011 99.583±0.010 99.669±0.009 99.729±0.008 99.716±0.008 99.486±0.012 99.005±0.022 97.435±0.052
𝜺𝒕𝒆𝒔𝒕


   0.007 99.167±0.014 99.287±0.012 99.360±0.011 99.461±0.010 99.536±0.009 99.535±0.010 99.319±0.013 98.871±0.022 97.380±0.049
   0.01 98.782±0.017 98.899±0.015 98.977±0.014 99.083±0.014 99.175±0.013 99.191±0.014 99.014±0.017 98.615±0.024 97.238±0.047
   0.015 97.871±0.025 97.979±0.025 98.044±0.025 98.134±0.025 98.222±0.025 98.265±0.026 98.197±0.029 97.921±0.033 96.810±0.049
   0.02 96.771±0.036 96.847±0.037 96.896±0.037 96.966±0.037 97.040±0.038 97.092±0.038 97.105±0.040 96.962±0.043 96.198±0.053
   0.03 94.397±0.058 94.423±0.059 94.456±0.059 94.500±0.060 94.547±0.061 94.593±0.061 94.668±0.061 94.698±0.062 94.448±0.066
  𝑴𝑺𝑪𝑹 -0.126±0.007 -0.127±0.006 -0.116±0.006 -0.080±0.005 -0.055±0.005 -0.053±0.006 -0.018±0.010 0.015±0.015 0.090±0.024
                                                               (a) 2D Dataset

  CIFAR-10 Dataset                                                       𝜺𝒕𝒓𝒂𝒊𝒏
                𝟎            0.01          0.02          0.03                     0.05             0.07            𝜺𝒎𝒊𝒏             0.15
   0        91.681±0.304 91.932±0.318 91.917±0.417 91.311±0.470                90.428±0.427     88.645±0.454     86.051±0.800    81.989±0.855
   0.01     91.453±0.338 91.900±0.311 91.964±0.408 91.351±0.474                90.472±0.421     88.678±0.463     86.085±0.798    81.995±0.858
   0.02     90.675±0.429 91.527±0.338 91.868±0.428 91.459±0.442                90.577±0.420     88.817±0.457     86.158±0.763    82.082±0.844
   0.03     89.181±0.595 91.606±0.385 91.606±0.462 91.479±0.422                90.690±0.403     88.983±0.421     86.306±0.766    82.165±0.838
𝜺𝒕𝒆𝒔𝒕


   0.05     84.062±1.033 89.273±0.611 89.273±0.570 90.530±0.483                90.832±0.346     89.513±0.362     86.745±0.737    82.540±0.802
   0.07     76.706±1.396 84.086±0.999 84.086±0.831 87.303±0.672                90.065±0.324     89.907±0.332     87.322±0.621    83.051±0.789
   𝜺𝒎𝒊𝒏     59.261±1.868 67.534±1.814 67.534±1.695 74.137±1.367                83.784±0.714     88.260±0.374     88.194±0.418    84.181±0.666
   0.15     37.458±2.034 42.765±2.071 42.765±2.286 49.685±1.991                65.285±1.842     77.257±1.083     86.130±0.433    85.352±0.511
  𝑴𝑺𝑪𝑹     -35.362±0.020 -33.977±0.020 -26.527±0.018 -18.808±0.014             -7.347±0.009     -0.434±0.006      2.490±0.006     2.674±0.003
 Table 2 (top, 2D dataset) and Table 3 (bottom, CIFAR-10(b)     CIFAR-10
                                                             dataset):      Dataset in percentage (rows) and the 𝑀𝑆𝐶𝑅 value (last row) for
                                                                       Test accuracies
  various models (columns) ± the 95% confidence interval. Models are trained (𝜀𝑡𝑟𝑎𝑖𝑛 ) and tested (𝜀𝑡𝑒𝑠𝑡 ) with different levels of 𝐿∞ noise
  (models along columns, test levels along rows). Bold accuracies: Best model accuracy for every noise level. Bold 𝑀𝑆𝐶𝑅 value: Highest
Table
  𝑀𝑆𝐶𝑅 2 and
         value,3,i.e.
                  that  the MSCR
                      highest       value steadily
                              model robustness.       increases
                                                Last row         withHighlights
                                                         color scale:         Figures  6a (2Dincrease
                                                                                 the constant  dataset)  and 6b
                                                                                                      of 𝑀𝑆𝐶𝑅     (CIFAR-10)
                                                                                                               with  increasing 𝜀display     the
                                                                                                                                  𝑡𝑟𝑎𝑖𝑛 . Light
   grey accuracies: Model trained and tested on the same noise level (𝜀𝑡𝑟𝑎𝑖𝑛 = 𝜀𝑡𝑒𝑠𝑡 ). Dark grey accuracies: Maximum overall accuracy.
higher   levels of training noise 𝜀𝑡𝑟𝑎𝑖𝑛 . For both datasets, accuracy-robustness-tradeoff for the models trained with
the MSCR increases from negative values on less robust different 𝜀𝑡𝑟𝑎𝑖𝑛 by contrasting MSCR versus clean accuracy
trained models to zero and even positive values for more                 values. Both Figures in principle show a tradeoff curve.
robust trained models.                                                   However, it is visible that for 𝜀𝑡𝑟𝑎𝑖𝑛 ≤ 0.007 on 2D data
   For CIFAR-10, the MSCR values are overall much larger                 and 𝜀𝑡𝑟𝑎𝑖𝑛 ≤ 0.01 on CIFAR-10, both clean accuracy and
than for the 2D data. This effect correlates with the 𝜀𝑚𝑖𝑛               robustness increase compared to the baseline model with
noise level, which is about 26 times larger in absolute                  𝜀𝑡𝑟𝑎𝑖𝑛 = 0. The tradeoff is overcome for these models
values.                                                                  (arguably also for 𝜀𝑡𝑟𝑎𝑖𝑛 = 0.01 for 2D data and 𝜀𝑡𝑟𝑎𝑖𝑛 = 0.02
   Figure 5 shows a comparison on the 2D dataset be-                     for CIFAR-10).
tween the 1NN model and the RF model with regards to
clean accuracy (Fig. 5a) and MSCR (Fig. 5b). Both models
are trained on the various 𝜀𝑡𝑟𝑎𝑖𝑛 values. While for the RF               5. Discussion
model, both metrics increase with increasing training
noise up to the optimum of 𝜀𝑡𝑟𝑎𝑖𝑛 = 0.007, the 1NN model                 5.1. Applicability of the MSCR metric
shows constant (and superior) metrics up to this training                Our results from the experiments indicate that the rela-
noise. This illustrates the inherent robustness of the 1NN               tive difference between the noise-augmented robust accu-
model. The comparison also shows that this inherent                      racy and the clean accuracy is a measure for corruption
robustness is indeed advantageous regarding accuracy                     robustness of models. For 𝜀𝑡𝑒𝑠𝑡 = 𝜀𝑚𝑖𝑛 in particular, this rel-
on our dataset.                                                          ative difference that we named MSCR steadily increases
                              1NN vs. RF over training noise on 2D Dataset                                             1NN vs. RF over training noise on 2D Dataset
                 100                                                                                           0,15

                                                                                                               0,10
                      99


                                                                                           Robustness [MSCR]
                                                                                                                                       MSCR RF
 Clean Accuracy [%]


                                                                                                               0,05
                      98                                                                                                               MSCR 1NN
                                          Clean Accuracy RF                                                    0,00
                      97
                                          Clean Accuracy 1NN                                                   -0,05
                      96                                                                                       -0,10

                      95                                                                                       -0,15
                               0    0,001 0,002 εmin 0,007 0,01 0,015 0,02          0,03                                  0    0,001 0,002 εmin 0,007 0,01 0,015 0,02        0,03
                                                    εtrain                                                                                     εtrain
                                           (a) Clean Accuracy                                                                        (b) Robustness (MSCR)
Figure 5: Model comparison on 2D Dataset with regards to clean accuracy and robustness (MSCR): RF versus 1NN model
with different 𝜀𝑡𝑟𝑎𝑖𝑛 .


                      0,10                      2D Dataset                                                                              CIFAR-10 Dataset
                                                                                                                 10
                                   0,03                                                                                       0,15         εmin
                      0,05                                                                                                                              0,07
                                                                                                                  0
                                                                                           Robustness [MSCR]
 Robustness [MSCR]


                                                             0,02                                                                                              0,05
                      0,00                                                                                      -10
                                                                      0,015
                      -0,05                                              ​                                                                                       0,03
                                                                    0.007 & 0.01                                -20
                                                                           εmin                                                      Model with εtrain =
                                          Model with εtrain =                                                                                                         0,02
                      -0,10                                                                                     -30                  Baseline Model
                                          Baseline Model                           0,002                                                                           0,01
                                                                          0        0,001                                                                          0
                      -0,15                                                                                     -40
                           97%                98%            99%                    100%                           80%                   85%             90%                 95%
                                                Clean Accuracy                                                                             Clean Accuracy
                                              (a) 2D Dataset                                                                         (b) CIFAR-10 Dataset
Figure 6: Accuracy-robustness-tradeoff for models trained with different levels of augmented training noise 𝜀𝑡𝑟𝑎𝑖𝑛 , compared to
the baseline model with 𝜀𝑡𝑟𝑎𝑖𝑛 = 0. Models with both higher MSCR and higher clean accuracy (when the curve evolves towards
the top right corner) contradict the inherent tradeoff.


with higher corruption robustness of the RF model on 2D                                    herently robust model such as 1NN, which fits its deci-
data and the wide residual network on CIFAR-10. This                                       sion boundary based on maximum class separation. The
way, we verify the metric’s capability to reflect the cor-                                 MSCR values are able to correctly display this interrela-
ruption robustness of different models. However, this                                      tion.
claim is based on the assumption that increasing corrup-
tion robustness of our models can be generated through                                     5.2. Disadvantages and advantages of the
training with higher noise levels. This seems evident
based on research by [10], but requires future validation
                                                                                                MSCR metric
like in [11], who confirm that their Gaussian robustness                                   In our experiments, the steady robustness increase for
metric is strongly correlated with the popular physical                                    higher 𝜀𝑡𝑟𝑎𝑖𝑛 also holds for other levels of testing noise
corruptions benchmark by [8].                                                              than 𝜀𝑚𝑖𝑛 . The MSCR value, which uses 𝜀𝑚𝑖𝑛 -corruptions
   On the 2D dataset, the 1NN model shows a constant,                                      as the underlying robustness requirement, is only one
superior MSCR value compared to the RF model for all                                       particular case of this robustness calculation approach.
𝜀𝑡𝑟𝑎𝑖𝑛 ≤ 0.007, where classes are still predominantly sep-                                 It has to be emphasized that from our results in Tables 2
arated. This is the performance expected from an in-                                       and 3, we cannot observe any conspicuities for 𝜀𝑡𝑒𝑠𝑡 ∼ 𝜀𝑚𝑖𝑛 .
For example, there is no indication that models perform      to 𝜀𝑡𝑟𝑎𝑖𝑛 = 0 does not achieve 95%-confidence in a pair-
well below this noise level while massively dropping off     wise statistical comparison. More than 20 runs are neces-
at higher noise levels, as could be presumed from the        sary to obtain statistically significant results, which we
r-separation theory. It is therefore evident to conclude     could not achieve due to limited computational resources.
that measuring corruption robustness works with other        Hence, we only treat our results on CIFAR-10 regarding
𝜀𝑡𝑒𝑠𝑡 -values. In practice, if specific corruptions are knownthe accuracy-robustness-tradeoff as suggestions.
for an application, those corruptions should also be used       The suggestion that some 𝜀𝑡𝑟𝑎𝑖𝑛 > 0 leads to higher
for testing, e.g. through benchmarks [8].                    clean accuracy than 𝜀𝑡𝑟𝑎𝑖𝑛 = 0 has theoretical relevance.
   However, we emphasize that the MSCR metric is ad-         It supports the claim made, but not practically proven by
vantageous in two ways: First, it does not require prior     [3], that accuracy and robustness are not in an inherent
physical knowledge to define corruption distributions,       tradeoff as long as the noise level 𝜀 fulfills Equation 1.
like e.g. [8] does. Instead, it only requires measuring the     The result also seems relevant from a practical per-
actual class separation from any classification dataset.     spective, since developers may try some 𝜀𝑡𝑟𝑎𝑖𝑛 for training
Second, the MSCR can be interpreted with a clear contex-     data augmentation, which increases robustness without
tual meaning, since the robustness requirement is derived    drawbacks regarding accuracy. We emphasize that this
from the dataset: It measures “the theoretically avoidable   practical implication is only valid for the very limited
loss (or win) of accuracy due to statistical corruptions”.   model architectures, datasets and augmentation distribu-
                                                             tions we tested. For example, our experiments show that
5.3. On achieving high MSCR values                           noise training below 𝜀𝑚𝑖𝑛 has no effect on an inherently
                                                             robust model such as 1NN. This is due to the fact that this
Clearly, avoiding any loss of accuracy on 𝜀𝑚𝑖𝑛 -noise is model type maximizes the class separation of its decision
hard to achieve in practice on high-dimensional data. For boundary in training anyways.
CIFAR-10, 𝑀𝑆𝐶𝑅 = 0 can be achieved, but only with               On the one hand, overcoming the tradeoff for small
𝜀𝑡𝑟𝑎𝑖𝑛 = 0.07, where the clean accuracy declines by 3 per- 𝜀𝑡𝑟𝑎𝑖𝑛 is not entirely surprising, since it is well known
centage points compared to 𝜀𝑡𝑟𝑎𝑖𝑛 = 0. We also verify our that data transformations and data augmentations can
conjecture that 𝑀𝑆𝐶𝑅 > 0 is possible for some robust increase generalization of models (in fact, we also used
trained models. For this behavior, we find the discovery random flips and crops for CIFAR-10 training). [11] and
in [23] a convincing technical explanation. Misclassified [20] also manage to overcome the tradeoff with more ad-
data points tend to lie closer to the decision boundary vanced training methods. On the other hand, our results
than correctly classified data points. The data augmen- are surprising considering this drawback-free increase in
tations on a misclassified data point therefore have a robust accuracy is quite significant for the RF model on
high chance of causing a favorable class change. At the 2D data (less than halving the classification error). Also,
same time, data augmentations on correctly classified uniform 𝐿∞ data augmentation is a very simple method
points have a lower chance of causing an unfavorable and less contextually relevant compared to physically
class change when their distance to the decision bound- derived augmentations. An explanation may be that the
ary is high, which is what a robust model is trained for. uniform 𝐿𝑝 -norm noise allows a stricter coverage of the
                                                             input parameter space near data points compared to phys-
5.4. The accuracy-robustness-tradeoff                        ical data augmentations, enforcing a smooth model that
                                                             is less prone to overfitting the corruptions.
Besides our investigation of the MSCR metric, we re-
port on findings regarding the tradeoff between accuracy
and corruption robustness. For both 2D and CIFAR-10
                                                             5.5. Class separation distance for model
datasets we observe higher clean and robust accuracy                training
on any test noise when training a model with a specific From our results we also need to conclude that in prac-
level of uniform noise (𝜀𝑡𝑟𝑎𝑖𝑛 = 0.007 for 2D, 𝜀𝑡𝑟𝑎𝑖𝑛 = 0.01 tice, the 𝜀 value has only limited expressiveness when
                                                                         𝑚𝑖𝑛
for CIFAR-10), compared to standard training. For the trying to find the optimal 𝜀
                                                                                              𝑡𝑟𝑎𝑖𝑛 with regards to (robust)
2D data, this optimum 𝜀𝑡𝑟𝑎𝑖𝑛 value is even higher than accuracy. This is visible in Figures 6a and 6b, where
𝜀𝑚𝑖𝑛 , the value which the r-separation theory suggests to based solely on the r-separation theory, we may have
be beneficial for robustness while not hurting accuracy. expected the curve to reverse its trend along the x-axis
This could be due to the major proportion of minimal when 𝜀
                                                                      𝑡𝑟𝑎𝑖𝑛 = 𝜀𝑚𝑖𝑛 . In reality, the best overall accuracy
distances of data points to other classes being signifi- for the 2D data is achieved for 𝜀
                                                                                                    𝑡𝑟𝑎𝑖𝑛 ∼ 2 ∗ 𝜀𝑚𝑖𝑛 , while on
cantly bigger than 𝜀𝑚𝑖𝑛 . Our results are statistically sig- CIFAR-10 it is achieved for 𝜀
                                                                                                𝑡𝑟𝑎𝑖𝑛 < 𝜀𝑚 𝑖𝑛/5. We suspect
nificant for the 2D dataset experiment. For 20 runs per that high-dimensional datasets are notoriously hard to
trained model on CIFAR-10, we emphasize that claiming train with regards to high robust accuracy, at least for
higher mean clean accuracy for any 𝜀𝑡𝑟𝑎𝑖𝑛 > 0 compared such 𝜀 levels their high 𝐿 class separation distance
                                                                       𝑚𝑖𝑛                      ∞
inevitably entails. We suspect that on other datasets 𝜀𝑚𝑖𝑛
may be even greater and further away from the optimum
𝜀𝑡𝑟𝑎𝑖𝑛 . Additional research is needed on various distance
measures, dataset dimensions and model types in order
to utilize class separation distances for optimizing robust
accuracy.

5.6. Optima of 𝜀𝑡𝑟𝑎𝑖𝑛 vs. 𝜀𝑡𝑒𝑠𝑡
Another interesting finding from the accuracy matrix
of both datasets is that the best 𝜀𝑡𝑟𝑎𝑖𝑛 value for models
evaluated with certain 𝜀𝑡𝑒𝑠𝑡 deviates from the expected
diagonal. For example, 𝜀𝑡𝑟𝑎𝑖𝑛 = 0.03 is not the best choice
to prepare for 𝜀𝑡𝑒𝑠𝑡 = 0.03. In Figure 7, the accuracy
matrix for CIFAR-10 from Table 3 is visualized in a 3D
plot, which shows how the optima in (robust) accuracy
deviate from the diagonal. It appears that for low noise
levels the best choice is 𝜀𝑡𝑟𝑎𝑖𝑛 > 𝜀𝑡𝑒𝑠𝑡 , while for higher     Figure 7: CIFAR-10 (robust) accuracies for different 𝜀𝑡𝑟𝑎𝑖𝑛 and
noise levels 𝜀𝑡𝑟𝑎𝑖𝑛 < 𝜀𝑡𝑒𝑠𝑡 is more favorable. This suspected   𝜀𝑡𝑒𝑠𝑡 . The optima, marked with points, deviate from the diag-
                                                                onal (white line where 𝜀𝑡𝑟𝑎𝑖𝑛 = 𝜀𝑡𝑒𝑠𝑡 ): towards higher 𝜀𝑡𝑟𝑎𝑖𝑛 for
dependency needs further investigation.
                                                                lower noise levels and towards lower 𝜀𝑡𝑟𝑎𝑖𝑛 for higher noise
                                                                levels.
6. Conclusion
In this article we evaluated a data augmentation method            Our work seems to fit into a gap between those re-
in order to obtain a comparable, interpretable measure          searchers optimizing test accuracy and those optimizing
of corruption robustness for classifiers. We measured           robustness. Our future work will include further investi-
the relative difference between the robust accuracy on          gations of data augmentation training and testing using
corrupted test data and the clean accuracy. We proposed         other dataset types, distance metrics and corruption dis-
to use half the minimal class separation distance mea-          tributions. It would be of additional interest, whether
sured from the dataset as the maximum distance 𝜀𝑚𝑖𝑛             some increase in adversarial robustness can be obtained
of the augmented test noise. This robustness require-           without loosing accuracy. Our findings emphasize the
ment does not presume any prior knowledge about real            potential and encourage the development of advanced
corruption distances. It theoretically allows a classifier      training procedures mitigating the accuracy-robustness-
to be fully robust while not losing accuracy. The class         tradeoff, since the combination of both properties is es-
separation distance therefore gives our metric a distinct       sential from a risk assessment perspective.
meaning: It represents any “avoidable” loss (or win) in
accuracy due to corruptions. We experimentally showed
that our metric is able to reflect various degrees of model     References
robustness.
   From training classifiers with different levels of noise      [1] G. Siedel, S. Voß, S. Vock, An overview of the
we found that classifiers with the highest robust accuracy           research landscape in the field of safe machine
on a certain level of noise are not strictly those, which are        learning, in: Volume 13: Safety Engineering,
trained on this same level of noise. We also presented in-           Risk, and Reliability Analysis; Research Posters,
dications that a tradeoff between accuracy and corruption            American Society of Mechanical Engineers, 2021.
robustness is not inherent: In our experiments, simple               doi:1 0 . 1 1 1 5 / I M E C E 2 0 2 1 - 6 9 3 9 0 .
augmentation training on significant random uniform              [2] T.-W. Weng, H. Zhang, P.-Y. Chen, J. Yi, D. Su,
noise could improve test accuracy of classifiers addition-           Y. Gao, C.-J. Hsieh, L. Daniel, Evaluating the robust-
ally to their robustness, compared with normal training.             ness of neural networks: An extreme value theory
However, the minimal class separation distance could                 approach, International Conference on Learning
in practice not guide us towards the optimal values of               Representations (ICLR) (2018).
training noise. These findings regarding the accuracy-           [3] Y.-Y. Yang, C. Rashtchian, H. Zhang, R. R. Salakhut-
robustness-tradeoff could in our opinion be useful in                dinov, K. Chaudhuri, A closer look at accuracy
practice.                                                            vs. robustness, Advances in neural information
                                                                     processing systems 33 (2020) 8588–8601.
 [4] X. Zhao, W. Huang, V. Bharti, Y. Dong, V. Cox,                P. S. Liang, Unlabeled data improves adversarial
     A. Banks, S. Wang, S. Schewe, X. Huang, Reliability           robustness, Advances in neural information pro-
     assessment and safety arguments for machine learn-            cessing systems 32 (2019).
     ing components in assuring learning-enabled au-          [17] J. Cohen, E. Rosenfeld, Z. Kolter, Certified adver-
     tonomous systems, arXiv preprint arXiv:2112.00646             sarial robustness via randomized smoothing, in: In-
     (2021).                                                       ternational Conference on Machine Learning, 2019,
 [5] Deutsches Institut für Normung, Din spec 92001-               pp. 1310–1320.
     2: Artificial intelligence – life cycle processes and    [18] A. Madry, A. Makelov, L. Schmidt, D. Tsipras,
     quality requirements: Part 2: Robustness, 2020.               A. Vladu, Towards deep learning models resistant
 [6] A. Fawzi, O. Fawzi, P. Frossard, Analysis of classi-          to adversarial attacks, International Conference on
     fiers’ robustness to adversarial perturbations, Ma-           Learning Representations (ICLR) (2018).
     chine learning 107 (2018) 481–508.                       [19] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, M. Jor-
 [7] J. Gilmer, N. Ford, N. Carlini, E. Cubuk, Adversarial         dan, Theoretically principled trade-off between
     examples are a natural consequence of test error              robustness and accuracy, in: International Confer-
     in noise, in: International Conference on Machine             ence on Machine Learning, 2019, pp. 7472–7482.
     Learning, 2019, pp. 2280–2289.                           [20] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph,
 [8] D. Hendrycks, T. Dietterich, Benchmarking neural              J. Gilmer, B. Lakshminarayanan, Augmix: A simple
     network robustness to common corruptions and                  data processing method to improve robustness and
     perturbations, International Conference on Learn-             uncertainty, International Conference on Learning
     ing Representations (ICLR) (2019).                            Representations (ICLR) (2020).
 [9] E. Rusak, L. Schott, R. S. Zimmermann, J. Bitterwolf,    [21] A. Raghunathan, S. M. Xie, F. Yang, J. Duchi,
     O. Bringmann, M. Bethge, W. Brendel, A simple                 P. Liang, Understanding and mitigating the trade-
     way to make neural networks robust against diverse            off between robustness and accuracy, International
     image corruptions, in: European Conference on                 Conference on Machine Learning (ICML) (2020).
     Computer Vision, 2020, pp. 53–69.                        [22] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner,
[10] B. Wang, S. Webb, T. Rainforth, Statistically robust          A. Madry, Robustness may be at odds with accu-
     neural network classification, in: Uncertainty in             racy, International Conference on Learning Repre-
     Artificial Intelligence (UAI), 2021, pp. 1735–1745.           sentations (ICLR) (2019).
[11] R. G. Lopes, D. Yin, B. Poole, J. Gilmer, E. D. Cubuk,   [23] D. Mickisch, F. Assion, F. Greßner, W. Günther,
     Improving robustness without sacrificing accuracy             M. Motta, Understanding the decision boundary of
     with patch gaussian augmentation, arXiv preprint              deep neural networks: An empirical study, arXiv
     arXiv:1906.02611 (2019).                                      preprint arXiv:2002.01810 (2020).
[12] C. Paterson, H. Wu, J. Grese, R. Calinescu, C. S.
     Pasareanu, C. Barrett, Deepcert: Verification of
     contextually relevant robustness for neural network
     image classifiers, in: International Conference on
     Computer Safety, Reliability, and Security, 2021, pp.
     3–17.
[13] O. Molokovich, A. Morozov, N. Yusupova, K. Jan-
     schek, Evaluation of graphic data corruptions im-
     pact on artificial intelligence applications, in: IOP
     Conference Series: Materials Science and Engineer-
     ing, volume 1069, 2021, p. 012010.
[14] P. Schwerdtner, F. Greßner, N. Kapoor, F. Assion,
     R. Sass, W. Günther, F. Hüger, P. Schlicht, Risk as-
     sessment for machine learning models, NeurIPS
     2020 Virtual Workshop: Machine Learning for Au-
     tonomous Driving (2020). URL: https://arxiv.org/
     pdf/2011.04328.pdf.
[15] L. Weng, P.-Y. Chen, L. Nguyen, M. Squillante,
     A. Boopathy, I. Oseledets, L. Daniel, Proven: Veri-
     fying robustness of neural networks with a proba-
     bilistic approach, in: International Conference on
     Machine Learning, 2019, pp. 6727–6736.
[16] Y. Carmon, A. Raghunathan, L. Schmidt, J. C. Duchi,