Utilizing Class Separation Distance for the Evaluation of Corruption Robustness of Machine Learning Classifiers Georg Siedel1,2,βˆ— , Silvia Vock1 , Andrey Morozov2 and Stefan Voß1 1 Federal Institute for Occupational Safety and Health (BAuA) Germany 2 University of Stuttgart, Germany Abstract Robustness is a fundamental pillar of Machine Learning (ML) classifiers, substantially determining their reliability. Methods for assessing classifier robustness are therefore essential. In this work, we address the challenge of evaluating corruption robustness in a way that allows comparability and interpretability on a given dataset. We propose a test data augmentation method that uses a robustness distance Ξ΅ derived from the datasets minimal class separation distance. The resulting MSCR (mean statistical corruption robustness) metric allows a dataset-specific comparison of different classifiers with respect to their corruption robustness. The MSCR value is interpretable, as it represents the classifiers avoidable loss of accuracy due to statistical corruptions. On 2D and image data, we show that the metric reflects different levels of classifier robustness. Furthermore, we observe unexpected optima in classifiers robust accuracy through training and testing classifiers with different levels of noise. While researchers have frequently reported on a significant tradeoff on accuracy when training robust models, we strengthen the view that a tradeoff between accuracy and corruption robustness is not inherent. Our results indicate that robustness training through simple data augmentation can already slightly improve accuracy. Keywords corruption robustness, classifier, class separation, metric, accuracy-robustness-tradeoff 1. Introduction of the output to a misclassification. The property of a classifier resistant to any such input corruptions is called ML functions are deployed to an increasing extent robustness2 . A classifier is a function that assigns a class over various industries including machinery engineering. to any 𝑑-dimensional input π‘₯ ∈ 𝑅𝑑 . Classifier 𝑔 is robust β€² Within the European domestic market, machinery prod- at a point π‘₯ within a distance πœ€ > 0, if 𝑔(π‘₯) = 𝑔(π‘₯ ) holds β€² β€² ucts are subject to regulation of the Machinery directive, for all perturbed points π‘₯ that satisfy 𝑑𝑖𝑠𝑑(π‘₯ βˆ’ π‘₯ ) ≀ πœ€ which demands a risk assessment1 . [2, 3]. The 𝑑𝑖𝑠𝑑-function can e.g. be an 𝐿𝑝 -norm distance, Risk assessment includes risk estimation and evalua- while πœ€ can be defined based on physical observations of tion, where risk is defined as a combination of probability e.g. which perturbations are imperceptible for humans. and severity of a hazardous event. Therefore, once ML Robustness is considered a desirable property since in- functions are deployed in machinery products, where tuitively, a slightly perturbed input (e.g. an imperceptibly their failure may lead to a hazardous event, being able changed image) should not lead to a classifier changing to quantify the probability and severity of their failures its corresponding prediction. In essence, a robustness becomes mandatory. However, there still exists a gap requirement demands that within a certain input param- β€² between the regulative and normative requirements for eter space around π‘₯, all points π‘₯ have to share the same safety critical software and the existing methods to assess class. This way, a robustness requirement adds addi- ML safety [1]. tional information on how the classifier should behave This work targets ML classifiers, the failures of which near ground truth data points. Authors therefore argue are misclassifications. Our focus is on the evaluation the importance of robustness, being a fundamental pillar of failure probability specifically, not on failure severity. of reliability [4] and quality [5] of ML models. We address one specific failure mode of ML classifiers: However, popular robustness training methods show Corrupted or perturbed data inputs that cause a change significantly lowered test accuracy compared to standard training, which has lead to some authors discussing an The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety inherent, i.e. inevitable tradeoff between accuracy and 2022), July 24-25, 2022, Vienna, Austria robustness (see Section 2.2). βˆ— Corresponding author. Two types of robustness need to be clearly distin- Envelope-Open siedel.georg@baua.bund.de (G. Siedel) guished [5, 6, 7]: adversarial robustness and corruption GLOBE https://github.com/georgsiedel/ minimal-separation-corruption-robustness (G. Siedel) robustness. Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Robustness includes resistance to any corruption-caused class 1 Machinery Directive, Directive 2006/42/EC of the European Parlia- change, which may not be a failure mode when the original point ment and of the Council of 17 May 2006. was already misclassified (cf. footnote 4). Adversarial inputs are perturbed data deliberately op- Ξ΅ timized to fool a classifier into changing its output class. Class 1 Corruption robustness (sometimes: statistical robust- Input Parameter 2 ness) describes a model’s output stability not against Class 2 such worst-case, but against statistically distributed in- put corruptions. The two types of robustness require different training methods and are differently hard to Low achieve depending on the data dimension [6]. In practice, Robustness training a model for one of the two robustness types only Classifier shows limited or selective improvement for the other type [7, 8, 9]. In the field of research towards ML robustness, most Input Parameter 1 of the attention has been given to adversarial attack and Figure 1: A robustness requirement (here: 𝐿2 -norm balls with defense methods. However, from the perspective of ma- maximum distance πœ€) assigned to the data points (stars) of a chinery safety and risk assessment, adversarial robust- 2D binary dataset (2 input parameters, 2 classes). The shown ness is mainly a security concern and therefore not in classifier is not robust, since its dotted decision boundary the scope of this article. [10] argue that instead of ad- violates the robustness requirement. To evaluate this, addi- tional points (dots) are augmented within Ξ΅ of each original versarial robustness evaluation, a corruption robustness point. On those points, the robust accuracy of the classifier is evaluation is often more applicable to obtain a real-world measured – for this classifier, some errors arise. robustness measure and it can be used to estimate a prob- ability of failure on potentially perturbed inputs for the overall system. of correct classification on original test data (β€œclean ac- Contribution: In this paper, we investigate corruption curacy/error”). Robust accuracy represents a combined robustness using data augmentation for testing and train- measure for accuracy and robustness4 . A useful way to ing3 . Our key contributions are twofold: obtain a measure of robustness only is by subtracting β€’ We propose the ”𝑀𝑆𝐢𝑅” metric to evaluate and robust accuracy/error and clean accuracy/error [8, 11]. compare classifiers corruption robustness. The In most cases, the corrupted test dataset is derived approach is independent of prior knowledge from an original test dataset through data augmentation. about corruption distances, but utilizes proper- One or multiple corruptions out of some distribution are ties of the underlying dataset, giving the metric a added to every original data points. Figure 1 explains this distinct interpretable meaning. We show experi- procedure of data augmentation with corruptions (dots) mentally, that the metric captures different levels being added to a test dataset (stars) with 2 parameters of classifier corruption robustness. and 2 classes. It illustrates how a 100% accurate but non- robust classifier achieves lower robust accuracy on the β€’ We evaluate the tradeoff between accuracy and augmented data points. robustness from the perspective of corruption The corruption distribution can be defined e.g. based robustness and present arguments against the on physical observations. For the example of image data, tradeoff being inherent. [8, 12] add corruptions like brightness, blur and contrast, After giving an overview of related work, we present our while [13, 14] use special weather or sensor corruptions. approach for the MSCR metric in section 3.1. We then [8] created robustness benchmarks for the most popular test our approach on simple 2D as well as image data image datasets based on such physical corruptions. with the setup described in section 3.2. We present and Corruption distributions can also be defined without discuss the results in sections 4 and 5. physical representations by adding e.g. Gaussian-, salt- and-pepper-, or uniformly distributed noise of certain magnitude to the inputs [8, 11, 14, 10]. Figure 1 exem- 2. Related Work plary demonstrates uniformly distributed noise within 𝐿2 -norm distance πœ€ (in 2D, 𝐿2 -norm is a circle) of the data 2.1. Measuring corruption robustness points. Corruption robustness of classifiers can be numerically With PROVEN, [15] propose a framework that uses evaluated by testing the ratio of correctly/incorrectly statistical data augmentation to estimate bounds on ad- classified inputs from a corrupted test dataset. This ratio is called robust accuracy/error, in contrast to the ratio 4 The term astuteness can be used for robust accuracy to differentiate the term from robustness, see [3]. Throughout this work, we use the popular term robustness to describe our metric for consistency 3 Code available on Github, see front page. with works like [8] and [10]. versarial robustness of a model, essentially combining the Ξ΅min evaluation of both adversarial and corruption robustness. 2r Accuracy = [4] take a robustness evaluation approach different Robust Accuracy from measuring robust accuracy. The authors augment οƒ  MSCR = 0 the entire input space with uniformly distributed data Input Parameter 2 points, independent of a test dataset. They divide the Accuracy > input space into cells, the size of which is based on the Robust Accuracy r-separation distance described in [3] and in section 2.2. οƒ  MSCR < 0 This way, they can assign a conflict free ground truth class to each cell and evaluate the misclassification ratio Accuracy < on all added data points. The approach allows for statis- Robust Accuracy tical testing of the entire input space, but does not scale οƒ  MSCR > 0 well to high dimensions. An analytical way of measuring the robustness of a Input Parameter 1 classifier is through describing the characteristics of its Figure 2: The MSCR concept, demonstrated on 2D test data. decision boundary. One possibility is to estimate the Data augmentation is carried out like in Figure 1. The distance local Lipschitzness, i.e. a tightened continuity property (πœ€π‘šπ‘–π‘› ) is determined by the minimal distance (2π‘Ÿ) of original of models in proximity to data points. To the best of our points from different classes (black and grey). This way, aug- knowledge however, Lipschitzness has only been used to mented points of different classes are still separated and classi- fiers can be both accurate and robust. The decision boundaries investigate adversarial, not corruption robustness [2, 3]. of 3 hypothetical classifiers are shown to demonstrate differ- Both the measure in [4] and Lipschitzness values lack ent levels of robustness and their resulting MSCR value. distinct interpretability in terms of what the calculated value represents exactly. a classifier can be both robust and accurate as long as 2.2. The Accuracy-Robustness-Tradeoff πœ€β‰€π‘Ÿ (1) Significant effort has recently been put into increasing classifier robustness, commonly targeting adversarial ro- holds, where πœ€ is the corruption distance for which ro- bustness, e.g. in [9, 16, 17, 18, 19]. All these methods bustness is evaluated and π‘Ÿ is half this minimal class sep- cause a significant drop in clean accuracy. aration distance. We adopt this notation and set πœ€π‘šπ‘–π‘› = π‘Ÿ [11, 20] and [10] observe a clear tradeoff between as our corner case corruption distance (see Figure 2). The corruption robustness and accuracy for different train- value πœ€π‘šπ‘–π‘› is not related to any prior physical knowledge ing methods using data augmentation. The two former of e.g. which corruptions are imperceptible, but is specific works then propose specialized training methods for miti- for the given dataset, i.e. it is based on the fundamental gating parts of this tradeoff on the popular image datasets property of minimal class separation. Accordingly, we CIFAR-10 and ImageNet. call our metric β€œMinimal Separation Corruption Robust- Based on such research, [19] and [21] discuss a tradeoff ness” (MSCR). between accuracy and robustness, while [22] even argue that the cause for this tradeoff is inherent, i.e. inevitable. 3.1. MSCR metric A counterargument is presented by [3], who argue that To measure corruption robustness, we carry out data aug- accuracy and robustness are not necessarily at odds as mentation on the test data with uniformly distributed long as data points from different classes are separated corruptions, generated by a random sampling algorithm, far enough from each other (see section 3). The authors similar to the method shown by [10]. In contrast to [10], measure this β€œr-separation” between different classes on we set the upper bound of the distance πœ€π‘‘π‘’π‘ π‘‘ , within which various image datasets and find it to be high enough the augmented noise is distributed, to πœ€π‘šπ‘–π‘› , as required in for classifiers to be both accurate and robust for typical Equation 1 (see Figure 2 for an illustration). We measure perturbation distances. robust accuracy on the augmented data, which corre- sponds to a combination of clean accuracy and corruption 3. Method robustness. However, we want to quantify robustness independent of clean accuracy for comparability, so we Our robustness evaluation approach is based on this same subtract the clean accuracy (π΄π‘π‘π‘π‘™π‘’π‘Žπ‘› ) from the robust idea by [3], who measure the distance 2r for a dataset, accuracy on πœ€π‘šπ‘–π‘› -augmented test data (π΄π‘π‘π‘Ÿπ‘œπ‘βˆ’πœ€π‘šπ‘–π‘› ) and which is the minimal distance between any two points of normalize by the clean accuracy: different classes (2r in Figure 2). The authors argue that 𝑀𝑆𝐢𝑅 = (π΄π‘π‘π‘Ÿπ‘œπ‘βˆ’πœ€π‘šπ‘–π‘› βˆ’ π΄π‘π‘π‘π‘™π‘’π‘Žπ‘› )/π΄π‘π‘π‘π‘™π‘’π‘Žπ‘› (2) According to [3], a classifier can in principle be robust on Algorithm 1: MSCR calculation such augmented noise of magnitude πœ€π‘šπ‘–π‘› while maintain- Data: classification dataset ing accuracy. This can be seen from Figure 2, where the {𝑋 (π‘₯1 , …, π‘₯𝑛 ), π‘Œ (𝑦1 , …, 𝑦𝑛 )} circles of radius πœ€π‘šπ‘–π‘› in which data is augmented, never Parameters: π‘šπ‘œπ‘‘π‘’π‘™π‘  = {π‘šπ‘œπ‘‘π‘’π‘™1 , …, π‘šπ‘œπ‘‘π‘’π‘™π‘š }, overlap for different classes. We use an identical radius π‘Ÿ = {1, …, π‘Ÿπ‘’π‘›π‘ }, π‘˜, πœ€π‘‘π‘’π‘ π‘‘ = {0, πœ€π‘šπ‘–π‘› } πœ€π‘šπ‘–π‘› for all classes, assuming that the separation of data Output: 𝑀𝑆𝐢𝑅 = {𝑀𝑆𝐢𝑅1 , …, π‘€π‘†πΆπ‘…π‘š } points from the classifiers decision boundary is equally 1 πœ€π‘šπ‘–π‘› = ( min {𝑑𝑖𝑠𝑑(π‘₯𝑗 βˆ’ π‘₯𝑖 )|𝑦𝑖 β‰  𝑦𝑗 })/2 important for all classes. For this noise level πœ€π‘šπ‘–π‘› , any π‘₯π‘–βˆˆπ‘› ,π‘₯π‘—βˆˆπ‘› non-robust behavior is theoretically avoidable, since a 2 for π‘šπ‘œπ‘‘π‘’π‘™π‘  do classifiers decision boundary can separate the classes 3 for π‘Ÿ do even with augmented data, as long as the ML algorithm 4 Train π‘šπ‘œπ‘‘π‘’π‘™π‘š is capable of learning the exact function. The MSCR 5 Test model with original test data metric therefore measures the (relative) win or loss in (πœ€π‘‘π‘’π‘ π‘‘ = 0) β†’ return π΄π‘π‘π‘π‘™π‘’π‘Žπ‘› accuracy when testing on such noisy data that any loss 6 For every test data point: Uniform random is just about avoidable. Figure 2 illustrates the impact of sample π‘˜ points within 𝑑𝑖𝑠𝑑(πœ€π‘šπ‘–π‘› ) and the proposed metric using three corner cases: augment the test data 7 Test model with data from step 6 β†’ return β€’ 𝑀𝑆𝐢𝑅 = 0, π΄π‘π‘π‘Ÿπ‘œπ‘βˆ’πœ€π‘šπ‘–π‘› = π΄π‘π‘π‘π‘™π‘’π‘Žπ‘› , solid line in π΄π‘π‘π‘Ÿπ‘œπ‘βˆ’πœ€π‘šπ‘–π‘› Figure 2: A classifier that is as robust as possible 8 π‘€π‘†πΆπ‘…π‘Ÿ = (π΄π‘π‘π‘Ÿπ‘œπ‘βˆ’πœ€π‘šπ‘–π‘› βˆ’ π΄π‘π‘π‘π‘™π‘’π‘Žπ‘› )/π΄π‘π‘π‘π‘™π‘’π‘Žπ‘› for the given class separation of the dataset. It not π‘Ÿπ‘’π‘›π‘  only correctly classifies the original data points, 9 π‘€π‘†πΆπ‘…π‘š = (βˆ‘π‘Ÿ=1 π‘€π‘†πΆπ‘…π‘Ÿ )/π‘Ÿπ‘’π‘›π‘  but also all augmented data points. β€’ 𝑀𝑆𝐢𝑅 < 0, π΄π‘π‘π‘Ÿπ‘œπ‘βˆ’πœ€π‘šπ‘–π‘› < π΄π‘π‘π‘π‘™π‘’π‘Žπ‘› , dotted line in Figure 2: A classifier that is not perfectly robust. πœ€π‘‘π‘’π‘ π‘‘ , models trained with πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = πœ€π‘‘π‘’π‘ π‘‘ are expected to It correctly classifies all original data points, but perform best [10]. misclassifies a number of augmented data points As demonstrated in Figure 2, corruption levels below due to low robustness. πœ€π‘šπ‘–π‘› theoretically allow a classifier to be robust while not β€’ 𝑀𝑆𝐢𝑅 > 0, π΄π‘π‘π‘Ÿπ‘œπ‘βˆ’πœ€π‘šπ‘–π‘› > π΄π‘π‘π‘π‘™π‘’π‘Žπ‘› , dashed line losing test accuracy. We investigate this theoretical claim in Figure 2: A classifier misclassifies some orig- by [3] additionally to the MSCR metric by evaluating inal data points, but correctly classifies some of changes in robust accuracy when augmenting multiple their augmentations. Especially for classifiers corruption levels πœ€π‘‘π‘’π‘ π‘‘ to the test dataset. In contrast to that trained to be very robust, we expect this re- the work of [10], we extensively evaluate more corrup- sult to be possible. tion levels below, around and including πœ€π‘šπ‘–π‘› specifically. Algorithm 1 shows the MSCR calculation procedure. In In contrast to the work of [11] and [20], we use simple step 1, different distance functions (e.g. 𝐿∞ -norm) can uniformly distributed data augmentation with a fixed be applied. We account for randomness in the data split- upper bound of noise for the entire dataset instead of ting, model training and data augmentation procedures Gaussian noise. This allows us the comparison of the by carrying out multiple runs of the same experiment noise levels with the class separation distances. It shall be and reporting average values and 95%-confidence inter- noted however that in contrast to Gaussian noise, where vals over all runs. The reasonable number of augmented density decreases with distance, uniform noise does not points k per original data point varies depending on the reflect the higher uncertainty in a class assignment when dataset (see section 3.2). Within the respective for-loop, the distance from a ground truth data point increases. variable π‘šπ‘œπ‘‘π‘’π‘™π‘  runs through the list of all classifier mod- Even though our data augmentation method is simple, els to be compared, while π‘Ÿ counts up to (the overall we still expect to find counterexamples for the accuracy- number of) π‘Ÿπ‘’π‘›π‘ . robustness-tradeoff, based solely on the class-separation theory. We believe that the case of finding such coun- terexamples with less advanced methods than e.g. [11] 3.2. Experimental details represents even more credible evidence for the argument Additionally to test data augmentation, we train multiple of [3] against an inherent accuracy-robustness-tradeoff. models on datasets augmented with different corruption We carry out the experiments on 3 binary class 2D distances πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› . Increasing a model’s πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› should lead to datasets as were used and provided by [4]. For clarity, a growing MSCR value, as it is expected that the model we only report results with 𝐿∞ -corruptions on one of robustness grows. This way, we evaluate the trend of those datasets, which is shown in Figure 3 and features the MSCR value for models with different corruption 4674 data points. Experiments with the other 2D datasets robustness levels. Also, on test data corrupted with large and 𝐿2 -corruptions exhibit similar fundamental results, Figure 4: Effect of hyperparameter π‘˜ on robust accuracy and Figure 3: Data points in the binary class 2D dataset. its deviation. 2D dataset, πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› , πœ€π‘‘π‘’π‘ π‘‘ = 0.001. which can also be found in our Github repository (see Table 1 frontpage). The two input parameters π‘₯[0] and π‘₯[1] are Minimal 𝐿∞ class separation and corresponding πœ€π‘šπ‘–π‘› normalized to the interval [0, 1]. For classification, we Dataset 2π‘Ÿ(𝐿∞ ) πœ€π‘šπ‘–π‘› use a random forest (RF) algorithm with 100 trees. We also compare this classifier with a 1-nearest-neighbor 2D dataset 0.008026 0.004013 model, which is known to be inherently robust, since it CIFAR-10 (train and test set) 0.211765 0.105882 classifies based on distance to the 1 nearest data point. We choose π‘˜ = 10 augmented data points per original data point, as we found higher numbers of π‘˜ not signifi- 4. Results cantly improving the resulting robust accuracy and its standard deviation. This effect of different values for the Table 2 displays the matrix of test accuracies for the 2D hyperparameter π‘˜ is displayed in Figure 4. In order to dataset for different values of both πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› (representing achieve statistically representative results, we evaluate different models, along columns) and πœ€π‘‘π‘’π‘ π‘‘ (along rows). how the average test accuracy converges over multiple The bold values highlight the best model for every level runs and accordingly choose 1200 runs. of test noise. As can be seen, the optima of the accuracy The experiments are additionally run in a more applied do not actually match with the matrix diagonal, where image classification setting using benchmark dataset training and test noise are equal (highlighted in light CIFAR-10. We adopt the classifier architecture from [10], grey). Instead, when testing with lower noise levels and using a 28-10 wide residual network with SGD optimizer, even with clean test data, the model trained on πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0.3 dropout rate, training batch size 32 and 30 epochs with 0.007 performs best. The maximum overall accuracy is a 3-step decreasing learning rate. All pixel values are nor- achieved with a model trained on πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0.007 that is malized to [0, 1] and random horizontal flips and random also tested on πœ€π‘‘π‘’π‘ π‘‘ = 0.001 corruptions. For higher noise crops with 4px padding are used for training generaliza- levels, the optimum robust accuracies are achieved with tion. For CIFAR-10 we choose π‘˜ = 1, since [10] report πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› ≀ πœ€π‘‘π‘’π‘ π‘‘ , displaying the opposite trend compared to one augmented point to be sufficient. We suspect that low noise levels. this is due to the multiple epochs of the training process, The results on CIFAR-10 in Table 3 show a similar which allows to train the model on multiple augmenta- trend, albeit less pronounced. For low noise levels, train- tions per training data point. We choose 20 runs due to ing with πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0.01 appears to be optimal for clean computational feasibility of all training procedures. Ta- accuracy. The maximum overall accuracy is achieved ble 1 shows the minimal class separation distances 2π‘Ÿ and with πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0.02 and πœ€π‘‘π‘’π‘ π‘‘ = 0.01. For higher levels of the corresponding πœ€π‘šπ‘–π‘› values, measured in 𝐿∞ -distance test noise, similarly to 2D data, it appears beneficial to for both datasets. For intuition, the CIFAR-10 πœ€π‘šπ‘–π‘› value use πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› ≀ πœ€π‘‘π‘’π‘ π‘‘ . In contrast to the 2D data, where the translates to a maximum color grade change of 27/255 optimum πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› for πœ€π‘‘π‘’π‘ π‘‘ = 0 is higher than the πœ€π‘šπ‘–π‘› value, on all pixels. Higher values for 2π‘Ÿ are to be expected for CIFAR-10 it is ∼10 times lower than πœ€π‘šπ‘–π‘› . The op- for image data, since 𝐿∞ -norm evaluates the maximum timum πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0.01 translates to a 2.5/255 color grade distance in any of the 3072 dimensions of CIFAR-10 input corruption for every pixel. data. For both datasets it is visible from the last rows of Table 2 Clean accuracies (first row) and robust accuracies in percentage plus the MSCR value (last row) for various models (columns) Β± the 95% confidence intervals. Models are trained and tested with different levels of 𝐿∞ -noise (πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› along columns, πœ€π‘‘π‘’π‘ π‘‘ along rows). Bold accuracies: Best model accuracy for every noise level. Bold MSCR value: Highest MSCR value, i.e. highest model robustness. Last row color scale: Highlights the constant increase of MSCR with increasing πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› . Light grey accuracies: Model trained and tested on the same noise level (πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = πœ€π‘‘π‘’π‘ π‘‘ ). Dark grey accuracies: Maximum overall accuracy. 2D Dataset πœΊπ’•π’“π’‚π’Šπ’ 𝟎 0.001 0.002 𝛆𝐦𝐒𝐧 0.007 0.01 0.015 0.02 0.03 0 99.531Β±0.014 99.652Β±0.011 99.699Β±0.011 99.748Β±0.010 99.784Β±0.009 99.769Β±0.010 99.504Β±0.016 98.990Β±0.028 97.347Β±0.060 0.001 99.515Β±0.013 99.640Β±0.011 99.689Β±0.010 99.746Β±0.009 99.785Β±0.009 99.775Β±0.009 99.524Β±0.014 99.017Β±0.025 97.380Β±0.057 0.002 99.495Β±0.013 99.607Β±0.011 99.660Β±0.010 99.732Β±0.009 99.777Β±0.008 99.768Β±0.009 99.528Β±0.013 99.026Β±0.024 97.411Β±0.055 πœΊπ’Žπ’Šπ’ 99.405Β±0.013 99.525Β±0.011 99.583Β±0.010 99.669Β±0.009 99.729Β±0.008 99.716Β±0.008 99.486Β±0.012 99.005Β±0.022 97.435Β±0.052 πœΊπ’•π’†π’”π’• 0.007 99.167Β±0.014 99.287Β±0.012 99.360Β±0.011 99.461Β±0.010 99.536Β±0.009 99.535Β±0.010 99.319Β±0.013 98.871Β±0.022 97.380Β±0.049 0.01 98.782Β±0.017 98.899Β±0.015 98.977Β±0.014 99.083Β±0.014 99.175Β±0.013 99.191Β±0.014 99.014Β±0.017 98.615Β±0.024 97.238Β±0.047 0.015 97.871Β±0.025 97.979Β±0.025 98.044Β±0.025 98.134Β±0.025 98.222Β±0.025 98.265Β±0.026 98.197Β±0.029 97.921Β±0.033 96.810Β±0.049 0.02 96.771Β±0.036 96.847Β±0.037 96.896Β±0.037 96.966Β±0.037 97.040Β±0.038 97.092Β±0.038 97.105Β±0.040 96.962Β±0.043 96.198Β±0.053 0.03 94.397Β±0.058 94.423Β±0.059 94.456Β±0.059 94.500Β±0.060 94.547Β±0.061 94.593Β±0.061 94.668Β±0.061 94.698Β±0.062 94.448Β±0.066 𝑴𝑺π‘ͺ𝑹 -0.126Β±0.007 -0.127Β±0.006 -0.116Β±0.006 -0.080Β±0.005 -0.055Β±0.005 -0.053Β±0.006 -0.018Β±0.010 0.015Β±0.015 0.090Β±0.024 (a) 2D Dataset CIFAR-10 Dataset πœΊπ’•π’“π’‚π’Šπ’ 𝟎 0.01 0.02 0.03 0.05 0.07 πœΊπ’Žπ’Šπ’ 0.15 0 91.681Β±0.304 91.932Β±0.318 91.917Β±0.417 91.311Β±0.470 90.428Β±0.427 88.645Β±0.454 86.051Β±0.800 81.989Β±0.855 0.01 91.453Β±0.338 91.900Β±0.311 91.964Β±0.408 91.351Β±0.474 90.472Β±0.421 88.678Β±0.463 86.085Β±0.798 81.995Β±0.858 0.02 90.675Β±0.429 91.527Β±0.338 91.868Β±0.428 91.459Β±0.442 90.577Β±0.420 88.817Β±0.457 86.158Β±0.763 82.082Β±0.844 0.03 89.181Β±0.595 91.606Β±0.385 91.606Β±0.462 91.479Β±0.422 90.690Β±0.403 88.983Β±0.421 86.306Β±0.766 82.165Β±0.838 πœΊπ’•π’†π’”π’• 0.05 84.062Β±1.033 89.273Β±0.611 89.273Β±0.570 90.530Β±0.483 90.832Β±0.346 89.513Β±0.362 86.745Β±0.737 82.540Β±0.802 0.07 76.706Β±1.396 84.086Β±0.999 84.086Β±0.831 87.303Β±0.672 90.065Β±0.324 89.907Β±0.332 87.322Β±0.621 83.051Β±0.789 πœΊπ’Žπ’Šπ’ 59.261Β±1.868 67.534Β±1.814 67.534Β±1.695 74.137Β±1.367 83.784Β±0.714 88.260Β±0.374 88.194Β±0.418 84.181Β±0.666 0.15 37.458Β±2.034 42.765Β±2.071 42.765Β±2.286 49.685Β±1.991 65.285Β±1.842 77.257Β±1.083 86.130Β±0.433 85.352Β±0.511 𝑴𝑺π‘ͺ𝑹 -35.362Β±0.020 -33.977Β±0.020 -26.527Β±0.018 -18.808Β±0.014 -7.347Β±0.009 -0.434Β±0.006 2.490Β±0.006 2.674Β±0.003 Table 2 (top, 2D dataset) and Table 3 (bottom, CIFAR-10(b) CIFAR-10 dataset): Dataset in percentage (rows) and the 𝑀𝑆𝐢𝑅 value (last row) for Test accuracies various models (columns) Β± the 95% confidence interval. Models are trained (πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› ) and tested (πœ€π‘‘π‘’π‘ π‘‘ ) with different levels of 𝐿∞ noise (models along columns, test levels along rows). Bold accuracies: Best model accuracy for every noise level. Bold 𝑀𝑆𝐢𝑅 value: Highest Table 𝑀𝑆𝐢𝑅 2 and value,3,i.e. that the MSCR highest value steadily model robustness. increases Last row withHighlights color scale: Figures 6a (2Dincrease the constant dataset) and 6b of 𝑀𝑆𝐢𝑅 (CIFAR-10) with increasing πœ€display the π‘‘π‘Ÿπ‘Žπ‘–π‘› . Light grey accuracies: Model trained and tested on the same noise level (πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = πœ€π‘‘π‘’π‘ π‘‘ ). Dark grey accuracies: Maximum overall accuracy. higher levels of training noise πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› . For both datasets, accuracy-robustness-tradeoff for the models trained with the MSCR increases from negative values on less robust different πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› by contrasting MSCR versus clean accuracy trained models to zero and even positive values for more values. Both Figures in principle show a tradeoff curve. robust trained models. However, it is visible that for πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› ≀ 0.007 on 2D data For CIFAR-10, the MSCR values are overall much larger and πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› ≀ 0.01 on CIFAR-10, both clean accuracy and than for the 2D data. This effect correlates with the πœ€π‘šπ‘–π‘› robustness increase compared to the baseline model with noise level, which is about 26 times larger in absolute πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0. The tradeoff is overcome for these models values. (arguably also for πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0.01 for 2D data and πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0.02 Figure 5 shows a comparison on the 2D dataset be- for CIFAR-10). tween the 1NN model and the RF model with regards to clean accuracy (Fig. 5a) and MSCR (Fig. 5b). Both models are trained on the various πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› values. While for the RF 5. Discussion model, both metrics increase with increasing training noise up to the optimum of πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0.007, the 1NN model 5.1. Applicability of the MSCR metric shows constant (and superior) metrics up to this training Our results from the experiments indicate that the rela- noise. This illustrates the inherent robustness of the 1NN tive difference between the noise-augmented robust accu- model. The comparison also shows that this inherent racy and the clean accuracy is a measure for corruption robustness is indeed advantageous regarding accuracy robustness of models. For πœ€π‘‘π‘’π‘ π‘‘ = πœ€π‘šπ‘–π‘› in particular, this rel- on our dataset. ative difference that we named MSCR steadily increases 1NN vs. RF over training noise on 2D Dataset 1NN vs. RF over training noise on 2D Dataset 100 0,15 0,10 99 Robustness [MSCR] MSCR RF Clean Accuracy [%] 0,05 98 MSCR 1NN Clean Accuracy RF 0,00 97 Clean Accuracy 1NN -0,05 96 -0,10 95 -0,15 0 0,001 0,002 Ξ΅min 0,007 0,01 0,015 0,02 0,03 0 0,001 0,002 Ξ΅min 0,007 0,01 0,015 0,02 0,03 Ξ΅train Ξ΅train (a) Clean Accuracy (b) Robustness (MSCR) Figure 5: Model comparison on 2D Dataset with regards to clean accuracy and robustness (MSCR): RF versus 1NN model with different πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› . 0,10 2D Dataset CIFAR-10 Dataset 10 0,03 0,15 Ξ΅min 0,05 0,07 0 Robustness [MSCR] Robustness [MSCR] 0,02 0,05 0,00 -10 0,015 -0,05 ​ 0,03 0.007 & 0.01 -20 Ξ΅min Model with Ξ΅train = Model with Ξ΅train = 0,02 -0,10 -30 Baseline Model Baseline Model 0,002 0,01 0 0,001 0 -0,15 -40 97% 98% 99% 100% 80% 85% 90% 95% Clean Accuracy Clean Accuracy (a) 2D Dataset (b) CIFAR-10 Dataset Figure 6: Accuracy-robustness-tradeoff for models trained with different levels of augmented training noise πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› , compared to the baseline model with πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0. Models with both higher MSCR and higher clean accuracy (when the curve evolves towards the top right corner) contradict the inherent tradeoff. with higher corruption robustness of the RF model on 2D herently robust model such as 1NN, which fits its deci- data and the wide residual network on CIFAR-10. This sion boundary based on maximum class separation. The way, we verify the metric’s capability to reflect the cor- MSCR values are able to correctly display this interrela- ruption robustness of different models. However, this tion. claim is based on the assumption that increasing corrup- tion robustness of our models can be generated through 5.2. Disadvantages and advantages of the training with higher noise levels. This seems evident based on research by [10], but requires future validation MSCR metric like in [11], who confirm that their Gaussian robustness In our experiments, the steady robustness increase for metric is strongly correlated with the popular physical higher πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› also holds for other levels of testing noise corruptions benchmark by [8]. than πœ€π‘šπ‘–π‘› . The MSCR value, which uses πœ€π‘šπ‘–π‘› -corruptions On the 2D dataset, the 1NN model shows a constant, as the underlying robustness requirement, is only one superior MSCR value compared to the RF model for all particular case of this robustness calculation approach. πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› ≀ 0.007, where classes are still predominantly sep- It has to be emphasized that from our results in Tables 2 arated. This is the performance expected from an in- and 3, we cannot observe any conspicuities for πœ€π‘‘π‘’π‘ π‘‘ ∼ πœ€π‘šπ‘–π‘› . For example, there is no indication that models perform to πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0 does not achieve 95%-confidence in a pair- well below this noise level while massively dropping off wise statistical comparison. More than 20 runs are neces- at higher noise levels, as could be presumed from the sary to obtain statistically significant results, which we r-separation theory. It is therefore evident to conclude could not achieve due to limited computational resources. that measuring corruption robustness works with other Hence, we only treat our results on CIFAR-10 regarding πœ€π‘‘π‘’π‘ π‘‘ -values. In practice, if specific corruptions are knownthe accuracy-robustness-tradeoff as suggestions. for an application, those corruptions should also be used The suggestion that some πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› > 0 leads to higher for testing, e.g. through benchmarks [8]. clean accuracy than πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0 has theoretical relevance. However, we emphasize that the MSCR metric is ad- It supports the claim made, but not practically proven by vantageous in two ways: First, it does not require prior [3], that accuracy and robustness are not in an inherent physical knowledge to define corruption distributions, tradeoff as long as the noise level πœ€ fulfills Equation 1. like e.g. [8] does. Instead, it only requires measuring the The result also seems relevant from a practical per- actual class separation from any classification dataset. spective, since developers may try some πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› for training Second, the MSCR can be interpreted with a clear contex- data augmentation, which increases robustness without tual meaning, since the robustness requirement is derived drawbacks regarding accuracy. We emphasize that this from the dataset: It measures β€œthe theoretically avoidable practical implication is only valid for the very limited loss (or win) of accuracy due to statistical corruptions”. model architectures, datasets and augmentation distribu- tions we tested. For example, our experiments show that 5.3. On achieving high MSCR values noise training below πœ€π‘šπ‘–π‘› has no effect on an inherently robust model such as 1NN. This is due to the fact that this Clearly, avoiding any loss of accuracy on πœ€π‘šπ‘–π‘› -noise is model type maximizes the class separation of its decision hard to achieve in practice on high-dimensional data. For boundary in training anyways. CIFAR-10, 𝑀𝑆𝐢𝑅 = 0 can be achieved, but only with On the one hand, overcoming the tradeoff for small πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0.07, where the clean accuracy declines by 3 per- πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› is not entirely surprising, since it is well known centage points compared to πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0. We also verify our that data transformations and data augmentations can conjecture that 𝑀𝑆𝐢𝑅 > 0 is possible for some robust increase generalization of models (in fact, we also used trained models. For this behavior, we find the discovery random flips and crops for CIFAR-10 training). [11] and in [23] a convincing technical explanation. Misclassified [20] also manage to overcome the tradeoff with more ad- data points tend to lie closer to the decision boundary vanced training methods. On the other hand, our results than correctly classified data points. The data augmen- are surprising considering this drawback-free increase in tations on a misclassified data point therefore have a robust accuracy is quite significant for the RF model on high chance of causing a favorable class change. At the 2D data (less than halving the classification error). Also, same time, data augmentations on correctly classified uniform 𝐿∞ data augmentation is a very simple method points have a lower chance of causing an unfavorable and less contextually relevant compared to physically class change when their distance to the decision bound- derived augmentations. An explanation may be that the ary is high, which is what a robust model is trained for. uniform 𝐿𝑝 -norm noise allows a stricter coverage of the input parameter space near data points compared to phys- 5.4. The accuracy-robustness-tradeoff ical data augmentations, enforcing a smooth model that is less prone to overfitting the corruptions. Besides our investigation of the MSCR metric, we re- port on findings regarding the tradeoff between accuracy and corruption robustness. For both 2D and CIFAR-10 5.5. Class separation distance for model datasets we observe higher clean and robust accuracy training on any test noise when training a model with a specific From our results we also need to conclude that in prac- level of uniform noise (πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0.007 for 2D, πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0.01 tice, the πœ€ value has only limited expressiveness when π‘šπ‘–π‘› for CIFAR-10), compared to standard training. For the trying to find the optimal πœ€ π‘‘π‘Ÿπ‘Žπ‘–π‘› with regards to (robust) 2D data, this optimum πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› value is even higher than accuracy. This is visible in Figures 6a and 6b, where πœ€π‘šπ‘–π‘› , the value which the r-separation theory suggests to based solely on the r-separation theory, we may have be beneficial for robustness while not hurting accuracy. expected the curve to reverse its trend along the x-axis This could be due to the major proportion of minimal when πœ€ π‘‘π‘Ÿπ‘Žπ‘–π‘› = πœ€π‘šπ‘–π‘› . In reality, the best overall accuracy distances of data points to other classes being signifi- for the 2D data is achieved for πœ€ π‘‘π‘Ÿπ‘Žπ‘–π‘› ∼ 2 βˆ— πœ€π‘šπ‘–π‘› , while on cantly bigger than πœ€π‘šπ‘–π‘› . Our results are statistically sig- CIFAR-10 it is achieved for πœ€ π‘‘π‘Ÿπ‘Žπ‘–π‘› < πœ€π‘š 𝑖𝑛/5. We suspect nificant for the 2D dataset experiment. For 20 runs per that high-dimensional datasets are notoriously hard to trained model on CIFAR-10, we emphasize that claiming train with regards to high robust accuracy, at least for higher mean clean accuracy for any πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› > 0 compared such πœ€ levels their high 𝐿 class separation distance π‘šπ‘–π‘› ∞ inevitably entails. We suspect that on other datasets πœ€π‘šπ‘–π‘› may be even greater and further away from the optimum πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› . Additional research is needed on various distance measures, dataset dimensions and model types in order to utilize class separation distances for optimizing robust accuracy. 5.6. Optima of πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› vs. πœ€π‘‘π‘’π‘ π‘‘ Another interesting finding from the accuracy matrix of both datasets is that the best πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› value for models evaluated with certain πœ€π‘‘π‘’π‘ π‘‘ deviates from the expected diagonal. For example, πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = 0.03 is not the best choice to prepare for πœ€π‘‘π‘’π‘ π‘‘ = 0.03. In Figure 7, the accuracy matrix for CIFAR-10 from Table 3 is visualized in a 3D plot, which shows how the optima in (robust) accuracy deviate from the diagonal. It appears that for low noise levels the best choice is πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› > πœ€π‘‘π‘’π‘ π‘‘ , while for higher Figure 7: CIFAR-10 (robust) accuracies for different πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› and noise levels πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› < πœ€π‘‘π‘’π‘ π‘‘ is more favorable. This suspected πœ€π‘‘π‘’π‘ π‘‘ . The optima, marked with points, deviate from the diag- onal (white line where πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› = πœ€π‘‘π‘’π‘ π‘‘ ): towards higher πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› for dependency needs further investigation. lower noise levels and towards lower πœ€π‘‘π‘Ÿπ‘Žπ‘–π‘› for higher noise levels. 6. Conclusion In this article we evaluated a data augmentation method Our work seems to fit into a gap between those re- in order to obtain a comparable, interpretable measure searchers optimizing test accuracy and those optimizing of corruption robustness for classifiers. We measured robustness. Our future work will include further investi- the relative difference between the robust accuracy on gations of data augmentation training and testing using corrupted test data and the clean accuracy. We proposed other dataset types, distance metrics and corruption dis- to use half the minimal class separation distance mea- tributions. It would be of additional interest, whether sured from the dataset as the maximum distance πœ€π‘šπ‘–π‘› some increase in adversarial robustness can be obtained of the augmented test noise. This robustness require- without loosing accuracy. Our findings emphasize the ment does not presume any prior knowledge about real potential and encourage the development of advanced corruption distances. It theoretically allows a classifier training procedures mitigating the accuracy-robustness- to be fully robust while not losing accuracy. The class tradeoff, since the combination of both properties is es- separation distance therefore gives our metric a distinct sential from a risk assessment perspective. meaning: It represents any β€œavoidable” loss (or win) in accuracy due to corruptions. We experimentally showed that our metric is able to reflect various degrees of model References robustness. From training classifiers with different levels of noise [1] G. Siedel, S. Voß, S. Vock, An overview of the we found that classifiers with the highest robust accuracy research landscape in the field of safe machine on a certain level of noise are not strictly those, which are learning, in: Volume 13: Safety Engineering, trained on this same level of noise. We also presented in- Risk, and Reliability Analysis; Research Posters, dications that a tradeoff between accuracy and corruption American Society of Mechanical Engineers, 2021. robustness is not inherent: In our experiments, simple doi:1 0 . 1 1 1 5 / I M E C E 2 0 2 1 - 6 9 3 9 0 . augmentation training on significant random uniform [2] T.-W. Weng, H. Zhang, P.-Y. Chen, J. Yi, D. Su, noise could improve test accuracy of classifiers addition- Y. Gao, C.-J. Hsieh, L. Daniel, Evaluating the robust- ally to their robustness, compared with normal training. ness of neural networks: An extreme value theory However, the minimal class separation distance could approach, International Conference on Learning in practice not guide us towards the optimal values of Representations (ICLR) (2018). training noise. These findings regarding the accuracy- [3] Y.-Y. Yang, C. Rashtchian, H. Zhang, R. R. Salakhut- robustness-tradeoff could in our opinion be useful in dinov, K. Chaudhuri, A closer look at accuracy practice. vs. robustness, Advances in neural information processing systems 33 (2020) 8588–8601. [4] X. Zhao, W. Huang, V. Bharti, Y. Dong, V. Cox, P. S. Liang, Unlabeled data improves adversarial A. Banks, S. Wang, S. Schewe, X. Huang, Reliability robustness, Advances in neural information pro- assessment and safety arguments for machine learn- cessing systems 32 (2019). ing components in assuring learning-enabled au- [17] J. Cohen, E. Rosenfeld, Z. Kolter, Certified adver- tonomous systems, arXiv preprint arXiv:2112.00646 sarial robustness via randomized smoothing, in: In- (2021). ternational Conference on Machine Learning, 2019, [5] Deutsches Institut fΓΌr Normung, Din spec 92001- pp. 1310–1320. 2: Artificial intelligence – life cycle processes and [18] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, quality requirements: Part 2: Robustness, 2020. A. Vladu, Towards deep learning models resistant [6] A. Fawzi, O. Fawzi, P. Frossard, Analysis of classi- to adversarial attacks, International Conference on fiers’ robustness to adversarial perturbations, Ma- Learning Representations (ICLR) (2018). chine learning 107 (2018) 481–508. [19] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, M. Jor- [7] J. Gilmer, N. Ford, N. Carlini, E. Cubuk, Adversarial dan, Theoretically principled trade-off between examples are a natural consequence of test error robustness and accuracy, in: International Confer- in noise, in: International Conference on Machine ence on Machine Learning, 2019, pp. 7472–7482. Learning, 2019, pp. 2280–2289. [20] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, [8] D. Hendrycks, T. Dietterich, Benchmarking neural J. Gilmer, B. Lakshminarayanan, Augmix: A simple network robustness to common corruptions and data processing method to improve robustness and perturbations, International Conference on Learn- uncertainty, International Conference on Learning ing Representations (ICLR) (2019). Representations (ICLR) (2020). [9] E. Rusak, L. Schott, R. S. Zimmermann, J. Bitterwolf, [21] A. Raghunathan, S. M. Xie, F. Yang, J. Duchi, O. Bringmann, M. Bethge, W. Brendel, A simple P. Liang, Understanding and mitigating the trade- way to make neural networks robust against diverse off between robustness and accuracy, International image corruptions, in: European Conference on Conference on Machine Learning (ICML) (2020). Computer Vision, 2020, pp. 53–69. [22] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, [10] B. Wang, S. Webb, T. Rainforth, Statistically robust A. Madry, Robustness may be at odds with accu- neural network classification, in: Uncertainty in racy, International Conference on Learning Repre- Artificial Intelligence (UAI), 2021, pp. 1735–1745. sentations (ICLR) (2019). [11] R. G. Lopes, D. Yin, B. Poole, J. Gilmer, E. D. Cubuk, [23] D. Mickisch, F. Assion, F. Greßner, W. GΓΌnther, Improving robustness without sacrificing accuracy M. Motta, Understanding the decision boundary of with patch gaussian augmentation, arXiv preprint deep neural networks: An empirical study, arXiv arXiv:1906.02611 (2019). preprint arXiv:2002.01810 (2020). [12] C. Paterson, H. Wu, J. Grese, R. Calinescu, C. S. Pasareanu, C. Barrett, Deepcert: Verification of contextually relevant robustness for neural network image classifiers, in: International Conference on Computer Safety, Reliability, and Security, 2021, pp. 3–17. [13] O. Molokovich, A. Morozov, N. Yusupova, K. Jan- schek, Evaluation of graphic data corruptions im- pact on artificial intelligence applications, in: IOP Conference Series: Materials Science and Engineer- ing, volume 1069, 2021, p. 012010. [14] P. Schwerdtner, F. Greßner, N. Kapoor, F. Assion, R. Sass, W. GΓΌnther, F. HΓΌger, P. Schlicht, Risk as- sessment for machine learning models, NeurIPS 2020 Virtual Workshop: Machine Learning for Au- tonomous Driving (2020). URL: https://arxiv.org/ pdf/2011.04328.pdf. [15] L. Weng, P.-Y. Chen, L. Nguyen, M. Squillante, A. Boopathy, I. Oseledets, L. Daniel, Proven: Veri- fying robustness of neural networks with a proba- bilistic approach, in: International Conference on Machine Learning, 2019, pp. 6727–6736. [16] Y. Carmon, A. Raghunathan, L. Schmidt, J. C. Duchi,