Beyond Test Accuracy: The Effects of Model Compression on CNNs Adrian Schwaiger, Kristian Schwienbacher, Karsten Roscher Fraunhofer IKS {firstname.lastname}@iks.fraunhofer.de Abstract low-power devices. Compressing models using pruning or quantization techniques can significantly reduce their size Model compression is widely employed to deploy convolu- tional neural networks on devices with limited computational without severely impacting the overall performance regard- resources or power limitations. For high stakes applications, ing test accuracy. However, especially for safety-critical ap- such as autonomous driving, it is, however, important that plications test accuracy on its own is not sufficient and un- compression techniques do not impair the safety of the sys- derlying negative effects, e.g., on the long tail of data distri- tem. In this paper, we therefore investigate the changes intro- butions have been shown (Hooker et al. 2021). Furthermore, duced by three compression methods – post-training quan- since model compression is often not explicitly addressed in tization, global unstructured pruning, and the combination development and assurance frameworks, such as Assurance of both – that go beyond the test accuracy. To this end, we of Machine Learning for use in Autonomous Systems (AM- trained three image classifiers on two datasets and compared LAS) (Hawkins et al. 2021), an introduction of additional them regarding their performance on the class level and re- failure modes by compression techniques might lead to ad- garding their attention to different input regions. Although the deviations in test accuracy were minimal, our results show ditional efforts required in the development if the effects are that the considered compression techniques introduce sub- considered too late during the process or in the worst case stantial changes to the models that reflect in the quality of might lead to failures during operations if not considered predictions of individual classes and in the salience of input at all. To this end, in this paper we investigate the effects of regions. While we did not observe the introduction of sys- model compression when using global unstructured pruning, tematic errors or biases towards certain classes, these changes post-training quantization, and their combination. We there- can significantly impact the failure modes of CNNs and thus fore aim to provide insights towards the question what and to are highly relevant for safety analyses. We therefore conclude which extent changes occur on a deeper level and how that that it is important to be aware of the changes caused by could potentially impact efforts towards arguing the safety model compression and to already consider them in the early of the system by making the following contributions: stages of the development process. • We investigate the effects of model compression on the 1 Introduction class and sample level regarding their predictive quality over three different models and two datasets Deep Neural Networks (DNNs) enable many complex ap- plications such as autonomous vehicles or automated man- • Additionally, for model pruning we investigate the ef- ufacturing processes. Especially for perception tasks, Con- fects on the attention of the models regarding compared volutional Neural Networks (CNNs) have shown impressive to the initial models by analyzing their saliency maps results and have been adopted widely. However, to achieve a high degree of performance, these networks often have 2 Related Work millions of parameters that require significant computing power for inference, impeding the deployment on edge or In the following, we present the related work regarding low-power mobile devices (Cheng et al. 2018). One way CNNs compression and its relevance towards arguing the to approach this problem is to compress the models, e.g., safety for ML-based systems. via pruning – i.e. removing parts of the network with a low contribution to the predictions – or quantization – i.e. 2.1 Pruning reducing the number of bits required for each parameter. Model pruning is a common technique for various ML al- These methods allow to reduce the memory footprint, in- gorithms, such as decision trees (Mingers 1989) and induc- crease computational efficiency, and in turn also decrease tive logic programming (Kazmi, Schüller, and Saygin 2017), the power demands, enabling the deployment of DNNs on not only for compression but also to improve generaliza- Copyright © 2022 for this paper by its authors. Use permitted un- tion capabilities. For neural networks, pruning is not a novel der Creative Commons License Attribution 4.0 International (CC idea (LeCun, Denker, and Solla 1989), but has gained inter- BY 4.0). est in recent years due to the increased popularity of neural networks and the need to deploy them on computationally- are trained to serve as teacher models which a smaller stu- restricted devices (Cheng et al. 2018). Pruning generally can dent model is trained to mimic (Hinton, Vinyals, and Dean be performed either in a structured (He et al. 2018) or un- 2015). Approaches based on low-rank factorization such as structured (Han, Mao, and Dally 2016) manner. The first one (Swaminathan et al. 2020) use matrix decomposition to re- removes – based on a norm for scoring the importance of construct linear transformations of a network into counter- the individual elements – connected groups of parameters, parts with less redundancy and therefore fewer parameters. e.g., on a per-channel or per-filter basis. The structured ap- Lastly, although not necessarily a compression technique, proach therefore not only provides improvements regarding neural architecture search can be utilized to find efficient ar- memory usage, but also provides reduced inference times chitectures as is done, e.g., in MnasNet (Tan et al. 2019) on regular hardware. Compared to structured pruning, un- that optimizes towards the real-world inference latency of structured pruning removes individual parameters, allowing DNNs. to decrease model sizes significantly more while retaining test accuracy. Since only individual parameters are removed, 2.4 Effects of Model Compression on Robustness the overall network structure does not change and sparsity is introduced. Therefore, specialized hardware is required to Compressed models have been extensively studied regard- benefit from inference speedups besides the improvements ing their robustness against adversarial attacks. For in- in memory requirements (Luo and Wu 2020). ML frame- stance, (Bernhard, Moellic, and Dutertre 2019) concluded works, such as PyTorch or TensorFlow, come with imple- that post-training quantization and quantization-aware train- mentations for the most common pruning techniques. Be- ing slightly improve the robustness of a network against at- yond that, research continues in that domain, for instance, tacks. Similarly, (Duncan et al. 2020) found that quantiza- AutoPruner (Luo and Wu 2020) improves significantly upon tion can reduce the transferability of adversarial examples the state-of-the-art by combining pruning and fine-tuning by up to 50%. The adversarially trained model compression steps, EagleEye (Li et al. 2020) proposes an efficient eval- framework (Gui et al. 2019) incorporates objectives regard- uation strategy to identify the best performing subnetworks ing adversarial robustness in the compression process to fur- as pruning candidates, and with ShrinkBench (Blalock et al. ther improve upon it. Apart from adversarial examples, some 2020) a benchmarking framework has been proposed to fa- research has been conducted regarding other aspects of ro- cilitate the comparison of pruning techniques. bustness. For instance, (Ferianc et al. 2021) demonstrated that a uniform quantization scheme does not considerably 2.2 Quantization impact the quality of uncertainty quantification in Bayesian Another widespread model compression technique is quanti- neural networks. (Hooker et al. 2021) studied the effects of zation that aims to reduce the number of bits required to rep- model compression beyond test accuracy and found that a resent the parameters of a DNN. DNNs are usually trained small subset of the data is systematically more impacted on hardware accelerators, such as GPUs or TPUs, that use and that the sensitivity towards distributional shifts corre- floating points, usually 32bit or 16bit, to represent the pa- lates significantly with model sparsity. rameters. A common technique is to quantize the parameters to 8-bit integers, effectively reducing the size by a factor of 2.5 Safety Assurance for ML-based Systems 4 or 2 respectively and allowing the exploitation of 8-bit op- Arguing the safety of ML-based systems is an emerging field timized computations of mobile CPUs, while having mini- and highly relevant to enable the use of ML in safety-critical mal impact on the model performance (Wu et al. 2016). In applications. A promising direction are holistic assurance practice, post-training quantization and quantization-aware strategies (Burton, Gauerhof, and Heinzemann 2017) that in- training are common. With post-training quantization, the corporate an analysis of the operational domain and the sys- parameters of a model are quantized after the training phase tem, as well as a sound validation and verification strategy to without requiring any fine-tuning. In contrast, quantization- design confidence arguments that provide evidence towards aware training models the parameter quantization during the safety of the system (Burton et al. 2019). The approach training and is able to achieve even lower bit-widths. As itself is domain agnostic and so far has been applied to, e.g., with pruning, both variants have implementations in com- the automotive (Burton et al. 2021a) and medical (Picardi mon ML frameworks. Research in that domain focuses on et al. 2019) domain. While it provides a general framework achieving lower bit-widths, while only minimally impacting towards arguing the safety of ML-based systems and frame- the performance of models (Banner, Nahshan, and Soudry works such as AMLAS (Hawkins et al. 2021) provide addi- 2019; Hubara et al. 2018) or on further simplifying the quan- tional guidance, further research regarding the design of safe tization process, e.g., by eliminating the need for calibration ML algorithms and effective testing methods is required to data (Cai et al. 2020). provide sufficient evidence for the assurance case. 2.3 Further Model Compression Techniques Besides pruning and quantization, other model compression 3 Evaluation techniques have been proposed. For instance, N2N learn- In this section, we discuss our results and observed findings ing (Ashok et al. 2018) removes parts of a network and regarding the changes beyond test accuracy introduced when afterwards shrinks them using a reinforcement learning ap- compressing image classifiers with pruning or quantization proach. With Knowledge distillation, one or more networks techniques. 3.1 Design of Experiments -10.0 Difference in accuracy to uncompressed model [pp] -5.0 0.0 5.0 10.0 To analyze the influence model architecture, we considered three different networks. A ResNet-18 (~11m parameters) (He et al. 2016) for its widespread usage, a SqueezeNet 2 (~750k parameters) (Iandola et al. 2016) for its computa- 13 1 tional efficiency, and a LeNet-5 (~62k parameters) (Lecun 38 12 10 4 et al. 1998) for its small size. We trained the models on 5 25 CIFAR-10 (Krizhevsky 2009) and the German Traffic Sign 9 7 Recognition Benchmark (GTSRB) (Stallkamp et al. 2011). 8 3 CIFAR-10 consists of 60,000 32x32px images, equally di- 11 18 vided into 10 classes, e.g., cat, dog, automobile, or ship. GT- 35 17 14 SRB contains 51,839 images of 43 different German traffic 31 15 signs that we rescaled to 32x32px. The distribution of the Target 33 26 traffic signs thereby is imbalanced, with the most frequent 30 23 sign, Speed limit (50 km/h) occurring more than 10 times 28 6 16 as often as the least frequent one, Dangerous curve to the 34 36 left. The class imbalance within GTSRB allows us to study 22 39 if any negative biases towards the underrepresented classes 40 21 are introduced by the model compression techniques. 42 29 24 We trained each model by minimizing the negative log- 20 32 likelihood using Adam as an optimizer. To prevent overfit- 27 37 ting, we stopped the training after the loss did not decrease 19 41 for 40 epochs. To improve the base accuracy on CIFAR- 0 2 13 1 38 12 10 4 5 25 9 7 8 3 11 18 35 17 14 31 15 33 26 30 23 28 6 16 34 36 22 39 40 21 42 29 24 20 32 27 37 19 41 0 10, we transformed each image at each epoch by randomly Prediction flipping it horizontally and by randomly cropping it back to 32x32px after adding a 4px padding each side. To compress the models, we used the implementations for Figure 1: Difference (in percentage points) between the con- pruning and quantization provided by the ML framework fusion matrices for the uncompressed and pruned ResNet-18 PyTorch. We choose global unstructured pruning, using the trained on GTSRB. Targets and predictions are ordered by L1 norm to score the parameters of the model, whereby the frequency of the respective class with class 2 being the the ones scored lowest are removed. We applied no subse- most frequent and class 0 being the most infrequent one. quent fine-tuning as it yielded the best results in our ex- periments. Compared to structured pruning it is not as ap- plicable to practical applications, as without sparse tensor sparseness of the pruned models or if it is optimized towards computations it only affects the memory requirements of the floating point or integer computations. model. However, unstructured pruning is widely considered For the evaluation, we additionally generated saliency in academia (LeCun, Denker, and Solla 1989; Renda, Fran- maps by computing the gradients for each input pixel regard- kle, and Carbin 2019) and with improvements in sparse ten- ing the target class and normalizing them to the range [0; 1] sor support on embedded hardware might become the pre- following (Simonyan, Vedaldi, and Zisserman 2014). Since dominant method for practical applications in the future. Af- PyTorch does not support gradient calculation for quantized ter the training phase, we pruned each model with the target tensors, we only generated saliency maps for the original to maximize the amount of dropped connections while main- and pruned variants of the models. taining a comparable level of accuracy to the original model. For quantization, we chose a non-intrusive post-training 3.2 Results and Discussion approach with per-channel bit allocation, as it gave the best Table 1 shows the test accuracies of all configurations. Most results in our experiments. We chose to quantize the weights configurations show only a slight drop in accuracy of less of all models once to 8bit and once to 4bit. The first case en- than 1 percentage point (pp), with the exception of some net- ables the utilization of integer-based hardware accelerators, works quantized with 4-bit weight precision. These are not while the second one would require specialized hardware to considered further in the following sections as their substan- gain additional benefits, apart from increased memory effi- tial drop in accuracy already implies significant changes. ciency, compared to the 8-bit variant. The activation preci- sion was kept at 8bit for both cases, as values below that 3.3 Changes at the Class Level severely impacted the accuracy of the models. Finally, we Pruning The accuracy on the entire test dataset did not also combined both compression approaches by quantizing reduce significantly after applying pruning for most con- the pruned models with 8-bit precision for weights and acti- figurations as Table 1 shows. However, we observe signif- vations. Table 1 lists all models and their compressed vari- icant changes at the class level for many configurations, ants, stating their test accuracy and memory footprint. We especially for GTSRB. For instance, Figure 1 shows the do not provide a measure of the inference time as it greatly difference between the confusion matrices for the original depends on the execution platform, e.g., if it can exploit the and pruned ResNet-18 on GTSRB. While the test accuracy Pruning 4-bit Quant. 8-bit Quant. Pruning + Percentage Uncompressed Architecture Dataset Accuracy Accuracy Accuracy 8-bit Quantization Pruned Accuracy Difference Difference Difference Accuracy Difference LeNet GTSRB 54.8% 92.3% -0.4pp -0.4pp -0.2pp -0.7pp CIFAR-10 39.8% 74.8% -0.5pp -7.9pp -0.1pp -0.6pp SqueezeNet GTSRB 49.4% 93.0% -0.8pp -2.2pp -0.2pp -0.8pp CIFAR-10 49.4% 84.5% -0.4pp -4.0pp 0pp -0.4pp ResNet-18 GTSRB 67.4% 95.4% 0pp -0.2pp -0.1pp -0.1pp CIFAR-10 72.4% 86.5% 0pp -0.9pp -0.1pp -0.2pp Table 1: Accuracies on the test dataset for each model and its compressed variants. The column Percentage Pruned states how much of the respective network was removed and only applies to the Pruning and Pruning + Quantization (8bit) variants. Significant deviations in accuracy of more than 1pp from the uncompressed model are indicated in bold. stayed the same, the accuracies and confusions of a few in- integral part of the system development. dividual classes change significantly. As an example, class For GTSRB, the overall effects regarding pruning are sim- 0 (Speed limit 20km/h) is confused ~8pp more often with ilar but more pronounced for LeNet and SqueezeNet com- class 1 (Speed limit 30km/h) but on the other hand, class 40 pared to the ResNet-18. Here, for some classes the change (Roundabout mandatory) is mistaken ~11pp less often as in confusion or accuracy is a bit more noticeable with up class 37 (Go straight or left). Furthermore, class 40 is pre- to 15pp and the proportion of samples that are predicted dicted more accurately by ~11pp whereas the accuracy of differently is higher with 3.5% and 3% respectively. Also class 0 drops by ~10pp. Many more classes have changes in the mean difference in the confidence for each sample is the accuracy or confusion in the range of up to 4pp that – significantly increased, even to the extent where a small depending on the concrete application – might also be rel- fraction of samples for one variant generates full support evant. Upon closer inspection, we find that for class 40 the for the target class and for the other one none, as Figure correct predictions of the uncompressed model are a proper 3 highlights. The increase in the observed effects is likely subset for the ones of the pruned network. For the remain- due to the smaller initial sizes of LeNet and SqueezeNet ing samples that were only predicted correctly by the pruned compared to the ResNet-18. Although 72% of the ResNet- model, we find that the confidence is between 18pp and 24pp 18’s connections were pruned, it still has more than 7 times higher for the pruned model. This means, that for these par- (SqueezeNet) and 89 times (LeNet) the number of parame- ticular samples there is a significant difference in how much ters, potentially still containing redundant features. support each of the networks generates for them and the in- crease in the correct predictions for this class by the pruned One important finding on the imbalanced GTSRB is that model is not just based on slight differences. Figure 2 sum- pruning did not introduce significant biases against the in- marizes the difference in confidence for the predicted class frequent classes for any of the networks. This also is evident between the uncompressed and pruned ResNet-18. While when considering the diagonal in Figure 1, where no corre- the vast majority of predictions show a similar (≤ 2.5pp) lation between the change in accuracy and the frequency of confidence, for some samples the confidence changes sig- the class is present. It is to mention, that the significant drop nificantly, up to 27.5pp. Although this only affects a small in accuracy for the most infrequent class 0 is only present subset of all samples and the overall number of samples that for ResNet-18, for the other networks it is not present. are predicted differently is small with 0.6%, it is something On CIFAR-10, the overall effects of pruning are similar to to be aware of since it might have been caused by the in- GTSRB but significantly less pronounced as Figure 4 shows. troduction of additional failure modes. Depending at which For SqueezeNet and LeNet the change in class-wise accura- stage of the system development model compression is con- cies does not exceed 2.5pp or 4pp respectively. However, for sidered this might have several implications. In the worst these two networks it is to note that the overall number of case, if model compression is performed immediately prior samples that are predicted differently by the uncompressed to the deployment without an extensive verification phase af- and pruned variant is increased with 4.2% and 7.2% respec- terwards, these failure modes are not addressed, potentially tively. Referring to Figure 4 this effect can be explained as leading to system failures during operations. But even in a result of previously wrong predictions that after apply- cases, where it is considered before the model verification ing pruning to the network are still predicted incorrectly but it can significantly impact the development. As additional towards another class. Here, it is to highlight that for the failure modes must be met with proper mitigation measures classes airplane and horse this effect is the most prominent, – e.g., in the form of safety monitors or considerations re- with the first one being overall predicted less and the lat- garding the operational domain –, the development process ter one being predicted more often by the pruned network, can be prolonged if model compression is not considered as therefore introducing a slight bias respectively against or to- wards these classes. Overall, the increase in the intra-class Predictions Differing for Uncompressed and Pruned Model (n=77) Predictions Differing for Uncompressed and Pruned Model (n=383) 1.0 1.0 Relative frequency Relative frequency 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 20 40 60 80 100 0 20 40 60 80 100 Predictions Equal for Uncompressed and Pruned Model (n=12553) Predictions Equal for Uncompressed and Pruned Model (n=12247) 1.0 1.0 Relative frequency Relative frequency 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 20 40 60 80 100 0 20 40 60 80 100 Difference in confidence for target class [pp] Difference in confidence for target class [pp] Figure 2: Difference (in percentage points) between the con- Figure 3: Difference between the confidence in the target fidence in the target class for the uncompressed and pruned class for the uncompressed and pruned SqueezeNet trained ResNet-18 trained on GTSRB. The upper plot shows the dif- on GTSRB. The upper plot shows the difference where the ference where the predictions of the both networks differ, the predictions of the both networks differ, the lower one where lower one where they are equal. In parenthesis we state the they are equal. number of samples underlying the respective diagram, e.g., the n = 77 in the upper plot shows that for 77 samples the uncompressed and pruned model yielded a different classi- fication result. mon observation regarding the low precision in combination with the static post-training quantization scheme (Banner, Nahshan, and Soudry 2019). For GTSRB, we again observe confusion compared to GTSRB likely can be attributed to changes on the class level introduced by the quantization the different complexity of the tasks. While GTSRB has only – but to a slightly lesser extent compared to pruning, al- very limited inter-class variance – a Speed limit (20km/h) though in most cases also reducing the test accuracy lesser sign always has the same shape and surface, the differences – as Figure 5 depicts. The same observation can be made in the images stem from different lighting, viewing angles, for SqueezeNet and LeNet, where the maximum extent of etc. – for CIFAR-10 samples from the same class can vary the change in confusion or accuracy is up to 6pp or 12pp re- greatly, increasing the likelihood of confusion. spectively. Comparing the differences in the confusion ma- trices for pruned and quantized networks, it is also evident Lastly, considering ResNet-18 on CIFAR-10, virtually that classes are not necessarily affected in the same way by no difference is observable between the uncompressed and both compression methods. This hints towards the finding pruned network. Only two samples are predicted differently that not only samples that are challenging for the uncom- and neither the uncompressed nor the pruned variant arrive pressed model are affected but that the compression tech- at the correct prediction. Additionally, the confidence differ- niques can potentially affect any sample. Regarding CIFAR- ence regarding the predicted class between both variants in 10, we again report the overall similar but significantly re- all cases is ≤ 2.5pp. The likely reason for this similarity is duced effects compared to GTSRB, as we already observed that although 72.4% of the network have been pruned, the for pruning, with the exception that the ResNet-18 also is pruned network still contains enough redundancy to mimic slightly affected with changes in confusion and accuracy, up the initial model and further pruning would be required to to 0.6pp. elicit any effects. In turn this also highlights that if pruning is performed conservatively and not to the absolute limit, i.e. Comparing 4-bit (with 8-bit activation precision) quanti- until even slight changes in the overall accuracy are notice- zation to 8-bit quantization for the configurations without a able, a compressed variant might be achievable that virtually significant drop in accuracy, we find that the observed effects mimics the initial network. are the same but more pronounced for 4-bit quantization. As already observed when comparing pruning and 8-bit quan- Quantization Referring to Table 1, 8-bit quantization tization, 4-bit quantization and 8-bit quantization show no shows very limited impact on the overall accuracy while 4- consistent patterns regarding the impacted samples, further bit quantization (with 8-bit activation precision) in half of supporting the hypothesis that any sample or class can be the experiments shows a significant drop in accuracy, a com- affected. Difference in accuracy to uncompressed model [pp] Difference in accuracy to uncompressed model [pp] -6.0 -4.0 -2.0 0.0 2.0 4.0 6.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 2 airplane 13 1 38 automobile 12 10 4 5 bird 25 9 7 cat 8 3 11 deer 18 Target 35 17 dog 14 31 15 Target 33 frog 26 30 23 horse 28 6 16 ship 34 36 22 truck 39 40 21 42 cat dog deer frog horse airplane bird ship automobile truck 29 24 20 32 27 37 19 Prediction 41 0 2 13 1 38 12 10 4 5 25 9 7 8 3 11 18 35 17 14 31 15 33 26 30 23 28 6 16 34 36 22 39 40 21 42 29 24 20 32 27 37 19 41 0 Prediction Figure 4: Difference between the confusion matrices for the uncompressed and pruned LeNet trained on CIFAR-10. Figure 5: Difference between the confusion matrices for the uncompressed and 8-bit quantized ResNet-18 trained on GT- Combined Pruning and Quantization Lastly, we com- SRB. bine pruning and quantization, by applying 8-bit quantiza- tion to the pruned models. As Table 1 shows, this has a slightly higher impact on the overall drop in accuracy than if we first performed a 3x3 average pooling with stride 3 over the compression techniques would be applied individually, the saliency maps – to reduce the sensitivity towards pixel- which is to be expected. Overall, the same general effects level changes in the attention, putting a stronger emphasis on are observable as if pruning or quantization are applied indi- higher-level features – and afterwards computed the Mean vidually as Figure 6 shows. Regarding the effect on individ- Absolute Deviation (MAD) between the reduced saliency ual classes, patterns present in both compression techniques maps of the uncompressed and pruned model. are combined, sometimes amplifying, other times canceling Figure 7 shows five selected saliency maps that summa- out the effects. Potentially this could lead to drastic effects, rize the observed findings. Overall, we did not observe any however, in our experiments we did not observe any. Also, it systematic changes between any model and its pruned vari- is noticeable that generally the number of samples where the ant. For example, in some instances, the pruned model fo- uncompressed and the compressed model disagree is slightly cuses better on the foreground, giving the correct prediction, higher than for any of the single compression variants. Gen- while the original model focuses on the background, classi- erally speaking, the combination of both compression tech- fying incorrectly, as Figure 7a representatively shows. How- niques, however, shows no peculiarities and does not intro- ever, this is not a consistent behavior, and the opposite effect duce significant additional effects. can be observed as well, e.g., in Figure 7b, where the pruned model puts too much attention on the sky, classifying the 3.4 Differences in the Relevance of Input Regions image as an airplane. Furthermore, even for samples where In addition to the quantitative analysis performed in the pre- both networks predict the same class, we can observe sig- vious section, we also qualitatively investigated the changes nificant changes in the salience of different input regions, introduced by model compressing. For this, we generated i.e, the models weight features differently or even rely on saliency maps that highlight the salient input regions for a different ones. While in the previous sections we found vir- model’s decision regarding the target class. Since statically tually no differences between the uncompressed and pruned quantized models in PyTorch don’t support gradient calcu- ResNet-18 on CIFAR-10, regarding their attention we could lation, we only analyze the changes between uncompressed find noticeable differences as Figures 7d and 7e show. This and pruned model variants. In order to keep the number of effect is also not limited to our selected samples, as Figure 8 images to analyze manageable, for each configuration we shows. With an average MAD of 7.7% between the saliency selected the 20 samples where the biggest difference in the maps of the uncompressed and pruned ResNet-18, it high- saliency maps was present. To compare two saliency maps, lights that although the effects of model compression might Difference in accuracy to uncompressed model [pp] horse airplane horse -6.0 -4.0 -2.0 0.0 2.0 4.0 6.0 8.0 2 13 1 38 12 10 4 (a) LeNet trained on CIFAR-10 (MAD=13%) 5 25 9 ship ship airplane 7 8 3 11 18 35 17 14 31 15 Target 33 26 30 23 (b) LeNet trained on CIFAR-10 (MAD=12%) 28 6 16 frog frog frog 34 36 22 39 40 21 42 29 24 20 32 27 37 19 (c) LeNet trained on CIFAR-10 (MAD=12%) 41 0 frog frog frog 2 13 1 38 12 10 4 5 25 9 7 8 3 11 18 35 17 14 31 15 33 26 30 23 28 6 16 34 36 22 39 40 21 42 29 24 20 32 27 37 19 41 0 Prediction Figure 6: Difference between the confusion matrices for the uncompressed and the pruned + 8-bit quantized ResNet-18 trained on GTSRB. (d) ResNet-18 trained on CIFAR-10 (MAD=16%) automobile automobile automobile not be noticeable at the level of dataset or even class-wise accuracy, it is definitely important to consider them in safety analyses, as it might, for example, introduce additional fail- ures in corner cases where a model bases its decision on the wrong features. 4 Conclusions and Future Work (e) ResNet-18 trained on CIFAR-10 (MAD=15%) In this paper, we investigated changes in the predictions of networks compressed with either post-training quantiza- Figure 7: Comparison of saliency maps for the original tion, global unstructured pruning, or a combination of both. and pruned variants of the LeNet and ResNet-18 trained on While the deviations from the test accuracy of the uncom- CIFAR-10. On the left is the original image with the target pressed model were minimal, we observed that the compres- class annotated. Following that are the saliency maps for the sion techniques still caused significant changes in the predic- original (middle) and pruned (right) model each annotated tions. For one thing, we found that the accuracy of individual with the prediction of the respective model. classes can change greatly – in our experiments up to 15pp – and that the confusion between classes can vary to the same Relative frequency extent. For another thing, our investigation showed that also 0.04 the confidence regarding the target class can change signif- icantly, with extreme cases were the uncompressed model 0.02 has zero confidence in the target class while the compressed variant has full confidence, and vice versa. Lastly, our com- 0.00 parison of saliency maps for uncompressed and pruned mod- 2 4 6 8 10 12 14 16 18 els revealed the presence of significant differences, hinting Mean absolute deviation [%] towards the two variants relying on or weighting features differently. It is to mention, however, that we did not ob- serve the introduction of systematic errors, e.g., in the form Figure 8: Distribution of the mean absolute deviations be- of biases against infrequent classes. Nonetheless, based on tween the saliency maps of the uncompressed and pruned the effects we observed, we strongly suggest to view model ResNet-18 trained on CIFAR-10. compression as integral part of any ML development cycle Cai, Y.; Yao, Z.; Dong, Z.; Gholami, A.; Mahoney, M. W.; and to consider it in early development stages. Model com- and Keutzer, K. 2020. ZeroQ: A Novel Zero Shot Quantiza- pression can cause substantial changes in the predictions of tion Framework. In Proc. CVPR, 13169–13178. a network and with that bears the potential to introduce ad- Cheng, Y.; Wang, D.; Zhou, P.; and Zhang, T. 2018. Model ditional failure modes. These must be addressed in the sys- Compression and Acceleration for Deep Neural Networks: tem development and the earlier they are known, the better The Principles, Progress, and Challenges. IEEE Signal Pro- mitigation measures can be integrated in the system, overall cess. Mag., 35(1): 126–136. facilitating the development process. Duncan, K.; Komendantskaya, E.; Stewart, R.; and Lones, Regarding future work, we suggest to expand our ex- M. 2020. Relative Robustness of Quantized Neural Net- periments also to other model architectures, datasets, and works Against Adversarial Attacks. In Proc. IJCNN, 1–8. tasks and to investigate other compression techniques, e.g., quantization-aware training, structured pruning, or knowl- Ferianc, M.; Maji, P.; Mattina, M.; and Rodrigues, M. 2021. edge distillation, as these are also highly relevant in practice. On the Effects of Quantisation on Model Uncertainty in Furthermore, we deem it as highly important to further de- Bayesian Neural Networks. arXiv:2102.11062 [cs, stat]. velop methods for systematically and rigorously analyzing Gui, S.; Wang, H. N.; Yang, H.; Yu, C.; Wang, Z.; and Liu, machine learning systems that go beyond averaging metrics, J. 2019. Model Compression with Adversarial Robustness: as these hide many peculiarities that bear the potential for A Unified Optimization Framework. In Proc. NeurIPS, vol- failures. Lastly, we deem it equally as important to continue ume 32. Curran Associates, Inc. research into the direction of continuous safety assurance Han, S.; Mao, H.; and Dally, W. J. 2016. Deep Compression: (Burton et al. 2021b) in order to consider safety as integral Compressing Deep Neural Network with Pruning, Trained part of the development of ML-based systems, addressing Quantization and Huffman Coding. In Proc. ICLR. issues such as potentially negative effects due to model com- Hawkins, R.; Paterson, C.; Picardi, C.; Jia, Y.; Calinescu, pression early on. R.; and Habli, I. 2021. Guidance on the Assurance of Ma- chine Learning in Autonomous Systems (AMLAS). CoRR, References abs/2102.01564. Ashok, A.; Rhinehart, N.; Beainy, F.; and Kitani, K. M. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual 2018. N2N Learning: Network to Network Compression via Learning for Image Recognition. In Proc. CVPR, 770–778. Policy Gradient Reinforcement Learning. In Proc. ICLR. He, Y.; Kang, G.; Dong, X.; Fu, Y.; and Yang, Y. 2018. Soft Banner, R.; Nahshan, Y.; and Soudry, D. 2019. Post Training Filter Pruning for Accelerating Deep Convolutional Neural 4-Bit Quantization of Convolutional Networks for Rapid- Networks. Deployment. In Proc. NeurIPS, 714, 7950–7958. Red Hook, Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the NY, USA: Curran Associates Inc. Knowledge in a Neural Network. arXiv:1503.02531 [cs, Bernhard, R.; Moellic, P.-A.; and Dutertre, J.-M. 2019. Im- stat]. pact of Low-Bitwidth Quantization on the Adversarial Ro- Hooker, S.; Courville, A.; Clark, G.; Dauphin, Y.; and bustness for Embedded Neural Networks. In 2019 Interna- Frome, A. 2021. What Do Compressed Deep Neural Net- tional Conference on Cyberworlds (CW), 308–315. works Forget? arXiv:1911.05248 [cs, stat]. Blalock, D. W.; Ortiz, J. J. G.; Frankle, J.; and Guttag, J. V. 2020. What Is the State of Neural Network Pruning? In Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; and Proc. MLSys. mlsys.org. Bengio, Y. 2018. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activa- Burton, S.; Gauerhof, L.; and Heinzemann, C. 2017. Mak- tions. Journal of Machine Learning Research, 18(187): 1– ing the Case for Safety of Machine Learning in Highly Au- 30. tomated Driving. In Computer Safety, Reliability, and Secu- rity, LNCS, 5–16. Cham: Springer International Publishing. Iandola, F. N.; Moskewicz, M. W.; Ashraf, K.; Han, S.; Dally, W. J.; and Keutzer, K. 2016. SqueezeNet: AlexNet- Burton, S.; Gauerhof, L.; Sethy, B. B.; Habli, I.; and Level Accuracy with 50x Fewer Parameters and <1MB Hawkins, R. 2019. Confidence Arguments for Evidence Model Size. CoRR, abs/1602.07360. of Performance in Machine Learning for Highly Automated Driving Functions. In Computer Safety, Reliability, and Se- Kazmi, M.; Schüller, P.; and Saygin, Y. 2017. Improving curity, LNCS, 365–377. Cham: Springer International Pub- Scalability of Inductive Logic Programming via Pruning and lishing. Best-Effort Optimisation. Expert Systems With Applications, 87: 291–303. Burton, S.; Kurzidem, I.; Schwaiger, A.; Schleiß, P.; Unter- reiner, M.; Graeber, T.; and Becker, P. 2021a. Safety Assur- Krizhevsky, A. 2009. Learning Multiple Layers of Features ance of Machine Learning for Chassis Control Functions. from Tiny Images. 60. In Computer Safety, Reliability, and Security, LNCS. Cham: Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Springer International Publishing. Gradient-Based Learning Applied to Document Recogni- Burton, S.; McDermid, J. A.; Garnett, P.; and Weaver, R. tion. Proc. IEEE, 86(11): 2278–2324. 2021b. Safety, Complexity, and Automated Driving: Holis- LeCun, Y.; Denker, J. S.; and Solla, S. A. 1989. Optimal tic Perspectives on Safety Assurance. Computer, 54(8): 22– Brain Damage. In Touretzky, D. S., ed., Proc. NIPS, 598– 32. 605. Morgan Kaufmann. Li, B.; Wu, B.; Su, J.; and Wang, G. 2020. EagleEye: Fast Sub-Net Evaluation for Efficient Neural Network Pruning. In Proc. ECCV, LNCS, 639–654. Cham: Springer Interna- tional Publishing. Luo, J.-H.; and Wu, J. 2020. AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference. Pattern Recognition, 107: 107461. Mingers, J. 1989. An Empirical Comparison of Pruning Methods for Decision Tree Induction. Machine Learning, 4(2): 227–243. Picardi, C.; Hawkins, R.; Paterson, C.; and Habli, I. 2019. A Pattern for Arguing the Assurance of Machine Learning in Medical Diagnosis Systems. In Computer Safety, Reliabil- ity, and Security, LNCS, 165–179. Cham: Springer Interna- tional Publishing. Renda, A.; Frankle, J.; and Carbin, M. 2019. Comparing Rewinding and Fine-Tuning in Neural Network Pruning. In ICLR 2019. Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Deep inside Convolutional Networks: Visualising Image Classifi- cation Models and Saliency Maps. In Proc. ICLR. Stallkamp, J.; Schlipsing, M.; Salmen, J.; and Igel, C. 2011. The German Traffic Sign Recognition Benchmark: A Multi- Class Classification Competition. In Proc. IJCNN, 1453– 1460. Swaminathan, S.; Garg, D.; Kannan, R.; and Andres, F. 2020. Sparse Low Rank Factorization for Deep Neural Net- work Compression. Neurocomputing, 398: 185–196. Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; and Le, Q. V. 2019. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In Proc. CVPR, 2820–2828. Wu, J.; Leng, C.; Wang, Y.; Hu, Q.; and Cheng, J. 2016. Quantized Convolutional Neural Networks for Mobile De- vices. In Proc. CVPR, 4820–4828. 5 Acknowledgments This work was funded by the Bavarian Ministry for Eco- nomic Affairs, Regional Development and Energy as part of a project to support the thematic development of the Institute for Cognitive Systems.