Beyond Test Accuracy: The Effects of Model Compression on CNNs

                           Adrian Schwaiger, Kristian Schwienbacher, Karsten Roscher
                                                             Fraunhofer IKS
                                                 {firstname.lastname}@iks.fraunhofer.de


                            Abstract                                   low-power devices. Compressing models using pruning or
                                                                       quantization techniques can significantly reduce their size
  Model compression is widely employed to deploy convolu-
  tional neural networks on devices with limited computational         without severely impacting the overall performance regard-
  resources or power limitations. For high stakes applications,        ing test accuracy. However, especially for safety-critical ap-
  such as autonomous driving, it is, however, important that           plications test accuracy on its own is not sufficient and un-
  compression techniques do not impair the safety of the sys-          derlying negative effects, e.g., on the long tail of data distri-
  tem. In this paper, we therefore investigate the changes intro-      butions have been shown (Hooker et al. 2021). Furthermore,
  duced by three compression methods – post-training quan-             since model compression is often not explicitly addressed in
  tization, global unstructured pruning, and the combination           development and assurance frameworks, such as Assurance
  of both – that go beyond the test accuracy. To this end, we          of Machine Learning for use in Autonomous Systems (AM-
  trained three image classifiers on two datasets and compared         LAS) (Hawkins et al. 2021), an introduction of additional
  them regarding their performance on the class level and re-
                                                                       failure modes by compression techniques might lead to ad-
  garding their attention to different input regions. Although the
  deviations in test accuracy were minimal, our results show           ditional efforts required in the development if the effects are
  that the considered compression techniques introduce sub-            considered too late during the process or in the worst case
  stantial changes to the models that reflect in the quality of        might lead to failures during operations if not considered
  predictions of individual classes and in the salience of input       at all. To this end, in this paper we investigate the effects of
  regions. While we did not observe the introduction of sys-           model compression when using global unstructured pruning,
  tematic errors or biases towards certain classes, these changes      post-training quantization, and their combination. We there-
  can significantly impact the failure modes of CNNs and thus          fore aim to provide insights towards the question what and to
  are highly relevant for safety analyses. We therefore conclude       which extent changes occur on a deeper level and how that
  that it is important to be aware of the changes caused by            could potentially impact efforts towards arguing the safety
  model compression and to already consider them in the early
                                                                       of the system by making the following contributions:
  stages of the development process.
                                                                        • We investigate the effects of model compression on the
                     1     Introduction                                   class and sample level regarding their predictive quality
                                                                          over three different models and two datasets
Deep Neural Networks (DNNs) enable many complex ap-
plications such as autonomous vehicles or automated man-                • Additionally, for model pruning we investigate the ef-
ufacturing processes. Especially for perception tasks, Con-               fects on the attention of the models regarding compared
volutional Neural Networks (CNNs) have shown impressive                   to the initial models by analyzing their saliency maps
results and have been adopted widely. However, to achieve
a high degree of performance, these networks often have                                    2    Related Work
millions of parameters that require significant computing
power for inference, impeding the deployment on edge or                In the following, we present the related work regarding
low-power mobile devices (Cheng et al. 2018). One way                  CNNs compression and its relevance towards arguing the
to approach this problem is to compress the models, e.g.,              safety for ML-based systems.
via pruning – i.e. removing parts of the network with a
low contribution to the predictions – or quantization – i.e.           2.1   Pruning
reducing the number of bits required for each parameter.               Model pruning is a common technique for various ML al-
These methods allow to reduce the memory footprint, in-                gorithms, such as decision trees (Mingers 1989) and induc-
crease computational efficiency, and in turn also decrease             tive logic programming (Kazmi, Schüller, and Saygin 2017),
the power demands, enabling the deployment of DNNs on                  not only for compression but also to improve generaliza-
Copyright © 2022 for this paper by its authors. Use permitted un-      tion capabilities. For neural networks, pruning is not a novel
der Creative Commons License Attribution 4.0 International (CC         idea (LeCun, Denker, and Solla 1989), but has gained inter-
BY 4.0).                                                               est in recent years due to the increased popularity of neural
networks and the need to deploy them on computationally-          are trained to serve as teacher models which a smaller stu-
restricted devices (Cheng et al. 2018). Pruning generally can     dent model is trained to mimic (Hinton, Vinyals, and Dean
be performed either in a structured (He et al. 2018) or un-       2015). Approaches based on low-rank factorization such as
structured (Han, Mao, and Dally 2016) manner. The first one       (Swaminathan et al. 2020) use matrix decomposition to re-
removes – based on a norm for scoring the importance of           construct linear transformations of a network into counter-
the individual elements – connected groups of parameters,         parts with less redundancy and therefore fewer parameters.
e.g., on a per-channel or per-filter basis. The structured ap-    Lastly, although not necessarily a compression technique,
proach therefore not only provides improvements regarding         neural architecture search can be utilized to find efficient ar-
memory usage, but also provides reduced inference times           chitectures as is done, e.g., in MnasNet (Tan et al. 2019)
on regular hardware. Compared to structured pruning, un-          that optimizes towards the real-world inference latency of
structured pruning removes individual parameters, allowing        DNNs.
to decrease model sizes significantly more while retaining
test accuracy. Since only individual parameters are removed,      2.4   Effects of Model Compression on Robustness
the overall network structure does not change and sparsity is
introduced. Therefore, specialized hardware is required to        Compressed models have been extensively studied regard-
benefit from inference speedups besides the improvements          ing their robustness against adversarial attacks. For in-
in memory requirements (Luo and Wu 2020). ML frame-               stance, (Bernhard, Moellic, and Dutertre 2019) concluded
works, such as PyTorch or TensorFlow, come with imple-            that post-training quantization and quantization-aware train-
mentations for the most common pruning techniques. Be-            ing slightly improve the robustness of a network against at-
yond that, research continues in that domain, for instance,       tacks. Similarly, (Duncan et al. 2020) found that quantiza-
AutoPruner (Luo and Wu 2020) improves significantly upon          tion can reduce the transferability of adversarial examples
the state-of-the-art by combining pruning and fine-tuning         by up to 50%. The adversarially trained model compression
steps, EagleEye (Li et al. 2020) proposes an efficient eval-      framework (Gui et al. 2019) incorporates objectives regard-
uation strategy to identify the best performing subnetworks       ing adversarial robustness in the compression process to fur-
as pruning candidates, and with ShrinkBench (Blalock et al.       ther improve upon it. Apart from adversarial examples, some
2020) a benchmarking framework has been proposed to fa-           research has been conducted regarding other aspects of ro-
cilitate the comparison of pruning techniques.                    bustness. For instance, (Ferianc et al. 2021) demonstrated
                                                                  that a uniform quantization scheme does not considerably
2.2   Quantization                                                impact the quality of uncertainty quantification in Bayesian
Another widespread model compression technique is quanti-         neural networks. (Hooker et al. 2021) studied the effects of
zation that aims to reduce the number of bits required to rep-    model compression beyond test accuracy and found that a
resent the parameters of a DNN. DNNs are usually trained          small subset of the data is systematically more impacted
on hardware accelerators, such as GPUs or TPUs, that use          and that the sensitivity towards distributional shifts corre-
floating points, usually 32bit or 16bit, to represent the pa-     lates significantly with model sparsity.
rameters. A common technique is to quantize the parameters
to 8-bit integers, effectively reducing the size by a factor of   2.5   Safety Assurance for ML-based Systems
4 or 2 respectively and allowing the exploitation of 8-bit op-    Arguing the safety of ML-based systems is an emerging field
timized computations of mobile CPUs, while having mini-           and highly relevant to enable the use of ML in safety-critical
mal impact on the model performance (Wu et al. 2016). In          applications. A promising direction are holistic assurance
practice, post-training quantization and quantization-aware       strategies (Burton, Gauerhof, and Heinzemann 2017) that in-
training are common. With post-training quantization, the         corporate an analysis of the operational domain and the sys-
parameters of a model are quantized after the training phase      tem, as well as a sound validation and verification strategy to
without requiring any fine-tuning. In contrast, quantization-     design confidence arguments that provide evidence towards
aware training models the parameter quantization during           the safety of the system (Burton et al. 2019). The approach
training and is able to achieve even lower bit-widths. As         itself is domain agnostic and so far has been applied to, e.g.,
with pruning, both variants have implementations in com-          the automotive (Burton et al. 2021a) and medical (Picardi
mon ML frameworks. Research in that domain focuses on             et al. 2019) domain. While it provides a general framework
achieving lower bit-widths, while only minimally impacting        towards arguing the safety of ML-based systems and frame-
the performance of models (Banner, Nahshan, and Soudry            works such as AMLAS (Hawkins et al. 2021) provide addi-
2019; Hubara et al. 2018) or on further simplifying the quan-     tional guidance, further research regarding the design of safe
tization process, e.g., by eliminating the need for calibration   ML algorithms and effective testing methods is required to
data (Cai et al. 2020).                                           provide sufficient evidence for the assurance case.
2.3   Further Model Compression Techniques
Besides pruning and quantization, other model compression
                                                                                        3   Evaluation
techniques have been proposed. For instance, N2N learn-           In this section, we discuss our results and observed findings
ing (Ashok et al. 2018) removes parts of a network and            regarding the changes beyond test accuracy introduced when
afterwards shrinks them using a reinforcement learning ap-        compressing image classifiers with pruning or quantization
proach. With Knowledge distillation, one or more networks         techniques.
3.1   Design of Experiments                                                     -10.0
                                                                                          Difference in accuracy to uncompressed model [pp]
                                                                                           -5.0               0.0                  5.0        10.0

To analyze the influence model architecture, we considered
three different networks. A ResNet-18 (~11m parameters)
(He et al. 2016) for its widespread usage, a SqueezeNet                     2

(~750k parameters) (Iandola et al. 2016) for its computa-
                                                                           13
                                                                            1

tional efficiency, and a LeNet-5 (~62k parameters) (Lecun
                                                                           38
                                                                           12
                                                                           10
                                                                            4
et al. 1998) for its small size. We trained the models on                   5
                                                                           25
CIFAR-10 (Krizhevsky 2009) and the German Traffic Sign                      9
                                                                            7
Recognition Benchmark (GTSRB) (Stallkamp et al. 2011).                      8
                                                                            3
CIFAR-10 consists of 60,000 32x32px images, equally di-                    11
                                                                           18
vided into 10 classes, e.g., cat, dog, automobile, or ship. GT-            35
                                                                           17
                                                                           14
SRB contains 51,839 images of 43 different German traffic                  31
                                                                           15
signs that we rescaled to 32x32px. The distribution of the


                                                                  Target
                                                                           33
                                                                           26
traffic signs thereby is imbalanced, with the most frequent                30
                                                                           23
sign, Speed limit (50 km/h) occurring more than 10 times                   28
                                                                            6
                                                                           16
as often as the least frequent one, Dangerous curve to the                 34
                                                                           36
left. The class imbalance within GTSRB allows us to study                  22
                                                                           39
if any negative biases towards the underrepresented classes                40
                                                                           21
are introduced by the model compression techniques.                        42
                                                                           29
                                                                           24
   We trained each model by minimizing the negative log-                   20
                                                                           32
likelihood using Adam as an optimizer. To prevent overfit-                 27
                                                                           37
ting, we stopped the training after the loss did not decrease              19
                                                                           41
for 40 epochs. To improve the base accuracy on CIFAR-                       0


                                                                                 2
                                                                                13
                                                                                 1
                                                                                38
                                                                                12
                                                                                10
                                                                                 4
                                                                                 5
                                                                                25
                                                                                 9
                                                                                 7
                                                                                 8
                                                                                 3
                                                                                11
                                                                                18
                                                                                35
                                                                                17
                                                                                14
                                                                                31
                                                                                15
                                                                                33
                                                                                26
                                                                                30
                                                                                23
                                                                                28
                                                                                 6
                                                                                16
                                                                                34
                                                                                36
                                                                                22
                                                                                39
                                                                                40
                                                                                21
                                                                                42
                                                                                29
                                                                                24
                                                                                20
                                                                                32
                                                                                27
                                                                                37
                                                                                19
                                                                                41
                                                                                 0
10, we transformed each image at each epoch by randomly                                                      Prediction
flipping it horizontally and by randomly cropping it back to
32x32px after adding a 4px padding each side.
   To compress the models, we used the implementations for        Figure 1: Difference (in percentage points) between the con-
pruning and quantization provided by the ML framework             fusion matrices for the uncompressed and pruned ResNet-18
PyTorch. We choose global unstructured pruning, using the         trained on GTSRB. Targets and predictions are ordered by
L1 norm to score the parameters of the model, whereby             the frequency of the respective class with class 2 being the
the ones scored lowest are removed. We applied no subse-          most frequent and class 0 being the most infrequent one.
quent fine-tuning as it yielded the best results in our ex-
periments. Compared to structured pruning it is not as ap-
plicable to practical applications, as without sparse tensor      sparseness of the pruned models or if it is optimized towards
computations it only affects the memory requirements of the       floating point or integer computations.
model. However, unstructured pruning is widely considered            For the evaluation, we additionally generated saliency
in academia (LeCun, Denker, and Solla 1989; Renda, Fran-          maps by computing the gradients for each input pixel regard-
kle, and Carbin 2019) and with improvements in sparse ten-        ing the target class and normalizing them to the range [0; 1]
sor support on embedded hardware might become the pre-            following (Simonyan, Vedaldi, and Zisserman 2014). Since
dominant method for practical applications in the future. Af-     PyTorch does not support gradient calculation for quantized
ter the training phase, we pruned each model with the target      tensors, we only generated saliency maps for the original
to maximize the amount of dropped connections while main-         and pruned variants of the models.
taining a comparable level of accuracy to the original model.
   For quantization, we chose a non-intrusive post-training       3.2             Results and Discussion
approach with per-channel bit allocation, as it gave the best     Table 1 shows the test accuracies of all configurations. Most
results in our experiments. We chose to quantize the weights      configurations show only a slight drop in accuracy of less
of all models once to 8bit and once to 4bit. The first case en-   than 1 percentage point (pp), with the exception of some net-
ables the utilization of integer-based hardware accelerators,     works quantized with 4-bit weight precision. These are not
while the second one would require specialized hardware to        considered further in the following sections as their substan-
gain additional benefits, apart from increased memory effi-       tial drop in accuracy already implies significant changes.
ciency, compared to the 8-bit variant. The activation preci-
sion was kept at 8bit for both cases, as values below that        3.3             Changes at the Class Level
severely impacted the accuracy of the models. Finally, we         Pruning The accuracy on the entire test dataset did not
also combined both compression approaches by quantizing           reduce significantly after applying pruning for most con-
the pruned models with 8-bit precision for weights and acti-      figurations as Table 1 shows. However, we observe signif-
vations. Table 1 lists all models and their compressed vari-      icant changes at the class level for many configurations,
ants, stating their test accuracy and memory footprint. We        especially for GTSRB. For instance, Figure 1 shows the
do not provide a measure of the inference time as it greatly      difference between the confusion matrices for the original
depends on the execution platform, e.g., if it can exploit the    and pruned ResNet-18 on GTSRB. While the test accuracy
                                                                   Pruning    4-bit Quant.   8-bit Quant.             Pruning +
                                  Percentage   Uncompressed
     Architecture   Dataset                                       Accuracy      Accuracy       Accuracy       8-bit Quantization
                                     Pruned        Accuracy
                                                                 Difference    Difference     Difference    Accuracy Difference

     LeNet          GTSRB             54.8%            92.3%        -0.4pp        -0.4pp          -0.2pp                 -0.7pp
                    CIFAR-10          39.8%            74.8%        -0.5pp        -7.9pp          -0.1pp                 -0.6pp
     SqueezeNet     GTSRB             49.4%            93.0%        -0.8pp        -2.2pp          -0.2pp                 -0.8pp
                    CIFAR-10          49.4%            84.5%        -0.4pp        -4.0pp             0pp                 -0.4pp
     ResNet-18      GTSRB             67.4%            95.4%           0pp         -0.2pp         -0.1pp                 -0.1pp
                    CIFAR-10          72.4%            86.5%           0pp         -0.9pp         -0.1pp                 -0.2pp

Table 1: Accuracies on the test dataset for each model and its compressed variants. The column Percentage Pruned states how
much of the respective network was removed and only applies to the Pruning and Pruning + Quantization (8bit) variants.
Significant deviations in accuracy of more than 1pp from the uncompressed model are indicated in bold.


stayed the same, the accuracies and confusions of a few in-          integral part of the system development.
dividual classes change significantly. As an example, class             For GTSRB, the overall effects regarding pruning are sim-
0 (Speed limit 20km/h) is confused ~8pp more often with              ilar but more pronounced for LeNet and SqueezeNet com-
class 1 (Speed limit 30km/h) but on the other hand, class 40         pared to the ResNet-18. Here, for some classes the change
(Roundabout mandatory) is mistaken ~11pp less often as               in confusion or accuracy is a bit more noticeable with up
class 37 (Go straight or left). Furthermore, class 40 is pre-        to 15pp and the proportion of samples that are predicted
dicted more accurately by ~11pp whereas the accuracy of              differently is higher with 3.5% and 3% respectively. Also
class 0 drops by ~10pp. Many more classes have changes in            the mean difference in the confidence for each sample is
the accuracy or confusion in the range of up to 4pp that –           significantly increased, even to the extent where a small
depending on the concrete application – might also be rel-           fraction of samples for one variant generates full support
evant. Upon closer inspection, we find that for class 40 the         for the target class and for the other one none, as Figure
correct predictions of the uncompressed model are a proper           3 highlights. The increase in the observed effects is likely
subset for the ones of the pruned network. For the remain-           due to the smaller initial sizes of LeNet and SqueezeNet
ing samples that were only predicted correctly by the pruned         compared to the ResNet-18. Although 72% of the ResNet-
model, we find that the confidence is between 18pp and 24pp          18’s connections were pruned, it still has more than 7 times
higher for the pruned model. This means, that for these par-         (SqueezeNet) and 89 times (LeNet) the number of parame-
ticular samples there is a significant difference in how much        ters, potentially still containing redundant features.
support each of the networks generates for them and the in-
crease in the correct predictions for this class by the pruned          One important finding on the imbalanced GTSRB is that
model is not just based on slight differences. Figure 2 sum-         pruning did not introduce significant biases against the in-
marizes the difference in confidence for the predicted class         frequent classes for any of the networks. This also is evident
between the uncompressed and pruned ResNet-18. While                 when considering the diagonal in Figure 1, where no corre-
the vast majority of predictions show a similar (≤ 2.5pp)            lation between the change in accuracy and the frequency of
confidence, for some samples the confidence changes sig-             the class is present. It is to mention, that the significant drop
nificantly, up to 27.5pp. Although this only affects a small         in accuracy for the most infrequent class 0 is only present
subset of all samples and the overall number of samples that         for ResNet-18, for the other networks it is not present.
are predicted differently is small with 0.6%, it is something           On CIFAR-10, the overall effects of pruning are similar to
to be aware of since it might have been caused by the in-            GTSRB but significantly less pronounced as Figure 4 shows.
troduction of additional failure modes. Depending at which           For SqueezeNet and LeNet the change in class-wise accura-
stage of the system development model compression is con-            cies does not exceed 2.5pp or 4pp respectively. However, for
sidered this might have several implications. In the worst           these two networks it is to note that the overall number of
case, if model compression is performed immediately prior            samples that are predicted differently by the uncompressed
to the deployment without an extensive verification phase af-        and pruned variant is increased with 4.2% and 7.2% respec-
terwards, these failure modes are not addressed, potentially         tively. Referring to Figure 4 this effect can be explained as
leading to system failures during operations. But even in            a result of previously wrong predictions that after apply-
cases, where it is considered before the model verification          ing pruning to the network are still predicted incorrectly but
it can significantly impact the development. As additional           towards another class. Here, it is to highlight that for the
failure modes must be met with proper mitigation measures            classes airplane and horse this effect is the most prominent,
– e.g., in the form of safety monitors or considerations re-         with the first one being overall predicted less and the lat-
garding the operational domain –, the development process            ter one being predicted more often by the pruned network,
can be prolonged if model compression is not considered as           therefore introducing a slight bias respectively against or to-
                                                                     wards these classes. Overall, the increase in the intra-class
                               Predictions Differing for Uncompressed and Pruned Model (n=77)                                        Predictions Differing for Uncompressed and Pruned Model (n=383)
                     1.0                                                                                                   1.0
Relative frequency


                                                                                                      Relative frequency
                     0.8                                                                                                   0.8

                     0.6                                                                                                   0.6

                     0.4                                                                                                   0.4

                     0.2                                                                                                   0.2

                     0.0                                                                                                   0.0
                           0             20            40             60             80         100                              0             20             40             60             80         100
                               Predictions Equal for Uncompressed and Pruned Model (n=12553)                                         Predictions Equal for Uncompressed and Pruned Model (n=12247)
                     1.0                                                                                                   1.0
Relative frequency


                                                                                                      Relative frequency
                     0.8                                                                                                   0.8

                     0.6                                                                                                   0.6

                     0.4                                                                                                   0.4

                     0.2                                                                                                   0.2

                     0.0                                                                                                   0.0
                           0             20            40             60             80         100                              0             20             40             60             80         100
                                         Difference in confidence for target class [pp]                                                         Difference in confidence for target class [pp]


Figure 2: Difference (in percentage points) between the con-                                          Figure 3: Difference between the confidence in the target
fidence in the target class for the uncompressed and pruned                                           class for the uncompressed and pruned SqueezeNet trained
ResNet-18 trained on GTSRB. The upper plot shows the dif-                                             on GTSRB. The upper plot shows the difference where the
ference where the predictions of the both networks differ, the                                        predictions of the both networks differ, the lower one where
lower one where they are equal. In parenthesis we state the                                           they are equal.
number of samples underlying the respective diagram, e.g.,
the n = 77 in the upper plot shows that for 77 samples the
uncompressed and pruned model yielded a different classi-
fication result.                                                                                      mon observation regarding the low precision in combination
                                                                                                      with the static post-training quantization scheme (Banner,
                                                                                                      Nahshan, and Soudry 2019). For GTSRB, we again observe
confusion compared to GTSRB likely can be attributed to                                               changes on the class level introduced by the quantization
the different complexity of the tasks. While GTSRB has only                                           – but to a slightly lesser extent compared to pruning, al-
very limited inter-class variance – a Speed limit (20km/h)                                            though in most cases also reducing the test accuracy lesser
sign always has the same shape and surface, the differences                                           – as Figure 5 depicts. The same observation can be made
in the images stem from different lighting, viewing angles,                                           for SqueezeNet and LeNet, where the maximum extent of
etc. – for CIFAR-10 samples from the same class can vary                                              the change in confusion or accuracy is up to 6pp or 12pp re-
greatly, increasing the likelihood of confusion.                                                      spectively. Comparing the differences in the confusion ma-
                                                                                                      trices for pruned and quantized networks, it is also evident
   Lastly, considering ResNet-18 on CIFAR-10, virtually                                               that classes are not necessarily affected in the same way by
no difference is observable between the uncompressed and                                              both compression methods. This hints towards the finding
pruned network. Only two samples are predicted differently                                            that not only samples that are challenging for the uncom-
and neither the uncompressed nor the pruned variant arrive                                            pressed model are affected but that the compression tech-
at the correct prediction. Additionally, the confidence differ-                                       niques can potentially affect any sample. Regarding CIFAR-
ence regarding the predicted class between both variants in                                           10, we again report the overall similar but significantly re-
all cases is ≤ 2.5pp. The likely reason for this similarity is                                        duced effects compared to GTSRB, as we already observed
that although 72.4% of the network have been pruned, the                                              for pruning, with the exception that the ResNet-18 also is
pruned network still contains enough redundancy to mimic                                              slightly affected with changes in confusion and accuracy, up
the initial model and further pruning would be required to                                            to 0.6pp.
elicit any effects. In turn this also highlights that if pruning
is performed conservatively and not to the absolute limit, i.e.                                          Comparing 4-bit (with 8-bit activation precision) quanti-
until even slight changes in the overall accuracy are notice-                                         zation to 8-bit quantization for the configurations without a
able, a compressed variant might be achievable that virtually                                         significant drop in accuracy, we find that the observed effects
mimics the initial network.                                                                           are the same but more pronounced for 4-bit quantization. As
                                                                                                      already observed when comparing pruning and 8-bit quan-
Quantization Referring to Table 1, 8-bit quantization                                                 tization, 4-bit quantization and 8-bit quantization show no
shows very limited impact on the overall accuracy while 4-                                            consistent patterns regarding the impacted samples, further
bit quantization (with 8-bit activation precision) in half of                                         supporting the hypothesis that any sample or class can be
the experiments shows a significant drop in accuracy, a com-                                          affected.
                              Difference in accuracy to uncompressed model [pp]                                                                   Difference in accuracy to uncompressed model [pp]
                                                                                                                                    -6.0   -4.0           -2.0           0.0           2.0            4.0   6.0
                            -4.0          -3.0           -2.0         -1.0     0.0      1.0           2.0     3.0


                                                                                                                                2
                 airplane                                                                                                      13
                                                                                                                                1
                                                                                                                               38
               automobile                                                                                                      12
                                                                                                                               10
                                                                                                                                4
                                                                                                                                5
                     bird                                                                                                      25
                                                                                                                                9
                                                                                                                                7
                      cat                                                                                                       8
                                                                                                                                3
                                                                                                                               11
                     deer                                                                                                      18
      Target


                                                                                                                               35
                                                                                                                               17
                     dog                                                                                                       14
                                                                                                                               31
                                                                                                                               15


                                                                                                                      Target
                                                                                                                               33
                     frog                                                                                                      26
                                                                                                                               30
                                                                                                                               23
                    horse                                                                                                      28
                                                                                                                                6
                                                                                                                               16
                     ship                                                                                                      34
                                                                                                                               36
                                                                                                                               22
                    truck                                                                                                      39
                                                                                                                               40
                                                                                                                               21
                                                                                                                               42
                                                                cat


                                                                               dog
                                                                        deer


                                                                                     frog

                                                                                              horse
                               airplane


                                                        bird


                                                                                                       ship
                                           automobile


                                                                                                              truck            29
                                                                                                                               24
                                                                                                                               20
                                                                                                                               32
                                                                                                                               27
                                                                                                                               37
                                                                                                                               19
                                                                      Prediction                                               41
                                                                                                                                0


                                                                                                                                     2
                                                                                                                                    13
                                                                                                                                     1
                                                                                                                                    38
                                                                                                                                    12
                                                                                                                                    10
                                                                                                                                     4
                                                                                                                                     5
                                                                                                                                    25
                                                                                                                                     9
                                                                                                                                     7
                                                                                                                                     8
                                                                                                                                     3
                                                                                                                                    11
                                                                                                                                    18
                                                                                                                                    35
                                                                                                                                    17
                                                                                                                                    14
                                                                                                                                    31
                                                                                                                                    15
                                                                                                                                    33
                                                                                                                                    26
                                                                                                                                    30
                                                                                                                                    23
                                                                                                                                    28
                                                                                                                                     6
                                                                                                                                    16
                                                                                                                                    34
                                                                                                                                    36
                                                                                                                                    22
                                                                                                                                    39
                                                                                                                                    40
                                                                                                                                    21
                                                                                                                                    42
                                                                                                                                    29
                                                                                                                                    24
                                                                                                                                    20
                                                                                                                                    32
                                                                                                                                    27
                                                                                                                                    37
                                                                                                                                    19
                                                                                                                                    41
                                                                                                                                     0
                                                                                                                                                                     Prediction
Figure 4: Difference between the confusion matrices for the
uncompressed and pruned LeNet trained on CIFAR-10.
                                                                                                                      Figure 5: Difference between the confusion matrices for the
                                                                                                                      uncompressed and 8-bit quantized ResNet-18 trained on GT-
Combined Pruning and Quantization Lastly, we com-                                                                     SRB.
bine pruning and quantization, by applying 8-bit quantiza-
tion to the pruned models. As Table 1 shows, this has a
slightly higher impact on the overall drop in accuracy than if                                                        we first performed a 3x3 average pooling with stride 3 over
the compression techniques would be applied individually,                                                             the saliency maps – to reduce the sensitivity towards pixel-
which is to be expected. Overall, the same general effects                                                            level changes in the attention, putting a stronger emphasis on
are observable as if pruning or quantization are applied indi-                                                        higher-level features – and afterwards computed the Mean
vidually as Figure 6 shows. Regarding the effect on individ-                                                          Absolute Deviation (MAD) between the reduced saliency
ual classes, patterns present in both compression techniques                                                          maps of the uncompressed and pruned model.
are combined, sometimes amplifying, other times canceling                                                                Figure 7 shows five selected saliency maps that summa-
out the effects. Potentially this could lead to drastic effects,                                                      rize the observed findings. Overall, we did not observe any
however, in our experiments we did not observe any. Also, it                                                          systematic changes between any model and its pruned vari-
is noticeable that generally the number of samples where the                                                          ant. For example, in some instances, the pruned model fo-
uncompressed and the compressed model disagree is slightly                                                            cuses better on the foreground, giving the correct prediction,
higher than for any of the single compression variants. Gen-                                                          while the original model focuses on the background, classi-
erally speaking, the combination of both compression tech-                                                            fying incorrectly, as Figure 7a representatively shows. How-
niques, however, shows no peculiarities and does not intro-                                                           ever, this is not a consistent behavior, and the opposite effect
duce significant additional effects.                                                                                  can be observed as well, e.g., in Figure 7b, where the pruned
                                                                                                                      model puts too much attention on the sky, classifying the
3.4            Differences in the Relevance of Input Regions                                                          image as an airplane. Furthermore, even for samples where
In addition to the quantitative analysis performed in the pre-                                                        both networks predict the same class, we can observe sig-
vious section, we also qualitatively investigated the changes                                                         nificant changes in the salience of different input regions,
introduced by model compressing. For this, we generated                                                               i.e, the models weight features differently or even rely on
saliency maps that highlight the salient input regions for a                                                          different ones. While in the previous sections we found vir-
model’s decision regarding the target class. Since statically                                                         tually no differences between the uncompressed and pruned
quantized models in PyTorch don’t support gradient calcu-                                                             ResNet-18 on CIFAR-10, regarding their attention we could
lation, we only analyze the changes between uncompressed                                                              find noticeable differences as Figures 7d and 7e show. This
and pruned model variants. In order to keep the number of                                                             effect is also not limited to our selected samples, as Figure 8
images to analyze manageable, for each configuration we                                                               shows. With an average MAD of 7.7% between the saliency
selected the 20 samples where the biggest difference in the                                                           maps of the uncompressed and pruned ResNet-18, it high-
saliency maps was present. To compare two saliency maps,                                                              lights that although the effects of model compression might
                         Difference in accuracy to uncompressed model [pp]                                                  horse                  airplane              horse
              -6.0    -4.0       -2.0        0.0         2.0       4.0       6.0   8.0


          2
         13
          1
         38
         12
         10
          4
                                                                                                                          (a) LeNet trained on CIFAR-10 (MAD=13%)
          5
         25
          9                                                                                                                  ship                    ship            airplane
          7
          8
          3
         11
         18
         35
         17
         14
         31
         15
Target


         33
         26
         30
         23                                                                                                               (b) LeNet trained on CIFAR-10 (MAD=12%)
         28
          6
         16                                                                                                                  frog                    frog                 frog
         34
         36
         22
         39
         40
         21
         42
         29
         24
         20
         32
         27
         37
         19                                                                                                               (c) LeNet trained on CIFAR-10 (MAD=12%)
         41
          0
                                                                                                                             frog                    frog                 frog
               2
              13
               1
              38
              12
              10
               4
               5
              25
               9
               7
               8
               3
              11
              18
              35
              17
              14
              31
              15
              33
              26
              30
              23
              28
               6
              16
              34
              36
              22
              39
              40
              21
              42
              29
              24
              20
              32
              27
              37
              19
              41
               0


                                            Prediction


Figure 6: Difference between the confusion matrices for the
uncompressed and the pruned + 8-bit quantized ResNet-18
trained on GTSRB.
                                                                                                                      (d) ResNet-18 trained on CIFAR-10 (MAD=16%)

                                                                                                                          automobile              automobile        automobile
not be noticeable at the level of dataset or even class-wise
accuracy, it is definitely important to consider them in safety
analyses, as it might, for example, introduce additional fail-
ures in corner cases where a model bases its decision on the
wrong features.

              4      Conclusions and Future Work                                                                      (e) ResNet-18 trained on CIFAR-10 (MAD=15%)
In this paper, we investigated changes in the predictions
of networks compressed with either post-training quantiza-                               Figure 7: Comparison of saliency maps for the original
tion, global unstructured pruning, or a combination of both.                             and pruned variants of the LeNet and ResNet-18 trained on
While the deviations from the test accuracy of the uncom-                                CIFAR-10. On the left is the original image with the target
pressed model were minimal, we observed that the compres-                                class annotated. Following that are the saliency maps for the
sion techniques still caused significant changes in the predic-                          original (middle) and pruned (right) model each annotated
tions. For one thing, we found that the accuracy of individual                           with the prediction of the respective model.
classes can change greatly – in our experiments up to 15pp –
and that the confusion between classes can vary to the same
                                                                                          Relative frequency


extent. For another thing, our investigation showed that also                                                  0.04
the confidence regarding the target class can change signif-
icantly, with extreme cases were the uncompressed model                                                        0.02
has zero confidence in the target class while the compressed
variant has full confidence, and vice versa. Lastly, our com-
                                                                                                               0.00
parison of saliency maps for uncompressed and pruned mod-                                                             2      4         6      8         10     12    14          16   18
els revealed the presence of significant differences, hinting                                                                              Mean absolute deviation [%]
towards the two variants relying on or weighting features
differently. It is to mention, however, that we did not ob-
serve the introduction of systematic errors, e.g., in the form                           Figure 8: Distribution of the mean absolute deviations be-
of biases against infrequent classes. Nonetheless, based on                              tween the saliency maps of the uncompressed and pruned
the effects we observed, we strongly suggest to view model                               ResNet-18 trained on CIFAR-10.
compression as integral part of any ML development cycle            Cai, Y.; Yao, Z.; Dong, Z.; Gholami, A.; Mahoney, M. W.;
and to consider it in early development stages. Model com-          and Keutzer, K. 2020. ZeroQ: A Novel Zero Shot Quantiza-
pression can cause substantial changes in the predictions of        tion Framework. In Proc. CVPR, 13169–13178.
a network and with that bears the potential to introduce ad-        Cheng, Y.; Wang, D.; Zhou, P.; and Zhang, T. 2018. Model
ditional failure modes. These must be addressed in the sys-         Compression and Acceleration for Deep Neural Networks:
tem development and the earlier they are known, the better          The Principles, Progress, and Challenges. IEEE Signal Pro-
mitigation measures can be integrated in the system, overall        cess. Mag., 35(1): 126–136.
facilitating the development process.                               Duncan, K.; Komendantskaya, E.; Stewart, R.; and Lones,
   Regarding future work, we suggest to expand our ex-              M. 2020. Relative Robustness of Quantized Neural Net-
periments also to other model architectures, datasets, and          works Against Adversarial Attacks. In Proc. IJCNN, 1–8.
tasks and to investigate other compression techniques, e.g.,
quantization-aware training, structured pruning, or knowl-          Ferianc, M.; Maji, P.; Mattina, M.; and Rodrigues, M. 2021.
edge distillation, as these are also highly relevant in practice.   On the Effects of Quantisation on Model Uncertainty in
Furthermore, we deem it as highly important to further de-          Bayesian Neural Networks. arXiv:2102.11062 [cs, stat].
velop methods for systematically and rigorously analyzing           Gui, S.; Wang, H. N.; Yang, H.; Yu, C.; Wang, Z.; and Liu,
machine learning systems that go beyond averaging metrics,          J. 2019. Model Compression with Adversarial Robustness:
as these hide many peculiarities that bear the potential for        A Unified Optimization Framework. In Proc. NeurIPS, vol-
failures. Lastly, we deem it equally as important to continue       ume 32. Curran Associates, Inc.
research into the direction of continuous safety assurance          Han, S.; Mao, H.; and Dally, W. J. 2016. Deep Compression:
(Burton et al. 2021b) in order to consider safety as integral       Compressing Deep Neural Network with Pruning, Trained
part of the development of ML-based systems, addressing             Quantization and Huffman Coding. In Proc. ICLR.
issues such as potentially negative effects due to model com-       Hawkins, R.; Paterson, C.; Picardi, C.; Jia, Y.; Calinescu,
pression early on.                                                  R.; and Habli, I. 2021. Guidance on the Assurance of Ma-
                                                                    chine Learning in Autonomous Systems (AMLAS). CoRR,
                        References                                  abs/2102.01564.
Ashok, A.; Rhinehart, N.; Beainy, F.; and Kitani, K. M.             He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual
2018. N2N Learning: Network to Network Compression via              Learning for Image Recognition. In Proc. CVPR, 770–778.
Policy Gradient Reinforcement Learning. In Proc. ICLR.              He, Y.; Kang, G.; Dong, X.; Fu, Y.; and Yang, Y. 2018. Soft
Banner, R.; Nahshan, Y.; and Soudry, D. 2019. Post Training         Filter Pruning for Accelerating Deep Convolutional Neural
4-Bit Quantization of Convolutional Networks for Rapid-             Networks.
Deployment. In Proc. NeurIPS, 714, 7950–7958. Red Hook,
                                                                    Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the
NY, USA: Curran Associates Inc.
                                                                    Knowledge in a Neural Network. arXiv:1503.02531 [cs,
Bernhard, R.; Moellic, P.-A.; and Dutertre, J.-M. 2019. Im-         stat].
pact of Low-Bitwidth Quantization on the Adversarial Ro-
                                                                    Hooker, S.; Courville, A.; Clark, G.; Dauphin, Y.; and
bustness for Embedded Neural Networks. In 2019 Interna-
                                                                    Frome, A. 2021. What Do Compressed Deep Neural Net-
tional Conference on Cyberworlds (CW), 308–315.
                                                                    works Forget? arXiv:1911.05248 [cs, stat].
Blalock, D. W.; Ortiz, J. J. G.; Frankle, J.; and Guttag, J. V.
2020. What Is the State of Neural Network Pruning? In               Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; and
Proc. MLSys. mlsys.org.                                             Bengio, Y. 2018. Quantized Neural Networks: Training
                                                                    Neural Networks with Low Precision Weights and Activa-
Burton, S.; Gauerhof, L.; and Heinzemann, C. 2017. Mak-             tions. Journal of Machine Learning Research, 18(187): 1–
ing the Case for Safety of Machine Learning in Highly Au-           30.
tomated Driving. In Computer Safety, Reliability, and Secu-
rity, LNCS, 5–16. Cham: Springer International Publishing.          Iandola, F. N.; Moskewicz, M. W.; Ashraf, K.; Han, S.;
                                                                    Dally, W. J.; and Keutzer, K. 2016. SqueezeNet: AlexNet-
Burton, S.; Gauerhof, L.; Sethy, B. B.; Habli, I.; and
                                                                    Level Accuracy with 50x Fewer Parameters and <1MB
Hawkins, R. 2019. Confidence Arguments for Evidence
                                                                    Model Size. CoRR, abs/1602.07360.
of Performance in Machine Learning for Highly Automated
Driving Functions. In Computer Safety, Reliability, and Se-         Kazmi, M.; Schüller, P.; and Saygin, Y. 2017. Improving
curity, LNCS, 365–377. Cham: Springer International Pub-            Scalability of Inductive Logic Programming via Pruning and
lishing.                                                            Best-Effort Optimisation. Expert Systems With Applications,
                                                                    87: 291–303.
Burton, S.; Kurzidem, I.; Schwaiger, A.; Schleiß, P.; Unter-
reiner, M.; Graeber, T.; and Becker, P. 2021a. Safety Assur-        Krizhevsky, A. 2009. Learning Multiple Layers of Features
ance of Machine Learning for Chassis Control Functions.             from Tiny Images. 60.
In Computer Safety, Reliability, and Security, LNCS. Cham:          Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998.
Springer International Publishing.                                  Gradient-Based Learning Applied to Document Recogni-
Burton, S.; McDermid, J. A.; Garnett, P.; and Weaver, R.            tion. Proc. IEEE, 86(11): 2278–2324.
2021b. Safety, Complexity, and Automated Driving: Holis-            LeCun, Y.; Denker, J. S.; and Solla, S. A. 1989. Optimal
tic Perspectives on Safety Assurance. Computer, 54(8): 22–          Brain Damage. In Touretzky, D. S., ed., Proc. NIPS, 598–
32.                                                                 605. Morgan Kaufmann.
Li, B.; Wu, B.; Su, J.; and Wang, G. 2020. EagleEye: Fast
Sub-Net Evaluation for Efficient Neural Network Pruning.
In Proc. ECCV, LNCS, 639–654. Cham: Springer Interna-
tional Publishing.
Luo, J.-H.; and Wu, J. 2020. AutoPruner: An End-to-End
Trainable Filter Pruning Method for Efficient Deep Model
Inference. Pattern Recognition, 107: 107461.
Mingers, J. 1989. An Empirical Comparison of Pruning
Methods for Decision Tree Induction. Machine Learning,
4(2): 227–243.
Picardi, C.; Hawkins, R.; Paterson, C.; and Habli, I. 2019. A
Pattern for Arguing the Assurance of Machine Learning in
Medical Diagnosis Systems. In Computer Safety, Reliabil-
ity, and Security, LNCS, 165–179. Cham: Springer Interna-
tional Publishing.
Renda, A.; Frankle, J.; and Carbin, M. 2019. Comparing
Rewinding and Fine-Tuning in Neural Network Pruning. In
ICLR 2019.
Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Deep
inside Convolutional Networks: Visualising Image Classifi-
cation Models and Saliency Maps. In Proc. ICLR.
Stallkamp, J.; Schlipsing, M.; Salmen, J.; and Igel, C. 2011.
The German Traffic Sign Recognition Benchmark: A Multi-
Class Classification Competition. In Proc. IJCNN, 1453–
1460.
Swaminathan, S.; Garg, D.; Kannan, R.; and Andres, F.
2020. Sparse Low Rank Factorization for Deep Neural Net-
work Compression. Neurocomputing, 398: 185–196.
Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.;
Howard, A.; and Le, Q. V. 2019. MnasNet: Platform-Aware
Neural Architecture Search for Mobile. In Proc. CVPR,
2820–2828.
Wu, J.; Leng, C.; Wang, Y.; Hu, Q.; and Cheng, J. 2016.
Quantized Convolutional Neural Networks for Mobile De-
vices. In Proc. CVPR, 4820–4828.

                5    Acknowledgments
This work was funded by the Bavarian Ministry for Eco-
nomic Affairs, Regional Development and Energy as part of
a project to support the thematic development of the Institute
for Cognitive Systems.