Elite BackProp: Training Sparse Interpretable Neurons Theodoros Kasioumis, Joe Townsend and Hiroya Inakoshi Abstract In this paper we present a method called Elite BackProp (EBP) to train more interpretable convolutional neural networks (CNNs) by introducing class-wise activation sparsity; after training, each class will be associated with a small set of elite filters that fire rarely but highly activate on visual primitives from images of that class. Our method is broadly applicable as it does not require additional object part annotations during training. We demonstrate experimentally that EBP realizes high degrees of activation sparsity with no accuracy loss and enhances the performance of a rule extraction algorithm that distils the knowledge from a CNN, by inducing more compact rules that use fewer atoms to describe the decisions of a CNN while maintaining high fidelity compared with other solutions. This happens because EBP induces sparse compositional representations that reuse and combine primitive filters. Such representations can assist in understanding the reasoning behind a CNN and build trust into their decisions. Keywords neural-symbolic integration, training interpretable CNNs, activation sparsity, rule extraction 1. Introduction Training neural networks to be interpretable [1, 2] or interpreting their decisions [3, 4, 5, 6], has received a great deal of attention in recent years. Multiple rule extraction algorithms have been proposed that aim to distil the knowledge from a CNN and intepret its decisions [7, 8, 9, 10, 11], many of which rely on thresholding filter activations to determine whether a filter can be considered active (a process known as quantisation [9, 12]), detecting a specific pattern across images. After combining active filters, rules are formed to explain the classification decision of the CNN, where each atom used in explanations corresponds to an active filter. One can regard compact explanations as more interpretable as there is less information for the reader to assimilate. Thus, ideal explanations are composed of smaller sets of rules which are in turn composed of smaller sets of atoms that can be reused across the rule set and across different classes in different combinations. Such compactness is more difficult to achieve if the original CNN itself encodes a large number of redundant representations, i.e. in which different convolutional filters co-activate on the same concepts. We aim to tackle the aforementioned inefficiencies by introducing an algorithm called Elite BackProp (EBP) that enforces class-wise activation sparsity. That is, EBP trains CNNs to associate each class with a handful of elite filters that fire rarely but highly activate on images from that NeSy’20/21: 15th International Workshop on Neural-Symbolic Learning and Reasoning, October 25–27, 2021, Virtual " theodoros.kasioumis@fujitsu.com (T. Kasioumis); joseph.townsend@fujitsu.com (J. Townsend); hiroya.inakoshi@fujitsu.com (H. Inakoshi)  0000-0003-2008-5817 (T. Kasioumis); 0000-0002-5478-0028 (J. Townsend); 0000-0003-4405-8952 (H. Inakoshi) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) class. By filter we mean the set of weights that make a single channel in the convolutional layer. Every filter is assigned a probability of being active on each class based on its frequency and magnitude of activation on images from that class. Filters are ranked for each class according to their activation probability and the top-K filters for a class form its elite. If a filter activates on an image from a class that has low probability of activating that filter, then a penalty inversely proportional to filter’s ranking is applied. Since only the elite filters will have strong activations for each class, the model is incentivised to learn more compositional representations by re- using and combining primitive filters to describe more complex concepts rather than learning redundant multiple filters for each class separately. Activation sparsity has been used extensively in the literature in many different forms [13, 14, 15, 16, 17, 18, 19, 20, 21, 22] and it has also been hypothesized that sparsely activated neurons are more interpretable than neurons that activate frequently [23]. To the best of our knowledge, EBP is the first to induce class-wise activation sparsity as we define it above. Our experiments show that models trained with EBP maintain their original accuracy while realizing high degrees of sparsity. Furthermore, we demonstrate that the group activation sparsity induced by EBP benefits a rule extraction algorithm that distils the knowledge from a trained CNN and explains the output classification in an interpretable logical language over quantized filter activations represented as logical atoms. We show that EBP yields extracted programs with high fidelity (accuracy in approximating the original model) that use much fewer atoms compared with other activation sparsity methods. Moreover, we present qualitative examples of how the receptive field of filters trained with EBP is much more interpretable compared to training without as it consistently detects a specific pattern across different images. This paper is organized as follows: in Section 2 we review prior work in activation sparsity and interpretability. In Section 3, we present EBP and Section in 4 we conduct experimental evaluation of EBP. Section 5 concludes by summarizing our results and discussing future work. 2. Related Work Work in the field of neural-symbolic integration [3, 8, 12, 24, 25] concerns both post-hoc and self- interpretable models of explainability for neural networks, and employ symbolic representations such as decision trees or logic programs to explain a network’s behavior. ProtoPNet [1] trains interpretable CNNs by partitioning the input image into sub-regions and associating each region with a prototypical part of some class that “looks like” the extracted region. Another idea introduced in [2] uses a “filter loss”that pushes a filter to represent an object part of a particular class, but not others. Both lines of work differ from EBP in that they both add a new structural layer whereas EBP only adds a penalty to the loss that enforces class-wise activation sparsity. Another line of research focuses on sparsifying connections or activations in neural networks in order to reduce overfitting and redundancy in representations while optimizing the model in terms of accuracy and speed. The literature in weight sparsification and compression is vast, see for example [26] and references therein. However, our work is only concerned with activity sparsity, and not weight sparsity. Sparse activity was initially inspired by sparse coding [23] and has been used in the literature in many different forms. The commonly used ReLU activation function [13] induces activation sparsity in a fraction of neurons and recently authors in [18] proposed three activation functions to induce spike-like sparsity. A k-Winners-Take-All activation (k-WTA) function which retains only the k highest activations from a layer and sets all the rest to zero was used in [17] to improve adversarial robustness. This is a natural generalization of the boolean K-Winners-Take- All network [27] which was motivated by biological neural circuits and had boolean outputs. Authors in [28] also utilized k-WTA in k-Sparse Autoencoders. Another winner-take-all method is Duty Cycle [19] which sparsifies activations in a layer like k-WTA but in addition it sparsifies connections preceding that layer by initializing it from sparse random distribution and also it introduces a boosting term to favor units that have not been recently active in order to encourage every unit to be equally active and hence maintain the representational power of the model. Our method differs from these works in that it associates images of each class to a common group of winners/elites without modifying the forward pass or pruning connections. Instead it penalizes filters that activate on images from a class that has low probability of activating those filters. Dropout [15] has been widely used to prevent co-adaptation of neurons by randomly dropping them during training with uniform probability. Sparseout [16] extended Dropout by imposing an 𝐿𝑞 penalty on the activations, allowing one to choose the level of activation sparsity. Recently DASNet [20] introduced a dynamic activation sparsity method, utilizing a winners-take-all dropout technique. DASNet behaves as a mask between layers and prunes low-ranking neurons in terms of their activation magnitude at run-time for computational speedups. In contrast EBP is not concerned with pruning and the activations are not masked at runtime. Other solutions induce activation sparsity by applying regularization penalties; [29] penalizes the deviation of the expected activation of the hidden units from a low fixed level 𝑝 to achieve a sparsity level 𝑝 layer-wise and [14] added a cross entropy term between the average probability of unit activation and the desired sparsity level 𝑝 that encourages the activation probability of a neuron to be close to 𝑝. [30] used a clustering based regularization approach to obtain sparse representations. Recently [21] exploited an 𝐿1 activation regularizer for computational benefits and [22] improved the results using a variant of ReLu in conjuction with a regularizer based on Hoyer sparsity metric [31, 32]. Both [21, 22] induce activation sparsity layer-wise, i.e., a certain percent of neuron activations in each layer will be retained and [22] introduces a novel approach for activation pruning for computational speedups. In contrast, EBP induces class-wise sparsity, meaning that images from a specific class will be associated with the same group of elite filters and no pruning is performed. 3. Method We denote by 𝐶 the number of classes in a dataset 𝒳 and by 𝑐𝑖 the ground truth class of an (𝑙) (𝑙) image 𝑋𝑖 ∈ 𝒳 . Given a conv-layer 𝑙 of a CNN with filters {𝑓1 , . . . , 𝑓𝑀𝑙 } and a batch of images (𝑙) (𝑙) (𝑙) (𝑙) {𝑋𝑖 , . . . , 𝑋𝑁 }, let 𝐹𝑖 = (𝐹𝑖1 , 𝐹𝑖2 , . . . , 𝐹𝑖𝑀𝑙 ) be the feature map output of 𝑋𝑖 at the 𝑙-th layer, (𝑙) where each 𝐹𝑖𝑗 , 𝑗 = 1, . . . , 𝑀𝑙 , is a 2D activation matrix that is the output of convolving the feature map of layer 𝑙 − 1 with the 𝑗-th filter for 𝑖-th image, followed by ReLU and maxpooling (0) (𝑙) (𝑙) (if present). 𝐹𝑖 = 𝑋𝑖 denotes the input image. The activation 𝐴𝑖𝑗 of filter 𝑓𝑗 for the 𝑖-th image in the batch at layer 𝑙 is defined as the spatial average of activations: 𝐻 𝑊 𝑖𝑙 𝑖𝑙 ∑︁ (𝑙) 1 ∑︁ (𝑙) 𝐴𝑖𝑗 = |(𝐹𝑖𝑗 )𝑟𝑠 |, (1) (𝐻𝑖𝑙 𝑊𝑖𝑙 ) 𝑟 𝑠 where 𝐻𝑖𝑙 , 𝑊𝑖𝑙 denotes the height and width of the feature map at layer 𝑙 respectively for the (𝑙) image 𝑋𝑖 and (·)𝑟𝑠 stands for the (𝑟, 𝑠) spatial coordinates. A filter 𝑓𝑗 is said to be active for (𝑙) an image 𝑋𝑖 if its activation 𝐴𝑖𝑗 > 𝜃, where 𝜃 is a specified threshold. EBP associates each class 𝑐 with a handful of 𝐾𝑐 elite filters that activate rarely and have strong activation magnitude on images from that class. This is accomplished by assigning each filter a probability of being active for each class based on the frequency and magnitude of activations on images from that class during training. Then, for each class 𝑐 filters are ranked and the top-𝐾𝑐 form its elite. 𝐾𝑐 controls the degree of sparsity and representation power of the model (more sparse for lower 𝐾𝑐 ). Filters that activate on images from a class that do not belongs to its elite are penalized, with penalties inversely proportional to their ranking. On each iteration and for each image 𝑋𝑖 of class 𝑐𝑖 in the batch, EBP stores the 𝑀𝑙 -dimensional (𝑙) (𝑙) vector (𝐴𝑖1 , . . . , 𝐴𝑖𝑀𝑙 ) of average filter activations (Eq. 1) from a layer 𝑙 accumulatively in a vector 𝐷𝑐𝑖 as follows: (𝑙) (𝑙) 𝐷𝑐𝑖 ← 𝐷𝑐𝑖 + (𝐴𝑖1 , . . . , 𝐴𝑖𝑀𝑙 ). (2) Afterwards, for each class 𝑐, filters are ranked based on their history of accumulated activations: 𝑓𝑟𝑐𝑖 ≻ 𝑓𝑟𝑐𝑖+1 iff 𝐴𝑐𝑟𝑖 > 𝐴𝑐𝑟𝑖+1 , for 𝑖 ∈ {1, . . . , 𝑀𝑙−1 }, (3) where 𝐷𝑐 = (𝐴𝑐1 , . . . , 𝐴𝑐𝑀𝑙 ) and 𝐴𝑐𝑟 = 𝑖 𝐴𝑖𝑟 , 𝑟 = 1, . . . , 𝑀𝑙 denotes the accumulated 𝑐 ∑︀ activations of the 𝑟-th filter 𝑓𝑟𝑐 for class 𝑐 (layer (𝑙) is removed from the superscript for notational convenience). The 𝐾𝑐 highest activated filters for each class 𝑐 form its elite and are stored in a set 𝐸 𝑐 = {𝑓𝑟𝑐1 , . . . , 𝑓𝑟𝑐𝐾𝑐 }. For each filter 𝑓𝑗𝑐 we define a probability 𝑝𝑗𝑐 of being active for class 𝑐 as in Equation 4 and we penalize filters which activate on an image 𝑋𝑖 of class 𝑐𝑖 , with penalties 𝑅(𝑊1:𝑙 ) proportional to (1 − 𝑝𝑗𝑐𝑖 ) (i.e., the probability of filter being inactive): 𝑀𝑙 𝑁 ∑︁ ∑︁ (𝑙) 𝐷𝑗𝑐 𝑅(𝑊1:𝑙 ) = (1 − 𝑝𝑗𝑐𝑖 )𝐴𝑖𝑗 , where 𝑝𝑗𝑐 = 1 − 𝛿𝑗𝑐 , (4) 𝐴𝑐𝑟𝐾𝑐 𝑖=1 𝑗=1 where 𝑊1:𝑙 , denotes the set of weights from layer 1 up to 𝑙, 𝛿𝑗𝑐 = 1 if 𝑓𝑗 ̸∈ 𝐸𝑐 and 0 otherwise, 𝐷𝑗𝑐 is the 𝑗-th index of 𝐷𝑐 and 𝐴𝑐𝑟𝐾𝑐 the 𝐾𝑐 -th sorted accumulated activation (Eq. 3) for class 𝑐. The total loss we optimize in each batch is a combination of cross entropy loss and 𝑅: 𝑁 ∑︁ ∑︁ 𝐶 𝐿𝑊 (𝑦, 𝑦ˆ) = − 𝑦𝑖𝑐 log 𝑦ˆ𝑖𝑐 + 𝜆𝑅(𝑊1:𝑙 ), (5) 𝑖=1 𝑐=1 where 𝑦𝑖𝑐 = 1 if the 𝑖-th observation 𝑋𝑖 is of class 𝑐 and 0 otherwise, 𝑦ˆ𝑖𝑐 is the predicted proba- bility that 𝑋𝑖 is of class 𝑐 and 𝜆 ∈ R controls the regularization strength. In our experiments we have chosen for simplicity the number of elites 𝐾𝑐 per class 𝑐 to be equal to 𝐾 for all classes, and we leave experimenting with different 𝐾𝑐 for each class for future work. We refer to the Appendix 6 for further discussion and visualization of the class probabilities of images before and after applying EBP. 4. Experiments We conduct experiments on 2 image classification datasets; a 6-class animal subset [2] of Pascal VOC [33] and a toy 3-class subset of places365 dataset [34] consisting of the classes forest road, highway and street which we will refer to as the road dataset. Regarding (train, valid, test) splits, we further split the original train-val to get a test set because annotations were not available for the original test set. We use (3341, 594, 1399) for the 6-class Pascal subset and (10444, 1500, 3054) for the 3-class road dataset. We quantitatively compare EBP in terms of accuracy, activation sparsity and rule extraction (Section 4.1) with several activation sparsity methods from the literature. We show that EBP results in higher sparsity without sacrificing accuracy and also that rules extracted from a CNN trained with EBP use fewer atoms and have higher fidelity. Qualitative examples of extracted rules in the road dataset using EBP are shown in Fig. 3. From k-winner-take-all methods we compare with k-WTA and Duty Cycle and from regularization methods we compare with [29] which we refer to as EASR (Expected Activation Sparsity Regularization). We use Sparseout to investigate the effect of dropout-like methods. Since our method is concerned with sparsifying existing layer activations, we do not compare with [1, 2] because they require additional, specialised neural layers to enforce interpretability. For each method, we initialized VGG-16 with weights pretrained on ImageNet and fol- lowed the preprocessing and augmentations in [35]. We finetuned the last dense layer for 50 epochs with Adam optimizer [36], ridge regularization parameter 0.005 and learning rate 5 × 10−5 using an NVIDIA GPU 1080 Ti 12GB and Tensorflow1 . Afterwards, we applied each method after the conv13 layer and resumed training all layers for 100 epochs with learning rate 10−6 . We select the conv13 layer because filters in deeper layers in CNNs tend to rep- resent object parts and more semantic concepts than earlier layers [37, 38]. Regarding the hyperparameters, for EBP we trained with different 𝐾 ∈ {10, 25, 50, 100, 200, 300} (same 𝐾 for all classes) and regularization values2 𝜆 ∈ Λ = {0.1, 0.01, 0.001, 0.0001}. For EASR we used 𝜆 ∈ {0.0005, 0.0001, 0.00005, 0.00002} and for k-WTA we trained with density ratio 𝑘 ∈ {0.05, 0.1, 0.2, 0.5, 0.8, 0.9, 0.95}. For Duty Cycle we tune the density coefficient 𝑎 ˆ𝑙 that controls the percentage of neurons that are expected to be active in {0.1, 0.2, 0.5, 0.7, 0.9} and the boosting coefficient 𝛽 from 1 up to 512, incrementing it in powers of 2. For Sparseout we experimented with values for 𝑝 ∈ {0.3, 0.5} and 𝑞 ∈ {0.5, 1, 1.5, 2, 2.5, 3}. 4.1. Sparsity and Rule Extraction We measure the activity sparsity after training with each method with various metrics such as Hoyer measure [32], Lifetime and Population kurtosis, Treves-Rolls lifetime (T-R Life.), Treves- Rolls population (T-R Popul.) and activity sparseness [39] on images from road and 6-class subset of Pascal dataset. The rationale behind choosing this toy road dataset to assess the sparsity and benefits in a simple rule extraction algorithm (Section 4.1) is that scenes contain topics that are shared between different classes and we don’t want to learn separate filters for each of them. For example, trees appear in all classes, hence it is preferable to learn one filter 1 https://www.tensorflow.org/ 2 for smaller 𝐾 values and 𝜆 = 0.01 the convergence was slower. Training for 100 epochs was necessary to determine the best model but with higher 𝐾 values and lower 𝜆 the models converged in 30 − 50 epochs. Figure 1: Comparing accuracy, fidelity and size of extracted program for each method using different hyperparameters and avg tree depth 6 for road (1a-c) and 7 for 6-class subset of Pascal (2a-c) (each point represents the average over three trials). EBP is associated with high activation sparsity without sacrifycing accuracy. Rules extracted from CNN trained with EBP use less atoms in explanations. Road Rule Extraction Road Activity Sparsity Sparsity Extracted Fidelity Atoms Rules Avg. Hoyer Life. Popul. T-R T-R Activ. CNN Method Acc. Depth Spars. Kurt. Kurt. Popul. Life. Spars. Acc. EBP 84.48 89.87 12 36 6 0.9287 1175.3 212.34 0.9868 0.9807 0.9791 89.47 Vanilla 78.95 82.35 44 56 6 0.5593 32.02 28.26 0.7789 0.7366 0.8367 87.46 Sparseout 80.64 84.84 36 46 6 0.6331 57.02 27.24 0.84 0.769 0.8746 89.03 DutyCycle 82.16 87.11 28 39 6 0.7413 260.36 46.45 0.9141 0.8508 0.9126 89.16 k-WTA 80.85 86.46 31 45 6 0.6436 61.83 28.54 0.8483 0.7671 0.8803 89.32 EASR 79.04 84.38 19 36 6 0.9269 967.08 223.91 0.9862 0.9792 0.9789 88.93 Table 1 For each method we have chosen hyperparameters that yielded the best CNN accuracy and sparsity on road dataset (see 6 Appendix). For an ablation study of different hyperparameters see Fig. 1. CNNs trained with EBP achieve high activity sparsity and the program extracted from the CNN using EBP has higher fidelity and uses fewer rules and atoms in explanations. The best depth for each method was chosen based on sklearn’s cost-complexity pruning and results are averaged across 3 different runs. Notice also that all sparse activity methods achieve better accuracy than the baseline. Best results shown in bold (if deviation is less than 0.5% from the best multiple are marked). Pascal Rule Extraction Pascal Activity Sparsity Sparsity Extracted Fidelity Atoms Rules Avg. Hoyer Life. Popul. T-R T-R Activ. CNN Method Acc. Depth Spars. Kurt. Kurt. Popul. Life. Spars. Acc. EBP 71.48 72.69 26 32 7 0.861 286.64 140.93 0.9508 0.9674 0.9584 89.2 Vanilla 57.18 57.68 54 63 7 0.454 17.32 18.88 0.6983 0.6659 0.7856 86.49 Sparseout 66.98 68.26 39 51 7 0.488 16.84 19.55 0.7189 0.6598 0.7994 92.2 DutyCycle 53.45 67.33 44 58 7 0.619 71.55 46.4 0.8104 0.8032 0.8527 87.53 k-WTA 70.41 71.19 48 53 7 0.483 15.73 18.45 0.7069 0.6459 0.7909 91.92 EASR 58.12 59.25 42 41 7 0.731 71.87 78.69 0.9061 0.882 0.9125 87.92 Table 2 Similar analysis as in Table 1 for rule extraction and sparsity of each method on 6-class subset of Pascal. Ablation study for different hyperparameters in depicted in Fig. 1 (bottom). Results consistently indicate that EBP realizes high activity sparsity and the program extracted from CNN using EBP has higher fidelity and uses fewer rules and atoms in explanations. Best results shown in bold. Figure 2: Comparison of a filter’s highly activated feature maps (as defined in [2]) in ordinary CNN (top) and those after EBP (bottom) in road dataset (col. 1-4) and pascal (col. 5-8) . Filters trained with EBP capture more compact semantic regions and it is easier to access the activation pattern. Figure 3: Visualization of rules extracted after training with EBP for depth 3 trees and the receptive field of filters. Green frame denotes presence and red absence of a pattern. Notice how concepts like trees and buildings are shared between classes promoting more compositional representations that use fewer atoms. that fires in response to trees (in all classes) rather than having a filter per class. Results in Tables 1 and 2 indicate that EBP achieves high activity sparsity without sacrificing accuracy. Fig. 2 depicts that filters trained with EBP consistently detect a specific pattern across images and that they fire in response to smaller and more compact semantic regions which are more interpretable and have activation patterns that are more clear. Though there is no consensus in the literature regarding a metric for interpretability, the size of an explanation has been proposed as an option [8, 12]. To assess the extent to which EBP benefits rule extraction by inducing parsimonious compositional representations that re-use and combine primitive filters to form new concepts, we show how a binary decision tree trained as a symbolic approximation of a CNN’s behavior is more compact when that CNN is trained with EBP than without. To perform rule extraction we first quantize each filter activation. A filter is considered active if its activation (Eq. 1) is above (𝜇 + 𝜎) where 𝜇 and 𝜎 denote the mean and standard deviation of filter activations after conv13 layer across all images. The tree takes the thresholded filter activations of the conv13 layer as the input and the CNN’s output classification as the target so that for a given input instance the induced decision path from the root to the leaf node provides a symbolic explanation of the original CNN’s classification of that instance. Each such path can be regarded as a separate rule over a set of atoms, e.g. 𝐴 ∧ ¬𝐵 ∧ 𝐶 → highway states (Fig. 3 top-left) that if filters A and C are active and filter B is inactive, then the input is a highway. We approximated VGG-16 on the road and 6-class subset of Pascal dataset by generating decision trees using the sklearn3 library’s DecisionTreeClassifier class using entropy criterion and cost complexity pruning to choose the optimal depth. Fig. 1 (1a-c, 2a-c) shows the tradeoff between sparsity, fidelity and accuracy after training using different hyperparameter values for each method. Results in Table 1 and Fig. 1 demonstate that programs extracted from a CNN trained with EBP have higher fidelity and use consistently fewer unique atoms and rules in explanations compared to other approaches. This means that EBP induces more compositional representations that re-use and combine filters instead of using redudant multiple filters for each class separately. Fig. 3 shows some rules extracted from a CNN trained with EBP and the receptive field of filters activation. For visualization purposes we used a tree of depth 3. 5. Conclusion and future work In this paper we proposed a method that promotes group activation sparsity to encourage more parsimonious and interpretable representations in a CNN. We demonstrated that models trained with EBP realize high degrees of activation sparsity without sacrificing accuracy and benefit a rule extraction algorithm that distils the knowledge from a CNN; fewer atoms are used in explanations while maintaining high fidelity. In future work we aim to conduct experiments with different 𝐾𝑐 values for each class 𝑐, further analysis of different layers and rule extraction and possibly an application with emphasis on studying the benefits of compositionality. Another future direction is to embed rules into neural networks, in order to enhance their interpretability and generalization capabilities. Rule embedding can assist in neural-symbolic integration, which is an important step towards bridging the gap between neural networks and symbolic representations. 3 https://scikit-learn.org/stable/ References [1] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, J. K. Su, This looks like that: Deep learning for interpretable image recognition, in: Advances in Neural Information Processing Systems, volume 32, Curran Associates, Inc., 2019. [2] Q. Zhang, Y. N. Wu, S.-C. Zhu, Interpretable convolutional neural networks, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8827–8836. [3] J. Townsend, T. Kasioumis, H. Inakoshi, ERIC: extracting relations inferred from convo- lutions, in: 15th Asian Conference on Computer Vision, Kyoto, Japan, Revised Selected Papers, Part III, volume 12624 of Lecture Notes in Computer Science, Springer, Nov. 30 - Dec. 4, 2020, pp. 206–222. [4] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of methods for explaining black box models, ACM Comput. Surv. 51 (2018). [5] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, vol. 30, in: Advances in Neural Information Processing Systems, Curran Associates, Inc., 2017. [6] M. T. Ribeiro, S. Singh, C. Guestrin, "why should i trust you?": Explaining the predictions of any classifier, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 97–101. [7] N. Frosst, G. E. Hinton, Distilling a neural network into a soft decision tree, ArXiv abs/1711.09784 (2017). [8] R. Andrews, J. Diederich, A. B. Tickle, Survey and critique of techniques for extracting rules from trained artificial neural networks, Knowledge-Based Systems 8 (1995) 373–389. [9] H. Jacobsson, Rule extraction from recurrent neural networks: Ataxonomy and review, Neural Computation 17 (2005) 1223–1263. [10] Q. Zhang, R. Cao, F. Shi, Y. N. Wu, S.-C. Zhu, Interpreting cnn knowledge via an explanatory graph, Proceedings of the AAAI Conference on Artificial Intelligence 32 (2018). [11] Q. Zhang, Y. Yang, H. Ma, Y. N. Wu, Interpreting cnns via decision trees, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [12] J. Townsend, T. Chaton, J. M. Monteiro, Extracting relational explanations from deep neural networks: A survey from a neural-symbolic perspective, IEEE Transactions on Neural Networks and Learning Systems 31 (2020) 3456–3470. [13] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, Fort Lauderdale, USA, 2011, pp. 315–323. [14] V. Nair, G. E. Hinton, 3d object recognition with deep belief nets, in: Advances in Neural Information Processing Systems, volume 22, Curran Associates, Inc., 2009. [15] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (2014) 1929–1958. [16] N. Khan, I. Stavness, Sparseout: Controlling sparsity in deep networks, in: Advances in Artificial Intelligence - 32nd Canadian Conference on Artificial Intelligence, Canadian AI 2019, Kingston, ON, Canada, May 28-31, 2019, Proceedings, volume 11489 of Lecture Notes in Computer Science, Springer, 2019, pp. 296–307. [17] C. Xiao, P. Zhong, C. Zheng, Enhancing adversarial defense by k-winners-take-all, Inter- national Conference on Learning Representations (ICLR) (2020). [18] P. Bizopoulos, D. Koutsouris, Sparsely activated networks, IEEE Transactions on Neural Networks and Learning Systems 32 (2021) 1304–1313. [19] S. Ahmad, L. Scheinkman, How can we be so dense? the benefits of using highly sparse representations, ICMLWorkshop on Uncertainty and Robustness in Deep Learning (2019). [20] Q. Yang, J. Mao, Z. Wang, H. Li, Dasnet: Dynamic activation sparsity for neural network efficiency improvement, in: 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), 2019, pp. 1401–1405. [21] G. Georgiadis, Accelerating convolutional neural networks via activation map compression, in: Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7078–7088. [22] M. Kurtz, J. Kopinsky, R. Gelashvili, A. Matveev, J. Carr, M. Goin, W. Leiserson, S. Moore, N. Shavit, D. Alistarh, Inducing and exploiting activation sparsity for fast inference on deep neural networks, in: Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, 2020, pp. 5533–5543. [23] F. D. J. Olshausen B. A., Sparse coding with an overcomplete basis set: A strategy employed by v1?, Vision research 37 (2007) 3311–3325. [24] I. Donadello, L. Serafini, A. D. Garcez, Logic tensor networks for semantic image interpre- tation, IJCAI’17, AAAI Press, 2017, p. 1596–1602. [25] A. S. d. Garcez, D. M. Gabbay, K. B. Broda, Neural-Symbolic Learning System: Foundations and Applications, Springer-Verlag, Berlin, Heidelberg, 2002. [26] R. Ma, L. Niu, A survey of sparse-learning methods for deep neural networks, in: 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI), 2018, pp. 647–650. [27] E. Majani, R. Erlanson, Y. Abu-Mostafa, On the k-winners-take-all network, in: Advances in Neural Information Processing Systems, volume 1, Morgan-Kaufmann, 1989. [28] A. Makhzani, B. Frey, K-sparse autoencoders, arXiv preprint arXiv:1312.5663 (2013). [29] H. Lee, C. Ekanadham, A. Y. Ng, Sparse deep belief net model for visual area v2, in: Pro- ceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, Curran Associates Inc., Red Hook, NY, USA, 2007, p. 873–880. [30] R. Liao, A. Schwing, R. S. Zemel, R. Urtasun, Learning deep parsimonious representations, in: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Curran Associates Inc., Red Hook, NY, USA, 2016, p. 5083–5091. [31] K. Kimura, T. Yoshida, Non-negative matrix factorization with sparse features, in: 2011 IEEE International Conference on Granular Computing, 2011, pp. 324–329. [32] P. Hoyer, Non-negative matrix factorization with sparseness constraints, Journal of machine learning research (2004) 1457–1459. [33] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman, The pascal visual object classes challenge 2012 (voc2012) results, 2012. URL: http://www.pascal-network. org/challenges/VOC/voc2012/workshop/index.html. [34] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba, Places: A 10 million image database for scene recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2018) 1452–1464. [35] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, volume 25, Curran Associates, Inc., 2012. [36] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015. [37] D. Bau, B. Zhou, A. Khosla, A. Oliva, A. Torralba, Network dissection: Quantifying interpretability of deep visual representations, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 3319–3327. [38] S. Odense, A. Garcez, Layerwise knowledge extraction from deep convolutional networks, ArXiv abs/2003.09000 (2020). [39] B. Willmore, D. Tolhurst, Characterizing the sparseness of neural codes, Network (Bristol, England) 12 (2001) 255—270. 6. Appendix 6.1. Best hyperparameters for each method For each method in Tables 1 and 2 we select hyperparameter that yield the best accuracy on vali- ation set. If different hyparameters yielded similar accuracy (less than 0.5% accuracy deviation) we choose the parameters that yielded much higher Hoyer sparsity. The best parameters for each dataset are shown in Table 3. Fig. 1 shows the results on different hyperparameter runs. For EBP higher values for 𝐾 resulted in lower sparsity, higher number of atoms used in explanations and lower fidelity. This is intuitive because most of the kernels in conv13 layer are redundant for classification on road dataset or on the 6-class subset of Pascal. Moreover from Fig. 1 is evident that as 𝐾 increases EBP performs similarly with other methods from the literature, however, EBP performs better for lower 𝐾. Road dataset 6-class subset of Pascal dataset Method parameters parameters EBP 𝐾 = 25, 𝜆 = 0.001 𝐾 = 25, 𝜆 = 0.001 Sparseout 𝑝 = 0.3, 𝑞 = 2.5 𝑝 = 0.3, 𝑞 = 2 DutyCycle 𝑎^𝑙 = 0.7, 𝛽 = 8 ^𝑙 = 0.7, 𝛽 = 8 𝑎 𝑘-WTA 𝑘 = 0.8 𝑘 = 0.2 EASR 𝜆 = 0.00005 𝜆 = 0.00001 Table 3 Hyperparameters that yielded the best accuracy and Hoyer sparsity on road and 6-class subset of Pascal. For Duty Cycle d and b refer to density and boost coefficients and for k-WTA 𝑘 stands for the sparse ratio. The hyperparameters 𝑝 and 𝑞 control the percentage of neurons dropped in Sparseout and the hyperparameter 𝜆 controls the regularization strength in the loss. 6.2. Probabilities of filter activations for each class In Section 3 we defined the probability 𝑝𝑗𝑐 of a filter 𝑓𝑗 being active for a particular class 𝑐 as in Equation 4. This probability is dynamically computed during training from the history of activations. Fig. 4 (top) shows the evolution of a filter activation for each class before and after applying EBP. Before applying EBP the filter fired in responce to a collection of patterns (trees, road, cyclists), hence the probability of filter activation was spread across different classes (blue histogram on Fig. 4). Moreover it was difficult to assess the cause of its activation. After training Figure 4: Visualization of the receptive field and probability of filter activation per class before (second column) and after applying EBP (third column). The histogram on the right hand side shows the corresponding probabilities of activations of the filter for each class. with EBP the filter fires in responce to a specific pattern (trees) which is shared between classes but mostly present in forest road. Fig. 4 (bottom) shows the evolution of another filter that is shared between classes. Before applying EBP the filter fired in responce to traffic signs and trees. After training with EBP the filter fires in responce to traffic signs only which are present mostly in highway and street classes (hence the probability for those classes has been boosted in orange histogram). Observe that before applying EBP the probability of activation for forest road was higher (since the presence of trees and traffic signs combined is higher in forest road that in street and highway class). 6.3. Discussion of proposed method and regularization In the proposed algorithm in Section 3 elite filters were not penalized during training, since the assigned probability from Equation 4 is 1. This can be altered and introduce penalties for all 𝐷𝑐 filters (elite included) by defining 𝑝𝑗𝑐 = 1 − 𝐴𝑐𝑗 . Penalizing all filters may induce even more 𝑟1 activity sparsity. In our experiments we used the setup described in Section 3. Furthermore, we recommend applying ridge weight regularization on the (𝑙 + 1) layer following the EBP 𝑙-th layer to constrain weights in a small Euclidean ball. The reason is that EBP penalizes activations on layer 𝑙 according to a probability distribution and if we do not impose any constraints on the weights following that layer then the model has the freedom to learn arbitrarily large weights on layer 𝑙 + 1 and possibly negate our penalization. In our experiments we did not use additional regularization after conv13 (where EBP was applied) in order to make our method directly comparable with other approaches from the literature. However our preliminary experiments show no significant difference after adding this additional regularization layer in the road and 6-class subset of Pascal dataset.