=Paper=
{{Paper
|id=Vol-2742/short6
|storemode=property
|title=Understanding Deep Learning with Activation Pattern Diagrams
|pdfUrl=https://ceur-ws.org/Vol-2742/short6.pdf
|volume=Vol-2742
|authors=Francesco Craighero,Alex Graudenzi,Fabrizio Angaroni,Fabio Stella,Marco Antoniotti
|dblpUrl=https://dblp.org/rec/conf/aiia/CraigheroGASA20
}}
==Understanding Deep Learning with Activation Pattern Diagrams==
Understanding Deep Learning with Activation Pattern Diagrams Francesco Craighero1 , Fabrizio Angaroni1 , Alex Graudenzi2,†,∗ , Fabio Stella1,† , and Marco Antoniotti1,† 1 Department of Informatics, Systems and Communication, University of Milan-Bicocca, Milan, Italy 2 Institute of Molecular Bioimaging and Physiology, Consiglio Nazionale delle Ricerche (IBFM-CNR), Segrate, Italy ∗ corresponding author: alex.graudenzi@unimib.it † co-senior authors Abstract. The growing demand for machine learning tools to solve hard tasks, from natural language processing to image understanding, recently shifted the attention to understand and possibly to explain the behaviour of deep learning. Deep neural networks represent today the state-of-the- art in many applications that have been shown to be solved by data- driven approaches. However, they are also well known for their complex- ity, which hinders the interpretation of their functioning. To address this issue, researchers have lately focused either on understanding the opti- mization algorithms or on extracting information from a trained model; in this context we propose the Activation Pattern Diagram (APD) as a new tool to analyse neural networks by mainly focusing on the input data. The APD is a graphical representation of how a dataset is learned by a neural network with piecewise linear activation functions, such as the ReLU activation. By analysing the evolution of the diagram during the training procedure, the APD sheds light on the learning process and how data influences it. Additionally, we introduce a way to plot the APD to help the visualization and interpretation of the diagram. Keywords: Activation Patterns · Piecewise Linear Functions · Neural Networks · Visualization. 1 Introduction Deep neural networks (DNNs) have achieved remarkably good results in a broad range of tasks, including Computer Vision [7,8,15], Natural Language Processing [2] and game playing [14]. Nevertheless, due to the complexity of these models, many phenomena are still only partially understood, such as their ability of to generalize well with over-parameterized models [17] or their fragility to adver- sarial attacks [16]. Moreover, the ever growing adoption of black-box models Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). fueled the need of explainable systems, in order to gain the trust of the user and improve the confidence for safety-critical applications [3]. In order to explain neural networks, a number of techniques provide justifica- tions for the predictions [10], such as sensitivity analysis [15]. On the other hand, other methods have been proposed to investigate the properties of DNNs, e.g., to estimate the expressiveness of the model [5,6,13], to analyse the behaviour of DNNs with different optimization techniques [18] or to characterize input data complexity [1]. The inspection of a deep neural network can be simplified by employing piecewise linear activation functions, such as the ReLU activation [4]. These activations partition the input space in linear regions to learn complex functions [11], thus properties of those regions, such as the number or the size, can be exploited to better understand the learned function [1, 5, 6, 13, 18]. In [1] we defined a novel data structure, the Activation Pattern Diagram (APD), that can be used to understand and visualize how data is transformed by a neural network with piecewise linear activations. Additionally, we introduced a method to estimate the input data complexity of a dataset, given the function learned by a DNN with ReLU activations. More in detail, we showed that the distribution of the input instances among the linear regions, summarised by the APD, can be used to estimate the confidence of the model in predicting the label for a given instance. Briefly, linear regions identify the transformation applied by the neural network; if many instances share the same linear region, then we expect them to be “more common”, and easier, than an instance that has its own linear region. In fact, linear regions are denser around decision boundaries [18]. In order to further explore the APD properties, we aim at investigating its evolution during the training process. To this end, in the following we will: – introduce a proof-of-concept for a novel tool to visualize the APD on a selected subset of instances, providing a new strategy to interpret DNNs; – show preliminary results of the evolution of the APD during learning. 2 From Activation Patterns to the APD Let Nθ (x0 ) be a Deep Neural Network with input x0 ∈ Rn0 and trainable pa- rameters θ. A layer hl with size nl , for l ∈ 1, . . . , L, is defined by neurons hl,i = gl,i ◦ fl,i , for i ∈ 1, . . . , nl , where fl,i is a linear preactivation function and gl,i a nonlinear activation function. Let xl be the output of the l-th layer and the input data to the network for l = 0, then, we define fl,i (xl−1 ) = Wl xl−1 + bl,i , where both Wl ∈ Rnl−1 and bl,i ∈ R belong to the trainable parameters θ. Regarding activation functions, we will focus on ReLU activation function, i.e., gl,i (x) = max{0, x}. Finally, we can represent the DNN Nθ as a function Nθ : Rn0 → Rout that can be decomposed as Nθ (x) = (fout ◦ hL ◦ · · · ◦ h1 )(x), (1) where fout is the output layer (e.g., softmax, sigmoid, . . . ). Moreover, given a dataset D ⊆ Rn0 , we define the activation pattern Al (x0 ) of layer l given input x0 ∈ D as the following (binary) vector: Al (x0 ) = [ai | ai = 1 if hl,i (xl−1 ) > 0 else ai = 0, ∀i = 1, . . . , nl ]. (2) In order to distinguish activation patterns by the layer to which they belong, let us adjust the notation as follows: A∗l (x0 ) = (l, Al (x0 )), ∀x0 ∈ D, l ∈ 1, . . . , L. (3) Then, let us define the set of activation patterns of layer l for all instances in D as: A∗l (D) = {A∗l (x0 ) | x0 ∈ D}, l ∈ 1, . . . , L, where |A∗l (D)| will denote its cardinality. Lastly, the activation pattern diagram (APD) of dataset D is a directed acyclic graph AP DNθ (D) = (V, E), where – V is the set of vertices defined by the activation patterns of all the layers, i.e.: [L V = A∗l (D). l=1 – E is the set of edges defined by the activation of each input instance x0 ∈ D, i.e. (A∗l−1 (x0 ), A∗l (x0 )) ∈ E for l ∈ 2, . . . L. Note that the APD has the same depth of the network. In the following we will consider an extended version of the APD, in which we add a node for each predicted label and edges (A∗L (x0 ), Nθ (x0 )), where Nθ (x0 ) is the predicted label for x0 , for each input instance. Example 1. Let us consider a network Nθ with L = 2 and n1 , n2 = 2. Given a dataset with one instance x0 , we may have A∗1 (x0 ) = (1, [0, 0]), A∗2 (x0 ) = (2, [1, 0]) and Nθ (x0 ) = y0 , i.e. y0 is the predicted label for x0 . Then the APD is defined as: V = {(1, [0, 0]), (2, [1, 0]), y0 }, E = (1, [0, 0]), (2, [1, 0]) , (2, [1, 0]), y0 . 3 APD evolution during training In this section we will show results obtained with a neural network with L = 3 layers with 40 neurons each, trained on the MNIST dataset [9] with SGD and fixed learning rate at 0.001. In figure 1 we have the loss for the training process (0.1 train/validation split of the 60 000 total instances) on the left, while on the right we have the evolution of the number of unique activation patterns, i.e. |A∗l (D)| for l ∈ 1, . . . 3, where D is the training set. We can see that the number of unique patterns at each epoch Losses Number of Patterns Valid 50000 Number of patterns Train 100 40000 Log Loss layer 30000 Layer 1 Layer 2 10−1 20000 Layer 3 10000 0 0 200 400 0 200 400 Epoch Epoch Fig. 1: (Left) Train (orange) and validation (blue) loss. The best validation is achieved at epoch 309. (Right) Number of unique activation patterns per layer, i.e. |A∗l (D)|, with regard to the network at a given epoch. All the layers steadily increase the number of activation patterns during training. Furthermore, the first layer (blue) has 4 times the number of activation patterns of the second layer, while the third layer (green) doesn’t reach more that 4 000 activation patterns in all the 500 epochs. decreases with the layer’s depth; moreover, the number of activation patterns of each layer is always far below the theoretical upper bound of 240 possible patterns (see [6] for an explanation of this phenomenon) and less than the 54 000 training instances, thus activation patterns are shared between instances. In figure 2 we plotted the APD obtained by performing predictions for the same 500 instances of label “1” with the learned model at epochs 10, 50, 150 and 300. The plots are Sankey diagrams from the Plotly library [12], where: – the blue rectangles, from left to right, represent the activation patterns of the layers and the predicted labels, with height and color intensity proportional to the number of instances activating that pattern or predicting that label. As an example, in each APD, the upper right rectangle represent all correctly predicted labels; – the color of the edges corresponds to the proportion of wrong instances belonging to the edge, and size proportional to the number of instances following that edge. From figure 2 we can observe that activation patterns are shared more in deeper layers, as emerges from figure 1. As a consequence, from epoch 150 there are some clear flows of instances that share the same activation patterns. Lastly, wrongly classified instances, with regard to the chosen subset of instances, mostly belong to activation patterns that are not shared by many instances. The trend observed in the previous figures is confirmed by figure 3. We first clustered instances based on the pattern of both the second and last layer, i.e. if x0 , x1 belong to cluster (or “flow”) C, then A2 (x0 ) = A2 (x1 ) and A3 (x0 ) = A3 (x1 ). Note that such clusters correspond to paths from the second to third layer in figure 2. Then, we observed the distribution of all (second row) or % errors % errors 90.0 90.0 70.0 70.0 50.0 50.0 30.0 30.0 10.0 10.0 (a) Epoch 10. (b) Epoch 50. % errors % errors 90.0 90.0 70.0 70.0 50.0 50.0 30.0 30.0 10.0 10.0 (c) Epoch 150. (d) Epoch 300. Fig. 2: Four different APDs with regard to the same 500 instances of label “1” and the model learned at epochs 10, 50, 150 and 300. The APD is composed by four levels of blue nodes; the nodes of each level identify, from left to right, the activation patterns of the three layers of the network and the predicted labels. Additionally, the nodes have height and color intensity proportional to the number of instances they represent. The first layer, as expected from figure 1, has always more patterns, while in the level of predicted labels we have a tall node on top representing the mostly predicted label, i.e. “1”. The edges have size proportional to the number of instances they represent, and different colors depending on the proportion of errors. Epoch 10 Epoch 50 Epoch 150 Epoch 300 0.0 0.0 0.0 0.0 Errors distr. 0.2 0.2 0.2 0.2 103 0.5 0.5 0.5 0.5 102 0.8 0.8 0.8 0.8 101 Binned Purity 1.0 1.0 1.0 1.0 100 1 2 20 54 403 1 2 20 54 403 1 2 20 54 403 1 2 20 54 403 0.0 0.0 0.0 0.0 Instance distr. 104 0.2 0.2 0.2 0.2 103 0.5 0.5 0.5 0.5 0.8 0.8 0.8 0.8 102 1.0 1.0 1.0 1.0 101 100 1 2 20 54 403 1 2 20 54 403 1 2 20 54 403 1 2 20 54 403 Binned Strength Fig. 3: Distribution of wrongly classified instances (first row) and all instances (second row) among clusters (“flows”) defined by activation patterns of second and third layer. Clusters correspond to the edges in the APD between the acti- vation patterns of the second and third layer, respectively. In the x-axis we have the strength of the clusters, with log-bins, i.e. the number of instances belonging to that cluster. In the y-axis we have the purity of the cluster, i.e. the proportion of instances belonging to the most frequently predicted label in the cluster. wrongly classified (first row) instances among the clusters with regard to two measures: purity, i.e. the proportion of instances of the most frequently pre- dicted class in the cluster, and strength, i.e. the cluster size. We can observe that there is a number of instances belonging to clusters with high purity and high strength, that are almost always correct from epoch 150, while wrongly classified instances usually belong to small clusters or clusters with low purity. 4 Concluding remarks In the previous section we showed some of the possible observations resulting from the analysis of how data flows through the APD during the training process. Additionally, we introduced a novel visualization tool to plot the APD for a given set of input instances. We are able to cluster data based on how the neural network performs the task, paving the way to further experiments with the aim of both studying how the characteristics of input data influences the learning process and providing an interpretation for the function learned by a neural network. As an example, the APD can provide a way to quickly assess when a trained DNN is straying from what it was trained on, potentially providing early warnings “on field”, when it behaves in ways that were not expected or foreseen. Among the possible future research venues, we want to investigate topological measures to quantify the information contained in the APD and experiment the influence of hyperparameters, such as the chosen architecture or optimization algorithm, on the shape of the diagram. Lastly, in our experiments we used all the neurons in each layer of the APD, but additional research may introduce new ways to identify only the relevant part of activation patterns. Furthermore, we here introduced a visualization of the APD with the Plotly library [12], that might represent a new tool for the researcher or user who wants to understand the inner functioning of a DNN. References 1. Craighero, F., Angaroni, F., Graudenzi, A., Stella, F., Antoniotti, M.: Investigat- ing the Compositional Structure Of Deep Neural Networks. In: Proceedings of the Sixth International Conference on Machine Learning, Optimization, and Data Sci- ence. LOD 2020. Siena, Italy (2020), (preprint: https://arxiv.org/abs/2002.06967) 2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Vol- ume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423 3. Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L.: Explaining Explanations: An Overview of Interpretability of Machine Learning. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). pp. 80–89 (Oct 2018). https://doi.org/10.1109/DSAA.2018.00018 4. Glorot, X., Bordes, A., Bengio, Y.: Deep Sparse Rectifier Neural Networks. In: AISTATS (2011) 5. Hanin, B., Rolnick, D.: Complexity of Linear Regions in Deep Networks. In: Inter- national Conference on Machine Learning. pp. 2596–2604 (2019) 6. Hanin, B., Rolnick, D.: Deep ReLU networks have surprisingly few activation pat- terns. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada. pp. 359–368 (2019) 7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2016) 8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (May 2017). https://doi.org/10.1145/3065386 9. LeCun, Y., Cortes, C., Burges, C.: MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2 (2010) 10. Montavon, G., Samek, W., Müller, K.R.: Methods for interpreting and under- standing deep neural networks. Digital Signal Processing 73, 1–15 (Feb 2018). https://doi.org/10.1016/j.dsp.2017.10.011 11. Montúfar, G.F., Pascanu, R., Cho, K., Bengio, Y.: On the number of linear regions of deep neural networks. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. pp. 2924–2932 (2014) 12. Plotly Technologies Inc.: Collaborative data science. https://plot.ly (2015) 13. Raghu, M., Poole, B., Kleinberg, J.M., Ganguli, S., Sohl-Dickstein, J.: On the Expressive Power of Deep Neural Networks. In: Precup, D., Teh, Y.W. (eds.) Pro- ceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. Proceedings of Machine Learning Re- search, vol. 70, pp. 2847–2854. PMLR (2017) 14. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Diele- man, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T.P., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of Go with deep neural networks and tree search. Nat. 529(7587), 484–489 (2016). https://doi.org/10.1038/nature16961 15. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015) 16. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I.J., Fer- gus, R.: Intriguing properties of neural networks. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings (2014) 17. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learn- ing requires rethinking generalization. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net (2017) 18. Zhang, X., Wu, D.: Empirical Studies on the Properties of Linear Regions in Deep Neural Networks. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net (2020)