=Paper= {{Paper |id=Vol-2742/short6 |storemode=property |title=Understanding Deep Learning with Activation Pattern Diagrams |pdfUrl=https://ceur-ws.org/Vol-2742/short6.pdf |volume=Vol-2742 |authors=Francesco Craighero,Alex Graudenzi,Fabrizio Angaroni,Fabio Stella,Marco Antoniotti |dblpUrl=https://dblp.org/rec/conf/aiia/CraigheroGASA20 }} ==Understanding Deep Learning with Activation Pattern Diagrams== https://ceur-ws.org/Vol-2742/short6.pdf
    Understanding Deep Learning with Activation
                 Pattern Diagrams

Francesco Craighero1 , Fabrizio Angaroni1 , Alex Graudenzi2,†,∗ , Fabio Stella1,† ,
                           and Marco Antoniotti1,†
              1
               Department of Informatics, Systems and Communication,
                       University of Milan-Bicocca, Milan, Italy
                 2
                   Institute of Molecular Bioimaging and Physiology,
            Consiglio Nazionale delle Ricerche (IBFM-CNR), Segrate, Italy
                 ∗
                   corresponding author: alex.graudenzi@unimib.it
                                  † co-senior authors



        Abstract. The growing demand for machine learning tools to solve hard
        tasks, from natural language processing to image understanding, recently
        shifted the attention to understand and possibly to explain the behaviour
        of deep learning. Deep neural networks represent today the state-of-the-
        art in many applications that have been shown to be solved by data-
        driven approaches. However, they are also well known for their complex-
        ity, which hinders the interpretation of their functioning. To address this
        issue, researchers have lately focused either on understanding the opti-
        mization algorithms or on extracting information from a trained model;
        in this context we propose the Activation Pattern Diagram (APD) as
        a new tool to analyse neural networks by mainly focusing on the input
        data. The APD is a graphical representation of how a dataset is learned
        by a neural network with piecewise linear activation functions, such as
        the ReLU activation. By analysing the evolution of the diagram during
        the training procedure, the APD sheds light on the learning process and
        how data influences it. Additionally, we introduce a way to plot the APD
        to help the visualization and interpretation of the diagram.

        Keywords: Activation Patterns · Piecewise Linear Functions · Neural
        Networks · Visualization.


1     Introduction

Deep neural networks (DNNs) have achieved remarkably good results in a broad
range of tasks, including Computer Vision [7,8,15], Natural Language Processing
[2] and game playing [14]. Nevertheless, due to the complexity of these models,
many phenomena are still only partially understood, such as their ability of to
generalize well with over-parameterized models [17] or their fragility to adver-
sarial attacks [16]. Moreover, the ever growing adoption of black-box models
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
fueled the need of explainable systems, in order to gain the trust of the user and
improve the confidence for safety-critical applications [3].
    In order to explain neural networks, a number of techniques provide justifica-
tions for the predictions [10], such as sensitivity analysis [15]. On the other hand,
other methods have been proposed to investigate the properties of DNNs, e.g.,
to estimate the expressiveness of the model [5,6,13], to analyse the behaviour of
DNNs with different optimization techniques [18] or to characterize input data
complexity [1].
    The inspection of a deep neural network can be simplified by employing
piecewise linear activation functions, such as the ReLU activation [4]. These
activations partition the input space in linear regions to learn complex functions
[11], thus properties of those regions, such as the number or the size, can be
exploited to better understand the learned function [1, 5, 6, 13, 18].
    In [1] we defined a novel data structure, the Activation Pattern Diagram
(APD), that can be used to understand and visualize how data is transformed by
a neural network with piecewise linear activations. Additionally, we introduced
a method to estimate the input data complexity of a dataset, given the function
learned by a DNN with ReLU activations. More in detail, we showed that the
distribution of the input instances among the linear regions, summarised by the
APD, can be used to estimate the confidence of the model in predicting the label
for a given instance. Briefly, linear regions identify the transformation applied
by the neural network; if many instances share the same linear region, then we
expect them to be “more common”, and easier, than an instance that has its
own linear region. In fact, linear regions are denser around decision boundaries
[18].
    In order to further explore the APD properties, we aim at investigating its
evolution during the training process. To this end, in the following we will:
 – introduce a proof-of-concept for a novel tool to visualize the APD on a
   selected subset of instances, providing a new strategy to interpret DNNs;
 – show preliminary results of the evolution of the APD during learning.


2    From Activation Patterns to the APD
Let Nθ (x0 ) be a Deep Neural Network with input x0 ∈ Rn0 and trainable pa-
rameters θ. A layer hl with size nl , for l ∈ 1, . . . , L, is defined by neurons
hl,i = gl,i ◦ fl,i , for i ∈ 1, . . . , nl , where fl,i is a linear preactivation function
and gl,i a nonlinear activation function.
     Let xl be the output of the l-th layer and the input data to the network for
l = 0, then, we define fl,i (xl−1 ) = Wl xl−1 + bl,i , where both Wl ∈ Rnl−1 and
bl,i ∈ R belong to the trainable parameters θ. Regarding activation functions,
we will focus on ReLU activation function, i.e., gl,i (x) = max{0, x}.
     Finally, we can represent the DNN Nθ as a function Nθ : Rn0 → Rout that
can be decomposed as

                          Nθ (x) = (fout ◦ hL ◦ · · · ◦ h1 )(x),                      (1)
where fout is the output layer (e.g., softmax, sigmoid, . . . ).
    Moreover, given a dataset D ⊆ Rn0 , we define the activation pattern Al (x0 )
of layer l given input x0 ∈ D as the following (binary) vector:

        Al (x0 ) = [ai | ai = 1 if hl,i (xl−1 ) > 0 else ai = 0, ∀i = 1, . . . , nl ].   (2)

In order to distinguish activation patterns by the layer to which they belong, let
us adjust the notation as follows:

                     A∗l (x0 ) = (l, Al (x0 )), ∀x0 ∈ D, l ∈ 1, . . . , L.               (3)

Then, let us define the set of activation patterns of layer l for all instances in D
as:
                    A∗l (D) = {A∗l (x0 ) | x0 ∈ D}, l ∈ 1, . . . , L,
where |A∗l (D)| will denote its cardinality.
   Lastly, the activation pattern diagram (APD) of dataset D is a directed
acyclic graph AP DNθ (D) = (V, E), where
 – V is the set of vertices defined by the activation patterns of all the layers,
   i.e.:
                                       [L
                                  V =     A∗l (D).
                                              l=1

 – E is the set of edges defined by the activation of each input instance x0 ∈ D,
   i.e. (A∗l−1 (x0 ), A∗l (x0 )) ∈ E for l ∈ 2, . . . L.
Note that the APD has the same depth of the network. In the following we
will consider an extended version of the APD, in which we add a node for each
predicted label and edges (A∗L (x0 ), Nθ (x0 )), where Nθ (x0 ) is the predicted label
for x0 , for each input instance.

Example 1. Let us consider a network Nθ with L = 2 and n1 , n2 = 2. Given
a dataset with one instance x0 , we may have A∗1 (x0 ) = (1, [0, 0]), A∗2 (x0 ) =
(2, [1, 0]) and Nθ (x0 ) = y0 , i.e. y0 is the predicted label for x0 . Then the APD is
defined as:
                                                                                      
     V = {(1, [0, 0]), (2, [1, 0]), y0 }, E = (1, [0, 0]), (2, [1, 0]) , (2, [1, 0]), y0 .


3    APD evolution during training
In this section we will show results obtained with a neural network with L = 3
layers with 40 neurons each, trained on the MNIST dataset [9] with SGD and
fixed learning rate at 0.001.
    In figure 1 we have the loss for the training process (0.1 train/validation split
of the 60 000 total instances) on the left, while on the right we have the evolution
of the number of unique activation patterns, i.e. |A∗l (D)| for l ∈ 1, . . . 3, where D
is the training set. We can see that the number of unique patterns at each epoch
                                  Losses                                              Number of Patterns
                                             Valid
                                                                          50000




                                                     Number of patterns
                                             Train
                        100
                                                                          40000




            Log Loss
                                                                                                    layer
                                                                          30000                      Layer 1
                                                                                                     Layer 2
                       10−1                                               20000                      Layer 3



                                                                          10000

                                                                              0
                              0   200      400                                    0         200     400
                                  Epoch                                                     Epoch


Fig. 1: (Left) Train (orange) and validation (blue) loss. The best validation is
achieved at epoch 309. (Right) Number of unique activation patterns per layer,
i.e. |A∗l (D)|, with regard to the network at a given epoch. All the layers steadily
increase the number of activation patterns during training. Furthermore, the first
layer (blue) has 4 times the number of activation patterns of the second layer,
while the third layer (green) doesn’t reach more that 4 000 activation patterns
in all the 500 epochs.


decreases with the layer’s depth; moreover, the number of activation patterns of
each layer is always far below the theoretical upper bound of 240 possible patterns
(see [6] for an explanation of this phenomenon) and less than the 54 000 training
instances, thus activation patterns are shared between instances.
    In figure 2 we plotted the APD obtained by performing predictions for the
same 500 instances of label “1” with the learned model at epochs 10, 50, 150
and 300. The plots are Sankey diagrams from the Plotly library [12], where:

 – the blue rectangles, from left to right, represent the activation patterns of the
   layers and the predicted labels, with height and color intensity proportional
   to the number of instances activating that pattern or predicting that label.
   As an example, in each APD, the upper right rectangle represent all correctly
   predicted labels;
 – the color of the edges corresponds to the proportion of wrong instances
   belonging to the edge, and size proportional to the number of instances
   following that edge.

From figure 2 we can observe that activation patterns are shared more in deeper
layers, as emerges from figure 1. As a consequence, from epoch 150 there are some
clear flows of instances that share the same activation patterns. Lastly, wrongly
classified instances, with regard to the chosen subset of instances, mostly belong
to activation patterns that are not shared by many instances.
    The trend observed in the previous figures is confirmed by figure 3. We first
clustered instances based on the pattern of both the second and last layer, i.e.
if x0 , x1 belong to cluster (or “flow”) C, then A2 (x0 ) = A2 (x1 ) and A3 (x0 ) =
A3 (x1 ). Note that such clusters correspond to paths from the second to third
layer in figure 2. Then, we observed the distribution of all (second row) or
                                % errors                                 % errors

                                 90.0                                     90.0


                                 70.0                                     70.0


                                 50.0                                     50.0


                                 30.0                                     30.0


                                 10.0                                     10.0




              (a) Epoch 10.                            (b) Epoch 50.
                                % errors                                 % errors

                                 90.0                                     90.0


                                 70.0                                     70.0


                                 50.0                                     50.0


                                 30.0                                     30.0


                                 10.0                                     10.0




              (c) Epoch 150.                          (d) Epoch 300.

Fig. 2: Four different APDs with regard to the same 500 instances of label “1”
and the model learned at epochs 10, 50, 150 and 300. The APD is composed
by four levels of blue nodes; the nodes of each level identify, from left to right,
the activation patterns of the three layers of the network and the predicted
labels. Additionally, the nodes have height and color intensity proportional to
the number of instances they represent. The first layer, as expected from figure
1, has always more patterns, while in the level of predicted labels we have a tall
node on top representing the mostly predicted label, i.e. “1”. The edges have
size proportional to the number of instances they represent, and different colors
depending on the proportion of errors.
                                                      Epoch 10                    Epoch 50                    Epoch 150                   Epoch 300
                                            0.0                         0.0                         0.0                         0.0




                            Errors distr.
                                            0.2                         0.2                         0.2                         0.2                       103


                                            0.5                         0.5                         0.5                         0.5
                                                                                                                                                          102

                                            0.8                         0.8                         0.8                         0.8

                                                                                                                                                          101
                Binned Purity

                                            1.0                         1.0                         1.0                         1.0


                                                                                                                                                          100
                                                  1   2   20   54 403         1   2   20   54 403         1   2   20   54 403         1   2   20 54 403


                                            0.0                         0.0                         0.0                         0.0
    Instance distr.




                                                                                                                                                          104
                                            0.2                         0.2                         0.2                         0.2

                                                                                                                                                          103
                                            0.5                         0.5                         0.5                         0.5


                                            0.8                         0.8                         0.8                         0.8                       102


                                            1.0                         1.0                         1.0                         1.0                       101


                                                                                                                                                          100
                                                  1   2   20   54 403         1   2   20   54 403         1   2   20   54 403         1   2   20 54 403

                                                                                           Binned Strength


Fig. 3: Distribution of wrongly classified instances (first row) and all instances
(second row) among clusters (“flows”) defined by activation patterns of second
and third layer. Clusters correspond to the edges in the APD between the acti-
vation patterns of the second and third layer, respectively. In the x-axis we have
the strength of the clusters, with log-bins, i.e. the number of instances belonging
to that cluster. In the y-axis we have the purity of the cluster, i.e. the proportion
of instances belonging to the most frequently predicted label in the cluster.


wrongly classified (first row) instances among the clusters with regard to two
measures: purity, i.e. the proportion of instances of the most frequently pre-
dicted class in the cluster, and strength, i.e. the cluster size. We can observe
that there is a number of instances belonging to clusters with high purity and
high strength, that are almost always correct from epoch 150, while wrongly
classified instances usually belong to small clusters or clusters with low purity.


4    Concluding remarks

In the previous section we showed some of the possible observations resulting
from the analysis of how data flows through the APD during the training process.
Additionally, we introduced a novel visualization tool to plot the APD for a given
set of input instances.
    We are able to cluster data based on how the neural network performs the
task, paving the way to further experiments with the aim of both studying how
the characteristics of input data influences the learning process and providing an
interpretation for the function learned by a neural network. As an example, the
APD can provide a way to quickly assess when a trained DNN is straying from
what it was trained on, potentially providing early warnings “on field”, when it
behaves in ways that were not expected or foreseen.
    Among the possible future research venues, we want to investigate topological
measures to quantify the information contained in the APD and experiment the
influence of hyperparameters, such as the chosen architecture or optimization
algorithm, on the shape of the diagram. Lastly, in our experiments we used all
the neurons in each layer of the APD, but additional research may introduce
new ways to identify only the relevant part of activation patterns.
    Furthermore, we here introduced a visualization of the APD with the Plotly
library [12], that might represent a new tool for the researcher or user who wants
to understand the inner functioning of a DNN.

References
 1. Craighero, F., Angaroni, F., Graudenzi, A., Stella, F., Antoniotti, M.: Investigat-
    ing the Compositional Structure Of Deep Neural Networks. In: Proceedings of the
    Sixth International Conference on Machine Learning, Optimization, and Data Sci-
    ence. LOD 2020. Siena, Italy (2020), (preprint: https://arxiv.org/abs/2002.06967)
 2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep
    bidirectional transformers for language understanding. In: Burstein, J., Doran,
    C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North Ameri-
    can Chapter of the Association for Computational Linguistics: Human Language
    Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Vol-
    ume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational
    Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
 3. Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L.: Explaining
    Explanations: An Overview of Interpretability of Machine Learning. In: 2018 IEEE
    5th International Conference on Data Science and Advanced Analytics (DSAA).
    pp. 80–89 (Oct 2018). https://doi.org/10.1109/DSAA.2018.00018
 4. Glorot, X., Bordes, A., Bengio, Y.: Deep Sparse Rectifier Neural Networks. In:
    AISTATS (2011)
 5. Hanin, B., Rolnick, D.: Complexity of Linear Regions in Deep Networks. In: Inter-
    national Conference on Machine Learning. pp. 2596–2604 (2019)
 6. Hanin, B., Rolnick, D.: Deep ReLU networks have surprisingly few activation pat-
    terns. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox,
    E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32:
    Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019,
    8-14 December 2019, Vancouver, BC, Canada. pp. 359–368 (2019)
 7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
    (CVPR) (Jun 2016)
 8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep
    convolutional neural networks. Commun. ACM 60(6), 84–90 (May 2017).
    https://doi.org/10.1145/3065386
 9. LeCun, Y., Cortes, C., Burges, C.: MNIST handwritten digit database. ATT Labs
    [Online]. Available: http://yann.lecun.com/exdb/mnist 2 (2010)
10. Montavon, G., Samek, W., Müller, K.R.: Methods for interpreting and under-
    standing deep neural networks. Digital Signal Processing 73, 1–15 (Feb 2018).
    https://doi.org/10.1016/j.dsp.2017.10.011
11. Montúfar, G.F., Pascanu, R., Cho, K., Bengio, Y.: On the number of linear regions
    of deep neural networks. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence,
    N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems
    27: Annual Conference on Neural Information Processing Systems 2014, December
    8-13 2014, Montreal, Quebec, Canada. pp. 2924–2932 (2014)
12. Plotly Technologies Inc.: Collaborative data science. https://plot.ly (2015)
13. Raghu, M., Poole, B., Kleinberg, J.M., Ganguli, S., Sohl-Dickstein, J.: On the
    Expressive Power of Deep Neural Networks. In: Precup, D., Teh, Y.W. (eds.) Pro-
    ceedings of the 34th International Conference on Machine Learning, ICML 2017,
    Sydney, NSW, Australia, 6-11 August 2017. Proceedings of Machine Learning Re-
    search, vol. 70, pp. 2847–2854. PMLR (2017)
14. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche,
    G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Diele-
    man, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T.P.,
    Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of
    Go with deep neural networks and tree search. Nat. 529(7587), 484–489 (2016).
    https://doi.org/10.1038/nature16961
15. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale
    Image Recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference
    on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
    Conference Track Proceedings (2015)
16. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I.J., Fer-
    gus, R.: Intriguing properties of neural networks. In: Bengio, Y., LeCun, Y. (eds.)
    2nd International Conference on Learning Representations, ICLR 2014, Banff, AB,
    Canada, April 14-16, 2014, Conference Track Proceedings (2014)
17. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learn-
    ing requires rethinking generalization. In: 5th International Conference on Learning
    Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track
    Proceedings. OpenReview.net (2017)
18. Zhang, X., Wu, D.: Empirical Studies on the Properties of Linear Regions in Deep
    Neural Networks. In: 8th International Conference on Learning Representations,
    ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net (2020)