-

E. Tartaglione);

A round-trip journey in pruned artificial neural networks

Andrea Bragagnolo

andrea.bragagnolo@synesthesia.it 2

Enzo Tartaglione

enzo.tartaglione@telecom-paris.fr 1

Gianluca Dalmasso

gianluca.dalmasso@unito.it 0 2

Marco Grangetto

marco.grangetto@unito.it 0

Deep Learning, Pruning, Eficiency

0 Computer Science Dept., University of Turin , Italy 1 LTCI, Télécom Paris, Institut Polytechnique de Paris , France 2 Synesthesia s.r.l , Turin , Italy

2023

000 0 0002

In the last decade, deep learning models competed for performance at the price of tremendous computational costs. Such a critical aspect recently attracted more attention for both the training and inference phases. The latter is obviously orders of magnitude lower than the training complexity, but on the other hand, it contributes many times, which impacts eficiency on edge or embedded devices. Inference can be made eficient through neural network pruning, which consists of parameters and neurons' removal from the model's topology while maintaining the model's accuracy. This results in reduced resource and energy requirements for the models. This paper describes two pruning procedures for lowering the operations required during the inference phase and a method to exploit the resulting sparsity. The same cannot be applied at training time: we show it is possible to borrow similar ideas to reduce the cost of gradient backpropagation by disabling the computation for selected neurons.

1. Introduction works are widely used in various tasks, such as speech recognition and computer vision. However, modern architectures require many parameters to generalize well, resulting in large model sizes, high computational and memory resources, and significant energy consumption during training and inference.

In this paper, we present our research on neural network pruning, which involves removing the less essential elements of the network to reduce the model resource requirements. Specifically, we explore the design of pruning procedures (Sec. 2 and Sec. 3), the efect of pruning on network features (Sec. 4), and the practical application of pruned networks to reduce energy consumption (Sec. 5).

We present two pruning techniques capable of squeezing the model size incrementally during training: LOBSTER [ 1 ], an unstructured approach that uses parameter sensitivity as a regularizer, and SeReNe [ 2 ], a structured procedure that evaluates the contribution of neurons to the network’s output. Pruned networks obtained with these techniques were used to assess the benefits of pruning at inference time.

Ital-IA 2023: 3rd National Conference on Artificial Intelligence, organEvelop-O ∗Corresponding author.

2. LOBSTER

In this section, we present LOBSTER [ 1 ] (LOss-Based SensiTivity rEgulaRization), an unstructured and gradual pruning procedure.

LOBSTER uses a sensitivity-based regularization to promote sparsity in the network topology. Specifically, we define the sensitivity of a network parameter as the derivative of the loss function with respect to that parameter. Parameters with low sensitivity have little impact on the loss function when perturbed and can be pruned without compromising performance. LOBSTER achieves sparsity by gradually shrinking parameters with low sensitivity using a regularize-and-prune approach.

The sensitivity is defined as !!+’& "+",(’&#"", ) ’ ) ) ( ’ )+"$#’&!!+$ ’,#-’*!+$ ()*! &!+$ ℓ2(),&#&" !!! * ( * * ) ( *-’"* $" "’."$’",% ℓ2)*-’$’# ),&! )%!+!)* (a) *-’"!*&","*+ (b)

SeReNe [ 2 ] solves this issue by producing sparse net(ℒ , ,, ) = | ℒ,, | , (1) wfeowrekr tnoepuorloongsieasnwd,itthhearesftorurec,tuferwe,ehreonpceeractoionnsisstdiunrginogf inference. Our approach involves driving all the paramewetietrhoℒftrheepnreestewnotirnkg. the loss function and ,, a param- ters of a neuron toward zero, allowing us to prune entire

LOBSTER allows training a network from scratch, neurons from the network. To achieve this, we leverage thanks to its loss-based sensitivity formulation. More- the concept of neuron’s sensitivity, defined as the variaover, it avoids additional derivative computations or tion of the network output with respect to the neuron’s second-order derivatives, unlike other sensitivity-based activity: appErxopaecrhimese.nts on multiple architectures and datasets , ( , , ) = 1 ∑ | , | , (2) demonstrate that LOBSTER outperforms several com- =1 , petitors in multiple tasks. It achieves competitive com- where represents the network’s output and , the pression ratios with minimal computational overhead -th neuron of the -th layer activity. and without compromising performance. The results of During training, all the parameters of low-sensitivity the pruning procedure for LeNet-5 trained on the MNIST neurons are shrunk, making it possible to remove them dataset and ResNet-18 trained on ImageNet are shown from the network. When the ℓ2 norm of a neuron’s pain Figure 1. LOBSTER achieves state-of-the-art sparsi- rameters approach zero, the neuron no longer emits sigifcation and classification errors for both architectures. nals (except for the bias), and can be pruned. We propose Sparse VD [ 4 ] slightly outperforms all other methods in an iterative two-step procedure to prune parameters bethe LeNet5-MNIST experiment at higher compression longing to low-sensitivity neurons. We ensure controlled rates. performance loss for the original architecture using a cross-validation strategy. 3. SeReNe Our approach allows us to learn network topologies that are not only sparse, i.e., with few non-zero paramAlthough LOBSTER can achieve high sparsity rates, the eters, but also with fewer neurons. This can speed-up sparsity is unstructured, meaning that the architecture the network execution by better using cache locality and may not remove entire neurons, and the resulting model memory access patterns. We demonstrate the efectivecan only be accelerated using specialized hardware and ness of SeReNe on multiple learning tasks and network software. architectures, outperforming state-of-the-art references.

Finally, we show that structured sparsity provides benefits when storing the neural network topology and pa- rons from the architecture. The resulting models do not rameters. Table 1 shows the results obtained applying require any particular software or hardware to speed up SeReNe on the LeNet-5 architecture trained on MNIST. their inference.

SeReNe achieves a high compression ratio and pruned We were able to perform benchmarks for both mobile neurons, outperforming the considered references. The devices [ 10 ] and FPGA platforms [11], which demonstructured sparsity results in a significant decrease in strates the efectiveness of our approach. Specifically, the uncompressed network storage footprint, with only we evaluated the performance of the pruned neural neta slight 0.12% performance drop after compression. We works on a range of devices, varying in terms of processalso tested our method on more challenging architectures ing power and memory capacity. Our results show that and datasets: Table 2 shows the results for ResNet-101 the combination of pruning and Simplify optimization trained on ImageNet. The pruning procedure results in outperforms the other techniques in terms of both inferaround 86% of the parameters being pruned, and the re- ence speed and memory footprint. Table 3 and Table 4 sulting network size is reduced from 156.67 MB to only shows the results for pruned and simplified network on 27.84 MB. mobile devices and FPGAs respectively.

Overall, these results demonstrate the feasibility of deploying pruned neural networks on various resource4. Structured pruning for constrained devices, opening up new opportunities for low-power devices bringing deep learning to the edge.

In this section, we present some empirical results that demonstrate how pruning (especially structured prun- 5. Neurons at Equilibrium (NEq) ing) can produce a network model that requires fewer resources to perform inference. To achieve this, we built All the works presented up to this point focused on reducthe s i m p l i f y library [ 9 ], a PyTorch-compatible tool that ing the neural network’s inference time. However, in this automates the process of optimizing the inference code section, we present NEq [ 3 ], an approach that enables us for pruned neural networks by removing the zeroed neu- to shrink the cost of training by reducing the number of Experimental results for diferent network architectures and pruning strategies. Left: percentage of pruned parameters, size of the simplified network topology, and size of the compressed bitstream. Right: inference time on diferent embedded devices: Raspberry Pi 3B (RPi 3B), Huawei P20 (P20), Xiaomi MI 9 (MI9), and Samsung Galaxy S6 lite (S6L). Source: [ 10 ]. satisfy | Δ | < for some ≥ 0 . where = ∑ ∈Ξ ∑ =1 ̂ ,,, ⋅ ̂,,,

−1 ilarity between all the outputs of the -th neuron at time and at time − 1 for the whole validation set Ξ . We can say that the -th neuron is at equilibrium when it can is the cosine simin FLOPs, and the network’s generalization capability at the end of training. Our results show that NEq consistently reduces the number of FLOPs with minimal or no performance drop. While the amount of saved computation is similar for the stochastic approach with fixed probabilities in all the considered scenarios, the loss in performance varies depending on the architecture and dataset. In contrast, NEq adapts to the particular setup and saves the largest FLOPs for a given performance, with a lower performance loss even when the stochastic approach is tested with the same FLOPs saving.

6. Conclusion

In this paper, we shared the research experiences we developed in the context of compressing large neural models. Our story has begun with classical unstructured pruning of model parameters, e.g. connections between neurons, where the target is the highest sparsification with the lowest performance impairment. This approach, while very sound from a theoretical point of view, does not guarantee significant eficiencing of the inference phase, when the model is deployed on actual devices. Therefore, we described the structured pruning alternatives that aim at removing whole neurons, thus uncovering the real pruning potential in saving memory and reducing the latency. Finally, we show that pruning can also be exploited at training time to cut the cost of backward propagation. In particular, we introduced NEq, a technique to disable the computation of gradients of neurons that have reached equilibrium: this amounts to pruning the backpropagation graph, and decreasing the number of operations during training. This technique can reduce the cost of training modern neural networks. for neural network compression, in: 2021 IEEE International Conference on Image Processing (ICIP), IEEE, 2021, pp. 3527–3531. [11] J. Flich, L. Medina, I. Catalán, C. Hernández, A. Bragagnolo, F. Auzanneau, D. Briand, Eficient inference of image-based neural network models in reconfigurable systems with pruning and quantization, in: 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 2491–2495. doi:10.1109/ICIP46576.2022.9897752.

[1]

Tartaglione ,

Bragagnolo ,

Fiandrotti ,

Grangetto , Loss-based sensitivity regularization: Towards deep sparse neural networks , Neural Networks 146 ( 2022 ) 230 - 237 . URL: https://www.sciencedirect.com/science/ article/pii/S0893608021004706. doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / j . n e u n e t . 2 0 2 1 . 1 1 . 0 2 9 .

[2]

Tartaglione ,

Bragagnolo ,

Odierna ,

Fiandrotti ,

Grangetto , Serene: Sensitivity-based regularization of neurons for structured sparsity in neural networks , IEEE Transactions on Neural Networks and Learning Systems 33 ( 2022 ) 7237 - 7250 . doi:1 0 . 1 1 0 9 / T N N L S . 2 0 2 1 . 3 0 8 4 5 2 7 .

[3]

Bragagnolo , E. Tartaglione,

Grangetto , To update or not to update? neurons at equilibrium in deep models , in: A. H. Oh , A.

Agarwal , D.

Belgrave , K. Cho (Eds.), Advances in Neural Information Processing Systems , 2022 . URL: https://openreview.net/ forum?id= LGDfv0U7MJR .

[4]

Molchanov ,

Ashukha ,

Vetrov , Variational dropout sparsifies deep neural networks , in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org , 2017 , pp. 2498 - 2507 .

[5]

Han , J . Pool,

Tran ,

Dally , Learning both weights and connections for eficient neural network , in: Advances in neural information processing systems , 2015 , pp. 1135 - 1143 .

[6]

Ullrich ,

Welling , E. Meeds, Soft weightsharing for neural network compression , 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings ( 2019 ).

[7]

Tartaglione ,

Lepsøy ,

Fiandrotti ,

Francini , Learning sparse neural networks via sensitivitydriven regularization , in: Advances in Neural Information Processing Systems , 2018 , pp. 3878 - 3888 .

[8]

Guo ,

Yao ,

Chen , Dynamic network surgery for eficient dnns , Advances in Neural Information Processing Systems ( 2016 ) 1387 - 1395 .

[9]

Bragagnolo ,

C. A.

Barbano , Simplify: A python library for optimizing pruned neural networks , SoftwareX 17 ( 2022 ) 100907 . URL: https://www.sciencedirect.com/science/ article/pii/S2352711021001576. doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / j . s o f t x . 2 0 2 1 . 1 0 0 9 0 7 .

[10]

Bragagnolo ,

Tartaglione ,

Fiandrotti ,

Grangetto , On the role of structured pruning