Proceedings of the 11th International Conference on Applied Informatics
      Eger, Hungary, January 29–31, 2020, published at http://ceur-ws.org


        Finding Dense Supermasks
  in Randomly Initialized Neural Networks

                                  Csanád Sándora,b

                      a
                        Faculty of Mathematics and Informatics,
                    Babeş-Bolyai University, Cluj-Napoca, Romania
                           csanad.sandor@cs.ubbcluj.ro
                       b
                           Robert Bosch SRL, Cluj-Napoca, Romania


                                         Abstract
          Recent works on network pruning [16, 12] showed that randomly weighted
      neural networks contain supermasks (subnetworks) the performance of which
      is comparable with similarly sized, trained networks. These results underpin
      the Lottery Ticket Hypothesis [3]: the effectiveness of deep neural networks
      rely on lucky initialization.
          While these works define sparse supermasks, we demonstrate that dense
      supermasks can also be found by applying structured pruning. We remove
      components from randomly weighted neural networks – neurons from fully
      connected layers – such that the loss of the networks decreases continuously.
      This results in smaller, dense networks whose accuracy is higher than their
      initial version.
      Keywords: deep learning, network pruning, supermasks, subnetworks


1. Introduction
The trend to improve the accuracy of modern deep neural networks is to increase
the number of parameters and the number of layers [7, 15, 5]. This practice,
with the help of bigger datasets, more memory and computing power, increased
the accuracy of neural networks on challenges such as the ImageNet Large Scale
Visual Recognition Challenge [13]. However, increasing the network size demands
gigabytes of memory and billions of floating point multiplication during inference
that is usually not available in resource limited devices. Moreover, different works
pointed out that these neural networks are over-parameterized [2, 1]: some of the
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).


                                            288
parameters are redundant. Eliminating them from the network causes no drop in
the accuracy.
    Since there is an increasing need to apply neural networks on resource lim-
ited devices (autonomous drones, cars, mobile phones or other embedded de-
vices), pruning methods were developed that aim is to eliminate redundant param-
eters [4, 10, 6, 14, 11]. These techniques estimate the importance of parameters
(or groups of parameters) and remove the less important ones such that the loss
does not decrease. Network pruning can be summarized in the following steps: (1.)
a randomly initialized network is trained on a dataset using some optimization
method (e.g. stochastic gradient descent), (2.) importance values are estimated
and assigned to the trained weights (e.g. the magnitude of the parameters are
considered as importance values), (3.) the less important parameters are elimi-
nated from the network, (4.) the pruned network is retrained to regain its original
accuracy. Steps (2.)-(4.) could be repeated multiple times to further decrease the
network size.
    Pruning results show that smaller networks are able to represent complex dis-
tributions but current optimization methods fail to tune their weights adequately.
However, optimizing in larger dimensional space and eliminating the parameters
afterwards can solve the problem.
    Contrary to this, a recent work pointed out that smaller, sparse networks can
be trained as good as their dense counterparts, if their weights are initialized prop-
erly [3]. The authors argue that optimization methods have more success on large
networks only because these chance of having luckily initialized subnetworks is
higher. In the paper these subnetworks are called lottery tickets. Training these
luckily initialized subnetworks leads to the same accuracy as their dense counter-
parts.
    Following the work of [3], [16] demonstrates that randomly initialized networks
already contain subnetworks – called supermasks – with accuracy far better than
chance. The authors highlight that the parameters of these subnetworks are not
trained at all. They freeze the network weights (with the initial values) and find
subnetworks with high accuracy. Moreover, [12] presents an edge-popup method
that is able to efficiently find subnetworks (in randomly initialized, untrained net-
works), that accuracy is comparable with their dense, trained counterpart.
    Our work is inspired by the results of [16, 12], but instead of examining sparse
subnetworks, we demonstrate that randomly initialized, untrained networks con-
tain dense subnetworks with an accuracy far from chance. We experimentally
demonstrate that a randomly initialized LeNet-300-100 architecture contains dense
subnetworks that accuracy is around 40% on the MNIST dataset. This value is far
better then the 10% accuracy of its unpruned counterpart. Moreover, we show that
a randomly initialized Wide-LeNet (a 2 layer FCN with increased size) contains
dense subnetworks with an accuracy above 60%.
    While this work studies fully connected networks on a small dataset we would
like to extend these experiments to convolutional neural networks and larger data-
sets as well.


                                         289
2. Pruning methods
To prune neurons from a network, we compare the magnitude based pruning with
the linear filter ensembles (LFE) method. As a control case, we also apply random
pruning where neurons are removed randomly from the network. In the following
we briefly present the magnitude and LFE pruning methods.
Magnitude pruning uses the parameter magnitude as importance value: param-
eters with small magnitude have less importance and their elimination chance is
higher. In case of structured pruning, the importance of the 𝑖𝑡ℎ neuron is estimated
by calculating its L2 norm:
                                         [︁∑︁                   ]︁ 12
                                  𝜃𝑖 =              | w𝑖,𝑗 |2
                                                𝑗

where w𝑖,𝑗 denotes the 𝑗 𝑡ℎ parameter of the 𝑖𝑡ℎ neuron and 𝜃𝑖 is the estimated
importance.
Linear filter ensembles [14] estimates the importance of different architectural
components by considering the network loss. Based on the filter importance values,
𝑝% of the components are removed from the network. While [14] computes the
optimal 𝑝 by evaluating the pruned network on a validation set, we simply set this
to a constant value. Therefore, in each pruning iteration 𝑝% of the neurons are
removed from the layer.
    LFE estimates the importance of the filters (or neurons) by evaluating the
network loss using filter ensembles. A filter ensemble is a group of filters (in case
of fully connected layers a group of neurons) from a given layer. Each ensemble is
represented by a 𝑧 𝑖 binary vector, that defines which filters should be used during
the network evaluation. For each filter ensemble an 𝑠𝑖 score is assigned that is
calculated from the network loss. Ensembles with small loss get high score while
                                                                                           𝑇
large loss produces smaller scores. Given the {𝑧𝑧 𝑖 , 𝑠𝑖 } pairs, 𝑍 = [𝑧𝑧 1 , . . . , 𝑧 𝑁 ]
                                 𝑇
matrix and 𝑠 = [𝑠1 , . . . , 𝑠𝑁 ] vector is constructed and the 𝑍 · 𝜃 = 𝑠 equation is
solved by minimizing the Euclidean 2-norm:

                                    ℒ = ‖𝑠𝑠 − 𝑍 · 𝜃 ‖ .

Finally, 𝜃 = [𝜃1 , . . . , 𝜃𝑁 ] is used as importance indicators, where large 𝜃𝑖 means
more importance value.


3. Experiments
We experiment with LFE, magnitude and random based pruning on the LeNet-
300-100 [8] architecture and the MNIST dataset [9].


                                            290
          Figure 1: The accuracy of LeNet-300-100, as more and more neu-
          rons are removed from the network. The curves present the re-
          sults of iterative (LFE, magnitude, random) and one-shot pruning
          (LFE one-shot), using linear filter ensembles, magnitude and ran-
                                 dom pruning methods.


Experimental setup
MNIST is a dataset that contains grayscale images of handwritten digits ranging
from 0 to 9. The dataset contains 60 000 samples in total: 50 000 images belongs
to the training set and 10 000 to the test set. Each image has a size of 28 × 28
pixels and the digits are centered and self-normalized on them.
    We experiment with a 3-layer fully connected network, called LeNet-300-100 [8]
on the MNIST dataset. This network has an input layer that contains 784 units
– one unit for each pixel from the MNIST image. The input layer is followed by
two fully connected layers, containing 300 and 100 units respectively and use the
ReLU activation function. The output layer contains 10 units and uses softmax
activation function. Since we are measuring the performance of randomly initial-
ized, untrained networks, we do not apply any training on them (we do not modify
the parameters, only remove them if required). The weights values correspond
to the initial values sampled from normal distribution with 0 mean, 0.1 standard
deviation. All biases are set to 0.
    Using the pruning methods (without training the network), we iteratively re-
move 6 and 2 units from the first and second layers. Therefore, the network size
shrinks from 100% to 1.77%. During each iteration, we measure the network loss
and accuracy on the test set. We also experiment with one-shot pruning using
LFE: instead of removing the neurons iteratively, the network size is decreased to
the target value simultaneously. In each pruning method we repeat the process
5 times and report the average loss and accuracy, as well as the min and max of
these values. To apply LFE pruning, we insert mask layers after the hidden layers.


                                        291
          Figure 2: The loss of LeNet-300-100, as more and more neurons
          are removed from the network. The curves present the results of it-
          erative (LFE, magnitude, random) and one-shot pruning (LFE one-
          shot), using linear filter ensembles, magnitude and random pruning
                                         methods.


These mask layers define the active and inactive units (neurons) from the previous
layer: units with 0 masks are turned off, therefore, the weights of that unit could
not contribute to the network output. We use the cross entropy loss function to
calculate the score of different filter ensembles.

3.1. LeNet-300-100 results
First, we experiment with the LeNet-300-100 fully connected network. Results
of the experiment is presented in Figure 1 and Figure 2. As Figure 1 shows, the
network accuracy with random pruning remains around the initial 10%. This value
is expected in case of a classification problem with 10 categories. This corresponds
to the intuition that a randomly initialized network after random pruning became
a smaller but still random network.
    Surprisingly, the magnitude based pruning could not increase the network ac-
curacy at all. While [16] reports huge accuracy increase with magnitude based,
unstructured pruning, this does not apply for magnitude based structured prun-
ing. With magnitude based pruning, the accuracy curve is similar to the curve of
random pruning. However, this behavior is expected as well: the parameters are
initialized with normal distribution, resulting near equal L2 norm per unit. Near
equal importance values are not useful for pruning parameters.
    In contrast to this, LFE pruning (iterative and one-shot as well) significantly
increases the network accuracy, as more and more units are removed from the
network. Applying one-shot pruning, the network accuracy peaks around 29%,
where 80% of the parameters are removed from the network. This increase is more


                                         292
          Figure 3: Change of accuracy as more and more parameters are
          removed from the untrained Wide-LeNet: a 2 layer fully connected
                    network with 3010 and 1010 hidden neurons.


significant in case of the iterative LFE: the network accuracy reaches 37% ± 4%,
where the network size is only 35.34% of the initial network (108 and 36 neurons
in the first and second layers respectively). After this value, the accuracy quickly
decreases and stops at the initial 11.9% with 1.77% of the initial network size.
    Figure 2 shows the loss of the networks during parameter pruning. With LFE
pruning (iterative and one-shot as well) the network loss decreases as expected: it
reaches a minimum value and starts to increase afterwards. However, random and
magnitude pruning produces a continuously decreasing loss which is in contrast
with the accuracy curve. One would expect that randomly removing parameters
from a random network will not influence the network loss. However, as more and
more parameters are eliminated, the network output (probabilities) becomes more
and more uniform (the network assigns near 0.1 probability for all the 10 digits).
This results decreasing cross entropy loss that asymptotically reaches the value
ℒ = − log(0.1) ∼ 2.3025.

3.2. Wide-LeNet results
Motivated by the results of [12], we also experiment with the pruning of (untrained)
Wide-LeNet: a 2 layer fully connected network with 3010 and 1010 units in the first
and second hidden layers. Since this network has far more parameters, it should
have higher chance to contain luckily initialized subnetworks. Based on the success
of LFE pruning, we iteratively remove 60 and 20 units and report the results in
Figure 3. Pruning 55.44% of the parameters, the network produces 63.47%±1.56%
accuracy on the MNIST test set while at 8.98% network size – 250 and 90 neurons
– the network accuracy is still above 60%.


                                        293
    These results validates the hypothesis that randomly initialized networks con-
tain supermasks – with accuracy far from random – not only on sparse form but
in dense form as well.


4. Conclusion
In this work we experimentally showed that randomly initialized, untrained neural
networks contain dense subnetworks with an accuracy far better than chance. Using
magnitude and LFE pruning, we iteratively pruned the randomly initialized LeNet-
300-100 architecture and measured the network loss and accuracy. With LFE
pruning, we found dense subnetworks with an accuracy of 37% on the MNIST
dataset. Moreover, applying the same pruning method on Wide-LeNet, we achieved
63.5% accuracy, without training the parameters at all.
    In the future we would like to expand these experiments to bigger networks and
real world datasets as well.


References
 [1] Dauphin, Y. N., and Bengio, Y. Big neural networks waste capacity. In 1st Inter-
     national Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona,
     USA, May 2-4, 2013, Workshop Track Proceedings (2013), Y. Bengio and Y. LeCun,
     Eds.
 [2] Denil, M., Shakibi, B., Dinh, L., Ranzato, M., and de Freitas, N. Predicting
     parameters in deep learning. In Proceedings of the 26th International Conference on
     Neural Information Processing Systems - Volume 2 (Red Hook, NY, USA, 2013),
     NIPS’13, Curran Associates Inc., p. 2148–2156.
 [3] Frankle, J., and Carbin, M. The lottery ticket hypothesis: Finding sparse, train-
     able neural networks. In 7th International Conference on Learning Representations,
     ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 (2019), OpenReview.net.
 [4] Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural
     network with pruning, trained quantization and huffman coding. In 4th International
     Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May
     2-4, 2016, Conference Track Proceedings (2016), Y. Bengio and Y. LeCun, Eds.
 [5] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image
     recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recogni-
     tion, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 (2016), IEEE Computer
     Society, pp. 770–778.
 [6] He, Y., Kang, G., Dong, X., Fu, Y., and Yang, Y. Soft filter pruning for ac-
     celerating deep convolutional neural networks. In Proceedings of the Twenty-Seventh
     International Joint Conference on Artificial Intelligence, IJCAI-18 (7 2018), Inter-
     national Joint Conferences on Artificial Intelligence Organization, pp. 2234–2240.
 [7] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification
     with deep convolutional neural networks. In Proceedings of the 25th International


                                          294
    Conference on Neural Information Processing Systems - Volume 1 (Red Hook, NY,
    USA, 2012), NIPS’12, Curran Associates Inc., p. 1097–1105.
 [8] Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning
     applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
 [9] LeCun, Y., Cortes, C., and Burges, C. Mnist handwritten digit database. ATT
     Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2 (2010).
[10] Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning
     filters for efficient convnets. CoRR abs/1608.08710 (2016).
[11] Louizos, C., Welling, M., and Kingma, D. P. Learning sparse neural networks
     through l_0 regularization. In 6th International Conference on Learning Represen-
     tations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference
     Track Proceedings (2018), OpenReview.net.
[12] Ramanujan, V., Wortsman, M., Kembhavi, A., Farhadi, A., and Rastegari,
     M. What’s Hidden in a Randomly Weighted Neural Network? arXiv e-prints (Nov.
     2019), arXiv:1911.13299.
[13] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,
     Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and
     Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. International
     Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252.
[14] Sándor, C., Pável, S., and Csató, L. Pruning CNN’s with linear filter ensembles.
     arXiv e-prints (Jan. 2020), arXiv:2001.08142.
[15] Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-scale
     image recognition, 2014. cite arxiv:1409.1556.
[16] Zhou, H., Lan, J., Liu, R., and Yosinski, J. Deconstructing lottery tickets:
     Zeros, signs, and the supermask. In Advances in Neural Information Processing
     Systems 32. Curran Associates, Inc., 2019, pp. 3592–3602.


                                         295