1. Introduction

Learning for Deep Neural Networks

Gabriele Lagani

gabriele.lagani@phd.unipi.it 0 1

Machine Learning, Hebbian Learning, Deep Neural Networks, Computer Vision

0 ISTI-CNR , Pisa, 56124 , Italy 1 University of Pisa , Pisa, 56127 , Italy

2022

19 22

Deep learning is becoming more and more popular to extract information from multimedia data for indexing and query processing. In recent contributions, we have explored a biologically inspired strategy for Deep Neural Network (DNN) training, based on the Hebbian principle in neuroscience. We studied hybrid approaches in which unsupervised Hebbian learning was used for a pre-training stage, followed by supervised fine-tuning based on Stochastic Gradient Descent (SGD). The resulting semi-supervised strategy exhibited encouraging results on computer vision datasets, motivating further interest towards applications in the domain of large scale multimedia content based retrieval.

1. Introduction

In the past few years, Deep Neural Networks (DNNs) have emerged as a powerful technology in the domain of computer vision [ 1, 2 ]. Consequently, DNNs started gaining popularity also in the domain of large scale multimedia content based retrieval, replacing handcrafted feature extractors [ 3, 4 ]. Learning algorithms for DNNs are typically based on supervised end-toend Stochastic Gradient Descent (SGD) training with error backpropagation (backprop). This approach is considered biologically implausible by neuroscientists [5]. Instead, they propose Hebbian learning as a biological alternative to model synaptic plasticity [6].

Backprop-based algorithms need a large number of labeled training samples in order to achieve high results, which are expensive to gather, as opposed to unlabeled samples.

The idea behind our contribution [7, 8] is to tackle the sample eficiency problem by taking inspiration from biology and Hebbian learning. Since Hebbian approaches are mainly unsupervised, we propose to use them to perform the unsupervised pre-training stage on all the available data, in a semi-supervised setting, followed by end-to-end backprop fine-tuning on the labeled data only. In the rest of this paper, we illustrate the proposed methodology, and we show experimental results in computer vision. The results are promising, motivating further interest in the application of our approach to the domain of multimedia content retrieval on a large scale.

The remainder of this paper is structured as follows: Section 2 gives a background concerning Hebbian learning and semi-supervised training; Section 3 delves deeper into the semi-supervised approach based on Hebbian learning; In Section 4, we illustrate our experimental results and discuss the conclusions;

2. Background and related work

Several variants of Hebbian learning rules were developed over the years. Some examples are: Hebbian learning with Winner-Takes-All (WTA) competition [9], Hebbian learning for Principal Component Analysis (PCA) [6, 10], Hebbian/anti-Hebbian learning [11]. A brief overview is given in Section 3. However, it was only recently that Hebbian learning started gaining attention in the context of DNN training [12, 13, 14, 15, 16].

In [14], a Hebbian learning rule based on inhibitory competition was used to train a neural network composed of fully connected layers. The approach was validated on object recognition tasks. Instead, the Hebbian/anti-Hebbian learning rule developed in [11] was applied in [13] to train convolutional feature extractors. The resulting features were shown to be efective for classification. Convolutional layers were also considered in [ 12], where a Hebbian approach based on WTA competition was employed in this case.

However, the previous approaches were based on relatively shallow network architectures (2-3 layers). A further step was taken in [15, 16], where Hebbian learning rules were applied for training a 6-layer Convolutional Neural Network (CNN).

It is known that a pre-training phase allows to initialize network weights in a region near a good local optimum [17, 18]. Previous papers investigated the idea of enhancing neural network training with an unsupervised learning objective [19, 20]. In [19], Variational AutoEncoders (VAE) were considered, in order to perform an unsupervised pre-training phase using a limited amount of labeled samples. Also [20] relied on autoencoding architectures to augment supervised training with unsupervised reconstruction objectives, showing that joint optimization of supervised and unsupervised losses helped to regularize the learning process.

3. Hebbian learning strategies and sample eficiency

Consider a single neuron with weight vector w and input x. Call = w x the neuron output. A learning rule defines a weight update as follows: w = w + Δw (1) where w is the updated weight vector, w is the old weight vector, and Δw is the weight update.

The Hebbian learning rule, in its simplest form, can be expressed as Δw = x (where is the learning rate) [6]. Basically, this rule states that the weight on a given synapse is reinforced when the input on that synapse and the output of the neuron are simultaneously high. Therefore, connections between neurons whose activations are correlated are reinforced. In order to prevent weights from growing unbounded, a weight decay term is generally added. In the context of competitive learning [9], this is obtained as follows:

Δwi = x − wi = (x − wi) parameter to cope with the variance of activations. For this reason, we introduced a variant of this approach that uses a softmax

operation in order to compute : = ∑ where the subscript i refers to the i’th neuron in a given network layer. Moreover, the output can be replaced with the result of a competitive nonlinearity, which allows to decorrelate the activity of diferent neurons. In the Winner-Takes-All (WTA) approach [ 9], at each training step, the neuron which produces the strongest activation for a given input is called the winner. In this case, = 1 if the i’th neuron is the winner and 0 otherwise. In other words, only the winner is allowed to perform the weight update, so that it will be more likely for the same neuron to win again if a similar input is presented again in the future. In this way diferent neurons are induced to specialize on diferent patterns. In soft-WTA [ 21], is computed as . We found this formulation to work poorly in practice, because there is no tunable =

/ ∑ /

=1 ( wi) = [( −

∑ ( ) wj)2] Δwi = ( )( − ∑ ( )wj) (2) (3) (4) (5) where T is called the temperature hyperparameter. The advantage of this formulation is that we can tune the temperature in order to obtain the best performance on a given task, depending on the distribution of the activations.

The Hebbian Principal Component Analysis (HPCA) learning rule, in the case of nonlinear neurons, is obtained by minimizing the so-called representation error : where () is the neuron activation function. Minimization of this objective leads to the nonlinear HPCA rule [10]:

It can be noticed that these learning rules do not require supervision, and they are local for each network layer, i.e. they do not require backpropagation. .

In order to contextualize our approach in a scenario with scarce data, let’s define the labeled set as a collection of elements for which the corresponding label is known. Conversely, the unlabeled set is a collection of elements whose labels are unknown. The whole training set is given by the union of and the same statistical distribution. In a sample eficiency . All the samples from are assumed to be drawn from scenario, the number of samples in is typically much smaller than the total number of samples in . In particular, in an % -sample eficiency

regime, the size of the labeled set is % that of the whole training set.

To tackle this scenario, we considered a semi-supervised approach in two phases. During the ifrst phase, latent representations are obtained from hidden layers of a DNN, which are trained using unsupervised Hebbian learning. This unsupervised pre-training is performed on all the available training samples. During the second phase, a final linear classifier is placed on top of the features extracted from deep network layers. Classifier and deep layers are fine-tuned in a supervised training fashion, by running an end-to-end SGD optimization procedure using only the few labeled samples at our disposal.

4. Results and conclusions

In order to validate our method, we performed experiments 1 on various image datasets, and in various sample eficiency regimes. For the sake of brevity, but without loss of generality, in this venue we present the results on CIFAR10 [22], in sample eficiency regimes where the amount of labeled samples was respectively 1%, 5%, 10%, and 100% of the whole training set. Further results can be found in [7, 8].

We considered a six layer neural network as shown in Fig. 1: five deep layers plus a final linear classifier. The various layers were interleaved with other processing stages (such as ReLU nonlinearities, max-pooling, etc.). We first performed unsupervised pre-training with a chosen algorithm. Then, we cut the network in correspondence of a given layer, and we attached a new classifier on top of the features extracted from that layer. Deep layers and classifier were then fine-tuned with supervision in an end-to-end fashion and the resulting accuracy was evaluated. This was done for each layer, in order to evaluate the network on a layer-by-layer basis, and for each sample eficiency regime. For the unsupervised pre-training of deep layers, we considered both the HPCA and the soft-WTA strategy. In addition, as a baseline for comparison, we considered another popular unsupervised method for pre-training, namely the Variational Auto-Encoder (VAE) [23] (considered also in [19]). Note that VAE is unsupervised, but still backprop-based.

The results are shown Tab. 1, together with 95% confidence intervals obtained from five independent repetitions of the experiments. In summary, the results suggest that our semisupervised approach based on unsupervised Hebbian pre-training performs generally better than VAE pre-training, especially in low sample eficiency regimes, in which only a small portion of the training set (between 1% and 10%) is assumed to be labeled. In particular, the HPCA 1Code available at: https://github.com/GabrieleLagani/HebbianPCA/tree/hebbpca. approach appears to perform generally better than soft-WTA. Concerning the computational cost of Hebbian learning, the approach converged in just 1-2 epochs of training, while backprop approaches required 10-20 epochs, showing promises towards scaling to large scale scenarios.

In future work, we plan to investigate the combination of Hebbian approaches with alternative semi-supervised methods, namely pseudo-labeling and consistency-based methods [24, 25], which do not exclude unsupervised pre-training, but rather can be integrated together. Moreover, we are currently conducting more thorough explorations of Hebbian algorithms in the domain of large scale multimedia content based retrieval, and the results are promising [26].

Acknowledgments

This work was partially supported by the H2020 projects AI4EU (GA 825619) and AI4Media (GA 951911). [5] R. C. O’Reilly, Y. Munakata, Computational explorations in cognitive neuroscience: Understanding the mind by simulating the brain, MIT press, 2000. [6] S. Haykin, Neural networks and learning machines, 3 ed., Pearson, 2009. [7] G. Lagani, F. Falchi, C. Gennaro, G. Amato, Hebbian semi-supervised learning in a sample eficiency setting, Neural Networks 143 (2021) 719–731. [8] G. Lagani, F. Falchi, C. Gennaro, G. Amato, Evaluating hebbian learning in a semi-supervised setting, in: International Conference on Machine Learning, Optimization, and Data Science, Springer, 2021, pp. 365–379. [9] S. Grossberg, Adaptive pattern classification and universal recoding: I. parallel development and coding of neural feature detectors, Biological cybernetics 23 (1976) 121–134. [10] J. Karhunen, J. Joutsensalo, Generalizations of principal component analysis, optimization problems, and neural networks, Neural Networks 8 (1995) 549–562. [11] C. Pehlevan, T. Hu, D. B. Chklovskii, A hebbian/anti-hebbian neural network for linear subspace learning: A derivation from multidimensional scaling of streaming data, Neural computation 27 (2015) 1461–1495. [12] A. Wadhwa, U. Madhow, Bottom-up deep learning using the hebbian principle, 2016. [13] Y. Bahroun, A. Soltoggio, Online representation learning with single and multi-layer hebbian networks for image classification, in: International Conference on Artificial Neural Networks, Springer, 2017, pp. 354–363. [14] D. Krotov, J. J. Hopfield, Unsupervised learning by competing hidden units, Proceedings of the

National Academy of Sciences 116 (2019) 7723–7731. [15] G. Lagani, F. Falchi, C. Gennaro, G. Amato, Training convolutional neural networks with competitive hebbian learning approaches, in: International Conference on Machine Learning, Optimization, and Data Science, Springer, 2021, pp. 25–40. [16] G. Lagani, F. Falchi, C. Gennaro, G. Amato, Comparing the performance of hebbian against backpropagation learning using convolutional neural networks, Neural Computing and Applications (2022) 1–17. [17] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep networks, in:

Advances in neural information processing systems, 2007, pp. 153–160. [18] H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin, Exploring strategies for training deep neural networks., Journal of machine learning research 10 (2009). [19] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, M. Welling, Semi-supervised learning with deep generative models, Advances in neural information processing systems 27 (2014) 3581–3589. [20] Y. Zhang, K. Lee, H. Lee, Augmenting supervised neural networks with unsupervised objectives for large-scale image classification, in: International conference on machine learning, 2016, pp. 612–621. [21] S. J. Nowlan, Maximum likelihood competitive learning, in: Advances in neural information processing systems, 1990, pp. 574–582. [22] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images, 2009. [23] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114 (2013). [24] A. Iscen, G. Tolias, Y. Avrithis, O. Chum, Label propagation for deep semi-supervised learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5070–5079. [25] P. Sellars, A. I. Aviles-Rivero, C.-B. Schönlieb, Laplacenet: A hybrid energy-neural model for deep semi-supervised classification, arXiv preprint arXiv:2106.04527 (2021). [26] G. Lagani, D. Bacciu, C. Gallicchio, F. Falchi, C. Gennaro, G. Amato, Deep features for cbir with scarce data using hebbian learning, Submitted at CBMI 2022 conference (2022). URL: https: //arxiv.org/abs/2205.08935.

[1]

Krizhevsky , I. Sutskever,

G. E.

Hinton , Imagenet classification with deep convolutional neural networks , Advances in neural information processing systems ( 2012 ).

[2]

He ,

Zhang , S. Ren,

Sun , Deep residual learning for image recognition , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2016 , pp. 770 - 778 .

[3]

Wan ,

Wang ,

S. C. H.

Hoi ,

Wu ,

Zhu ,

Zhang ,

Li , Deep learning for content-based image retrieval: A comprehensive study , in: Proceedings of the 22nd ACM international conference on Multimedia , 2014 , pp. 157 - 166 .

[4]

Babenko ,

Slesarev ,

Chigorin ,

Lempitsky , Neural codes for image retrieval , in: European conference on computer vision , Springer, 2014 , pp. 584 - 599 .