Implementation of Evolutionary Algorithms for
              Deep Architectures

                           Sreenivas Sremath Tirumala


      Abstract. Deep learning is becoming an increasingly interesting and
      powerful machine learning method with successful applications in many
      domains, such as natural language processing, image recognition, and
      hand-written character recognition. Despite of its eminent success, lim-
      itations of traditional learning approach may still prevent deep learning
      from achieving a wide range of realistic learning tasks. Due to the flexi-
      bility and proven eﬀectiveness of evolutionary learning techniques, they
      may therefore play a crucial role towards unleashing the full potential of
      deep learning in practice. Unfortunately, many researchers with a strong
      background on evolutionary computation are not fully aware of the state-
      of-the-art research on deep learning. To close this knowledge gap and to
      promote the research on evolutionary inspired deep learning techniques,
      this paper presents a comprehensive review of the latest deep architec-
      tures and surveys important evolutionary algorithms that can potentially
      be explored for training these deep architectures.


Index terms — Deep Architectures, Deep Learning, Evolutionary Algorithms


1   Introduction

Deep Learning is a topic of high interest with its extensive application in nat-
ural language processing, image recognition [1] [2] and computer vision. Cor-
porate giants like Google, Microsoft, Apple, Facebook, Yahoo etc. established
their deep learning research groups for implementing this concept in their prod-
ucts. Applications based on deep learning have won numerous machine learning
competitions in ICML and NIPS with considerable margins which were earlier
dominated by other machine learning approaches like Support Vector Machines.
In 2013 it has topped in Chinese Handwriting Recognition Competition, Galaxy
Zoo Competition, MICCAI 2013 Challenge, Merck Drug Discovery Competition,
Dogs versus Cats Competition etc. Deep Learning is rated as the most interesting
topic of research interests by Massachusetts Institute of Technology (MIT).
    The importance of studying deep architectures is motivated from the deep
architecture found in human brain. It is a common practice to reduce a high
level problem into a set of low level problems in a hierarchical manner with
easiest problem at the bottom. Interestingly, deep architectures based systems
can achieve the learning that a shallow architecture can, but the vice versa is
not feasible [3]. A Deep Neural Network (DNN) is an Artificial Neural Network


                               Page 164 of 171
(ANN) with multiple hidden layers. One of major problems of DNNs is overfit-
ting which was unaddressed till 2014 [4]. Further, due to the extensive use of
gradient descent based learning techniques, DNNs may easily be trapped into
local optima, resulting in undesirable learning performance. Moreover, the initial
topology of DNN is often determined through a seemingly arbitrary trial and
error process. However, the fixed topology thus created may seriously aﬀect the
learning flexibility and practical applicability of DNNs. Deep learning has been
applied on other machine learning paradigms like Support Vector Machines and
Reinforcement Learning.
    In this paper, we argue that Evolutionary Computation (EC) techniques can,
to a large extent, present satisfactory and eﬀective solutions to above mentioned
problems. In fact, several Neuroevolutoinary systems have been successfully de-
veloped to solve various challenging learning tasks with remarkably better per-
formance than traditional learning techniques. Unfortunately, many researchers
with a strong background in evolutionary computation are still not fully aware
of the state-of-the-art research on deep learning. To meet this knowledge gap
and to promote the research on evolutionary inspired deep learning techniques,
this paper presents a review of latest deep architectures and surveys important
evolutionary algorithms that can potentially be explored for training these deep
architectures. This paper is divided into 5 sections. Section 1 details the his-
tory of deep architectures. Section 2 provides a detailed study on various deep
architectures. Recent implementations of evolutionary algorithms on deep archi-
tectures are explored in section 3. Section 4 summarizes the paper with outcomes
and conclusion.


2   Deep Architectures

Deep architecture is a hierarchical structure of multiple layers with each layer
being self-trained to learn from the output of its preceding layer. This learning
process i.e., ’deep learning’ is based on distributed representation learning with
multiple levels of representation for various layers. In simple terms, each layer
learns a new feature from its preceding layer which makes the learning process
concrete. Thus, the learning process is hierarchical with low level feature at the
bottom and very high level feature at the top with intermediate features in the
middle that can also be utilized. From these features, greedy-layer-wise training
mechanism enables to extract only those features that are useful for learning.
Along with this, a pre-unsupervised training with unlabelled data makes deep
learning more eﬀective.
    Shallow architectures have only two levels of computation and learning el-
ements which makes them ineﬃcient to handle training data [5]. Deep archi-
tectures require fewer computational units that allow non-local generalization
which result in increased comprehensibility and eﬃciency that has been proved
with its success in Natural Language Processing (NLP) and image processing.
According to complexity theory of circuits, deep architectures can be exponen-
tially more eﬃcient than traditional narrow architectures in terms of functional


                              Page 165 of 171
representation for problem solving [5]. Traditional Artificial Neural Networks
(ANNs) are considered to be most suitable for implementing deep architectures.
    In 1980 Fukushima proposed Neocognition using Convolutional Neural Net-
works (ConvNets) [6] which served as a successful model for later works on deep
architectures which later been improved by Lecun [7]. The theoretical concepts
of deep architecture were proposed in 1998 by Lecun [8]. The Breakthrough in
the research of training deep architectures was achieved in 2006 when Lecun,
G.E. Hinton and Yoshua Bengio proposed 3 diﬀerent types of deep architectures
with eﬃcient training mechanism. Lecun implemented eﬃcient training mecha-
nism for ConvNets [9] in which he was not successful earlier. Hinton implemented
Deep Belief Networks (DBNs) [10] and Yoshua Bengio proposed Stacked Auto-
encoders [11].
    A simple form of deep architecture implementation is DNNs, feed-forward
ANNs with more than one hidden layer units that make them more eﬃcient than
a normal ANNs [12]. DNNs are trained with BP by discriminative probabilistic
models that calculate the diﬀerence between target outputs and actual outputs.
The weights in the DNNs are updated using stochastic gradient descent defined
                                 ∂C
as ∆wij (t + 1) = ∆wij (t) + η ∂w  ij
                                      , where η represents the learning rate, C
is the associated cost function and wij represents weight. For larger training
sets, DNNs may be trained in multiple batches of small sizes without losing the
eﬃciency [13]. However it is very complex to train DNNs with many layers and
many hidden units since the number of parameters to be optimized are very
high.

2.1   Convolutional Neural Networks (ConvNets)
ConvNets are a special type of feed-
forward ANNs that performs feature
extraction by applying convolution
and sub sampling. The principle ap-
plication of ConvNets is feature iden-
tification. ConvNets are biologically
inspired MLPs based on virtual cortex
principle [14] and the earliest imple- Fig. 1. ConvNets Structure proposed by
mentation is by Fukushima in 1980 [6] Lecun [9]
for pattern recognition followed by Le-
cun in 1998 [8]. ConvNets diversify by
applying local connections, sub sampling and sharing the weights which is sim-
ilar to the principle approach of ANNs in early 60s. In ConvNets each unit in
the layer receives input from set of units in small groups from its neighbouring
layer which is similar to earlier MLP model. Using local connections for feature
extraction has been proven successful, especially for extracting edges, end points
and corners. These features extracted at the initial layer will be combined subse-
quently at the later layers to achieve higher or better features. The features that
are detected at the initial stages may also be used at the subsequent stages. The
training procedure of the ConvNets is shown in Fig. 1. The first layer takes a raw


                               Page 166 of 171
pixel with 32 x 32 from the input image. The second layer consists of 6 kernels
with 5 x 5 local window. From this, a sub sampling is done in the 3rd layer
(sub sampling) layer. For the 4th layer, another ConvNets with 16 kernels was
exploited with the same 5 x 5 windows. Then the 5th layer is also constructed
using sub sampling. This procedure continues till the last layer and the entire
structure is developed as Gaussian connections.

2.2   Deep Belief Networks
Deep Belief Network (DBN) is a type of DNN proposed by Hinton in 2006 [15].
DBN is based on MLP model with greedy layer-wise training. DBN consists of
multiple interconnected hidden layers with each layer acting as an input to the
next layer and is visible only to the next layer. Each layer in a DBN has no
lateral connection between its nodes present in that layer. The nodes of DBN
are probabilistic logic nodes thus allowing the possibility of using activation
function. Restricted Boltzmann machine (RBM) is stochastic ANN with input
and hidden units with each and every connection connecting a hidden and visible
unit. RBMs act as the building blocks of DBNs because of their capability of
learning probabilistic distributions on their inputs. Initially the first layer of the
DBNs is trained as RBM that transforms input into output. The output thus
received is used as data for the second layer which is treated as a RBM for
the next level of training and the process continues. Similarly the output of the
second layer will be the input for the third layer and the process continues as
shown in Fig. 2 .The transformation of data is done using activation function
or sampling. In this way the subsequent hidden layer becomes a visible layer for
current hidden layer so as to train it as a RBM. An RBM with two layers, a
visible layer as layer 1 and a hidden layer as layer 2 is the simplest form of DBN.
The units of the visible layer are used to represent data and the units (hidden
with no connection between them) will learn to represent features. If a hidden
layer 3 is added to this, then layer 2 will be visible to only layer 3 (still hidden
to layer 1) and now the RBM will transform the data from layer 2 to layer 3.
This process is illustrated in Fig. 2.
                                                     In 2006, Hinton proposed a
                                                 greedy layer-wise unsupervised
                                                 pre-learning algorithm for train-
                                                 ing that addresses the problem
                                                 of training multilayer ANNs [10].
                                                 In DBNs, the lower level fea-
                                                 tures of the input are extracted as
                                                 lower layers and an abstract rep-
                                                 resentation (high level features)
 Fig. 2. Structure of Deep Belief Networks [15]
                                                 of the input is performed at the
                                                 higher layers. The training pro-
                                                 cedure of DBNs is carried out in
three phrases. Each layer of the DBN is pre-trained with greedy layer wise train-
ing followed by unsupervised learning for each layer and finally training the


                                Page 167 of 171
entire network with supervised training. The significance of this training proce-
dure is determined by the generative weights. After learning, the values of the
latent variables in every layer can be inferred by a single, bottom-up pass that
starts with observed data vector in the bottom layer using generative weights
in the reverse direction. DBNs proved to be the most eﬃcient in image recog-
nition [10], Face Recognition [16], Character Recognition [11] and various other
applications.

2.3   Stacked Auto-encoders
The idea of auto-encoders is evolved from the process of reducing dimensionality
of data by identifying eﬃcient method to transform complex high dimensional
data into lower dimensional code using an encoding multilayer ANN. A decoder
network will be used to recover the data from the code. Initially both encoder
and decoder networks are assigned with random weights and trained by observ-
ing the discrepancy between original data and output obtained from encoding
and decoding. After this the error is back propagated first through the decoder
network followed by encoder network and this entire system is named as auto-
encoders [15].
                                                                    1
     An auto-encoder with input x ∈ Rd is ”encoded” as h ∈ Rd using deter-
ministic function defined as fθ = σ(W x + b), θ = W, b. To ”decode”, a reverse
mapping of f : y = fθ1 (h) = σW 1 h + b1 with θ = (W 1 , b1 ) and W 1 = W T
with encoding and decoding with the same inputs. This process continues for
every training patten. For i training xi is mapped to hi with a reconstruction
yi . Parameter optimization is achieved by minimizing the cost function over the
training set. However, optimizing an auto-encoder network with multiple hidden
layers is diﬃcult. Being similar to DBN greedy layer wise training procedure,
this approach replaces RBMs by auto-encoders that perform learning by repro-
ducing every data vector from its own feature activation [5]. The considerable
change that has been applied in this model by Yoshua Bengio is changing the un
supervised training procedure to supervised in order to identify the significance
of training paradigm.
     The process of greedy layer wise training is as follows. In the entire ANN,
three layers are considered at one instance with the middle layer being the hidden
layer. In the next instance, the middle layer becomes input layer and the output
layer of the previous instance become hidden layer (the parameters from the
output becomes the training parameters) and the layer next to it will be the
new output layer. This process continues for the entire network. However, the
results were not eﬃcient since the network becomes too greedy [5]. It can be
concluded that, the performance of stacked auto-encoders with unsupervised
training was almost similar to that of RBNs with similar type of training whereas
stacked auto-encoders with supervised pre-training is less eﬃcient. Stacked auto-
encoders were not successful in ignoring random noise in its training data due to
which its performance is slightly less (almost equal performance but not same)
than RBM based deep architectures. However, this gap is narrowed by stacked
de-noising auto-encoder algorithm proposed in 2010 [17].


                              Page 168 of 171
3     Applying Evolutionary Algorithms on Deep
      Architectures
3.1   Generative Neuroevolution for Deep Learning
In 2013 Phillip Verbancsics and Josh Harguess proposed Generative Neuroevo-
lution for Deep Learning by implementing HyperNEAT as a feature learner
on a ANN similar to ConvNets [18]. Compositional pattern producing network
(CPPN) is an indirect encoding procedure of HyperNEAT that encodes weight
patterns of ANN using composite functions. The topology and weights required
for CPNN is evolved by HyperNEAT. In HyperNEAT process, CPPN defines an
ANN as a solution for required problem. CPNNs fitness score is determined by
evaluating the ANNs performance for the task for which it is evolved. Diverg-
ing from traditional methods, this approach trains ANN to learn features by
transforming input into features. Then these features are evaluated by Machine
Learning (ML) approach thus defining the fitness of CPNN. Therefore, this pro-
cess will maximize the performance of the learned solution since HyperNEAT
determines the best features out of other ML approach. ConvNets can be rep-
resented in a graph like structure with coordinates of the nodes associated with
each other which are similar to HyperNEAT structure. This similarity enables
to apply HyperNEAT on ConvNets based architectures.
    For the experiment, an eight dimensional Hypercube representation of CPNN
is used with f-axis as feature axis, x-axis as neuron constellation of each feature
and y-axis being pixel locations. HyperNEAT topology is a multilayer neural
network with layers traveling along z-axis with CPPN representing the points in
an eight-dimensional Hyper-cube that corresponds to connections in the four di-
mensional substrate. The location of each neuron can be identified using (x,y,f,z)
coordinate and each layer can be represented with a trait constituting number of
features(F) with X and Y dimensions. HyperNEAT is applied to the LeNet-5 [8].
The experiment is conducted on MNIST database with a population size of 250
with 30 runs for 2500 generations. With this comparative results its been con-
cluded that HyperNEAT with ANN architectures is overthrown by HyperNEAT
with CNN architecture.

3.2   Deep Learning using Genetic Algorithm
In 2012, Joshua proposed a learning method for deep architectures using genetic
algorithm [19]. A DNN for image classification is implemented using a genetic
algorithm and training each layer using generic algorithm. Further this study
tries to justify the possibility of using genetic algorithms to train non trivial
DNNs for feature extraction. Initially a matrix representing the DNN is gener-
ated with Sparse Network Design with most of the values being close to zero,
whereas the ideal solution in this case is an identity matrix. The genetic sequence
of individuals with non-zero elements (which is considered as a gene) is kept and
computed instead of re-generating the complete matrix which will reduce the
amount of data required to store in the matrix and the process complexity. The


                               Page 169 of 171
position of the gene in the matrix can be determined by row and column and
every gene has a magnitude.
    The proposed algorithms are tested on image data normalized in the range
of 0.0 and 1.0. Apart from applying to image data, the algorithm has been ap-
plied to handwriting, face image (small and large) and cat image identification.
The experimental results section shows the reconstruction (of input) error rate
for each experiment. Another experiment for reconstruction of faces with noisy
data claim to prove that this algorithm is not just copying blocks of data, but
generating the connections in the data and reconstructing the initial image. The
theoretical limitations of the algorithm is not addressed. The cost of reconstruc-
tion becomes 0 for a single training image as it will be eﬃcient only with a large
set of data.


4   Conclusion
This paper provides a theoretical review of standard deep architectures and
study the possibilities of implementing evolutionary computation principles on
deep architectures. Apart from introducing various types of deep architecture,
this paper provides a detailed explanation of their training procedure and imple-
mentations. Further, this paper analyses the implications of applying evolution-
ary algorithms on deep architectures with details of two such implementations
and a critical review on their achievement. The Neuroevolution approach for
deep architectures that is discussed in previous section is with respect to the
application of HyperNEAT on deep architectures. The success of this proposed
method cannot be determined since CNN holds the best classification for MNIST
database. But, this drives a way of implementing Neuroevolution algorithms on
deep architectures. Similarly, the second work of using genetic algorithms for
training DNNs, justifies the possibility of using genetic algorithms for training
deep architectures but does not show any signs of comparative studies of its
eﬃciency with respect to speed or quality.
    It is noteworthy that evolutionary algorithms may not be a complete re-
placement for deep learning algorithms at least not at this stage. However, the
successful application of evolutionary techniques on deep architectures will lead
to an improved learning mechanism for deep architectures. This might result in
reducing the training time which is the main drawback for deep architectures.
Future direction in this research could be evolving an optimized deep architecture
based neural networks using Neuroevolutonary principles. This could provide a
warm start to the deep learning process and could improve the performance of
the deep learning algorithms.


References
 1. Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks and appli-
    cations in vision,” in Circuits and Systems (ISCAS), Proceedings of 2010 IEEE
    International Symposium on, pp. 253–256, May 2010.


                              Page 170 of 171
 2. J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with deep neural
    networks,” in In NIPS, 2012.
 3. Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends in Ma-
    chine Learning, vol. 2, no. 1, pp. 1–127, 2009. Also published as a book. Now
    Publishers, 2009.
 4. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
    “Dropout: A simple way to prevent neural networks from overfitting,” Journal
    of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
 5. Y. Bengio and Y. LeCun, “Scaling learning algorithms towards AI,” in Large Scale
    Kernel Machines (L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, eds.), MIT
    Press, 2007.
 6. K. Fukushima, “Neocognitron: A self-organizing neural network model for a mecha-
    nism of pattern recognition unaﬀected by shift in position,” Biological Cybernetics,
    vol. 36, pp. 193–202, 1980.
 7. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and
    L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in
    Advances in Neural Information Processing Systems (NIPS 1989) (D. Touretzky,
    ed.), vol. 2, (Denver, CO), Morgan Kaufman, 1990.
 8. Y. Lecun, L. Bottou, Y. Bengio, and P. Haﬀner, “Gradient-based learning applied
    to document recognition,” in Proceedings of the IEEE, pp. 2278–2324, 1998.
 9. M. A. Ranzato, C. S. Poultney, S. Chopra, and Y. LeCun, “Eﬃcient learning of
    sparse representations with an energy-based model,” in NIPS, pp. 1137–1144, 2006.
10. G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep
    belief nets,” Neural Comput., vol. 18, pp. 1527–1554, July 2006.
11. Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, U. D. Montral, and M. Qubec,
    “Greedy layer-wise training of deep networks,” in In NIPS, MIT Press, 2007.
12. G. Tesauro, “Practical issues in temporal diﬀerence learning,” in Machine Learning,
    pp. 257–277, 1992.
13. G. E. Hinton, L. Deng, D. Yu, G. E. Dahl, A. rahman Mohamed, N. Jaitly, A. Se-
    nior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural
    networks for acoustic modeling in speech recognition: The shared views of four
    research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.
14. D. H. Hubel and T. N. Wiesel, “Receptive fields and functional architecture of
    monkey striate cortex,” Journal of Physiology (London), vol. 195, pp. 215–243,
    1968.
15. G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with
    neural networks,” Science, 2006.
16. Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to
    human-level performance in face verification,” in Conference on Computer Vision
    and Pattern Recognition (CVPR), 2007.
17. P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked
    denoising autoencoders: Learning useful representations in a deep network with a
    local denoising criterion,” J. Mach. Learn. Res., vol. 11, pp. 3371–3408, Dec. 2010.
18. P. Verbancsics and J. Harguess, “Generative neuroevolution for deep learning,”
    CoRR, vol. abs/1312.5355, 2013.
19. J. Lamos-Sweeney, “Deep learning using genetic algorithms. Master thesis, Insti-
    tute Thomas Golisano College of Computing and Information Sciences,” 2012.
    Advisor: Gaborski, Roger.


                                Page 171 of 171