<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Implementation of Evolutionary Algorithms for Deep Architectures</article-title>
      </title-group>
      <abstract>
        <p>Deep learning is becoming an increasingly interesting and powerful machine learning method with successful applications in many domains, such as natural language processing, image recognition, and hand-written character recognition. Despite of its eminent success, limitations of traditional learning approach may still prevent deep learning from achieving a wide range of realistic learning tasks. Due to the flexibility and proven effectiveness of evolutionary learning techniques, they may therefore play a crucial role towards unleashing the full potential of deep learning in practice. Unfortunately, many researchers with a strong background on evolutionary computation are not fully aware of the stateof-the-art research on deep learning. To close this knowledge gap and to promote the research on evolutionary inspired deep learning techniques, this paper presents a comprehensive review of the latest deep architectures and surveys important evolutionary algorithms that can potentially be explored for training these deep architectures.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>Deep Architectures</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Evolutionary Algorithms</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        (ANN) with multiple hidden layers. One of major problems of DNNs is
overfitting which was unaddressed till 2014 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Further, due to the extensive use of
gradient descent based learning techniques, DNNs may easily be trapped into
local optima, resulting in undesirable learning performance. Moreover, the initial
topology of DNN is often determined through a seemingly arbitrary trial and
error process. However, the fixed topology thus created may seriously affect the
learning flexibility and practical applicability of DNNs. Deep learning has been
applied on other machine learning paradigms like Support Vector Machines and
Reinforcement Learning.
      </p>
      <p>In this paper, we argue that Evolutionary Computation (EC) techniques can,
to a large extent, present satisfactory and effective solutions to above mentioned
problems. In fact, several Neuroevolutoinary systems have been successfully
developed to solve various challenging learning tasks with remarkably better
performance than traditional learning techniques. Unfortunately, many researchers
with a strong background in evolutionary computation are still not fully aware
of the state-of-the-art research on deep learning. To meet this knowledge gap
and to promote the research on evolutionary inspired deep learning techniques,
this paper presents a review of latest deep architectures and surveys important
evolutionary algorithms that can potentially be explored for training these deep
architectures. This paper is divided into 5 sections. Section 1 details the
history of deep architectures. Section 2 provides a detailed study on various deep
architectures. Recent implementations of evolutionary algorithms on deep
architectures are explored in section 3. Section 4 summarizes the paper with outcomes
and conclusion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Deep Architectures</title>
      <p>Deep architecture is a hierarchical structure of multiple layers with each layer
being self-trained to learn from the output of its preceding layer. This learning
process i.e., ’deep learning’ is based on distributed representation learning with
multiple levels of representation for various layers. In simple terms, each layer
learns a new feature from its preceding layer which makes the learning process
concrete. Thus, the learning process is hierarchical with low level feature at the
bottom and very high level feature at the top with intermediate features in the
middle that can also be utilized. From these features, greedy-layer-wise training
mechanism enables to extract only those features that are useful for learning.
Along with this, a pre-unsupervised training with unlabelled data makes deep
learning more effective.</p>
      <p>
        Shallow architectures have only two levels of computation and learning
elements which makes them inefficient to handle training data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Deep
architectures require fewer computational units that allow non-local generalization
which result in increased comprehensibility and efficiency that has been proved
with its success in Natural Language Processing (NLP) and image processing.
According to complexity theory of circuits, deep architectures can be
exponentially more efficient than traditional narrow architectures in terms of functional
representation for problem solving [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Traditional Artificial Neural Networks
(ANNs) are considered to be most suitable for implementing deep architectures.
      </p>
      <p>
        In 1980 Fukushima proposed Neocognition using Convolutional Neural
Networks (ConvNets) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which served as a successful model for later works on deep
architectures which later been improved by Lecun [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The theoretical concepts
of deep architecture were proposed in 1998 by Lecun [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The Breakthrough in
the research of training deep architectures was achieved in 2006 when Lecun,
G.E. Hinton and Yoshua Bengio proposed 3 different types of deep architectures
with efficient training mechanism. Lecun implemented efficient training
mechanism for ConvNets [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] in which he was not successful earlier. Hinton implemented
Deep Belief Networks (DBNs) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and Yoshua Bengio proposed Stacked
Autoencoders [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        A simple form of deep architecture implementation is DNNs, feed-forward
ANNs with more than one hidden layer units that make them more efficient than
a normal ANNs [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. DNNs are trained with BP by discriminative probabilistic
models that calculate the difference between target outputs and actual outputs.
The weights in the DNNs are updated using stochastic gradient descent defined
as ∆w ij (t + 1) = ∆w ij (t) + η ∂∂wCij , where η represents the learning rate, C
is the associated cost function and wij represents weight. For larger training
sets, DNNs may be trained in multiple batches of small sizes without losing the
efficiency [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. However it is very complex to train DNNs with many layers and
many hidden units since the number of parameters to be optimized are very
high.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Convolutional Neural Networks (ConvNets)</title>
        <p>
          ConvNets are a special type of
feedforward ANNs that performs feature
extraction by applying convolution
and sub sampling. The principle
application of ConvNets is feature
identification. ConvNets are biologically
inspired MLPs based on virtual cortex
principle [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and the earliest imple- Fig. 1. ConvNets Structure proposed by
mentation is by Fukushima in 1980 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] Lecun [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
for pattern recognition followed by
Lecun in 1998 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. ConvNets diversify by
applying local connections, sub sampling and sharing the weights which is
similar to the principle approach of ANNs in early 60s. In ConvNets each unit in
the layer receives input from set of units in small groups from its neighbouring
layer which is similar to earlier MLP model. Using local connections for feature
extraction has been proven successful, especially for extracting edges, end points
and corners. These features extracted at the initial layer will be combined
subsequently at the later layers to achieve higher or better features. The features that
are detected at the initial stages may also be used at the subsequent stages. The
training procedure of the ConvNets is shown in Fig. 1. The first layer takes a raw
pixel with 32 x 32 from the input image. The second layer consists of 6 kernels
with 5 x 5 local window. From this, a sub sampling is done in the 3rd layer
(sub sampling) layer. For the 4th layer, another ConvNets with 16 kernels was
exploited with the same 5 x 5 windows. Then the 5th layer is also constructed
using sub sampling. This procedure continues till the last layer and the entire
structure is developed as Gaussian connections.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Deep Belief Networks</title>
        <p>
          Deep Belief Network (DBN) is a type of DNN proposed by Hinton in 2006 [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
DBN is based on MLP model with greedy layer-wise training. DBN consists of
multiple interconnected hidden layers with each layer acting as an input to the
next layer and is visible only to the next layer. Each layer in a DBN has no
lateral connection between its nodes present in that layer. The nodes of DBN
are probabilistic logic nodes thus allowing the possibility of using activation
function. Restricted Boltzmann machine (RBM) is stochastic ANN with input
and hidden units with each and every connection connecting a hidden and visible
unit. RBMs act as the building blocks of DBNs because of their capability of
learning probabilistic distributions on their inputs. Initially the first layer of the
DBNs is trained as RBM that transforms input into output. The output thus
received is used as data for the second layer which is treated as a RBM for
the next level of training and the process continues. Similarly the output of the
second layer will be the input for the third layer and the process continues as
shown in Fig. 2 .The transformation of data is done using activation function
or sampling. In this way the subsequent hidden layer becomes a visible layer for
current hidden layer so as to train it as a RBM. An RBM with two layers, a
visible layer as layer 1 and a hidden layer as layer 2 is the simplest form of DBN.
The units of the visible layer are used to represent data and the units (hidden
with no connection between them) will learn to represent features. If a hidden
layer 3 is added to this, then layer 2 will be visible to only layer 3 (still hidden
to layer 1) and now the RBM will transform the data from layer 2 to layer 3.
This process is illustrated in Fig. 2.
entire network with supervised training. The significance of this training
procedure is determined by the generative weights. After learning, the values of the
latent variables in every layer can be inferred by a single, bottom-up pass that
starts with observed data vector in the bottom layer using generative weights
in the reverse direction. DBNs proved to be the most efficient in image
recognition [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], Face Recognition [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], Character Recognition [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and various other
applications.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Stacked Auto-encoders</title>
        <p>
          The idea of auto-encoders is evolved from the process of reducing dimensionality
of data by identifying efficient method to transform complex high dimensional
data into lower dimensional code using an encoding multilayer ANN. A decoder
network will be used to recover the data from the code. Initially both encoder
and decoder networks are assigned with random weights and trained by
observing the discrepancy between original data and output obtained from encoding
and decoding. After this the error is back propagated first through the decoder
network followed by encoder network and this entire system is named as
autoencoders [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>
          An auto-encoder with input x ∈ Rd is ”encoded” as h ∈ Rd1 using
deterministic function defined as fθ = σ(W x + b), θ = W, b. To ”decode”, a reverse
mapping of f : y = fθ1 (h) = σW 1h + b1 with θ = (W 1, b1) and W 1 = W T
with encoding and decoding with the same inputs. This process continues for
every training patten. For i training xi is mapped to hi with a reconstruction
yi. Parameter optimization is achieved by minimizing the cost function over the
training set. However, optimizing an auto-encoder network with multiple hidden
layers is difficult. Being similar to DBN greedy layer wise training procedure,
this approach replaces RBMs by auto-encoders that perform learning by
reproducing every data vector from its own feature activation [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The considerable
change that has been applied in this model by Yoshua Bengio is changing the un
supervised training procedure to supervised in order to identify the significance
of training paradigm.
        </p>
        <p>
          The process of greedy layer wise training is as follows. In the entire ANN,
three layers are considered at one instance with the middle layer being the hidden
layer. In the next instance, the middle layer becomes input layer and the output
layer of the previous instance become hidden layer (the parameters from the
output becomes the training parameters) and the layer next to it will be the
new output layer. This process continues for the entire network. However, the
results were not efficient since the network becomes too greedy [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. It can be
concluded that, the performance of stacked auto-encoders with unsupervised
training was almost similar to that of RBNs with similar type of training whereas
stacked auto-encoders with supervised pre-training is less efficient. Stacked
autoencoders were not successful in ignoring random noise in its training data due to
which its performance is slightly less (almost equal performance but not same)
than RBM based deep architectures. However, this gap is narrowed by stacked
de-noising auto-encoder algorithm proposed in 2010 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Applying Evolutionary Algorithms on Deep</title>
    </sec>
    <sec id="sec-4">
      <title>Architectures</title>
      <p>3.1</p>
      <sec id="sec-4-1">
        <title>Generative Neuroevolution for Deep Learning</title>
        <p>
          In 2013 Phillip Verbancsics and Josh Harguess proposed Generative
Neuroevolution for Deep Learning by implementing HyperNEAT as a feature learner
on a ANN similar to ConvNets [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Compositional pattern producing network
(CPPN) is an indirect encoding procedure of HyperNEAT that encodes weight
patterns of ANN using composite functions. The topology and weights required
for CPNN is evolved by HyperNEAT. In HyperNEAT process, CPPN defines an
ANN as a solution for required problem. CPNNs fitness score is determined by
evaluating the ANNs performance for the task for which it is evolved.
Diverging from traditional methods, this approach trains ANN to learn features by
transforming input into features. Then these features are evaluated by Machine
Learning (ML) approach thus defining the fitness of CPNN. Therefore, this
process will maximize the performance of the learned solution since HyperNEAT
determines the best features out of other ML approach. ConvNets can be
represented in a graph like structure with coordinates of the nodes associated with
each other which are similar to HyperNEAT structure. This similarity enables
to apply HyperNEAT on ConvNets based architectures.
        </p>
        <p>
          For the experiment, an eight dimensional Hypercube representation of CPNN
is used with f-axis as feature axis, x-axis as neuron constellation of each feature
and y-axis being pixel locations. HyperNEAT topology is a multilayer neural
network with layers traveling along z-axis with CPPN representing the points in
an eight-dimensional Hyper-cube that corresponds to connections in the four
dimensional substrate. The location of each neuron can be identified using (x,y,f,z)
coordinate and each layer can be represented with a trait constituting number of
features(F) with X and Y dimensions. HyperNEAT is applied to the LeNet-5 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
The experiment is conducted on MNIST database with a population size of 250
with 30 runs for 2500 generations. With this comparative results its been
concluded that HyperNEAT with ANN architectures is overthrown by HyperNEAT
with CNN architecture.
3.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Deep Learning using Genetic Algorithm</title>
        <p>
          In 2012, Joshua proposed a learning method for deep architectures using genetic
algorithm [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. A DNN for image classification is implemented using a genetic
algorithm and training each layer using generic algorithm. Further this study
tries to justify the possibility of using genetic algorithms to train non trivial
DNNs for feature extraction. Initially a matrix representing the DNN is
generated with Sparse Network Design with most of the values being close to zero,
whereas the ideal solution in this case is an identity matrix. The genetic sequence
of individuals with non-zero elements (which is considered as a gene) is kept and
computed instead of re-generating the complete matrix which will reduce the
amount of data required to store in the matrix and the process complexity. The
position of the gene in the matrix can be determined by row and column and
every gene has a magnitude.
        </p>
        <p>The proposed algorithms are tested on image data normalized in the range
of 0.0 and 1.0. Apart from applying to image data, the algorithm has been
applied to handwriting, face image (small and large) and cat image identification.
The experimental results section shows the reconstruction (of input) error rate
for each experiment. Another experiment for reconstruction of faces with noisy
data claim to prove that this algorithm is not just copying blocks of data, but
generating the connections in the data and reconstructing the initial image. The
theoretical limitations of the algorithm is not addressed. The cost of
reconstruction becomes 0 for a single training image as it will be efficient only with a large
set of data.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>This paper provides a theoretical review of standard deep architectures and
study the possibilities of implementing evolutionary computation principles on
deep architectures. Apart from introducing various types of deep architecture,
this paper provides a detailed explanation of their training procedure and
implementations. Further, this paper analyses the implications of applying
evolutionary algorithms on deep architectures with details of two such implementations
and a critical review on their achievement. The Neuroevolution approach for
deep architectures that is discussed in previous section is with respect to the
application of HyperNEAT on deep architectures. The success of this proposed
method cannot be determined since CNN holds the best classification for MNIST
database. But, this drives a way of implementing Neuroevolution algorithms on
deep architectures. Similarly, the second work of using genetic algorithms for
training DNNs, justifies the possibility of using genetic algorithms for training
deep architectures but does not show any signs of comparative studies of its
efficiency with respect to speed or quality.</p>
      <p>It is noteworthy that evolutionary algorithms may not be a complete
replacement for deep learning algorithms at least not at this stage. However, the
successful application of evolutionary techniques on deep architectures will lead
to an improved learning mechanism for deep architectures. This might result in
reducing the training time which is the main drawback for deep architectures.
Future direction in this research could be evolving an optimized deep architecture
based neural networks using Neuroevolutonary principles. This could provide a
warm start to the deep learning process and could improve the performance of
the deep learning algorithms.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , K. Kavukcuoglu, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Farabet</surname>
          </string-name>
          , “
          <article-title>Convolutional networks and applications in vision,” in Circuits and Systems (ISCAS)</article-title>
          ,
          <source>Proceedings of 2010 IEEE International Symposium on</source>
          , pp.
          <fpage>253</fpage>
          -
          <lpage>256</lpage>
          , May
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>J.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          , and E. Chen, “
          <article-title>Image denoising and inpainting with deep neural networks</article-title>
          ,” in In NIPS,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , “
          <article-title>Learning deep architectures for AI,” Foundations and Trends in Machine Learning</article-title>
          , vol.
          <volume>2</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>127</lpage>
          ,
          <year>2009</year>
          .
          <article-title>Also published as a book</article-title>
          .
          <source>Now Publishers</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          , “
          <article-title>Dropout: A simple way to prevent neural networks from overfitting</article-title>
          ,
          <source>” Journal of Machine Learning Research</source>
          , vol.
          <volume>15</volume>
          , pp.
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          and Y. LeCun, “
          <article-title>Scaling learning algorithms towards</article-title>
          AI,” in Large Scale Kernel
          <string-name>
            <surname>Machines (L. Bottou</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Chapelle</surname>
          </string-name>
          , D. DeCoste, and J. Weston, eds.), MIT Press,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>K.</given-names>
            <surname>Fukushima</surname>
          </string-name>
          , “
          <article-title>Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position</article-title>
          ,
          <source>” Biological Cybernetics</source>
          , vol.
          <volume>36</volume>
          , pp.
          <fpage>193</fpage>
          -
          <lpage>202</lpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Denker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hubbard</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Jackel</surname>
          </string-name>
          , “
          <article-title>Handwritten digit recognition with a back-propagation network,”</article-title>
          <source>in Advances in Neural Information Processing Systems (NIPS</source>
          <year>1989</year>
          )
          <article-title>(D</article-title>
          . Touretzky, ed.), vol.
          <volume>2</volume>
          , (Denver, CO), Morgan Kaufman,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lecun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Haffner</surname>
          </string-name>
          , “
          <article-title>Gradient-based learning applied to document recognition,”</article-title>
          <source>in Proceedings of the IEEE</source>
          , pp.
          <fpage>2278</fpage>
          -
          <lpage>2324</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Ranzato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Poultney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chopra</surname>
          </string-name>
          , and Y. LeCun, “
          <article-title>Efficient learning of sparse representations with an energy-based model</article-title>
          ,” in NIPS, pp.
          <fpage>1137</fpage>
          -
          <lpage>1144</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Osindero</surname>
          </string-name>
          , and Y.-W. Teh, “
          <article-title>A fast learning algorithm for deep belief nets,” Neural Comput</article-title>
          ., vol.
          <volume>18</volume>
          , pp.
          <fpage>1527</fpage>
          -
          <lpage>1554</lpage>
          ,
          <year>July 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lamblin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Popovici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          , U. D.
          <string-name>
            <surname>Montral</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Qubec</surname>
          </string-name>
          , “
          <article-title>Greedy layer-wise training of deep networks,” in In NIPS</article-title>
          , MIT Press,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. G. Tesauro, “
          <article-title>Practical issues in temporal difference learning</article-title>
          ,
          <source>” in Machine Learning</source>
          , pp.
          <fpage>257</fpage>
          -
          <lpage>277</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Dahl</surname>
          </string-name>
          , A. rahman
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Jaitly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Senior</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>T. N.</given-names>
          </string-name>
          <string-name>
            <surname>Sainath</surname>
            , and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Kingsbury</surname>
          </string-name>
          , “
          <article-title>Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process</article-title>
          . Mag., vol.
          <volume>29</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>97</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Hubel</surname>
          </string-name>
          and
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Wiesel</surname>
          </string-name>
          , “
          <article-title>Receptive fields and functional architecture of monkey striate cortex</article-title>
          ,
          <source>” Journal of Physiology (London)</source>
          , vol.
          <volume>195</volume>
          , pp.
          <fpage>215</fpage>
          -
          <lpage>243</lpage>
          ,
          <year>1968</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          , “
          <article-title>Reducing the dimensionality of data with neural networks</article-title>
          ,” Science,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Taigman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ranzato</surname>
          </string-name>
          , and L. Wolf, “
          <article-title>Deepface: Closing the gap to human-level performance in face verification,” in Conference on Computer Vision and Pattern Recognition (CVPR</article-title>
          ),
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>P.</given-names>
            <surname>Vincent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          , I. Lajoie,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , and P.-A. Manzagol, “
          <article-title>Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,”</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Mach</surname>
          </string-name>
          .
          <source>Learn. Res.</source>
          , vol.
          <volume>11</volume>
          , pp.
          <fpage>3371</fpage>
          -
          <lpage>3408</lpage>
          , Dec.
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>P.</given-names>
            <surname>Verbancsics</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Harguess</surname>
          </string-name>
          , “
          <article-title>Generative neuroevolution for deep learning</article-title>
          ,
          <source>” CoRR</source>
          , vol.
          <source>abs/1312.5355</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>J.</surname>
          </string-name>
          Lamos-Sweeney, “
          <article-title>Deep learning using genetic algorithms</article-title>
          .
          <source>Master thesis</source>
          , Institute Thomas Golisano College of Computing and Information Sciences,”
          <year>2012</year>
          . Advisor: Gaborski, Roger.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>