<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Explainable Neural Computation via Stack Neural Module
Networks. arXiv e-prints arXiv:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Modularity as a Means for Complexity Management in Neural Networks Learning</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. Gerber</institution>
          ,
          <addr-line>D. Lenat, F. van Harmelen, P. Clark (Eds.)</addr-line>
          ,
          <institution>Proceedings of the AAAI 2019 Spring Symposium on Combining Machine Learning with Knowledge Engineering (AAAI-MAKE 2019). Stanford University</institution>
          ,
          <addr-line>Palo Alto, California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>SIANI Institute and Department of Computer Science. University of Las Palmas de Gran Canaria</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1807</year>
      </pub-date>
      <volume>08556</volume>
      <abstract>
        <p>Training a Neural Network (NN) with lots of parameters or intricate architectures creates undesired phenomena that complicate the optimization process. To address this issue we propose a first modular approach to NN design, wherein the NN is decomposed into a control module and several functional modules, implementing primitive operations. We illustrate the modular concept by comparing performances between a monolithic and a modular NN on a list sorting problem and show the benefits in terms of training speed, training stability and maintainability. We also discuss some questions that arise in modular NNs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        There has been a recent boom in the development of Deep
Neural Networks, promoted by the increase in
computational power and parallelism and its availability to
researchers. This has triggered a trend towards reaching better
model performances via the growth of the number of
parameters
        <xref ref-type="bibr" rid="ref18">(He et al. 2015)</xref>
        and, in general, increment in
complexity
        <xref ref-type="bibr" rid="ref39">(Szegedy et al. 2014)</xref>
        <xref ref-type="bibr" rid="ref17 ref31 ref36">(Graves, Wayne, and Danihelka
2014)</xref>
        .
      </p>
      <p>
        However, training NNs with lots of parameters
emphasizes a series of undesired phenomena, such as gradient
vanishing (Hochreiter et al. 2001), spatial crosstalk
        <xref ref-type="bibr" rid="ref24">(Jacobs
1990)</xref>
        and the appearance of local minima. In addition, the
more parameters a model has, the more data and
computation time are required for training
        <xref ref-type="bibr" rid="ref7">(Blumer et al. 1989)</xref>
        .
      </p>
      <p>
        The research community has had notable success in
coping with this scenario, often through the inclusion of priors
in the network, as restrictions or conditionings. The priors
are fundamental in machine learning algorithms and they
have been, in fact, the main source of major breakthroughs
within the field. Two well known cases are the
Convolutional Neural Networks
        <xref ref-type="bibr" rid="ref32">(Lecun 1989)</xref>
        and the Long
ShortTerm Memory
        <xref ref-type="bibr" rid="ref19">(Hochreiter and Schmidhuber 1997)</xref>
        . This
trend has reached the extent that recent models, developed to
solve problems of moderate complexity, build up on
elaborated architectures that are specifically designed to the
problem in question
        <xref ref-type="bibr" rid="ref8">(Britz et al. 2017)</xref>
        <xref ref-type="bibr" rid="ref40">(van den Oord et al. 2016)</xref>
        .
      </p>
      <p>
        But these approaches are not always sufficient and, while
new techniques like attention mechanisms are now
enjoying great success
        <xref ref-type="bibr" rid="ref41">(Vaswani et al. 2017)</xref>
        , they are integrated
in monolithic approaches that tend to suffer from
overspecialization. Thus Deep Neural Networks become more and
more unmanageable every time they grow in complexity;
impossible for modest research teams to deal with, as the
state of the art is often built upon exaggerated computational
resources
        <xref ref-type="bibr" rid="ref27">(Jia et al. 2018)</xref>
        .
      </p>
      <p>
        NNs were developed by mimicking biological neural
structures and functions, and have ever since continued to
be inspired by brain-related research
        <xref ref-type="bibr" rid="ref29">(Kandel, Schwartz, and
Jessell 2000)</xref>
        . Such neural structures are inherently
modular and the human brain itself is modular in different
spatial scales, as the learning process occurs in a very localized
manner
        <xref ref-type="bibr" rid="ref21">(Hrycej 1992)</xref>
        . That is to say that the human brain is
organized as functional, sparsely connected subunits. This
is known to have been influenced by the impact of efficiency
in evolution
        <xref ref-type="bibr" rid="ref26">(Jeff Clune 2013)</xref>
        .
      </p>
      <p>In addition, modularity has an indispensable role in
engineering and enables the building of highly complex, yet
manageable systems. Modular systems can be designed,
maintained and enhanced in a very controlled and
methodological manner, as they ease tractability, knowledge
embedding and the reuse of preexisting parts. Modules are divided
according on functionality, according to the rules of high
cohesion and low coupling, so new functionality will come as
new modules in the system, leaving the others mostly
unaltered.</p>
      <p>In this paper we propose a way to integrate the prior of
modularity into the design process of NNs. We make an
initial simplified approach to modularity by working under the
assumption that a problem domain can be solved based on
a set of primitive operations. Using this proposal, we aim
to facilitate the building of complex, yet manageable
systems within the field of NNs, while enabling diverse module
implementations to coexist. Such systems may evolve by
allowing the exchange and addition of modules, regardless of
their implementation, thus avoiding the need to always start
from scratch. Our proposal should also ease the integration
of knowledge in the form of handcrafted modules, or simply
through the identification of primitives.</p>
      <p>Our main contributions are:
We propose an initial approach to a general modular
architecture for the design and training of complex NNs.
We discuss the possibilities regarding the combination
of different module implementations, maintainability and
knowledge embedding.</p>
      <p>We prove that a NN designed with modularity in mind is
able to train in a shorter time and is also a competitive
alternative to monolithic models.</p>
      <p>We give tips and guidelines for transferring our approach
to other problems.</p>
      <p>We give insights into the technical implications of
modular architectures in NNs.</p>
      <p>The code for the experiments in this paper is available at
gitlab.com/dcasbol/nn-modularization.</p>
    </sec>
    <sec id="sec-2">
      <title>The modular concept</title>
      <p>We are proposing a modular approach to NN design in which
modularity is a key factor, as is the case in engineering
projects. Our approach has many similarities to the
blackboard design pattern (D. Erman et al. 1980) and is based on
a perception-action loop (figure 1), in which the system is
an agent that interacts with an environment via an interface.
The environment reflects the current state of the problem,
possibly including auxiliary elements such as scratchpads,
markers or pointers, and the interface provides a
representation R(t) of it to work with. R(t) is thus a sufficient
representation of the environment state at time t. This
representation will match any relevant change in the environment
as soon as they occur and, if any changes were made by the
agent in the representation, the interface will be able to
forward them to the environment. This feedback through the
environment is what closes the perception-action loop.</p>
      <p>In the middle of this loop there is a control module that
decides, conditioned on the environment’s representation and
its own internal state, which action to take at each time step.
These actions are produced by operators. Operators have a
uniform interface: they admit an environment representation
as input and they output a representation as well. They can
therefore alter the environment and they will be used by the
control module to do so until the environment reaches a
target state, which represents the problem’s solution. As seen
in figure 2, each operator is composed of a selective input
submodule, a functional submodule and a selective update
submodule. Both selective submodules act as an attention
mechanism and help to decouple the functionality of the
operation from its interface, minimizing as well the number of
parameters that a neural functional submodule would need
to consume the input.</p>
      <p>There is no imposed restriction regarding module
implementations and therefore the architecture allows the
building of hybrid systems. This has important consequences
concerning maintenance and knowledge embedding,
understood as the reutilization of existing software, manual coding
or supervised training of modules. There is also the
possibility of progressively evolving such a system through the</p>
      <sec id="sec-2-1">
        <title>Environment</title>
      </sec>
      <sec id="sec-2-2">
        <title>Interface R(t)</title>
      </sec>
      <sec id="sec-2-3">
        <title>Control</title>
      </sec>
      <sec id="sec-2-4">
        <title>Module</title>
      </sec>
      <sec id="sec-2-5">
        <title>Interface R(t+1) selection</title>
      </sec>
      <sec id="sec-2-6">
        <title>Operators Library</title>
        <p>OP1</p>
        <p>OP2
· · ·</p>
        <p>OPm</p>
        <p>t
u
p
n
i
e
v
ti
c
e
l
e
s</p>
        <p>Operator module
functional
submodule
e
t
a
d
p
u
e
v
ti
c
e
l
e
s
R(t)
R(t+1)
replacement or addition of modules. In the latter case, the
control module must be updated.</p>
        <sec id="sec-2-6-1">
          <title>Motivations and architecture breakdown</title>
          <p>The architecture we propose is mainly motivated by the idea
that every problem can be decomposed in several
subproblems and thus a solution to a problem’s instance may be
decomposed into a set of primitive operations. The
observation, that an agent can decide which action to take based on
its perceptions and by these means reach a certain goal,
inspired us to think about problem solving in these terms. We
were also aimed to increase the degree of maintainability
and interchangeability of modules, thus reducing the
coupling was an important matter.</p>
          <p>In the following, we introduce the main components of
the architecture and describe their role in the system:
The environment. This represents the state of the
problem and contains all information involved in the decision
making process. The environment is rarely fully available
to the agent, so the agent can only perceive a partial
observation of it.
– The environment representation. This is an abstract
representation of the environment, which is sufficient
to estimate the state of the problem and take the
optimal action. In certain cases, where the nature of the
problem is abstract enough, this environment
representation is equivalent to the environment itself.
– The interface. Its role is to keep the environment and its
representation synchronized. Any changes that happen
in the environment will be reflected in its representation
and vice versa.</p>
          <p>The control module. This is the decision maker. It selects
which operation should be executed, according to the
current observation of the environment. This module may be
equated with the agent itself, the operation modules being
the tools that it uses for achieving its purpose.
– The digestor or perception module. This module takes
an environment representation as input, which may
have unbounded size and data type, and generates a
fixed size embedding from it. This module acts
therefore as a feature extractor for the policy.
– The policy. This module decides which operation to
execute, conditioned on the fixed size embedding that
the digestor generates.</p>
          <p>Operation modules. They implement primitive operations
which will be used to solve the problem. Their
architecture focuses on isolating functionality, while at the same
time allowing interfacing with the environment
representation.
– Selective input submodule. This module filters the
environment representation to select only the information
relevant to the operation.
– Functional submodule. This implements the
operation’s functionality.
– Selective update submodule. This uses the output of the
functional submodule to update the environment
representation.</p>
          <p>Among the defined components, only the control
module is strongly dependent on the problem, as it implements
the logic to solve it. Other modules, like the functional and
selective submodules, might well be reused to solve other
problems, even from distinct problem families. The
perception module, on the other hand, has a strong relationship
with the environment representation, so it could mostly be
reused for problems from the same family.</p>
          <p>An important appreciation is that the described
architecture can also be recursive. An operation may be internally
built following the same pattern and, the other way round,
a modular system may be wrapped into an operation of a
higher complexity. In such cases, an equivalency between
the environment’s API and the selective modules is
implicitly established. We believe that this feature is a fundamental
advantage of our modular approach, as it could be the
starting point for a systematic methodology for building complex
intelligent systems.</p>
        </sec>
        <sec id="sec-2-6-2">
          <title>Implications regarding knowledge embedding</title>
          <p>In this paper, we focus on studying the effects of following
the modular approach in NNs. However, the proposed
architecture is agnostic to the module’s implementations. This
gives rise to a handful of scenarios in which NNs and other
implementations cooperate within the same system. The
low coupling among the different modules allows
embedding knowledge through manual implementation of
modules, adaptation of already existing software or supervised
training of NN modules.</p>
          <p>We present below some points that we believe are strong
advantages:</p>
          <p>Selective submodules restrict the information that reaches
the functional submodules and maintain the low coupling
criteria. As their function is merely attentive, they are
very eligible to be implemented manually or by reusing
existing software elements.</p>
          <p>Interfaces of functional modules should be, by definition,
compatible among different implementations. That
implies that improvements in the module can come about
progressively in a transparent way. Some operations
would nowadays be implemented by NNs but, if
someday a more reliable and efficient algorithm is discovered,
they could be replaced without any negative side effect.
The isolation of functionality in modules allows an
easier analysis of neural modules, enabling the design of
more efficient and explainable symbolic implementations,
when applicable.</p>
          <p>
            If high-level policy sketches
            <xref ref-type="bibr" rid="ref4">(Andreas, Klein, and Levine
2016)</xref>
            are available, modules can learn-by-role
            <xref ref-type="bibr" rid="ref2">(Andreas
et al. 2015)</xref>
            and afterwards be analyzed. In contrast, when
a policy is not available, a neural policy can be learned
using RL algorithms.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Related work</title>
      <p>
        NNs have traditionally been regarded as a modular system
        <xref ref-type="bibr" rid="ref5">(Auda and Kamel 1999)</xref>
        . At the hardware level, the
computation of neurons and layers can be decomposed down to
a graph of multiplications and additions. This has been
exploited by GPU computation, enabling the execution of
nondependant operations in parallel, and by the development of
frameworks for this kind of computation. Regarding
computational motivations, the avoidance of coupling among
neurons and the quest for generalization and speed of learning
have been the main arguments used in favor of modularity
        <xref ref-type="bibr" rid="ref6">(Bennani 1995)</xref>
        .
      </p>
      <p>
        The main application of modularity has been the
construction of NN ensembles though, focusing on learning
algorithms that automate their formation
        <xref ref-type="bibr" rid="ref5">(Auda and Kamel
1999)</xref>
        <xref ref-type="bibr" rid="ref11">(Chen 2015)</xref>
        . A type of ensemble with some
similarities to our proposal is the Mixture of Experts (Jacobs
et al. 1991)
        <xref ref-type="bibr" rid="ref28">(Jordan and Jacobs 1994)</xref>
        , in which a gating
network selects the output from multiple expert networks.
Constructive Modularization Learning and boosting
methods pursue the divide-and-conquer idea as well, although
they do it through an automatic partitioning of the space.
This automatic treatment of the modularization process what
makes difficult to embed any kind of expertise in the system.
      </p>
      <p>
        In
        <xref ref-type="bibr" rid="ref2">(Andreas et al. 2015)</xref>
        , a visual question answering
problem is solved with a modular NN. Each module is
targeted to learn a certain operation and a module layout is
dynamically generated after parsing the input question. The
modules then converge to the expected functionality due to
their role in such layout. This is an important step towards
NN modularity, despite the modules being trained jointly.
      </p>
      <p>
        Many of the ideas presented here have already been
discussed in
        <xref ref-type="bibr" rid="ref22">(Hu et al. 2017)</xref>
        . They use Reinforcement
Learning (RL) for training the policy module and
backpropagation for the rest of the modules. However, they predict the
sequence of actions in one shot and do not yet consider the
possibility of implementing the feedback loop. They
implicitly exploit modularity to some extent, as they pretrain
the policy from expert traces and use a pretrained VGG-16
network
        <xref ref-type="bibr" rid="ref31 ref36">(Simonyan and Zisserman 2014)</xref>
        , but the modules
are trained jointly afterwards. In (Hu et al. 2018) they extend
this concept integrating a feedback loop, but substituting the
hard attention mechanism by a soft one in order to do an
end-to-end training. Thus, the modular structure is present,
but the independent training is not exploited.
      </p>
      <p>
        The idea of a NN being an agent interacting with some
environment is not new and is in fact the context in which
RL is defined
        <xref ref-type="bibr" rid="ref38">(Sutton and Barto 1998)</xref>
        . RL problems focus
on learning a policy that the agent should follow in order to
maximize an expected reward. In such cases there is usually
no point in training operation modules, as the agent
interacts simply by selecting an existing operation. RL methods
would therefore be a good option to train the control module.
      </p>
      <p>
        Our architecture proposal for the implementation of the
control module has been greatly inspired by the work in
Neural Programmer-Interpreters
        <xref ref-type="bibr" rid="ref35">(Reed and de Freitas 2015)</xref>
        .
We were also aware of the subsequent work on
generalization via recursion
        <xref ref-type="bibr" rid="ref10">(Cai, Shin, and Song 2017)</xref>
        , but we
thought it would be of greater interest to isolate the
generalization effects of modularity. An important background
for this work is, in fact, everything related to Neural
Program Synthesis and Neural Program Induction, which are
often applied to explore Domain-Specific Language (DSL)
spaces. In this regard, we were inspired by concepts and
techniques used in
        <xref ref-type="bibr" rid="ref9">(Bunel et al. 2018)</xref>
        ,
        <xref ref-type="bibr" rid="ref1">(Abolafia et al. 2018)</xref>
        and
        <xref ref-type="bibr" rid="ref16">(Devlin et al. 2017)</xref>
        .
      </p>
      <p>
        A sort of modularity is explored in
        <xref ref-type="bibr" rid="ref25">(Jaderberg et al. 2016)</xref>
        with the main intention of decoupling the learning of the
distinct layers. Although the learning has to be performed
jointly, the layers can be trained asynchronously thanks to
the synthetic gradient loosening the strong dependencies
among them. There are also other methods that are not
usually acknowledged as such but can be regarded as
modular approaches to NN training. Transfer learning
        <xref ref-type="bibr" rid="ref33">(Lorien
Y. Pratt and Kamm 1991)</xref>
        is a common practise among the
deep learning practitioners and pretrained networks are also
used as feature extractors. In the field of Natural Language
Processing, a word embedding module is often applied to the
input one-hot vectors to transform the input space to a more
efficient representation
        <xref ref-type="bibr" rid="ref34">(Mikolov et al. 2013)</xref>
        . This module
is commonly provided as a set of pretrained weights or
embeddings
        <xref ref-type="bibr" rid="ref13">(Conneau et al. 2017)</xref>
        .
      </p>
    </sec>
    <sec id="sec-4">
      <title>List sorting</title>
      <p>In order to test the modular concept, we selected a candidate
problem that complied with following desiderata:
1. It has to be as simple as possible, to avoid missing the
focus in modularity.
2. It has to be feasible to solve the problem using just a small
set of canonical operations.
3. The problem complexity has to be easily identifiable and
configurable.
4. The experiments should shed light into an actual
complexity-related issue.</p>
      <p>We found that the integer list sorting problem complied
with these points. We did not worry too much about the
usefulness of the problem solving itself, but rather about its
simplicity and the availability of training data and execution
traces. We define the list domain as integer lists containing
digits from 0 to 9 and we take the Selection Sort algorithm
as reference, which has O(n2) time complexity, but is
simple in its definition. This implies that the environment will
be comprised by a list of integers and two pointers, which
we name A and B. We associate the complexity level to
the maximal training lists length, because it will determine
the internal recurrence level of the modules. So, a network
trained on complexity N is expected to sort lists with up to
N digits. This setup also leads us to the following canonical
operations:
mova. Moves the pointer A one position to the right.
movb. Moves the pointer B one position to the right.
retb. Returns the pointer B to the position to the right
of the pointer A.
swap. Exchanges the values located at the positions
pointed by A and B.</p>
      <p>EOP. Leaves the representation unchanged and marks the
end of execution of the perception-action loop.</p>
      <p>Each problem instance can be solved based on this set
of primitive operations. At the beginning, the agent starts
with a zeroed internal state and the environment in an initial
state, where the pointers A and B are pointing to the first and
second digits respectively. We say that the environment state
is final if both pointers are pointing to the last digit of the list.
The execution nevertheless stops when the agent selects the
EOP operator. The goal is to use a NN to solve the proposed
sorting problem, after training it in a supervised fashion.</p>
      <p>Because we deal with sequential data, the most
straightforward choice was to use Recurrent Neural Networks
(RNN) to implement the different submodules. This brings
up the common issues related to RNNs, which are for the
most part the gradient vanishing and the parallelization
difficulties due to sequential dependencies. Our experiments
will offer some clues about how a modular approach can
help to cope with such issues.</p>
    </sec>
    <sec id="sec-5">
      <title>Dynamical data generation</title>
      <p>Our dataset comprises all possible sequences with digits
from 0 to 9, with lengths that go from 2 digits long up to
a length of N . We represent those digits with one-hot
vectors and we pad with zero-vectors the positions coming after
the list has reached its end.</p>
      <p>For training, we randomly generate lists with lengths
within the range [2; N ]. For testing, we evaluate the network
on samples of the maximum training length N . We also
generate algorithmic traces. After lists are sampled, we generate
the traces based on the Selection Sort algorithm, producing
the operations applied at each time step as well as the
intermediate results. These traces are intended to emulate the
availability of expert knowledge in the form of execution
examples, just as if they were previously recorded, and allow
us to measure the correctness of execution.</p>
      <p>We draw list lengths from a beta distribution that depends
on the training accuracy B( ; ), where = 1 + accuracy,
= 1 + (1 accuracy) and accuracy 2 [0; 1]. In this
way, we can start training on shorter lengths and by the end
of the training we mainly sample long lists. We found this
sampling method to be very advantageous with respect to
the uniform sampling. In figure 3 we show a comparison
between both sampling methods. There, we can see the
inability of the model to converge under uniformly sampled
batches, seemingly because of the complexity residing in the
longer sequences and their infrequent appearance under such
sampling conditions.</p>
    </sec>
    <sec id="sec-6">
      <title>Neural Network architecture</title>
      <p>We intend to evaluate the impact of modularization and
modular training in a NN and assess its technical
implications, transferring the proposed abstract architecture to a
particular case. Therefore, we have implemented a modular NN
architecture, in which every module (except selection
submodules) is a NN.</p>
      <p>This layout enables us to train the network’s modules
independently and assemble them afterwards, as well as to
treat the network as a whole and train it end to end, in the
most common fashion. In this way we are able to make a
fair comparison between the modular and the monolithic
approach. In all cases we will train the NN in a supervised
manner, based on (input, output) example pairs.</p>
      <p>The network can interact with a predefined environment
by perceiving it and acting upon it. At each execution step,
the current state representation of the environment is fed to
all existing operations. Then, the control module,
conditioned on the current and past states, selects which operation
will be run, its output becoming the next representation
(figure 4). In our implementation, we omit the interface against
the environment and establish an equivalency between the
environment and its representation.</p>
      <p>As said before, the environment has three elements: the
list and two pointers (A and B). The list is a sequence of
one-hot vectors, encoding integer digits that go from 0 to
9. Null values are encoded by a zero vector. The pointers
are sequences of values comprised in the range [0; 1], which
indicate the presence of the pointer at each position. A value
over 0:5 means that the pointer is present at that position.</p>
      <p>Each operation module is implemented following the
main modular concept (figure 2). We only allow the
functional submodules to be implemented by a NN and build
the selective submodules programmatically. In ptra and
ptrb, the same pointer is selected as input and output.
retb selects A as input and updates B. swap merges the
list and both pointers into a single tensor to build the input
and updates only the list.</p>
      <p>The architecture of each functional submodule is
different, depending on the nature of the operation, but they all
use at least one LSTM cell, which always has 100 units.
Pointer operations use an LSTM cell, followed by a fully
connected layer with a single output and a sigmoid
activation (figure 5). The swap submodule is based on a
bidirecptra
ptrb
retb
swap</p>
      <p>EOP
tional LSTM. The output sequences from both forwards and
backwards LSTM are fed to the same fully connected layer
with 11 outputs and summed afterwards. This resulting
sequence is then passed through a softmax activation (figure
6). We discard the eleventh element of the softmax output
to enable the generation of zero vectors. The EOP operation
does not follow the general operator architecture, as it just
forwards the input representation.</p>
      <p>The control module is intended to be capable of: 1)
perceiving the environment’s state representation, regardless of
its length and 2) conditioning itself on previous states and
actions. Therefore, its architecture is based on two LSTM
cells, running at two different recurrence levels (figure 7).
The first LSTM, which we call the digestor, consumes the
state representation one position at a time and produces a
fixed-size embedding at the last position. This fixed-size
embedding is fed to the second LSTM (the controlling
policy) as input. While the digestor’s internal state gets zeroed
before consuming each state, the controller ticks at a lower
rate and gets to keep its internal state during the whole
sorting sequence.</p>
    </sec>
    <sec id="sec-7">
      <title>Experimental setup</title>
      <p>Each agent configuration is trained until specific stop criteria
are fulfilled. Our intention is that training conditions for all
configurations are as equal as possible. We make use of two
distinct measures for determining when a model has reached
a satisfactory training state and we keep a moving average
for each one of them.</p>
      <p>The quantitative measure focuses on functionality and
depends on the corresponding error rates. An output list is
counted as an error when there is any mismatch with
rep(t+1)i
sigmoid</p>
      <p>FC</p>
      <p>h(t)i
LSTM
p(t)i
c(t)i
h(t)i
p(t+1)
p(t)
c(t)i-1
h(t)i-1
e(t)
controller
digestor
R(t)1</p>
      <p>R(t)2</p>
      <p>R(t)m
spect to the expected one. A pointer is considered valid if
the only value above 0:5 corresponds to the right position.
Otherwise it is taken as an error.</p>
      <p>The qualitative measure depends on the percentage of
output values that do not comply with a certain
saturation requirement, which is that the difference with respect
to the one-hot label is not greater than 0:1.</p>
      <p>The monolithic configuration is trained until the
quantitative measure reaches values below 1%. The training of the
operation modules takes also the qualitative measure into
account and only stops when both measures are below 1%. As
a measure to constrain the training time, the training of the
monolithic configuration is also stopped if the progress of
the loss value becomes stagnant or if the loss falls below
1e-6.</p>
      <p>It is convenient that the modules work well when several
operations are concatenated and that is why we require the
quality criterion and why we train them under noisy
conditions. In this regard, we apply to the inputs noise sampled
from a uniform distribution to deviate them from pure f0; 1g
values up to a 0:4 difference (eq. 1). Inputs that represent
a one-hot vector are extended to 11 elements before adding
the uniform noise and passing them through a softmax-like
function (eq. 2) in order to keep the values on the softmax
manifold. The eleventh element is then discarded again to
allow the generation of zero vectors.</p>
      <p>x^uniform = jx</p>
      <p>U (0; 0:4)j
x^softmax = softmax(x^uniform 100)</p>
      <p>
        Every configuration is trained in a supervised fashion,
making use of the cross-entropy loss and Adam
        <xref ref-type="bibr" rid="ref31 ref36">(Kingma
and Ba 2014)</xref>
        as the optimizer. The learning rate is kept the
same (1e-3) across all configurations. The cross-entropy
(1)
(2)
loss is generally computed with respect to the
corresponding output, with the exception of the monolithic
configuration, which is provided with an additional cross-entropy loss
computed over the selection vectors. This last bit is relevant
because it enables the monolithic configuration to learn the
algorithm and the operations simultaneously.
      </p>
      <p>We have built three different training setups:
modulewise training, monolithic training and staged training. In
the staged training, each module is trained independently,
but after every 100 training iterations the modules are
assembled and tested in the assembled configuration.</p>
      <p>
        The monolithic configuration is the most unstable and the
results vary significantly between different runs. Thus we
average the results obtained through 5 runs in order to
alleviate this effect. To ease training under this configuration,
we also make the selection provided by the training trace
override the control output, as it is done in
        <xref ref-type="bibr" rid="ref9">(Bunel et al.
2018)</xref>
        . This mechanism is deactivated during testing, but
during training allows the operations to converge faster,
regardless of the performance of the control module.
      </p>
      <p>We tried to compare results against a baseline model,
using a multilayer bidirectional LSTM, but such configuration
did not seem to converge to any valid solution, so we decided
to discard it for the sake of clarity.</p>
    </sec>
    <sec id="sec-8">
      <title>Experimental results</title>
      <p>After training both modular and monolithic configurations,
we saw that the training time is orders of magnitude shorter
when it is trained modular-wise (figure 8), despite the
requirements being stronger. It is important to stress that in
this case we considered only the worst case scenario, so we
count the time that takes to train the modules sequentially.
The training time can be reduced even further if all modules
are trained in parallel.</p>
      <p>We also see in figure 9 how each training progresses in
a very different manner, the modular configuration
needing much less time to reach low error rates than the
monolithic one. Though it takes more training iterations, they are
faster to compute and the error rate is more stable. We were
curious about this behaviour and we conducted additional
measurements regarding the gradient (figure 10). We then
saw then that the gradient is much richer in the monolithic
case, with a higher mean absolute value per parameter and
greater variations. This makes sense, as the back
propagation through time accumulates gradient at every time step
and the monolithic configuration has a recurrence of O(N 2),
so it is more informative and can capture complex relations,
even between modules.</p>
      <p>By observing the training data more thoroughly, we can
appreciate the relative complexity of the different modules.
In figure 11 we plot the loss curves for each module in
the network when trained independently. Pointer operations
converge very quickly, as they only learn to delay the
input in one time step. The swap operation needs more time,
but thanks to the bidirectional configuration each LSTM just
needs to remember one digit (listing 1). Surprisingly, the
control module does not need as much time as the swap
module to converge, even when having to learn how to
digest the list into an embedded representation and to use its
Model convergence
internal memory for remembering past actions. This could
again be a consequence of a richer gradient.</p>
      <p>Listing 1: Example of a bidirectional LSTM performing the
swap operation onto a list. An underscore represents a
zerovector.</p>
      <p>L = 3 , 9 , 5 , 4 ,
A = 0 , 1 , 0 , 0 , 0
B = 0 , 0 , 0 , 1 , 0
F o r w a r d LSTM O u t p u t = 3 , , 5 , 9 ,
B a c k w a r d LSTM O u t p u t = , 4 , , ,
Merge by a d d i t i o n &gt; 3 , 4 , 5 , 9 ,</p>
      <p>Regarding generalization, in figure 12 we show the
behaviour of each configuration when tested on list lengths
not seen during training. The monolithic configuration
generalizes better to longer lists and its performance degrades
slowly and smoothly than the modular one. This seems to
contradict the common belief that modular NNs are able
to achieve better generalization. However, we hypothesized
this could happen due to the additional restrictions applied
to the modular training, such as the random input noise and
the output saturation requirements.</p>
      <p>In the modular configuration, we tried to compensate the
lack of such gradient quality with ad-hoc loss functions and
training conditions, but adding such priors can also
backlash. This is therefore a phenomenon that should be
considered when designing modular NNs. In this case, we tried a
learning rate 5 times higher to compensate for the lower
gradient and we experienced a training time reduction of more
than a half, with a slight increase in generalization (figure
13). Further study of the special treatment for modular NNs
could be part of future research.
1.0
0.8</p>
      <p>Modular</p>
    </sec>
    <sec id="sec-9">
      <title>Conclusions</title>
      <p>We proposed a modular approach to NNs design based on a
perception-action loop, in which the whole system is
functionally divided in several modules with standardized
interfaces. These modules are all liable to be trained, either
independently or jointly, or also explicitly specified by a human
programmer. We have shown how a list sorting problem can
be solved by a NN following this modular architecture and
how modularity has a very positive impact on training speed
and stability. There seems to be a trade-off with respect
to the generalization and number of training steps though,
which somehow suffer from not having access to a global
gradient and excessive restrictions during training. We give
insights in this phenomena and suggestions to address them.</p>
      <p>Designing modular NNs can lead to a better utilization
of computational resources and data available, as well as an
easier integration of expert knowledge, as it is discussed in
the document. NN modules under this architecture are easily
upgradeable or interchangeable by alternative
implementations. Future research should explore this kind of scenarios
and practical implementations of the modular concept. The
effects of modular training on non-recurrent NN modules
should be studied as well. Moreover, what we have
introduced is an initial approach, so further investigations may
reveal a variety of faults and improvements, in particular
regarding the application of this concept to problems of higher
complexity.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgements</title>
      <p>We would like to thank the reviewers for their valuable
opinions and suggestions, which have helped us to substantially
improve the quality of this article.</p>
      <p>Deep and Modular Neural Networks.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Abolafia</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Norouzi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q. V.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Neural Program Synthesis with Priority Queue Training</article-title>
          . ArXiv e-prints.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Andreas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Rohrbach,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ; and
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>Deep compositional question answering with neural module networks</article-title>
          .
          <source>CoRR abs/1511</source>
          .02799.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Andreas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Levine</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Modular Multitask Reinforcement Learning with Policy Sketches</article-title>
          . arXiv e-prints arXiv:
          <volume>1611</volume>
          .
          <fpage>01796</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Auda</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Kamel</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>Modular neural networks: a survey</article-title>
          .
          <source>International journal of neural systems</source>
          <volume>9</volume>
          :
          <fpage>129</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Bennani</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>1995</year>
          .
          <article-title>A modular and hybrid connectionist system for speaker identification</article-title>
          .
          <source>Neural computation</source>
          <volume>7</volume>
          :
          <fpage>791</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Blumer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ehrenfeucht</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Haussler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Warmuth,
          <string-name>
            <surname>M. K.</surname>
          </string-name>
          <year>1989</year>
          .
          <article-title>Learnability and the vapnik-chervonenkis dimension</article-title>
          .
          <source>J. ACM</source>
          <volume>36</volume>
          (
          <issue>4</issue>
          ):
          <fpage>929</fpage>
          -
          <lpage>965</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Britz</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Goldie</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Luong</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Massive Exploration of Neural Machine Translation Architectures</article-title>
          . ArXiv e-prints.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Bunel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Hausknecht,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; and Kohli,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Leveraging grammar and reinforcement learning for neural program synthesis</article-title>
          .
          <source>In International Conference on Learning Representations.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shin</surname>
          </string-name>
          , R.; and
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Making Neural Programming Architectures Generalize via Recursion</article-title>
          . ArXiv e-prints.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          Springer. chapter
          <volume>28</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Ranzato,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Denoyer</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          ; and Je´gou,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Word Translation Without Parallel Data</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <year>1980</year>
          .
          <article-title>The hearsay-ii speech-understanding system: Integrating knowledge to resolve uncertainty</article-title>
          .
          <source>ACM Comput.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          Surv.
          <volume>12</volume>
          :
          <fpage>213</fpage>
          -
          <lpage>253</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Bunel,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Singh,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Hausknecht,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ; and Kohli,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Neural Program Meta-Induction. ArXiv e-prints.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wayne</surname>
          </string-name>
          , G.; and
          <string-name>
            <surname>Danihelka</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Neural turing machines</article-title>
          .
          <source>CoRR abs/1410</source>
          .5401.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>CoRR abs/1512</source>
          .03385.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural Comput</source>
          .
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          2001.
          <article-title>Gradient flow in recurrent nets: the difficulty of learning long-term dependencies</article-title>
          . In Kremer, and Kolen., eds.,
          <string-name>
            <surname>A Field</surname>
          </string-name>
          <article-title>Guide to Dynamical Recurrent Neural Networks</article-title>
          . IEEE Press.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Hrycej</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>1992</year>
          .
          <article-title>Modular learning in neural net- works: A modularized approach to classification</article-title>
          . New York: Wiley.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Andreas,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Rohrbach,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ; and
            <surname>Saenko</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Learning to Reason: End-to-End Module Networks for Visual Question Answering</article-title>
          . arXiv e-prints arXiv:
          <volume>1704</volume>
          .
          <fpage>05526</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          1991.
          <article-title>Adaptive mixtures of local-experts</article-title>
          .
          <source>Neural Computation</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ):
          <fpage>79</fpage>
          -
          <lpage>87</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Jacobs</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>1990</year>
          .
          <article-title>Task Decomposition Through Competition in a Modular Connectionist Architecture</article-title>
          ,
          <source>PhD Thesis</source>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Jaderberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Czarnecki</surname>
            ,
            <given-names>W. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Osindero</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Decoupled neural interfaces using synthetic gradients</article-title>
          .
          <source>CoRR abs/1608</source>
          .05343.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Clune</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jean-Baptiste Mouret</surname>
            ,
            <given-names>H. L.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>The evolutionary origins of modularity</article-title>
          .
          <source>Proceedings of the Royal Society B</source>
          <volume>280</volume>
          (
          <issue>1755</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>He</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rong</surname>
          </string-name>
          , H.;
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes</article-title>
          . ArXiv e-prints.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M. I.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Jacobs</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          <year>1994</year>
          .
          <article-title>Hierarchical mixture of experts and the em algorithm</article-title>
          .
          <source>Neural Computation</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ):
          <fpage>181</fpage>
          -
          <lpage>214</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Kandel</surname>
            ,
            <given-names>E. R.</given-names>
          </string-name>
          ; Schwartz,
          <string-name>
            <surname>J. H.</surname>
          </string-name>
          ; and Jessell,
          <string-name>
            <surname>T. M.</surname>
          </string-name>
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>Principle of Neural Science (4th Ed.)</source>
          . New York: McGrawHill.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          . ArXiv e-prints.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Lecun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>1989</year>
          .
          <article-title>Generalization and network design strategies</article-title>
          .
          <source>Elsevier</source>
          .
          <volume>143</volume>
          -
          <fpage>55</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>Lorien Y.</given-names>
            <surname>Pratt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            , and
            <surname>Kamm</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. A.</surname>
          </string-name>
          <year>1991</year>
          .
          <article-title>Direct transfer of learned information among neural networks</article-title>
          .
          <source>Proceedings of the Ninth National Conference on Artificial Intelligence (AAAI-91</source>
          )
          <fpage>584</fpage>
          -
          <lpage>589</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Corrado</surname>
          </string-name>
          , G.; and
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Efficient Estimation of Word Representations in Vector Space</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and de Freitas,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <year>2015</year>
          .
          <string-name>
            <given-names>Neural</given-names>
            <surname>ProgrammerInterpreters. ArXiv</surname>
          </string-name>
          e-prints.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Very Deep Convolutional Networks for Large-Scale Image Recognition</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <surname>arXiv</surname>
          </string-name>
          e-prints arXiv:
          <volume>1409</volume>
          .
          <fpage>1556</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R. S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Barto</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>Reinforcement Learning: An Introduction</article-title>
          . Cambridge, MA: MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Jia,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            ;
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ; and
            <surname>Rabinovich</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Going deeper with convolutions</article-title>
          .
          <source>CoRR abs/1409</source>
          .4842.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <string-name>
            <surname>van den Oord</surname>
          </string-name>
          , A.;
          <string-name>
            <surname>Kalchbrenner</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Espeholt</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Conditional Image Generation with PixelCNN Decoders</article-title>
          . ArXiv e-prints.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Attention Is All You Need</article-title>
          . ArXiv e-prints.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>