Modularity as a Means for Complexity Management in Neural Networks Learning

         David Castillo-Bolado                        Cayetano Guerra-Artal                Mario Hernandez-Tejera
     david.castillo@siani.es           cayetano.guerra@ulpgc.es                mario.hernandez@ulpgc.es
           SIANI Institute and Department of Computer Science. University of Las Palmas de Gran Canaria


                            Abstract                                 rated architectures that are specifically designed to the prob-
                                                                     lem in question (Britz et al. 2017) (van den Oord et al. 2016).
  Training a Neural Network (NN) with lots of parameters or
                                                                        But these approaches are not always sufficient and, while
  intricate architectures creates undesired phenomena that com-
  plicate the optimization process. To address this issue we pro-    new techniques like attention mechanisms are now enjoy-
  pose a first modular approach to NN design, wherein the NN         ing great success (Vaswani et al. 2017), they are integrated
  is decomposed into a control module and several functional         in monolithic approaches that tend to suffer from overspe-
  modules, implementing primitive operations. We illustrate          cialization. Thus Deep Neural Networks become more and
  the modular concept by comparing performances between a            more unmanageable every time they grow in complexity;
  monolithic and a modular NN on a list sorting problem and          impossible for modest research teams to deal with, as the
  show the benefits in terms of training speed, training stabil-     state of the art is often built upon exaggerated computational
  ity and maintainability. We also discuss some questions that       resources (Jia et al. 2018).
  arise in modular NNs.
                                                                        NNs were developed by mimicking biological neural
                                                                     structures and functions, and have ever since continued to
                        Introduction                                 be inspired by brain-related research (Kandel, Schwartz, and
                                                                     Jessell 2000). Such neural structures are inherently modu-
There has been a recent boom in the development of Deep              lar and the human brain itself is modular in different spa-
Neural Networks, promoted by the increase in compu-                  tial scales, as the learning process occurs in a very localized
tational power and parallelism and its availability to re-           manner (Hrycej 1992). That is to say that the human brain is
searchers. This has triggered a trend towards reaching better        organized as functional, sparsely connected subunits. This
model performances via the growth of the number of param-            is known to have been influenced by the impact of efficiency
eters (He et al. 2015) and, in general, increment in complex-        in evolution (Jeff Clune 2013).
ity (Szegedy et al. 2014) (Graves, Wayne, and Danihelka                 In addition, modularity has an indispensable role in en-
2014).                                                               gineering and enables the building of highly complex, yet
   However, training NNs with lots of parameters empha-              manageable systems. Modular systems can be designed,
sizes a series of undesired phenomena, such as gradient              maintained and enhanced in a very controlled and method-
vanishing (Hochreiter et al. 2001), spatial crosstalk (Jacobs        ological manner, as they ease tractability, knowledge embed-
1990) and the appearance of local minima. In addition, the           ding and the reuse of preexisting parts. Modules are divided
more parameters a model has, the more data and computa-              according on functionality, according to the rules of high co-
tion time are required for training (Blumer et al. 1989).            hesion and low coupling, so new functionality will come as
   The research community has had notable success in cop-            new modules in the system, leaving the others mostly unal-
ing with this scenario, often through the inclusion of priors        tered.
in the network, as restrictions or conditionings. The priors            In this paper we propose a way to integrate the prior of
are fundamental in machine learning algorithms and they              modularity into the design process of NNs. We make an ini-
have been, in fact, the main source of major breakthroughs           tial simplified approach to modularity by working under the
within the field. Two well known cases are the Convolu-              assumption that a problem domain can be solved based on
tional Neural Networks (Lecun 1989) and the Long Short-              a set of primitive operations. Using this proposal, we aim
Term Memory (Hochreiter and Schmidhuber 1997). This                  to facilitate the building of complex, yet manageable sys-
trend has reached the extent that recent models, developed to        tems within the field of NNs, while enabling diverse module
solve problems of moderate complexity, build up on elabo-            implementations to coexist. Such systems may evolve by al-
Copyright held by the author(s). In A. Martin, K. Hinkelmann, A.     lowing the exchange and addition of modules, regardless of
Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of   their implementation, thus avoiding the need to always start
the AAAI 2019 Spring Symposium on Combining Machine Learn-           from scratch. Our proposal should also ease the integration
ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford            of knowledge in the form of handcrafted modules, or simply
University, Palo Alto, California, USA, March 25-27, 2019.           through the identification of primitives.
  Our main contributions are:
• We propose an initial approach to a general modular ar-
  chitecture for the design and training of complex NNs.
• We discuss the possibilities regarding the combination
  of different module implementations, maintainability and                                            Environment
  knowledge embedding.
• We prove that a NN designed with modularity in mind is                       Interface                            Interface
  able to train in a shorter time and is also a competitive
  alternative to monolithic models.
                                                                                  R(t)                               R(t+1)
• We give tips and guidelines for transferring our approach
  to other problems.
• We give insights into the technical implications of modu-                     Control                  selection
  lar architectures in NNs.                                                     Module
  The code for the experiments in this paper is available at
gitlab.com/dcasbol/nn-modularization.
                                                                                                              Operators Library
                 The modular concept
                                                                                                       OP1        OP2      ···                     OPm
We are proposing a modular approach to NN design in which
modularity is a key factor, as is the case in engineering
projects. Our approach has many similarities to the black-
board design pattern (D. Erman et al. 1980) and is based on        Figure 1: Perception-action loop. Each module is suscepti-
a perception-action loop (figure 1), in which the system is        ble to being implemented by a NN. At each step, the control
an agent that interacts with an environment via an interface.      module selects an operator to be applied and this will gener-
The environment reflects the current state of the problem,         ate the next environment’s state.
possibly including auxiliary elements such as scratchpads,
markers or pointers, and the interface provides a represen-
tation R(t) of it to work with. R(t) is thus a sufficient rep-
resentation of the environment state at time t. This repre-
sentation will match any relevant change in the environment
as soon as they occur and, if any changes were made by the
agent in the representation, the interface will be able to for-
ward them to the environment. This feedback through the
environment is what closes the perception-action loop.
   In the middle of this loop there is a control module that de-
cides, conditioned on the environment’s representation and
its own internal state, which action to take at each time step.                                          Operator module
These actions are produced by operators. Operators have a
uniform interface: they admit an environment representation
                                                                                                                                selective update
                                                                                    selective input


as input and they output a representation as well. They can
                                                                                                             functional
therefore alter the environment and they will be used by the            R(t)
                                                                                                             submodule
                                                                                                                                                         R(t+1)
control module to do so until the environment reaches a tar-
get state, which represents the problem’s solution. As seen
in figure 2, each operator is composed of a selective input
submodule, a functional submodule and a selective update
submodule. Both selective submodules act as an attention
mechanism and help to decouple the functionality of the op-
eration from its interface, minimizing as well the number of
parameters that a neural functional submodule would need           Figure 2: Detail of an operator module, composed of
to consume the input.                                              an input-selection submodule, a functional kernel and a
   There is no imposed restriction regarding module imple-         selective-update submodule. Dashed lines highlight the se-
mentations and therefore the architecture allows the build-        lected data.
ing of hybrid systems. This has important consequences
concerning maintenance and knowledge embedding, under-
stood as the reutilization of existing software, manual coding
or supervised training of modules. There is also the possi-
bility of progressively evolving such a system through the
replacement or addition of modules. In the latter case, the        – Selective update submodule. This uses the output of the
control module must be updated.                                      functional submodule to update the environment repre-
                                                                     sentation.
Motivations and architecture breakdown
                                                                    Among the defined components, only the control mod-
The architecture we propose is mainly motivated by the idea      ule is strongly dependent on the problem, as it implements
that every problem can be decomposed in several subprob-         the logic to solve it. Other modules, like the functional and
lems and thus a solution to a problem’s instance may be de-      selective submodules, might well be reused to solve other
composed into a set of primitive operations. The observa-        problems, even from distinct problem families. The percep-
tion, that an agent can decide which action to take based on     tion module, on the other hand, has a strong relationship
its perceptions and by these means reach a certain goal, in-     with the environment representation, so it could mostly be
spired us to think about problem solving in these terms. We      reused for problems from the same family.
were also aimed to increase the degree of maintainability           An important appreciation is that the described architec-
and interchangeability of modules, thus reducing the cou-        ture can also be recursive. An operation may be internally
pling was an important matter.                                   built following the same pattern and, the other way round,
   In the following, we introduce the main components of         a modular system may be wrapped into an operation of a
the architecture and describe their role in the system:          higher complexity. In such cases, an equivalency between
                                                                 the environment’s API and the selective modules is implic-
• The environment. This represents the state of the prob-
                                                                 itly established. We believe that this feature is a fundamental
  lem and contains all information involved in the decision
                                                                 advantage of our modular approach, as it could be the start-
  making process. The environment is rarely fully available
                                                                 ing point for a systematic methodology for building complex
  to the agent, so the agent can only perceive a partial ob-
                                                                 intelligent systems.
  servation of it.
  – The environment representation. This is an abstract          Implications regarding knowledge embedding
    representation of the environment, which is sufficient
    to estimate the state of the problem and take the opti-      In this paper, we focus on studying the effects of following
    mal action. In certain cases, where the nature of the        the modular approach in NNs. However, the proposed ar-
    problem is abstract enough, this environment represen-       chitecture is agnostic to the module’s implementations. This
    tation is equivalent to the environment itself.              gives rise to a handful of scenarios in which NNs and other
                                                                 implementations cooperate within the same system. The
  – The interface. Its role is to keep the environment and its
                                                                 low coupling among the different modules allows embed-
    representation synchronized. Any changes that happen
                                                                 ding knowledge through manual implementation of mod-
    in the environment will be reflected in its representation
                                                                 ules, adaptation of already existing software or supervised
    and vice versa.
                                                                 training of NN modules.
• The control module. This is the decision maker. It selects        We present below some points that we believe are strong
  which operation should be executed, according to the cur-      advantages:
  rent observation of the environment. This module may be
  equated with the agent itself, the operation modules being     • Selective submodules restrict the information that reaches
  the tools that it uses for achieving its purpose.                the functional submodules and maintain the low coupling
                                                                   criteria. As their function is merely attentive, they are
  – The digestor or perception module. This module takes           very eligible to be implemented manually or by reusing
    an environment representation as input, which may              existing software elements.
    have unbounded size and data type, and generates a
    fixed size embedding from it. This module acts there-        • Interfaces of functional modules should be, by definition,
    fore as a feature extractor for the policy.                    compatible among different implementations. That im-
                                                                   plies that improvements in the module can come about
  – The policy. This module decides which operation to             progressively in a transparent way. Some operations
    execute, conditioned on the fixed size embedding that          would nowadays be implemented by NNs but, if some-
    the digestor generates.                                        day a more reliable and efficient algorithm is discovered,
• Operation modules. They implement primitive operations           they could be replaced without any negative side effect.
  which will be used to solve the problem. Their architec-       • The isolation of functionality in modules allows an eas-
  ture focuses on isolating functionality, while at the same       ier analysis of neural modules, enabling the design of
  time allowing interfacing with the environment represen-         more efficient and explainable symbolic implementations,
  tation.                                                          when applicable.
  – Selective input submodule. This module filters the en-       • If high-level policy sketches (Andreas, Klein, and Levine
    vironment representation to select only the information        2016) are available, modules can learn-by-role (Andreas
    relevant to the operation.                                     et al. 2015) and afterwards be analyzed. In contrast, when
  – Functional submodule. This implements the opera-               a policy is not available, a neural policy can be learned
    tion’s functionality.                                          using RL algorithms.
                      Related work                                alization effects of modularity. An important background
NNs have traditionally been regarded as a modular system          for this work is, in fact, everything related to Neural Pro-
(Auda and Kamel 1999). At the hardware level, the com-            gram Synthesis and Neural Program Induction, which are
putation of neurons and layers can be decomposed down to          often applied to explore Domain-Specific Language (DSL)
a graph of multiplications and additions. This has been ex-       spaces. In this regard, we were inspired by concepts and
ploited by GPU computation, enabling the execution of non-        techniques used in (Bunel et al. 2018), (Abolafia et al. 2018)
dependant operations in parallel, and by the development of       and (Devlin et al. 2017).
frameworks for this kind of computation. Regarding compu-            A sort of modularity is explored in (Jaderberg et al. 2016)
tational motivations, the avoidance of coupling among neu-        with the main intention of decoupling the learning of the
rons and the quest for generalization and speed of learning       distinct layers. Although the learning has to be performed
have been the main arguments used in favor of modularity          jointly, the layers can be trained asynchronously thanks to
(Bennani 1995).                                                   the synthetic gradient loosening the strong dependencies
   The main application of modularity has been the con-           among them. There are also other methods that are not
struction of NN ensembles though, focusing on learning al-        usually acknowledged as such but can be regarded as mod-
gorithms that automate their formation (Auda and Kamel            ular approaches to NN training. Transfer learning (Lorien
1999) (Chen 2015). A type of ensemble with some sim-              Y. Pratt and Kamm 1991) is a common practise among the
ilarities to our proposal is the Mixture of Experts (Jacobs       deep learning practitioners and pretrained networks are also
et al. 1991) (Jordan and Jacobs 1994), in which a gating          used as feature extractors. In the field of Natural Language
network selects the output from multiple expert networks.         Processing, a word embedding module is often applied to the
Constructive Modularization Learning and boosting meth-           input one-hot vectors to transform the input space to a more
ods pursue the divide-and-conquer idea as well, although          efficient representation (Mikolov et al. 2013). This module
they do it through an automatic partitioning of the space.        is commonly provided as a set of pretrained weights or em-
This automatic treatment of the modularization process what       beddings (Conneau et al. 2017).
makes difficult to embed any kind of expertise in the system.
   In (Andreas et al. 2015), a visual question answering                                  List sorting
problem is solved with a modular NN. Each module is tar-           In order to test the modular concept, we selected a candidate
geted to learn a certain operation and a module layout is          problem that complied with following desiderata:
dynamically generated after parsing the input question. The       1. It has to be as simple as possible, to avoid missing the
modules then converge to the expected functionality due to            focus in modularity.
their role in such layout. This is an important step towards
NN modularity, despite the modules being trained jointly.         2. It has to be feasible to solve the problem using just a small
   Many of the ideas presented here have already been dis-            set of canonical operations.
cussed in (Hu et al. 2017). They use Reinforcement Learn-         3. The problem complexity has to be easily identifiable and
ing (RL) for training the policy module and backpropaga-              configurable.
tion for the rest of the modules. However, they predict the       4. The experiments should shed light into an actual
sequence of actions in one shot and do not yet consider the           complexity-related issue.
possibility of implementing the feedback loop. They im-
plicitly exploit modularity to some extent, as they pretrain          We found that the integer list sorting problem complied
the policy from expert traces and use a pretrained VGG-16          with these points. We did not worry too much about the
network (Simonyan and Zisserman 2014), but the modules             usefulness of the problem solving itself, but rather about its
are trained jointly afterwards. In (Hu et al. 2018) they extend    simplicity and the availability of training data and execution
this concept integrating a feedback loop, but substituting the     traces. We define the list domain as integer lists containing
hard attention mechanism by a soft one in order to do an           digits from 0 to 9 and we take the Selection Sort algorithm
end-to-end training. Thus, the modular structure is present,       as reference, which has O(n2 ) time complexity, but is sim-
but the independent training is not exploited.                     ple in its definition. This implies that the environment will
   The idea of a NN being an agent interacting with some           be comprised by a list of integers and two pointers, which
environment is not new and is in fact the context in which         we name A and B. We associate the complexity level to
RL is defined (Sutton and Barto 1998). RL problems focus           the maximal training lists length, because it will determine
on learning a policy that the agent should follow in order to      the internal recurrence level of the modules. So, a network
maximize an expected reward. In such cases there is usually        trained on complexity N is expected to sort lists with up to
no point in training operation modules, as the agent inter-        N digits. This setup also leads us to the following canonical
acts simply by selecting an existing operation. RL methods         operations:
would therefore be a good option to train the control module.      • mova. Moves the pointer A one position to the right.
   Our architecture proposal for the implementation of the         • movb. Moves the pointer B one position to the right.
control module has been greatly inspired by the work in
Neural Programmer-Interpreters (Reed and de Freitas 2015).         • retb. Returns the pointer B to the position to the right
We were also aware of the subsequent work on general-                 of the pointer A.
ization via recursion (Cai, Shin, and Song 2017), but we           • swap. Exchanges the values located at the positions
thought it would be of greater interest to isolate the gener-         pointed by A and B.
• EOP. Leaves the representation unchanged and marks the                                                       Training progress (max. length 5)
                                                                                              1.0
  end of execution of the perception-action loop.                                                                                           Curriculum Learning
   Each problem instance can be solved based on this set                                                                                    Standard Training
of primitive operations. At the beginning, the agent starts                                   0.8
with a zeroed internal state and the environment in an initial


                                                                        Combined error rate
state, where the pointers A and B are pointing to the first and
second digits respectively. We say that the environment state                                 0.6
is final if both pointers are pointing to the last digit of the list.
The execution nevertheless stops when the agent selects the                                   0.4
EOP operator. The goal is to use a NN to solve the proposed
sorting problem, after training it in a supervised fashion.
   Because we deal with sequential data, the most straight-                                   0.2
forward choice was to use Recurrent Neural Networks
(RNN) to implement the different submodules. This brings
up the common issues related to RNNs, which are for the                                       0.0
most part the gradient vanishing and the parallelization dif-                                       0   1000     2000     3000      4000      5000    6000
ficulties due to sequential dependencies. Our experiments                                                           Wall-clock time (seconds)
will offer some clues about how a modular approach can
help to cope with such issues.                                          Figure 3: Progress of the error rate for standard training and
                                                                        curriculum learning. Standard training was stopped before
              Dynamical data generation                                 quantitative convergence, after 7 hours of execution without
                                                                        improving.
Our dataset comprises all possible sequences with digits
from 0 to 9, with lengths that go from 2 digits long up to
a length of N . We represent those digits with one-hot vec-             treat the network as a whole and train it end to end, in the
tors and we pad with zero-vectors the positions coming after            most common fashion. In this way we are able to make a
the list has reached its end.                                           fair comparison between the modular and the monolithic ap-
   For training, we randomly generate lists with lengths                proach. In all cases we will train the NN in a supervised
within the range [2, N ]. For testing, we evaluate the network          manner, based on (input, output) example pairs.
on samples of the maximum training length N . We also gen-                 The network can interact with a predefined environment
erate algorithmic traces. After lists are sampled, we generate          by perceiving it and acting upon it. At each execution step,
the traces based on the Selection Sort algorithm, producing             the current state representation of the environment is fed to
the operations applied at each time step as well as the in-             all existing operations. Then, the control module, condi-
termediate results. These traces are intended to emulate the            tioned on the current and past states, selects which operation
availability of expert knowledge in the form of execution ex-           will be run, its output becoming the next representation (fig-
amples, just as if they were previously recorded, and allow             ure 4). In our implementation, we omit the interface against
us to measure the correctness of execution.                             the environment and establish an equivalency between the
   We draw list lengths from a beta distribution that depends           environment and its representation.
on the training accuracy B(α, β), where α = 1 + accuracy,                  As said before, the environment has three elements: the
β = 1 + (1 − accuracy) and accuracy ∈ [0, 1]. In this                   list and two pointers (A and B). The list is a sequence of
way, we can start training on shorter lengths and by the end            one-hot vectors, encoding integer digits that go from 0 to
of the training we mainly sample long lists. We found this              9. Null values are encoded by a zero vector. The pointers
sampling method to be very advantageous with respect to                 are sequences of values comprised in the range [0, 1], which
the uniform sampling. In figure 3 we show a comparison                  indicate the presence of the pointer at each position. A value
between both sampling methods. There, we can see the in-                over 0.5 means that the pointer is present at that position.
ability of the model to converge under uniformly sampled                   Each operation module is implemented following the
batches, seemingly because of the complexity residing in the            main modular concept (figure 2). We only allow the func-
longer sequences and their infrequent appearance under such             tional submodules to be implemented by a NN and build
sampling conditions.                                                    the selective submodules programmatically. In ptra and
                                                                        ptrb, the same pointer is selected as input and output.
             Neural Network architecture                                retb selects A as input and updates B. swap merges the
We intend to evaluate the impact of modularization and                  list and both pointers into a single tensor to build the input
modular training in a NN and assess its technical implica-              and updates only the list.
tions, transferring the proposed abstract architecture to a par-           The architecture of each functional submodule is differ-
ticular case. Therefore, we have implemented a modular NN               ent, depending on the nature of the operation, but they all
architecture, in which every module (except selection sub-              use at least one LSTM cell, which always has 100 units.
modules) is a NN.                                                       Pointer operations use an LSTM cell, followed by a fully
   This layout enables us to train the network’s modules in-            connected layer with a single output and a sigmoid activa-
dependently and assemble them afterwards, as well as to                 tion (figure 5). The swap submodule is based on a bidirec-
                                                                                            p(t+1)i

                                                                                            sigmoid

                       ptra                                                                   FC
                                                                   p(t)                          h(t)i                  p(t+1)
                       ptrb
                                                                              c(t)i-1                      c(t)i
                                                                                            LSTM
                       retb                                                   h(t)i-1                      h(t)i

                       swap                                                                   p(t)i

                        EOP
                                                                   Figure 5: Architecture of the pointers’ functional submod-
                                                                   ule, with an LSTM and a fully connected output layer with
                                                                   sigmoid activation. c(t)i and h(t)i are the LSTM’s internal
Figure 4: Implementation of the perception-action loop for
                                                                   state and output at each time step and position i.
the list sorting problem. At each time step, the control mod-
ule selects the output of one operation, which substitutes the
previous representation.


tional LSTM. The output sequences from both forwards and
backwards LSTM are fed to the same fully connected layer
with 11 outputs and summed afterwards. This resulting se-
quence is then passed through a softmax activation (figure
6). We discard the eleventh element of the softmax output
to enable the generation of zero vectors. The EOP operation
does not follow the general operator architecture, as it just
forwards the input representation.
   The control module is intended to be capable of: 1) per-
ceiving the environment’s state representation, regardless of
its length and 2) conditioning itself on previous states and
actions. Therefore, its architecture is based on two LSTM
cells, running at two different recurrence levels (figure 7).
The first LSTM, which we call the digestor, consumes the
state representation one position at a time and produces a
fixed-size embedding at the last position. This fixed-size
embedding is fed to the second LSTM (the controlling pol-
icy) as input. While the digestor’s internal state gets zeroed
before consuming each state, the controller ticks at a lower
rate and gets to keep its internal state during the whole sort-
ing sequence.

                  Experimental setup
Each agent configuration is trained until specific stop criteria
are fulfilled. Our intention is that training conditions for all   Figure 6: Architecture of the swap functional module.
configurations are as equal as possible. We make use of two        The entire representation is merged into a one single tensor
distinct measures for determining when a model has reached         and fed to a bidirectional LSTM. The outputs pass through
a satisfactory training state and we keep a moving average         a fully connected layer and are then merged by addition.
for each one of them.                                              *Fully connected layers share parameters.
• The quantitative measure focuses on functionality and de-
  pends on the corresponding error rates. An output list is
  counted as an error when there is any mismatch with re-
                                     s(t)                        loss is generally computed with respect to the correspond-
                                                                 ing output, with the exception of the monolithic configura-
                                                                 tion, which is provided with an additional cross-entropy loss
                                                                 computed over the selection vectors. This last bit is relevant
                                                                 because it enables the monolithic configuration to learn the
                                  controller
                                                                 algorithm and the operations simultaneously.
                                                                    We have built three different training setups: module-
                                                                 wise training, monolithic training and staged training. In
                                     e(t)                        the staged training, each module is trained independently,
                                                                 but after every 100 training iterations the modules are as-
             digestor                                            sembled and tested in the assembled configuration.
                                                                    The monolithic configuration is the most unstable and the
                                                                 results vary significantly between different runs. Thus we
                                                                 average the results obtained through 5 runs in order to alle-
                                                                 viate this effect. To ease training under this configuration,
             R(t)1      R(t)2       R(t)m                        we also make the selection provided by the training trace
                                                                 override the control output, as it is done in (Bunel et al.
                                                                 2018). This mechanism is deactivated during testing, but
Figure 7: The digestor creates a fixed-size embedding e(t)       during training allows the operations to converge faster, re-
from the state representation and the controller takes it as     gardless of the performance of the control module.
input at every execution step. Conditioned on the embedding         We tried to compare results against a baseline model, us-
and its past state, it outputs the selection vector s(t).        ing a multilayer bidirectional LSTM, but such configuration
                                                                 did not seem to converge to any valid solution, so we decided
                                                                 to discard it for the sake of clarity.
  spect to the expected one. A pointer is considered valid if
  the only value above 0.5 corresponds to the right position.                     Experimental results
  Otherwise it is taken as an error.                             After training both modular and monolithic configurations,
• The qualitative measure depends on the percentage of           we saw that the training time is orders of magnitude shorter
  output values that do not comply with a certain satura-        when it is trained modular-wise (figure 8), despite the re-
  tion requirement, which is that the difference with respect    quirements being stronger. It is important to stress that in
  to the one-hot label is not greater than 0.1.                  this case we considered only the worst case scenario, so we
   The monolithic configuration is trained until the quantita-   count the time that takes to train the modules sequentially.
tive measure reaches values below 1%. The training of the        The training time can be reduced even further if all modules
operation modules takes also the qualitative measure into ac-    are trained in parallel.
count and only stops when both measures are below 1%. As            We also see in figure 9 how each training progresses in
a measure to constrain the training time, the training of the    a very different manner, the modular configuration need-
monolithic configuration is also stopped if the progress of      ing much less time to reach low error rates than the mono-
the loss value becomes stagnant or if the loss falls below       lithic one. Though it takes more training iterations, they are
1e-6.                                                            faster to compute and the error rate is more stable. We were
   It is convenient that the modules work well when several      curious about this behaviour and we conducted additional
operations are concatenated and that is why we require the       measurements regarding the gradient (figure 10). We then
quality criterion and why we train them under noisy condi-       saw then that the gradient is much richer in the monolithic
tions. In this regard, we apply to the inputs noise sampled      case, with a higher mean absolute value per parameter and
from a uniform distribution to deviate them from pure {0, 1}     greater variations. This makes sense, as the back propaga-
values up to a 0.4 difference (eq. 1). Inputs that represent     tion through time accumulates gradient at every time step
a one-hot vector are extended to 11 elements before adding       and the monolithic configuration has a recurrence of O(N 2 ),
the uniform noise and passing them through a softmax-like        so it is more informative and can capture complex relations,
function (eq. 2) in order to keep the values on the softmax      even between modules.
manifold. The eleventh element is then discarded again to           By observing the training data more thoroughly, we can
allow the generation of zero vectors.                            appreciate the relative complexity of the different modules.
                                                                 In figure 11 we plot the loss curves for each module in
                x̂unif orm = |x − U (0, 0.4)|             (1)    the network when trained independently. Pointer operations
                                                                 converge very quickly, as they only learn to delay the in-
                                                                 put in one time step. The swap operation needs more time,
          x̂sof tmax = softmax(x̂unif orm · 100)         (2)     but thanks to the bidirectional configuration each LSTM just
  Every configuration is trained in a supervised fashion,        needs to remember one digit (listing 1). Surprisingly, the
making use of the cross-entropy loss and Adam (Kingma            control module does not need as much time as the swap
and Ba 2014) as the optimizer. The learning rate is kept the     module to converge, even when having to learn how to di-
same (1e-3) across all configurations. The cross-entropy         gest the list into an embedded representation and to use its
                                                Model convergence
                                       Monolithic
                                       Modular
Time to converge (seconds)


                             104


                                                                                                                  Training progress (max. length 7)
                             103
                                                                                               1.0                                                     Monolithic
                                                                                                                                                       Modular
                                                                                               0.8
                                   3     4      5     6      7       8   9   10
                                                 Maximal input length                          0.6


                                                                                  Error rate
Figure 8: Convergence times for modular and monolithic
                                                                                               0.4
configurations. Training times do not only get longer with
longer input sequences, but also become more unstable. We
hypothesize this is because of the high recurrence. The time                                   0.2
scale is logarithmic.
                                                                                               0.0
internal memory for remembering past actions. This could                                                 0        1000       2000         3000         4000
again be a consequence of a richer gradient.                                                                                Wall-clock time
                                                                                                                  Training progress (max. length 7)
Listing 1: Example of a bidirectional LSTM performing the                                      1.0
swap operation onto a list. An underscore represents a zero-                                                                                           Monolithic
vector.                                                                                                                                                Modular
                                                                                               0.8
L = 3 ,9 ,5 ,4 ,
A = 0 ,1 ,0 ,0 ,0
B = 0 ,0 ,0 ,1 ,0                                                                              0.6
                                                                                  Error rate


 F o r w a r d LSTM O u t p u t = 3 , , 5 , 9 ,
 Backward LSTM O u t p u t = , 4 , , ,                                                         0.4
 Merge by a d d i t i o n −−−> 3 , 4 , 5 , 9 ,
   Regarding generalization, in figure 12 we show the be-                                      0.2
haviour of each configuration when tested on list lengths
not seen during training. The monolithic configuration gen-
eralizes better to longer lists and its performance degrades                                   0.0
                                                                                                     0       50       100       150       200    250      300
slowly and smoothly than the modular one. This seems to                                                                     Iterations (x100)
contradict the common belief that modular NNs are able
to achieve better generalization. However, we hypothesized
this could happen due to the additional restrictions applied                      Figure 9: Error rate curves during training for modular and
to the modular training, such as the random input noise and                       monolithic configurations, with respect to time (top) and
the output saturation requirements.                                               training iterations (bottom).
   In the modular configuration, we tried to compensate the
lack of such gradient quality with ad-hoc loss functions and
training conditions, but adding such priors can also back-
lash. This is therefore a phenomenon that should be consid-
ered when designing modular NNs. In this case, we tried a
learning rate 5 times higher to compensate for the lower gra-
dient and we experienced a training time reduction of more
than a half, with a slight increase in generalization (figure
13). Further study of the special treatment for modular NNs
could be part of future research.
                                    Mean gradient value per weight                                                                              Monolithic
                                                                        Monolithic
                 0.0012
                                                                        Modular                                       21
                 0.0010
                                                                                                                      18
Absolute value


                 0.0008
                                                                                                                      15


                                                                                                      Tested length
                 0.0006
                                                                                                                      12
                 0.0004

                                                                                                                       9
                 0.0002

                 0.0000                                                                                                6
                                  200            400           600          800
                                                 Training step                                                         3

Figure 10: Mean absolute value of the gradient at the                                                                  0
weights for each configuration. This data was obtained                                                                                 4 Modular6 training 8                 10
while training with lists of 7 digits maximum. Obtained                                                                                   Maximal trained length
data was slightly smoothed to help visualization. The grey
shadow represents the standard deviation.                                                            60                                          Modular
                                                                                                                           0.0         0.2      0.4      0.6          0.8         1.0
                                                                                                                                                  Accuracy
                                                                                                     50               18
                                                                                                                      16
                                                                                     Tested length


                                                                                                     40               14
                                                                                                      Tested length


                                                                                                                      12
                                   Training progress (max. length 7)                                 30
                                                                             swap                                     10
                 0.30                                                        ctrl
                                                                             ptra                    20                8
                 0.25
                                                                                                                       6
                 0.20                                                                                10                4
Training loss


                 0.15
                                                                                                                       2
                                                                                                      0
                                                                                                          2.5 0                  5.0      7.5       10.0       12.5         15.0
                 0.10                                                                                                                  4         6             8             10
                                                                                                                                       Maximal trained
                                                                                                                                          Maximal       length
                                                                                                                                                   trained length
                 0.05


                 0.00
                        0   100     200     300        400     500    600   700                           0.0 0.0 0.2 0.2                    0.4 0.4 0.60.6        0.80.8     1.0
                                                                                                                                                                                1.0
                                          Wall-clock time (seconds)                                                                                Accuracy
                                                                                                                                               Accuracy
Figure 11: Progress of the training loss for the different op-                                        Figure 12: Generalization tests for monolithic (top) and
erations during training with lists of maximal length 7. We                                           modular (bottom) configurations. Horizontal lines mark the
only show ptra because the complexity is the same as in                                               length where 0 accuracy is achieved. A dashed line points
ptrb and retb.                                                                                        where the accuracy passes 0.9.
                                                                                          References
                               Modular                            Abolafia, D. A.; Norouzi, M.; Shen, J.; Zhao, R.; and Le,
                                                                  Q. V. 2018. Neural Program Synthesis with Priority Queue
                18                                                Training. ArXiv e-prints.
                16                                                Andreas, J.; Rohrbach, M.; Darrell, T.; and Klein, D. 2015.
                                                                  Deep compositional question answering with neural module
                14                                                networks. CoRR abs/1511.02799.
                                                                  Andreas, J.; Klein, D.; and Levine, S. 2016. Modular Mul-
Tested length


                12
                                                                  titask Reinforcement Learning with Policy Sketches. arXiv
                10                                                e-prints arXiv:1611.01796.
                                                                  Auda, G., and Kamel, M. S. 1999. Modular neural networks:
                 8                                                a survey. International journal of neural systems 9:129–51.
                 6                                                Bennani, Y. 1995. A modular and hybrid connectionist sys-
                                                                  tem for speaker identification. Neural computation 7:791–8.
                 4
                                                                  Blumer, A.; Ehrenfeucht, A.; Haussler, D.; and Warmuth,
                 2                                                M. K. 1989. Learnability and the vapnik-chervonenkis di-
                                                                  mension. J. ACM 36(4):929–965.
                 0                                                Britz, D.; Goldie, A.; Luong, T.; and Le, Q. 2017. Mas-
                     4        6              8           10
                                                                  sive Exploration of Neural Machine Translation Architec-
                         Maximal trained length                   tures. ArXiv e-prints.
                                                                  Bunel, R.; Hausknecht, M.; Devlin, J.; Singh, R.; and Kohli,
Figure 13: Generalization tests for the modular configura-        P. 2018. Leveraging grammar and reinforcement learning
       0.0 being trained
tion after        0.2    with0.4        0.6 learning
                               the corrected      0.8 rate. 1.0   for neural program synthesis. In International Conference
                                Accuracy                          on Learning Representations.
                                                                  Cai, J.; Shin, R.; and Song, D. 2017. Making Neural Pro-
                         Conclusions                              gramming Architectures Generalize via Recursion. ArXiv
We proposed a modular approach to NNs design based on a           e-prints.
perception-action loop, in which the whole system is func-        Chen, K. 2015. Deep and Modular Neural Networks.
tionally divided in several modules with standardized inter-      Springer. chapter 28.
faces. These modules are all liable to be trained, either inde-
                                                                  Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; and
pendently or jointly, or also explicitly specified by a human
                                                                  Jégou, H. 2017. Word Translation Without Parallel Data.
programmer. We have shown how a list sorting problem can
                                                                  ArXiv e-prints.
be solved by a NN following this modular architecture and
how modularity has a very positive impact on training speed       D. Erman, L.; Hayes-Roth, F.; Lesser, V.; and Reddy, R.
and stability. There seems to be a trade-off with respect         1980. The hearsay-ii speech-understanding system: Inte-
to the generalization and number of training steps though,        grating knowledge to resolve uncertainty. ACM Comput.
which somehow suffer from not having access to a global           Surv. 12:213–253.
gradient and excessive restrictions during training. We give      Devlin, J.; Bunel, R.; Singh, R.; Hausknecht, M.; and Kohli,
insights in this phenomena and suggestions to address them.       P. 2017. Neural Program Meta-Induction. ArXiv e-prints.
   Designing modular NNs can lead to a better utilization         Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural
of computational resources and data available, as well as an      turing machines. CoRR abs/1410.5401.
easier integration of expert knowledge, as it is discussed in     He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep residual
the document. NN modules under this architecture are easily       learning for image recognition. CoRR abs/1512.03385.
upgradeable or interchangeable by alternative implementa-
tions. Future research should explore this kind of scenarios      Hochreiter, S., and Schmidhuber, J. 1997. Long short-term
and practical implementations of the modular concept. The         memory. Neural Comput. 9(8):1735–1780.
effects of modular training on non-recurrent NN modules           Hochreiter, S.; Bengio, Y.; Frasconi, P.; and Schmidhuber, J.
should be studied as well. Moreover, what we have intro-          2001. Gradient flow in recurrent nets: the difficulty of learn-
duced is an initial approach, so further investigations may       ing long-term dependencies. In Kremer, and Kolen., eds., A
reveal a variety of faults and improvements, in particular re-    Field Guide to Dynamical Recurrent Neural Networks. IEEE
garding the application of this concept to problems of higher     Press.
complexity.                                                       Hrycej, T. 1992. Modular learning in neural net- works: A
                                                                  modularized approach to classification. New York: Wiley.
                     Acknowledgements                             Hu, R.; Andreas, J.; Rohrbach, M.; Darrell, T.; and Saenko,
We would like to thank the reviewers for their valuable opin-     K. 2017. Learning to Reason: End-to-End Module
ions and suggestions, which have helped us to substantially       Networks for Visual Question Answering. arXiv e-prints
improve the quality of this article.                              arXiv:1704.05526.
Hu, R.; Andreas, J.; Darrell, T.; and Saenko, K. 2018.
Explainable Neural Computation via Stack Neural Module
Networks. arXiv e-prints arXiv:1807.08556.
Jacobs, R. A.; Jordan, M. I.; Nowlan, S. J.; and Hinton, G. E.
1991. Adaptive mixtures of local-experts. Neural Computa-
tion 3(1):79–87.
Jacobs, R. 1990. Task Decomposition Through Competi-
tion in a Modular Connectionist Architecture, PhD Thesis.
Amherst, MA, USA: University of Massachusets.
Jaderberg, M.; Czarnecki, W. M.; Osindero, S.; Vinyals, O.;
Graves, A.; and Kavukcuoglu, K. 2016. Decoupled neural
interfaces using synthetic gradients. CoRR abs/1608.05343.
Jeff Clune, Jean-Baptiste Mouret, H. L. 2013. The evo-
lutionary origins of modularity. Proceedings of the Royal
Society B 280(1755).
Jia, X.; Song, S.; He, W.; Wang, Y.; Rong, H.; Zhou, F.;
Xie, L.; Guo, Z.; Yang, Y.; Yu, L.; Chen, T.; Hu, G.; Shi, S.;
and Chu, X. 2018. Highly Scalable Deep Learning Training
System with Mixed-Precision: Training ImageNet in Four
Minutes. ArXiv e-prints.
Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mix-
ture of experts and the em algorithm. Neural Computation
6(2):181–214.
Kandel, E. R.; Schwartz, J. H.; and Jessell, T. M. 2000.
Principle of Neural Science (4th Ed.). New York: McGraw-
Hill.
Kingma, D. P., and Ba, J. 2014. Adam: A Method for
Stochastic Optimization. ArXiv e-prints.
Lecun, Y. 1989. Generalization and network design strate-
gies. Elsevier. 143–55.
Lorien Y. Pratt, J. M., and Kamm, C. A. 1991. Direct trans-
fer of learned information among neural networks. Proceed-
ings of the Ninth National Conference on Artificial Intelli-
gence (AAAI-91) 584–589.
Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Ef-
ficient Estimation of Word Representations in Vector Space.
ArXiv e-prints.
Reed, S., and de Freitas, N. 2015. Neural Programmer-
Interpreters. ArXiv e-prints.
Simonyan, K., and Zisserman, A. 2014. Very Deep Con-
volutional Networks for Large-Scale Image Recognition.
arXiv e-prints arXiv:1409.1556.
Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learn-
ing: An Introduction. Cambridge, MA: MIT Press.
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S. E.;
Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich,
A. 2014. Going deeper with convolutions. CoRR
abs/1409.4842.
van den Oord, A.; Kalchbrenner, N.; Vinyals, O.; Espeholt,
L.; Graves, A.; and Kavukcuoglu, K. 2016. Conditional
Image Generation with PixelCNN Decoders. ArXiv e-prints.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-
tention Is All You Need. ArXiv e-prints.