Modularity as a Means for Complexity Management in Neural Networks Learning David Castillo-Bolado Cayetano Guerra-Artal Mario Hernandez-Tejera david.castillo@siani.es cayetano.guerra@ulpgc.es mario.hernandez@ulpgc.es SIANI Institute and Department of Computer Science. University of Las Palmas de Gran Canaria Abstract rated architectures that are specifically designed to the prob- lem in question (Britz et al. 2017) (van den Oord et al. 2016). Training a Neural Network (NN) with lots of parameters or But these approaches are not always sufficient and, while intricate architectures creates undesired phenomena that com- plicate the optimization process. To address this issue we pro- new techniques like attention mechanisms are now enjoy- pose a first modular approach to NN design, wherein the NN ing great success (Vaswani et al. 2017), they are integrated is decomposed into a control module and several functional in monolithic approaches that tend to suffer from overspe- modules, implementing primitive operations. We illustrate cialization. Thus Deep Neural Networks become more and the modular concept by comparing performances between a more unmanageable every time they grow in complexity; monolithic and a modular NN on a list sorting problem and impossible for modest research teams to deal with, as the show the benefits in terms of training speed, training stabil- state of the art is often built upon exaggerated computational ity and maintainability. We also discuss some questions that resources (Jia et al. 2018). arise in modular NNs. NNs were developed by mimicking biological neural structures and functions, and have ever since continued to Introduction be inspired by brain-related research (Kandel, Schwartz, and Jessell 2000). Such neural structures are inherently modu- There has been a recent boom in the development of Deep lar and the human brain itself is modular in different spa- Neural Networks, promoted by the increase in compu- tial scales, as the learning process occurs in a very localized tational power and parallelism and its availability to re- manner (Hrycej 1992). That is to say that the human brain is searchers. This has triggered a trend towards reaching better organized as functional, sparsely connected subunits. This model performances via the growth of the number of param- is known to have been influenced by the impact of efficiency eters (He et al. 2015) and, in general, increment in complex- in evolution (Jeff Clune 2013). ity (Szegedy et al. 2014) (Graves, Wayne, and Danihelka In addition, modularity has an indispensable role in en- 2014). gineering and enables the building of highly complex, yet However, training NNs with lots of parameters empha- manageable systems. Modular systems can be designed, sizes a series of undesired phenomena, such as gradient maintained and enhanced in a very controlled and method- vanishing (Hochreiter et al. 2001), spatial crosstalk (Jacobs ological manner, as they ease tractability, knowledge embed- 1990) and the appearance of local minima. In addition, the ding and the reuse of preexisting parts. Modules are divided more parameters a model has, the more data and computa- according on functionality, according to the rules of high co- tion time are required for training (Blumer et al. 1989). hesion and low coupling, so new functionality will come as The research community has had notable success in cop- new modules in the system, leaving the others mostly unal- ing with this scenario, often through the inclusion of priors tered. in the network, as restrictions or conditionings. The priors In this paper we propose a way to integrate the prior of are fundamental in machine learning algorithms and they modularity into the design process of NNs. We make an ini- have been, in fact, the main source of major breakthroughs tial simplified approach to modularity by working under the within the field. Two well known cases are the Convolu- assumption that a problem domain can be solved based on tional Neural Networks (Lecun 1989) and the Long Short- a set of primitive operations. Using this proposal, we aim Term Memory (Hochreiter and Schmidhuber 1997). This to facilitate the building of complex, yet manageable sys- trend has reached the extent that recent models, developed to tems within the field of NNs, while enabling diverse module solve problems of moderate complexity, build up on elabo- implementations to coexist. Such systems may evolve by al- Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. lowing the exchange and addition of modules, regardless of Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of their implementation, thus avoiding the need to always start the AAAI 2019 Spring Symposium on Combining Machine Learn- from scratch. Our proposal should also ease the integration ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford of knowledge in the form of handcrafted modules, or simply University, Palo Alto, California, USA, March 25-27, 2019. through the identification of primitives. Our main contributions are: • We propose an initial approach to a general modular ar- chitecture for the design and training of complex NNs. • We discuss the possibilities regarding the combination of different module implementations, maintainability and Environment knowledge embedding. • We prove that a NN designed with modularity in mind is Interface Interface able to train in a shorter time and is also a competitive alternative to monolithic models. R(t) R(t+1) • We give tips and guidelines for transferring our approach to other problems. • We give insights into the technical implications of modu- Control selection lar architectures in NNs. Module The code for the experiments in this paper is available at gitlab.com/dcasbol/nn-modularization. Operators Library The modular concept OP1 OP2 ··· OPm We are proposing a modular approach to NN design in which modularity is a key factor, as is the case in engineering projects. Our approach has many similarities to the black- board design pattern (D. Erman et al. 1980) and is based on Figure 1: Perception-action loop. Each module is suscepti- a perception-action loop (figure 1), in which the system is ble to being implemented by a NN. At each step, the control an agent that interacts with an environment via an interface. module selects an operator to be applied and this will gener- The environment reflects the current state of the problem, ate the next environment’s state. possibly including auxiliary elements such as scratchpads, markers or pointers, and the interface provides a represen- tation R(t) of it to work with. R(t) is thus a sufficient rep- resentation of the environment state at time t. This repre- sentation will match any relevant change in the environment as soon as they occur and, if any changes were made by the agent in the representation, the interface will be able to for- ward them to the environment. This feedback through the environment is what closes the perception-action loop. In the middle of this loop there is a control module that de- cides, conditioned on the environment’s representation and its own internal state, which action to take at each time step. Operator module These actions are produced by operators. Operators have a uniform interface: they admit an environment representation selective update selective input as input and they output a representation as well. They can functional therefore alter the environment and they will be used by the R(t) submodule R(t+1) control module to do so until the environment reaches a tar- get state, which represents the problem’s solution. As seen in figure 2, each operator is composed of a selective input submodule, a functional submodule and a selective update submodule. Both selective submodules act as an attention mechanism and help to decouple the functionality of the op- eration from its interface, minimizing as well the number of parameters that a neural functional submodule would need Figure 2: Detail of an operator module, composed of to consume the input. an input-selection submodule, a functional kernel and a There is no imposed restriction regarding module imple- selective-update submodule. Dashed lines highlight the se- mentations and therefore the architecture allows the build- lected data. ing of hybrid systems. This has important consequences concerning maintenance and knowledge embedding, under- stood as the reutilization of existing software, manual coding or supervised training of modules. There is also the possi- bility of progressively evolving such a system through the replacement or addition of modules. In the latter case, the – Selective update submodule. This uses the output of the control module must be updated. functional submodule to update the environment repre- sentation. Motivations and architecture breakdown Among the defined components, only the control mod- The architecture we propose is mainly motivated by the idea ule is strongly dependent on the problem, as it implements that every problem can be decomposed in several subprob- the logic to solve it. Other modules, like the functional and lems and thus a solution to a problem’s instance may be de- selective submodules, might well be reused to solve other composed into a set of primitive operations. The observa- problems, even from distinct problem families. The percep- tion, that an agent can decide which action to take based on tion module, on the other hand, has a strong relationship its perceptions and by these means reach a certain goal, in- with the environment representation, so it could mostly be spired us to think about problem solving in these terms. We reused for problems from the same family. were also aimed to increase the degree of maintainability An important appreciation is that the described architec- and interchangeability of modules, thus reducing the cou- ture can also be recursive. An operation may be internally pling was an important matter. built following the same pattern and, the other way round, In the following, we introduce the main components of a modular system may be wrapped into an operation of a the architecture and describe their role in the system: higher complexity. In such cases, an equivalency between the environment’s API and the selective modules is implic- • The environment. This represents the state of the prob- itly established. We believe that this feature is a fundamental lem and contains all information involved in the decision advantage of our modular approach, as it could be the start- making process. The environment is rarely fully available ing point for a systematic methodology for building complex to the agent, so the agent can only perceive a partial ob- intelligent systems. servation of it. – The environment representation. This is an abstract Implications regarding knowledge embedding representation of the environment, which is sufficient to estimate the state of the problem and take the opti- In this paper, we focus on studying the effects of following mal action. In certain cases, where the nature of the the modular approach in NNs. However, the proposed ar- problem is abstract enough, this environment represen- chitecture is agnostic to the module’s implementations. This tation is equivalent to the environment itself. gives rise to a handful of scenarios in which NNs and other implementations cooperate within the same system. The – The interface. Its role is to keep the environment and its low coupling among the different modules allows embed- representation synchronized. Any changes that happen ding knowledge through manual implementation of mod- in the environment will be reflected in its representation ules, adaptation of already existing software or supervised and vice versa. training of NN modules. • The control module. This is the decision maker. It selects We present below some points that we believe are strong which operation should be executed, according to the cur- advantages: rent observation of the environment. This module may be equated with the agent itself, the operation modules being • Selective submodules restrict the information that reaches the tools that it uses for achieving its purpose. the functional submodules and maintain the low coupling criteria. As their function is merely attentive, they are – The digestor or perception module. This module takes very eligible to be implemented manually or by reusing an environment representation as input, which may existing software elements. have unbounded size and data type, and generates a fixed size embedding from it. This module acts there- • Interfaces of functional modules should be, by definition, fore as a feature extractor for the policy. compatible among different implementations. That im- plies that improvements in the module can come about – The policy. This module decides which operation to progressively in a transparent way. Some operations execute, conditioned on the fixed size embedding that would nowadays be implemented by NNs but, if some- the digestor generates. day a more reliable and efficient algorithm is discovered, • Operation modules. They implement primitive operations they could be replaced without any negative side effect. which will be used to solve the problem. Their architec- • The isolation of functionality in modules allows an eas- ture focuses on isolating functionality, while at the same ier analysis of neural modules, enabling the design of time allowing interfacing with the environment represen- more efficient and explainable symbolic implementations, tation. when applicable. – Selective input submodule. This module filters the en- • If high-level policy sketches (Andreas, Klein, and Levine vironment representation to select only the information 2016) are available, modules can learn-by-role (Andreas relevant to the operation. et al. 2015) and afterwards be analyzed. In contrast, when – Functional submodule. This implements the opera- a policy is not available, a neural policy can be learned tion’s functionality. using RL algorithms. Related work alization effects of modularity. An important background NNs have traditionally been regarded as a modular system for this work is, in fact, everything related to Neural Pro- (Auda and Kamel 1999). At the hardware level, the com- gram Synthesis and Neural Program Induction, which are putation of neurons and layers can be decomposed down to often applied to explore Domain-Specific Language (DSL) a graph of multiplications and additions. This has been ex- spaces. In this regard, we were inspired by concepts and ploited by GPU computation, enabling the execution of non- techniques used in (Bunel et al. 2018), (Abolafia et al. 2018) dependant operations in parallel, and by the development of and (Devlin et al. 2017). frameworks for this kind of computation. Regarding compu- A sort of modularity is explored in (Jaderberg et al. 2016) tational motivations, the avoidance of coupling among neu- with the main intention of decoupling the learning of the rons and the quest for generalization and speed of learning distinct layers. Although the learning has to be performed have been the main arguments used in favor of modularity jointly, the layers can be trained asynchronously thanks to (Bennani 1995). the synthetic gradient loosening the strong dependencies The main application of modularity has been the con- among them. There are also other methods that are not struction of NN ensembles though, focusing on learning al- usually acknowledged as such but can be regarded as mod- gorithms that automate their formation (Auda and Kamel ular approaches to NN training. Transfer learning (Lorien 1999) (Chen 2015). A type of ensemble with some sim- Y. Pratt and Kamm 1991) is a common practise among the ilarities to our proposal is the Mixture of Experts (Jacobs deep learning practitioners and pretrained networks are also et al. 1991) (Jordan and Jacobs 1994), in which a gating used as feature extractors. In the field of Natural Language network selects the output from multiple expert networks. Processing, a word embedding module is often applied to the Constructive Modularization Learning and boosting meth- input one-hot vectors to transform the input space to a more ods pursue the divide-and-conquer idea as well, although efficient representation (Mikolov et al. 2013). This module they do it through an automatic partitioning of the space. is commonly provided as a set of pretrained weights or em- This automatic treatment of the modularization process what beddings (Conneau et al. 2017). makes difficult to embed any kind of expertise in the system. In (Andreas et al. 2015), a visual question answering List sorting problem is solved with a modular NN. Each module is tar- In order to test the modular concept, we selected a candidate geted to learn a certain operation and a module layout is problem that complied with following desiderata: dynamically generated after parsing the input question. The 1. It has to be as simple as possible, to avoid missing the modules then converge to the expected functionality due to focus in modularity. their role in such layout. This is an important step towards NN modularity, despite the modules being trained jointly. 2. It has to be feasible to solve the problem using just a small Many of the ideas presented here have already been dis- set of canonical operations. cussed in (Hu et al. 2017). They use Reinforcement Learn- 3. The problem complexity has to be easily identifiable and ing (RL) for training the policy module and backpropaga- configurable. tion for the rest of the modules. However, they predict the 4. The experiments should shed light into an actual sequence of actions in one shot and do not yet consider the complexity-related issue. possibility of implementing the feedback loop. They im- plicitly exploit modularity to some extent, as they pretrain We found that the integer list sorting problem complied the policy from expert traces and use a pretrained VGG-16 with these points. We did not worry too much about the network (Simonyan and Zisserman 2014), but the modules usefulness of the problem solving itself, but rather about its are trained jointly afterwards. In (Hu et al. 2018) they extend simplicity and the availability of training data and execution this concept integrating a feedback loop, but substituting the traces. We define the list domain as integer lists containing hard attention mechanism by a soft one in order to do an digits from 0 to 9 and we take the Selection Sort algorithm end-to-end training. Thus, the modular structure is present, as reference, which has O(n2 ) time complexity, but is sim- but the independent training is not exploited. ple in its definition. This implies that the environment will The idea of a NN being an agent interacting with some be comprised by a list of integers and two pointers, which environment is not new and is in fact the context in which we name A and B. We associate the complexity level to RL is defined (Sutton and Barto 1998). RL problems focus the maximal training lists length, because it will determine on learning a policy that the agent should follow in order to the internal recurrence level of the modules. So, a network maximize an expected reward. In such cases there is usually trained on complexity N is expected to sort lists with up to no point in training operation modules, as the agent inter- N digits. This setup also leads us to the following canonical acts simply by selecting an existing operation. RL methods operations: would therefore be a good option to train the control module. • mova. Moves the pointer A one position to the right. Our architecture proposal for the implementation of the • movb. Moves the pointer B one position to the right. control module has been greatly inspired by the work in Neural Programmer-Interpreters (Reed and de Freitas 2015). • retb. Returns the pointer B to the position to the right We were also aware of the subsequent work on general- of the pointer A. ization via recursion (Cai, Shin, and Song 2017), but we • swap. Exchanges the values located at the positions thought it would be of greater interest to isolate the gener- pointed by A and B. • EOP. Leaves the representation unchanged and marks the Training progress (max. length 5) 1.0 end of execution of the perception-action loop. Curriculum Learning Each problem instance can be solved based on this set Standard Training of primitive operations. At the beginning, the agent starts 0.8 with a zeroed internal state and the environment in an initial Combined error rate state, where the pointers A and B are pointing to the first and second digits respectively. We say that the environment state 0.6 is final if both pointers are pointing to the last digit of the list. The execution nevertheless stops when the agent selects the 0.4 EOP operator. The goal is to use a NN to solve the proposed sorting problem, after training it in a supervised fashion. Because we deal with sequential data, the most straight- 0.2 forward choice was to use Recurrent Neural Networks (RNN) to implement the different submodules. This brings up the common issues related to RNNs, which are for the 0.0 most part the gradient vanishing and the parallelization dif- 0 1000 2000 3000 4000 5000 6000 ficulties due to sequential dependencies. Our experiments Wall-clock time (seconds) will offer some clues about how a modular approach can help to cope with such issues. Figure 3: Progress of the error rate for standard training and curriculum learning. Standard training was stopped before Dynamical data generation quantitative convergence, after 7 hours of execution without improving. Our dataset comprises all possible sequences with digits from 0 to 9, with lengths that go from 2 digits long up to a length of N . We represent those digits with one-hot vec- treat the network as a whole and train it end to end, in the tors and we pad with zero-vectors the positions coming after most common fashion. In this way we are able to make a the list has reached its end. fair comparison between the modular and the monolithic ap- For training, we randomly generate lists with lengths proach. In all cases we will train the NN in a supervised within the range [2, N ]. For testing, we evaluate the network manner, based on (input, output) example pairs. on samples of the maximum training length N . We also gen- The network can interact with a predefined environment erate algorithmic traces. After lists are sampled, we generate by perceiving it and acting upon it. At each execution step, the traces based on the Selection Sort algorithm, producing the current state representation of the environment is fed to the operations applied at each time step as well as the in- all existing operations. Then, the control module, condi- termediate results. These traces are intended to emulate the tioned on the current and past states, selects which operation availability of expert knowledge in the form of execution ex- will be run, its output becoming the next representation (fig- amples, just as if they were previously recorded, and allow ure 4). In our implementation, we omit the interface against us to measure the correctness of execution. the environment and establish an equivalency between the We draw list lengths from a beta distribution that depends environment and its representation. on the training accuracy B(α, β), where α = 1 + accuracy, As said before, the environment has three elements: the β = 1 + (1 − accuracy) and accuracy ∈ [0, 1]. In this list and two pointers (A and B). The list is a sequence of way, we can start training on shorter lengths and by the end one-hot vectors, encoding integer digits that go from 0 to of the training we mainly sample long lists. We found this 9. Null values are encoded by a zero vector. The pointers sampling method to be very advantageous with respect to are sequences of values comprised in the range [0, 1], which the uniform sampling. In figure 3 we show a comparison indicate the presence of the pointer at each position. A value between both sampling methods. There, we can see the in- over 0.5 means that the pointer is present at that position. ability of the model to converge under uniformly sampled Each operation module is implemented following the batches, seemingly because of the complexity residing in the main modular concept (figure 2). We only allow the func- longer sequences and their infrequent appearance under such tional submodules to be implemented by a NN and build sampling conditions. the selective submodules programmatically. In ptra and ptrb, the same pointer is selected as input and output. Neural Network architecture retb selects A as input and updates B. swap merges the We intend to evaluate the impact of modularization and list and both pointers into a single tensor to build the input modular training in a NN and assess its technical implica- and updates only the list. tions, transferring the proposed abstract architecture to a par- The architecture of each functional submodule is differ- ticular case. Therefore, we have implemented a modular NN ent, depending on the nature of the operation, but they all architecture, in which every module (except selection sub- use at least one LSTM cell, which always has 100 units. modules) is a NN. Pointer operations use an LSTM cell, followed by a fully This layout enables us to train the network’s modules in- connected layer with a single output and a sigmoid activa- dependently and assemble them afterwards, as well as to tion (figure 5). The swap submodule is based on a bidirec- p(t+1)i sigmoid ptra FC p(t) h(t)i p(t+1) ptrb c(t)i-1 c(t)i LSTM retb h(t)i-1 h(t)i swap p(t)i EOP Figure 5: Architecture of the pointers’ functional submod- ule, with an LSTM and a fully connected output layer with sigmoid activation. c(t)i and h(t)i are the LSTM’s internal Figure 4: Implementation of the perception-action loop for state and output at each time step and position i. the list sorting problem. At each time step, the control mod- ule selects the output of one operation, which substitutes the previous representation. tional LSTM. The output sequences from both forwards and backwards LSTM are fed to the same fully connected layer with 11 outputs and summed afterwards. This resulting se- quence is then passed through a softmax activation (figure 6). We discard the eleventh element of the softmax output to enable the generation of zero vectors. The EOP operation does not follow the general operator architecture, as it just forwards the input representation. The control module is intended to be capable of: 1) per- ceiving the environment’s state representation, regardless of its length and 2) conditioning itself on previous states and actions. Therefore, its architecture is based on two LSTM cells, running at two different recurrence levels (figure 7). The first LSTM, which we call the digestor, consumes the state representation one position at a time and produces a fixed-size embedding at the last position. This fixed-size embedding is fed to the second LSTM (the controlling pol- icy) as input. While the digestor’s internal state gets zeroed before consuming each state, the controller ticks at a lower rate and gets to keep its internal state during the whole sort- ing sequence. Experimental setup Each agent configuration is trained until specific stop criteria are fulfilled. Our intention is that training conditions for all Figure 6: Architecture of the swap functional module. configurations are as equal as possible. We make use of two The entire representation is merged into a one single tensor distinct measures for determining when a model has reached and fed to a bidirectional LSTM. The outputs pass through a satisfactory training state and we keep a moving average a fully connected layer and are then merged by addition. for each one of them. *Fully connected layers share parameters. • The quantitative measure focuses on functionality and de- pends on the corresponding error rates. An output list is counted as an error when there is any mismatch with re- s(t) loss is generally computed with respect to the correspond- ing output, with the exception of the monolithic configura- tion, which is provided with an additional cross-entropy loss computed over the selection vectors. This last bit is relevant because it enables the monolithic configuration to learn the controller algorithm and the operations simultaneously. We have built three different training setups: module- wise training, monolithic training and staged training. In e(t) the staged training, each module is trained independently, but after every 100 training iterations the modules are as- digestor sembled and tested in the assembled configuration. The monolithic configuration is the most unstable and the results vary significantly between different runs. Thus we average the results obtained through 5 runs in order to alle- viate this effect. To ease training under this configuration, R(t)1 R(t)2 R(t)m we also make the selection provided by the training trace override the control output, as it is done in (Bunel et al. 2018). This mechanism is deactivated during testing, but Figure 7: The digestor creates a fixed-size embedding e(t) during training allows the operations to converge faster, re- from the state representation and the controller takes it as gardless of the performance of the control module. input at every execution step. Conditioned on the embedding We tried to compare results against a baseline model, us- and its past state, it outputs the selection vector s(t). ing a multilayer bidirectional LSTM, but such configuration did not seem to converge to any valid solution, so we decided to discard it for the sake of clarity. spect to the expected one. A pointer is considered valid if the only value above 0.5 corresponds to the right position. Experimental results Otherwise it is taken as an error. After training both modular and monolithic configurations, • The qualitative measure depends on the percentage of we saw that the training time is orders of magnitude shorter output values that do not comply with a certain satura- when it is trained modular-wise (figure 8), despite the re- tion requirement, which is that the difference with respect quirements being stronger. It is important to stress that in to the one-hot label is not greater than 0.1. this case we considered only the worst case scenario, so we The monolithic configuration is trained until the quantita- count the time that takes to train the modules sequentially. tive measure reaches values below 1%. The training of the The training time can be reduced even further if all modules operation modules takes also the qualitative measure into ac- are trained in parallel. count and only stops when both measures are below 1%. As We also see in figure 9 how each training progresses in a measure to constrain the training time, the training of the a very different manner, the modular configuration need- monolithic configuration is also stopped if the progress of ing much less time to reach low error rates than the mono- the loss value becomes stagnant or if the loss falls below lithic one. Though it takes more training iterations, they are 1e-6. faster to compute and the error rate is more stable. We were It is convenient that the modules work well when several curious about this behaviour and we conducted additional operations are concatenated and that is why we require the measurements regarding the gradient (figure 10). We then quality criterion and why we train them under noisy condi- saw then that the gradient is much richer in the monolithic tions. In this regard, we apply to the inputs noise sampled case, with a higher mean absolute value per parameter and from a uniform distribution to deviate them from pure {0, 1} greater variations. This makes sense, as the back propaga- values up to a 0.4 difference (eq. 1). Inputs that represent tion through time accumulates gradient at every time step a one-hot vector are extended to 11 elements before adding and the monolithic configuration has a recurrence of O(N 2 ), the uniform noise and passing them through a softmax-like so it is more informative and can capture complex relations, function (eq. 2) in order to keep the values on the softmax even between modules. manifold. The eleventh element is then discarded again to By observing the training data more thoroughly, we can allow the generation of zero vectors. appreciate the relative complexity of the different modules. In figure 11 we plot the loss curves for each module in x̂unif orm = |x − U (0, 0.4)| (1) the network when trained independently. Pointer operations converge very quickly, as they only learn to delay the in- put in one time step. The swap operation needs more time, x̂sof tmax = softmax(x̂unif orm · 100) (2) but thanks to the bidirectional configuration each LSTM just Every configuration is trained in a supervised fashion, needs to remember one digit (listing 1). Surprisingly, the making use of the cross-entropy loss and Adam (Kingma control module does not need as much time as the swap and Ba 2014) as the optimizer. The learning rate is kept the module to converge, even when having to learn how to di- same (1e-3) across all configurations. The cross-entropy gest the list into an embedded representation and to use its Model convergence Monolithic Modular Time to converge (seconds) 104 Training progress (max. length 7) 103 1.0 Monolithic Modular 0.8 3 4 5 6 7 8 9 10 Maximal input length 0.6 Error rate Figure 8: Convergence times for modular and monolithic 0.4 configurations. Training times do not only get longer with longer input sequences, but also become more unstable. We hypothesize this is because of the high recurrence. The time 0.2 scale is logarithmic. 0.0 internal memory for remembering past actions. This could 0 1000 2000 3000 4000 again be a consequence of a richer gradient. Wall-clock time Training progress (max. length 7) Listing 1: Example of a bidirectional LSTM performing the 1.0 swap operation onto a list. An underscore represents a zero- Monolithic vector. Modular 0.8 L = 3 ,9 ,5 ,4 , A = 0 ,1 ,0 ,0 ,0 B = 0 ,0 ,0 ,1 ,0 0.6 Error rate F o r w a r d LSTM O u t p u t = 3 , , 5 , 9 , Backward LSTM O u t p u t = , 4 , , , 0.4 Merge by a d d i t i o n −−−> 3 , 4 , 5 , 9 , Regarding generalization, in figure 12 we show the be- 0.2 haviour of each configuration when tested on list lengths not seen during training. The monolithic configuration gen- eralizes better to longer lists and its performance degrades 0.0 0 50 100 150 200 250 300 slowly and smoothly than the modular one. This seems to Iterations (x100) contradict the common belief that modular NNs are able to achieve better generalization. However, we hypothesized this could happen due to the additional restrictions applied Figure 9: Error rate curves during training for modular and to the modular training, such as the random input noise and monolithic configurations, with respect to time (top) and the output saturation requirements. training iterations (bottom). In the modular configuration, we tried to compensate the lack of such gradient quality with ad-hoc loss functions and training conditions, but adding such priors can also back- lash. This is therefore a phenomenon that should be consid- ered when designing modular NNs. In this case, we tried a learning rate 5 times higher to compensate for the lower gra- dient and we experienced a training time reduction of more than a half, with a slight increase in generalization (figure 13). Further study of the special treatment for modular NNs could be part of future research. Mean gradient value per weight Monolithic Monolithic 0.0012 Modular 21 0.0010 18 Absolute value 0.0008 15 Tested length 0.0006 12 0.0004 9 0.0002 0.0000 6 200 400 600 800 Training step 3 Figure 10: Mean absolute value of the gradient at the 0 weights for each configuration. This data was obtained 4 Modular6 training 8 10 while training with lists of 7 digits maximum. Obtained Maximal trained length data was slightly smoothed to help visualization. The grey shadow represents the standard deviation. 60 Modular 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy 50 18 16 Tested length 40 14 Tested length 12 Training progress (max. length 7) 30 swap 10 0.30 ctrl ptra 20 8 0.25 6 0.20 10 4 Training loss 0.15 2 0 2.5 0 5.0 7.5 10.0 12.5 15.0 0.10 4 6 8 10 Maximal trained Maximal length trained length 0.05 0.00 0 100 200 300 400 500 600 700 0.0 0.0 0.2 0.2 0.4 0.4 0.60.6 0.80.8 1.0 1.0 Wall-clock time (seconds) Accuracy Accuracy Figure 11: Progress of the training loss for the different op- Figure 12: Generalization tests for monolithic (top) and erations during training with lists of maximal length 7. We modular (bottom) configurations. Horizontal lines mark the only show ptra because the complexity is the same as in length where 0 accuracy is achieved. A dashed line points ptrb and retb. where the accuracy passes 0.9. References Modular Abolafia, D. A.; Norouzi, M.; Shen, J.; Zhao, R.; and Le, Q. V. 2018. Neural Program Synthesis with Priority Queue 18 Training. ArXiv e-prints. 16 Andreas, J.; Rohrbach, M.; Darrell, T.; and Klein, D. 2015. Deep compositional question answering with neural module 14 networks. CoRR abs/1511.02799. Andreas, J.; Klein, D.; and Levine, S. 2016. Modular Mul- Tested length 12 titask Reinforcement Learning with Policy Sketches. arXiv 10 e-prints arXiv:1611.01796. Auda, G., and Kamel, M. S. 1999. Modular neural networks: 8 a survey. International journal of neural systems 9:129–51. 6 Bennani, Y. 1995. A modular and hybrid connectionist sys- tem for speaker identification. Neural computation 7:791–8. 4 Blumer, A.; Ehrenfeucht, A.; Haussler, D.; and Warmuth, 2 M. K. 1989. Learnability and the vapnik-chervonenkis di- mension. J. ACM 36(4):929–965. 0 Britz, D.; Goldie, A.; Luong, T.; and Le, Q. 2017. Mas- 4 6 8 10 sive Exploration of Neural Machine Translation Architec- Maximal trained length tures. ArXiv e-prints. Bunel, R.; Hausknecht, M.; Devlin, J.; Singh, R.; and Kohli, Figure 13: Generalization tests for the modular configura- P. 2018. Leveraging grammar and reinforcement learning 0.0 being trained tion after 0.2 with0.4 0.6 learning the corrected 0.8 rate. 1.0 for neural program synthesis. In International Conference Accuracy on Learning Representations. Cai, J.; Shin, R.; and Song, D. 2017. Making Neural Pro- Conclusions gramming Architectures Generalize via Recursion. ArXiv We proposed a modular approach to NNs design based on a e-prints. perception-action loop, in which the whole system is func- Chen, K. 2015. Deep and Modular Neural Networks. tionally divided in several modules with standardized inter- Springer. chapter 28. faces. These modules are all liable to be trained, either inde- Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; and pendently or jointly, or also explicitly specified by a human Jégou, H. 2017. Word Translation Without Parallel Data. programmer. We have shown how a list sorting problem can ArXiv e-prints. be solved by a NN following this modular architecture and how modularity has a very positive impact on training speed D. Erman, L.; Hayes-Roth, F.; Lesser, V.; and Reddy, R. and stability. There seems to be a trade-off with respect 1980. The hearsay-ii speech-understanding system: Inte- to the generalization and number of training steps though, grating knowledge to resolve uncertainty. ACM Comput. which somehow suffer from not having access to a global Surv. 12:213–253. gradient and excessive restrictions during training. We give Devlin, J.; Bunel, R.; Singh, R.; Hausknecht, M.; and Kohli, insights in this phenomena and suggestions to address them. P. 2017. Neural Program Meta-Induction. ArXiv e-prints. Designing modular NNs can lead to a better utilization Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural of computational resources and data available, as well as an turing machines. CoRR abs/1410.5401. easier integration of expert knowledge, as it is discussed in He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep residual the document. NN modules under this architecture are easily learning for image recognition. CoRR abs/1512.03385. upgradeable or interchangeable by alternative implementa- tions. Future research should explore this kind of scenarios Hochreiter, S., and Schmidhuber, J. 1997. Long short-term and practical implementations of the modular concept. The memory. Neural Comput. 9(8):1735–1780. effects of modular training on non-recurrent NN modules Hochreiter, S.; Bengio, Y.; Frasconi, P.; and Schmidhuber, J. should be studied as well. Moreover, what we have intro- 2001. Gradient flow in recurrent nets: the difficulty of learn- duced is an initial approach, so further investigations may ing long-term dependencies. In Kremer, and Kolen., eds., A reveal a variety of faults and improvements, in particular re- Field Guide to Dynamical Recurrent Neural Networks. IEEE garding the application of this concept to problems of higher Press. complexity. Hrycej, T. 1992. Modular learning in neural net- works: A modularized approach to classification. New York: Wiley. Acknowledgements Hu, R.; Andreas, J.; Rohrbach, M.; Darrell, T.; and Saenko, We would like to thank the reviewers for their valuable opin- K. 2017. Learning to Reason: End-to-End Module ions and suggestions, which have helped us to substantially Networks for Visual Question Answering. arXiv e-prints improve the quality of this article. arXiv:1704.05526. Hu, R.; Andreas, J.; Darrell, T.; and Saenko, K. 2018. Explainable Neural Computation via Stack Neural Module Networks. arXiv e-prints arXiv:1807.08556. Jacobs, R. A.; Jordan, M. I.; Nowlan, S. J.; and Hinton, G. E. 1991. Adaptive mixtures of local-experts. Neural Computa- tion 3(1):79–87. Jacobs, R. 1990. Task Decomposition Through Competi- tion in a Modular Connectionist Architecture, PhD Thesis. Amherst, MA, USA: University of Massachusets. Jaderberg, M.; Czarnecki, W. M.; Osindero, S.; Vinyals, O.; Graves, A.; and Kavukcuoglu, K. 2016. Decoupled neural interfaces using synthetic gradients. CoRR abs/1608.05343. Jeff Clune, Jean-Baptiste Mouret, H. L. 2013. The evo- lutionary origins of modularity. Proceedings of the Royal Society B 280(1755). Jia, X.; Song, S.; He, W.; Wang, Y.; Rong, H.; Zhou, F.; Xie, L.; Guo, Z.; Yang, Y.; Yu, L.; Chen, T.; Hu, G.; Shi, S.; and Chu, X. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. ArXiv e-prints. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mix- ture of experts and the em algorithm. Neural Computation 6(2):181–214. Kandel, E. R.; Schwartz, J. H.; and Jessell, T. M. 2000. Principle of Neural Science (4th Ed.). New York: McGraw- Hill. Kingma, D. P., and Ba, J. 2014. Adam: A Method for Stochastic Optimization. ArXiv e-prints. Lecun, Y. 1989. Generalization and network design strate- gies. Elsevier. 143–55. Lorien Y. Pratt, J. M., and Kamm, C. A. 1991. Direct trans- fer of learned information among neural networks. Proceed- ings of the Ninth National Conference on Artificial Intelli- gence (AAAI-91) 584–589. Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Ef- ficient Estimation of Word Representations in Vector Space. ArXiv e-prints. Reed, S., and de Freitas, N. 2015. Neural Programmer- Interpreters. ArXiv e-prints. Simonyan, K., and Zisserman, A. 2014. Very Deep Con- volutional Networks for Large-Scale Image Recognition. arXiv e-prints arXiv:1409.1556. Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learn- ing: An Introduction. Cambridge, MA: MIT Press. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S. E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2014. Going deeper with convolutions. CoRR abs/1409.4842. van den Oord, A.; Kalchbrenner, N.; Vinyals, O.; Espeholt, L.; Graves, A.; and Kavukcuoglu, K. 2016. Conditional Image Generation with PixelCNN Decoders. ArXiv e-prints. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At- tention Is All You Need. ArXiv e-prints.