Incremental Training of
         Deep Convolutional Neural Networks

       R. Istrate12 , A. C. I. Malossi1 , C. Bekas1 , and D. Nikolopoulos2
                        1
                        IBM Research – Zurich, Switzerland
                         {roi,acm,bek}@zurich.ibm.com
                  2
                    Queen’s University of Belfast, United Kingdom
                    {ristrate01,d.nikolopoulos}@qub.ac.uk


      Abstract. We propose an incremental training method that partitions
      the original network into sub-networks, which are then gradually incor-
      porated in the running network during the training process. To allow for
      a smooth dynamic growth of the network, we introduce a look-ahead ini-
      tialization that outperforms the random initialization. We demonstrate
      that our incremental approach reaches the reference network baseline ac-
      curacy. Additionally, it allows to identify smaller partitions of the original
      state-of-the-art network, that deliver the same final accuracy, by using
      only a fraction of the global number of parameters. This allows for a po-
      tential speedup of the training time of several factors. We report training
      results on CIFAR-10 for ResNet and VGGNet.

      Keywords: Training algorithm, Look-ahead, CNNs


1   Introduction
When dealing with a classification task on a new dataset of images, a widely
used strategy is to start by training a few state-of-the-art networks that were
developed for similar datasets. The main disadvantage of this approach is that
only too late in the process we learn whether the network is not well suited for
the dataset. At that moment, the training is stopped, the network is adapted
and the process is restarted from scratch, discarding all the previously collected
information. Thus, in the presence of large, complex datasets, the global time-
to-solution tends to reach the order of several weeks or even months.
    A great deal of research is now focused in optimizing the depth of a neural
network (NN) and most of the proposed methods involve growing the network
by gradually adding one or more layers [1–3]. Although this is a logical step
towards evolving network architectures, to the best of our knowledge there is
no work that compares the accuracy and performance of the networks obtained
by dynamically adapting their structure during training, with the original one
trained from scratch.
    In this paper we present our contribution towards gradually training deep
state-of-the-art convolutional neural networks (CNNs) with no loss in accuracy.
During incremental training we start the learning process with a shallow network.
2        R. Istrate, A.C.I. Malossi, C. Bekas, D. Nikolopoulos

When the network performance stops improving because of its limited capacity,
we transfer the knowledge of the shallow network to a deeper one and continue
the training. In this way the time and resources spent in training the initial
shallow network are not wasted and the deeper network has a better initialization
point. Additionally, since the incremental training method easily quantifies (by
construction) the value added by each network extension, it is simple to define
custom trade-off functions (e.g., time vs accuracy) to stop the network expansion
when desired criteria are met. The long-term goal is to develop a fully-automated
framework that optimizes neural network structures for specific tasks. However,
this requires knowing what type of changes need to be performed to the network,
which is outside the scope the current work.
    The rest of the paper is organized as follows. Section 2 briefly summaries
related works, Section 3 provides an overview of our methodology, while Section 4
presents the results of our experiments. Finally, concluding remarks and future
works are summarized in Section 5.


2     Related work
There is a clear tendency in the literature towards automatizing the design of
NN. In AdaNet [1] the authors aim to optimize the network architecture and
the internal weights, by balancing the trade-off between model complexity and
empirical risk minimization. Although the claim is that networks discovered
through this approach perform better than those found with a grid search, only
limited results for binary classifiers trained on subsets of CIFAR-10 [4] dataset
are presented.
    Auto-Net [5] is a framework that automatically tunes feed-forward neural net-
works without human intervention. The authors focus only on fully-connected
neural networks in order to keep the number of hyper-parameters at a man-
ageable level. They tune 63 network and layer dependant hyper-parameters for
networks which are at most 6 layers deep. With this method they generated an
ensemble of 39 models that outperformed all human experts and was the first
automatically generated model to win an image competition 3 .
    On the same track, a framework for large scale image classifiers [6] makes
use of genetic intuitive mutations to explore unprecedented large search spaces.
The fully automatic evolution culminates with a trained network that reaches
94.6% accuracy on CIFAR-10, 2% lower than manually engineered state-of-the-
art networks. However, the prohibitive computational cost of the framework, that
required 4 · 1020 FLOP for a 12 layer network, makes this approach unfeasible
in practical scenarios with bigger datasets.
    Other works [7–9] that discover good combination of hyper-parameters do not
employ neuro-evolution, therefore the discovery requirements in terms of time
and resources are less expensive, but have other drawbacks such as limited search
space, considerable loss in accuracy, limited usability, and need of retraining from
scratch for a considerable amount of time after each network alteration.
3
    http://automl.chalearn.org/
               Incremental Training of Deep Convolutional Neural Networks          3

    Other steps towards incremental training are presented in [10, 11], where the
goal is to transfer knowledge from a small network towards a significantly larger
network under some architectural constraints. These approaches have a twofold
benefit. First, they explore the design space of current state-of-the-art networks
in order to find better performers. Second, they avoid information loss when the
network structure has to be slightly modified. Although they improve state-of-
the-art results for the ImageNet classification task, it would be interesting to
compare in terms of time/accuracy the networks obtained through transferred
learning with the same networks trained from scratch.


3   Methodology

Let us consider a generic CNN N composed of n layers. The incremental training
method proposed in this work begins by partitioning the original network N into
K sub-networks Sk , with k = 1, ..., K and K ≤ n. In our current implementation,
the partitioning is done a priori and does not change throughout the training
process, although this will be subject of future studies. A sub-network can be
composed of one or several layers, with the only constraint that at least one
of the layers must contain trainable parameters; in other words, pooling and
dropout cannot constitute a sub-network by themselves. The classifier block at
the end of the network, which might include fully-connected layers or global
pooling layers, is not considered in the K partitions.
     Following the scheme illustrated in Fig. 1, the training process starts with
sub-network S1 attached to the classifier block. To determine when is the optimal
time to add the second sub-network S2 between S1 and the classifier, we compute
every window size (ws) epochs the improvement in the validation accuracy. When
the improvement observed is below a threshold we stop the training and we
increase the network depth by adding the next sub-network. This process, which
is illustrated in Fig. 2, repeats until all K sub-networks are incorporated or until
another custom criteria is met.
     Each time a new sub-network Sk+1 is inserted in the current architecture,
its weights need to be initialized. This step is delicate, as our experiments show
empirically that a non-optimal initialization (e.g., random) might prevent the
network from reaching the same accuracy as the original one, or might slow down
significantly the incremental training process (Section 4). To overcome this issue,
we propose a more efficient initialization technique, that we refer to as look-ahead.
Given a current network constituted by k sub-networks, our look-ahead method
consists of training Sk+1 for a few epochs based on the input generated by the
already trained uppermost part of the network. This strategy provides a more
informed starting point for the new sub-network, since it looks ahead towards
more complex features that can be obtained based on the previously learned
ones. We remark that the weights of the original sub-network S1 + S2 + ... + Sk
are freezed during the look-ahead process that initializes Sk+1 ; this implies that
no time is spent in back-propagating in that region of the network. Moreover,
the depth of the look-ahead tends to be comparably smaller than the depth
4       R. Istrate, A.C.I. Malossi, C. Bekas, D. Nikolopoulos


      (Step 1)                               (Step 2)              (Step 3)      (Step 4)

Fig. 1. A schematic description of incremental training for a CNN partitioned into
K sub-networks. Step 1 depicts the original network that is already split into K sub-
networks followed by the classifier layer. The incremental training process starts with
the sub-network S1 as shown in Step 2. The learning curve is monitored and when
the termination criterion described in Section 3 is met, the training is stopped. S2 is
inserted in the current network in Step 3 and is trained only for a few epochs based
on the freezed weights of S1 . The obtained network S1 + S2 in Step 4 follows the same
process as Step 2 and Step 3. The process finishes when either all the sub-networks are
incorporated or when a custom criteria is met.


                                         1

                                     0.9
                   Validation accuracy


                                                                     αi


                                     0.8                αi − 1


                                     0.7

                                     0.6

                                     0.5
                                                 ws         2*ws          3*ws
                                                          Epochs

Fig. 2. Visualization of the termination criteria. Every window size (ws) epochs we
compute the angle between the linear approximation of the last ws accuracy points
and the x-axis. This angle shows the improvement of the validation accuracy. The
training is stopped when αi 6 γαi−1 , where γ is a predefined threshold and αi is the
angle characterizing the accuracy for the i-th window.


of the final network, therefore the training of the look-ahead is not considered
expensive.
                  Incremental Training of Deep Convolutional Neural Networks                 5


    Table 1. Configuration details adopted during regular and incremental training.

     Panel A: Data pre-processing.          Panel B: Hyper-parameters.
     Parameter                  Value       Parameter                Value
     Train/Validation split    10%          Optimizer                RMSProp [14]
     Width shift               4 px         Momentum                 0.9
     Height shift              4 px         Learning rate            10−4 (no decay)
     Horizontal flip           Yes                                   128 (ResNet-X)
                                            Batch size
                                                                     32 (VGG-16)
                                            Weight decay             10−4
                                            Weight initialization    He [15]


4     Experiments

In this section, we train several state-of-the-art CNNs to compare the perfor-
mance of our incremental training method, with the one of the classical algo-
rithm. All runs involve single-precision arithmetic and are performed on IBM4
POWER8 Minsky compute nodes, equipped with four Nvidia P100 GPUs.
    In all experiments we use the CIFAR-10 dataset. CIFAR-10 is a collection
of 60,000 RGB images of 32x32x3 pixels each, evenly distributed across 10 mu-
tually exclusive classes. On top of it, we also use the simple data augmentation
described in [13] and summarized in Table 1, Panel A. Concerning the baseline
networks, we chose ResNet [17] and VGGNet [16]: the former is tailored for
CIFAR-10, while the latter is sized for more complex datasets.
    The aim of our experiments is to demonstrate the advantages of the incre-
mental training method compared to the regular one. Therefore, in our runs
we do not employ all the advanced tricks (e.g., learning rate schedule) needed
to reach the highest state-of-the-art accuracy. The values of the most relevant
hyper-parameters are summarized in Table 1, Panel B. The partitioning of the
networks for the incremental training is mainly based on simplicity. We group
layers with the same number of filters (in the case of convolutions) and the same
number of output nodes (in the case of dense layers). We tested different types
of network partitioning observing no major impact in the overall performance.
    In Fig. 3 we show a comparison between the regular and incremental train-
ing methods: for the latter, we plot results using both random and look-ahead
initialization. The validation accuracy (first row) shows that the look-ahead out-
performs the random initialization of the sub-networks. This behavior is better
illustrated in the case of ResNet-56 network, in contrast with the behavior of
VGGNet, as the over-parametrized network mitigates the outperformance. On
the second row we estimate the cost per inference of each sub-network in terms
of FLOPs. The plots highlight the benefit of our incremental training, that per-
forms a much lower amount of calculations for the majority of the training time.
4
    IBM, the IBM logo, ibm.com, OpenPOWER are trademarks or registered trademarks of
    International Business Machines Corporation in the United States, other countries, or both.
    Other product and service names might be trademarks of IBM or other companies.
6                                      R. Istrate, A.C.I. Malossi, C. Bekas, D. Nikolopoulos


                          1.0                                                                                   1.0

    Validation accuracy   0.9                                                                                   0.9


                                                                                          Validation accuracy
                          0.8                                                                                   0.8
                          0.7                                                                                   0.7
                          0.6                                          Net++                                    0.6                                           Net++
                                                                       Random                                                                                 Random
                          0.5                                          Lookahead                                0.5                                           Lookahead
                                                                       Baseline                                                                               Baseline
                          0.4                                                                                   0.4
                                0        20   40      60    80   100     120    140                                        0     20    40    60    80   100     120       140
                                                       Epochs                                                                                 Epochs
                                   1e8                                                                                     1e8
                          2.5
                                                                                                                       6
                          2.0
    FLOP / Inference


                                                                                                    FLOP / Inference
                                                                                                                       5
                          1.5                                                                                          4
                          1.0                                                                                          3
                          0.5                                          Random                                          2                                      Random
                                                                       Lookahead                                                                              Lookahead
                          0.0
                                                                       Baseline                                        1                                      Baseline
                                0        20   40      60    80   100     120    140                                        0     20    40    60    80   100     120       140
                                                       Epochs                                                                                 Epochs


                             1.0                                                                             1.0
                             0.9                                                                             0.9
       Validation accuracy


                                                                                       Validation accuracy


                             0.8                                                                             0.8
                             0.7                                                                             0.7
                             0.6                                                                             0.6
                                                                       Random                                                                                 Random
                             0.5                                       Lookahead                             0.5                                              Lookahead
                                                                       Baseline                                                                               Baseline
                             0.4                                                                             0.4
                                   0      1        2      3      4       5         6                            0.0              0.2   0.4 0.6 0.8 1.0          1.2     1.4
                                                   Cumulative FLOPs            1e14                                                       Cumulative FLOPs            1e15

                                                   ResNet-56                                                                                VGGNet

Fig. 3. Comparison in terms of accuracy (first row), inference cost (second row), and
cumulative FLOPs (third row) between the training of the original network from scratch
and its incremental counterpart on the CIFAR-10 dataset.


On the third and last row we observe that, while the regular version of the VG-
GNet reached 90% accuracy using 1.4 PFLOP, the incremental version needed
0.8 PFLOP and only 40% of the total number of parameters. In both networks
the look-ahead training converged faster.
    On top of these advantages, the incremental training allows to trace the
importance of each sub-network during the training. For instance, in the case
of VGGNet, the last two sub-networks bring a marginal improvement (less than
1% accuracy), while increasing by 16% the inference cost and by 80% the number
of parameters. In Fig. 4 we show that only 5% of the VGGNet paramaters are
enough to reach 85% accuracy, while with 42% we arrive to the baseline accuracy.
Therefore, the incremental training can be of great benefit for applications in
                                         Incremental Training of Deep Convolutional Neural Networks                                                   7


                                           Table 2. Classification error on CIFAR-10 test set.

                                                                              Test error(%)
                                             Network
                                                            Baseline          Look-ahead Random
                                             VGGNet             90.06             90.50                          88.91
                                             ResNet-20          87.81             87.52                          79.20
                                             ResNet-32          88.93             88.90                          79.89
                                             ResNet-44          89.40             89.32                          80.20
                                             ResNet-56          89.51             89.56                          80.96

which the last 1-2% of accuracy are not crucial and a slightly lower performance is
within accepted limits. Indeed, it can dynamically determine to stop the network
expansion early, saving a lot of computational resources without compromising
accuracy.


                      1.0                                                                             1.0
                      0.9                                                                             0.8
Validation accuracy


                                                                                Validation accuracy


                      0.8
                                                                                                      0.6
                      0.7
                                                                                                      0.4
                      0.6
                                                Random                                                                                  Random
                      0.5                       Lookahead                                             0.2                               Lookahead
                                                Baseline                                                                                Baseline
                      0.4                                                                             0.0
                            0.0   0.5    1.0 1.5 2.0 2.5         3.0    3.5                              0.0   0.5     1.0      1.5     2.0      2.5
                                        Parameters in network           1e7                                          Cumulative FLOPs          1e16


Fig. 4. Validation accuracy as a function                                       Fig. 5. Training of VGGNet on CIFAR-
of the number of parameters in the net-                                         10 images resized to 224x224x3 pixels.
work. The final accuracy obtained by the                                        For each sub-network we used the max-
VGGNet (which consists of 35M param-                                            imum possible batch size that fitted the
eters) is reached with only 15M parame-                                         GPU. This allowed for a larger learning
ters with the incremental training.                                             rate for the first sub-networks, that led
                                                                                to a faster convergence than the baseline.


5                           Conclusion
We proposed an incremental training method for CNNs. We demonstrated that
our method reaches same or slightly better accuracy than regular training meth-
ods on state-of-the-art networks. This was achieved by using a more informed
initialization of the network extension, which we call the look-ahead. Clear ben-
efits of the incremental training are the faster convergence, the intuitive un-
derstanding of the importance of each sub-network in the overall performance,
and the smooth synergy between training and optimal network depth discovery.
In future works, we plan to generalize the present approach to other types of
networks, such as Recurrent Neural Networks and Long Short Term Memory.
8       R. Istrate, A.C.I. Malossi, C. Bekas, D. Nikolopoulos

References
1. Corinna Cortes, Xavi Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang.
   Adanet: Adaptive structural learning of artificial neural networks. arXiv preprint
   arXiv:1607.01097, 2016.
2. Jae-Yoon Jung and James A. Reggia. The automated design of artificial neural
   networks using evolutionary computation. In Success in Evolutionary Computation,
   pages 19–41. Springer, 2008.
3. Dong Yu, Li Deng, Frank Torsten Bernd Seide, Gang Li. Discriminative pretraining
   of deep neural networks, May 30 2013. US Patent App. 13/304,643.
4. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from
   tiny images. 2009.
5. Hector Mendoza, Aaron Klein, Matthias Feurer, Jost Tobias Springenberg, and
   Frank Hutter. Towards automatically-tuned neural networks. In Workshop on
   Automatic Machine Learning, pages 58–65, 2016.
6. Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Sue-
   matsu, Quoc Le, and Alex Kurakin. Large-scale evolution of image classifiers. arXiv
   preprint arXiv:1703.01041, 2017.
7. Junqi Jin, Ziang Yan, Kun Fu, Nan Jiang, and Changshui Zhang. Neural net-
   work architecture optimization through submodularity and supermodularity. arXiv
   preprint arXiv:1609.00074, 2016.
8. Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep
   neural networks with pruning, trained quantization and huffman coding. arXiv
   preprint arXiv:1510.00149, 2015.
9. Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up au-
   tomatic hyperparameter optimization of deep neural networks by extrapolation of
   learning curves. In IJCAI, pages 3460–3468, 2015.
10. Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning
   via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015.
11. Tao Wei, Changhu Wang, Yong Rui, and Chang Wen Chen. Network morphism.
   In International Conference on Machine Learning, pages 564–572, 2016.
12. Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional
   networks. In European conference on computer vision, pages 818–833. Springer,
   2014.
13. Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu.
   Deeply-supervised nets. In Artificial Intelligence and Statistics, pages 562–570, 2015.
14. Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient
   by a running average of its recent magnitude. COURSERA: Neural networks for
   machine learning, 4(2), 2012.
15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into
   rectifiers: Surpassing human-level performance on imagenet classification. In Pro-
   ceedings of the IEEE international conference on computer vision, pages 1026–1034,
   2015.
16. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
   large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
17. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
   for image recognition. In Proceedings of the IEEE conference on computer vision
   and pattern recognition, pages 770–778, 2016.
18. Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learn-
   ing. arXiv preprint arXiv:1611.01578, 2016.