Residual Network with Delayed Max Pooling
      for Very Large Scale Plant Identification

                       Siang Thye Hang and Masaki Aono

         Knowledge Data Engineering and Information Retrieval Laboratory,
                Department of Computer Science and Engineering,
                    Toyohashi University of Technology, Japan
                      hang@kde.cs.tut.ac.jp, aono@tut.jp


        Abstract. In our approach, we applied a few modifications to the 50-
        layered Residual Network. Our preliminary experiments with the Plant-
        CLEF 2016 dataset showed that the modifications improved classifica-
        tion performance. We have trained three models based on the modified
        Residual Network configuration with different combinations of trusted
        and noisy PlantCLEF 2017 datasets. Using confidence scores extracted
        from the three models, we have submitted four runs and our methods
        showed competitive classification performance.

        Keywords: Plant Identification, Deep Learning, Down-sampling


1     Residual Network with Delayed Max Pooling
We applied a few modifications to the 50-layered Residual Network by He et
al. [3], which is also known as ResNet-50.

1.1    Max Pooling Based Down-sampling
In ResNet-50, a total of three convolution operations1 with stride size 2 but filter
size 1 × 1 is used for down-sampling. As the filter size is smaller than stride size,
part of the activations may be ignored in the filtering i.e. convolution processes,
as demonstrated in Figure 1.
     Therefore, we reduced stride size of the three convolution operations in
ResNet-50 from 2 to 1. In addition to this, max pooling of stride size 2 and
filter size 2 × 2 are inserted before these convolution operations. We label this
modified configuration as ResNet-50-MP.

1.2    Delayed Down-sampling
Down-sampling is an essential element in Convolutional Neural Network, which
reduces number of activations (as well as computational complexity). However,
1
    Namely res{3,4,5}a branch2a based on the ResNet-50 model definition in
    github.com/KaimingHe/deep-residual-networks
    (a) 3 × 3 filter, stride 2   (b) 2 × 2 filter, stride 2     (c) 1 × 1 filter, stride 2


Fig. 1. Coverage of stride 2 based filtering with filter size 3 × 3, 2 × 2, and 1 × 1. When
filter size is larger than 2 × 2 e.g. (a), some of the regions are overlapped, as shown in
the darker regions. When filter size is smaller than 2 × 2 e.g. (c), some of the regions
(in white) are not covered at all.


applying down-sampling too early may leave too little activations for subsequent
convolution operations, thus impacting classification performance.
     In a paper by He et al. [2], delaying down-sampling shows improved clas-
sification performance. In their method, stride size of pooling operations are
reduced from 2 to 1, and stride size of their subsequent convolution operations
are increased from 1 to 2. In our method, we simply switched the position of the
newly introduced max pooling operations (in Subsection 1.1) with their subse-
quent convolution operations. We label this configuration as ResNet-50-MPD.


2     Network Training and Testing

This section illustrates our implementation for the plant identification task i.e.
PlantCLEF. We use the Caffe framework by Jia et al. [5] to implement all of the
configurations. Note that all configurations are trained from scratch i.e.
no pretrained weights are used.


2.1     Input Data

Data Augmentation. Training images are randomly scaled such that shorter
sides are in the range of [224, 336]. Scaled images are randomly rotated for ±45◦ .
Rotated images are randomly cropped into 224 × 224, and finally the cropped
images are randomly horizontal flipped.
    As for test images, they are scaled such that the shorter sides become 224.
The scaled images are then horizontally flipped. Both flipped and non-flipped
images are applied into a trained network for prediction. Class-wise outputs
(before softmax normalization) of both instances are averaged and then softmax
normalized.
Input Normalization. Mean centering of input is already a common proce-
dure to train or test a network. However, this alone may not be sufficient as
variance is not considered in this process.
    Therefore, we attempted to normalize at higher order by applying Batch Nor-
malization [4] directly onto the (augmented) input. With Batch Normalization,
an input is normalized into zero mean and unit variance (and then scaled and
shifted accordingly). As the number of filters are quite limited i.e. 64 in the very
first convolution layer of ResNet i.e. conv1, such normalization should facilitate
this convolution layer to learn filters of more varying features.

2.2   Preliminary Experiments with PlantCLEF 2016 Dataset
Dataset Preparation. ResNet-50, ResNet-50-MP and ResNet-50-MPD con-
figurations in Section 1 are trained for 100 epochs with PlantCLEF 2016 training
dataset. After each training epoch, each configuration is validated with Plant-
CLEF 2016 test dataset. Omitting unseen images i.e. of ClassId 9999, a total
of 113204 training images and 4510 test images are utilized. Note that data
augmentation and normalization as detailed in Section 2.1 are applied to all
configurations.

Batch Size. The hardware we use for preliminary experiments is a single
NVIDIA’s Tesla K40. We use largest possible batch sizes i.e. based on the hard-
ware’s memory limitation of 12 GiB.

Learning Schedule. Initial learning rate is 0.1 and is multiplied by 0.1 twice
throughout the training process. Training iterations for each learning rate is
divided with ratio 4:2:1 across 100 training epochs.
   In summary, batch sizes for both original ResNet-50 and ResNet-50-MP are
maximized at 31 and they are trained with learning rates 0.1, 0.01, 0.001 for
208671, 104336, 52168 iterations respectively. As for ResNet-50-MPD, the largest
possible batch size is 21 and this configuration is trained with the same learning
rates for 308038, 154019, 77010 iterations respectively. Validation accuracy of
the whole training process for all three configurations is shown in Figure 2.

2.3   Experiments with PlantCLEF 2017 Dataset
Based on the results as shown in Figure 2, we selected the ResNet-50-MPD
configuration for this year’s plant identification task.

Dataset Preparation. Out of the 256287 trusted training images provided by
                                                                               1
the task organizers of PlantCLEF 2017 [6][1], we randomly selected around 10
of the images for validation purpose. Specifically, after separating 25063 images
for validation, 231224 images remain for ‘trusted’ training. As for noisy images,
out of the provided 1442642 metadata, we managed to obtain 99.0% of them,
specifically 1428395 images.
               1.0
                     original ResNet-50
               0.8   ResNet-50-MP
                     ResNet-50-MPD
    Accuracy


               0.6


               0.4


               0.2


                 0      20           40            60            80           100
                                          Epoch

Fig. 2. Validation accuracy of ResNet-50, ResNet-50-MP and ResNet-50-MPD config-
urations using PlantCLEF 2016 dataset. Base learning rate 0.1 is dropped to 0.01 at
around 57th epoch, and is further dropped to 0.001 at around 86th epoch, as indicated
by the vertical dotted lines.


   Different combinations of trusted and noisy images are used to train the
ResNet-50-MPD configuration. Our first model is trained with trusted training
images only2 , and our second model is trained with noisy training images only3 ,
while our third model is trained with both mixed together. We label the first as
model T, the second as model N, and the third as model X. All three models
are validated with the 25063 validation images.

Batch Size. We use a single NVIDIA’s Quadro P6000 to train the three
models. With 24 GiB memory, we were able to use batch size up to 47.

Learning Schedule. As the trusted and noisy datasets are a lot larger com-
pared to last year’s, we were not able to train for 100 epochs but a fixed amount
of training iterations. All three models are trained for 350000 iterations: learn-
ing rate 0.1 for 200000 iterations, 0.01 for 100000 iterations and 0.001 for 50000
iterations. In other words, model T was trained for around 71 epochs, model
N was trained for around 12 epochs, while model X was trained for around 10
epochs. Validation accuracy of the training processes is summarized in Figure 3.

Run Submission. A total of 25170 test images are provided by the task
organizers. As detailed Subsection 2.1, each test image is scaled and then hor-
izontally flipped. Both flipped and non-flipped test images are applied into all
2
    Corresponds to training set E in imageclef.org/lifeclef/2017/plant
3
    Corresponds to training set W
                1.0
                         model T (trusted only)
                0.8      model N (noisy only)
                         model X (trusted and noisy)
    Accuracy


                0.6


                0.4


                0.2


                  0      5       10        15          20       25   30        35
                                                            4
                                         Iteration (×10 )

Fig. 3. Validation accuracy of ResNet-50-MPD trained with different combinations of
trusted and noisy PlantCLEF 2017 datasets, namely model T, N and X. All of the
models are trained for 350000 iterations (but different number of training epochs due
to different number of image in each dataset).


three models i.e. model T, N and M. For each model, class-wise average of con-
fidence scores (before softmax normalization) extracted from both flipped and
non-flipped images is computed. Additionally, class-wise average of confidence
scores with the same ObservationId is also computed. The averaged confidence
scores are then softmax normalized. We have submitted four runs with team
name KDETUT for this year’s plant identification task, as summarized below.

 – KDETUT Run 1: Based on model T (trained with trusted images only)
 – KDETUT Run 2: Based on model N (trained with noisy images only)
 – KDETUT Run 3: Based on model X (trained with trusted and noisy img.)
 – KDETUT Run 4: Average of confidence scores based on model T and N


3              Evaluation Result

Mean Reciprocal Rank (MRR) is used as evaluation metric. Evaluation results
released by the task organizers are shown in Table 1 and Figure 4. Among the
submitted four runs, Run 4, which is average of confidence scores extracted from
model T and N, shows the best classification performance.
Table 1. Evaluation results of runs submitted by PlantCLEF 2017 participants, sorted
in descending order. Our four submitted runs are highlighted in color.

             Run Name                   MRR                           Run Name    MRR
   MarioTsaBerlin Run 4                 0.920                   KDETUT Run 1      0.772
   MarioTsaBerlin Run 2                 0.915                       CMP Run 2     0.765
   MarioTsaBerlin Run 3                 0.894                       CMP Run 4     0.733
        KDETUT Run 4                    0.853                         UM Run 1    0.700
   MarioTsaBerlin Run 1                 0.847           SabanciUGebzeTU Run 4     0.638
           CMP Run 1                    0.843           SabanciUGebzeTU Run 1     0.636
        KDETUT Run 3                    0.837           SabanciUGebzeTU Run 3     0.622
        KDETUT Run 2                    0.824                    PlantNet Run 1   0.613
           CMP Run 3                    0.807           SabanciUGebzeTU Run 2     0.581
    FHDO BCSG Run 2                     0.806               UPB HES SO Run 3      0.361
    FHDO BCSG Run 3                     0.804               UPB HES SO Run 4      0.361
             UM Run 2                   0.799               UPB HES SO Run 1      0.326
             UM Run 3                   0.798               UPB HES SO Run 2      0.305
    FHDO BCSG Run 1                     0.792               FHDO BCSG Run 4       0.000
             UM Run 4                   0.789


                          1.0
   Mean Reciprocal Rank


                          0.8


                          0.6


                          0.4


                          0.2


                           0
                                   MarioTsaBerlin Run 4
                                   MarioTsaBerlin Run 2
                                   MarioTsaBerlin Run 3
                                        KDETUT Run 4
                                   MarioTsaBerlin Run 1
                                            CMP Run 1
                                        KDETUT Run 3
                                        KDETUT Run 2
                                            CMP Run 3
                                    FHDO BCSG Run 2
                                    FHDO BCSG Run 3
                                              UM Run 2
                                              UM Run 3
                                    FHDO BCSG Run 1
                                              UM Run 4
                                        KDETUT Run 1
                                            CMP Run 2
                                            CMP Run 4
                                              UM Run 1
                                SabanciUGebzeTU Run 4
                                SabanciUGebzeTU Run 1
                                SabanciUGebzeTU Run 3
                                         PlantNet Run 1
                                SabanciUGebzeTU Run 2
                                    UPB HES SO Run 3
                                    UPB HES SO Run 4
                                    UPB HES SO Run 1
                                    UPB HES SO Run 2
                                    FHDO BCSG Run 4


                                    Fig. 4. Visualization of Table 1.
4    Conclusion
In this paper, we described our approach to PlantCLEF 2017, focusing on some
modifications to the 50-layered Residual Network. Nevertheless, there are still
rooms for improvements in our approaches, as itemized below.

 – Especially models trained with very large datasets i.e. model N and X, the
   fixed amount of 350000 training iterations may not be sufficient. For example,
   as detailed in Subsection 2.3, in fact model X is only trained for around 10
   epochs. However, this setup already requires 4 days to train each model.
   More training iterations i.e. longer training time may be required to achieve
   superior classification performance.
 – Among the four runs we have submitted, Run 4 yields the highest classi-
   fication performance. It is based on average of confidence scores computed
   from model T and N (one each). We believe classification performance can
   be further improved if the confidence scores are averaged from even more
   models, for example five model T and five model N. This is however at the
   cost of multiplied computation time.


References
1. Goëau, H., Bonnet, P., Joly, A.: Plant identification based on noisy web data: The
   amazing performance of deep learning (lifeclef 2017). In: CLEF working notes 2017
   (2017)
2. He, K., Sun, J.: Convolutional neural networks at constrained time cost. In: Com-
   puter Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on (2015)
3. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
   In: Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on
   (2016)
4. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
   reducing internal covariate shift. In: Proceedings of the 32nd International Confer-
   ence on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015. pp. 448–456
   (2015)
5. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
   rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding.
   In: Proceedings of the 22nd ACM international conference on Multimedia. pp. 675–
   678. ACM (2014)
6. Joly, A., Goëau, H., Glotin, H., Spampinato, C., Bonnet, P., Vellinga, W.P., Lom-
   bardo, J.C., Planqué, R., Palazzo, S., Müller, H.: Lifeclef 2017 lab overview: Multi-
   media species identification challenges