LifeClef 2017 Plant Identification Challenge:
      Classifying Plants Using Generic-Organ
                Correlation Features

            Sue Han Lee, Yang Loong Chang, and Chee Seng Chan

        Centre of Image & Signal Processing, Fac. Comp. Sci. & Info. Tech.,
                         University of Malaya, Malaysia
                    {leesuehan,yangloong}@siswa.um.edu.my
                               cs.chan@um.edu.my


      Abstract. This paper describes our proposal in the multi-organ plant
      identification task (LifeClef2017 challenge [8]). The objective of the chal-
      lenge is to evaluate to what extent machine learning and computer vi-
      sion can learn from noisy data compared to trusted data. To address the
      challenge, we employ our recent proposed hybrid generic-organ convolu-
      tional neural network, abbreviated HGO-CNN [11] to train on different
      composition of plant datasets. Overall, all the submitted runs obtained
      comparable results in the LifeClef2017 plant classification task.


Keywords: Plant classification, deep learning, convolutional neural network


1   Introduction

Plant classification has received particular attention in the computer vision field
[10] due to its important implications in agriculture automation and environ-
mental conservation. Along with the recent advances in science and technology,
automatic plant species recognition has been made possible to assist botanists
in plant identification tasks [10]. For example: development of an efficient plant
recognition system using the Local Binary Pattern [13] allows the classification
of medical plants [12]. Robotic weed control system drives studies on automatic
plant identification in agronomic research aimed at crop improvement by recog-
nition of crop plants and elimination of weeds [5]. Despite these, automatic plant
recognition, a foundational capability in this context, is nevertheless still in its
early stages.
    In 2013, LifeClef challenge [4] provided the first multi-organ plant dataset.
This was the first multi-organ plant classification benchmark for the computer
vision community. This year, LifeClef 2017 [8] offered a bigger amount of plant
biodiversity data [3]. The objective is to identify 10000 species from images of
plants collected based on two different channels: a “trusted” training set and a
“noisy” training set. The trusted training set is collected from the online collab-
orative Encyclopedia Of Life (EoL) such as Wikipedia, iNaturalist and Flickr
                  Fig. 1: Our proposed HGO-CNN architecture


while the noisy training set is collected based on the Google and Bing image
search results. In this challenge, we employ our recently proposed convolutional
neural network (CNN) architecture – namely the HGO-CNN [11] with small
refinements. Specifically, it integrates both the generic and organ-specific infor-
mation for the multi-organ plant classification task.
    The rest of the working note is organized as follows. In Section 2, we present
the methodology of our proposed architecture. Section 3 illustrates its training
scheme. Section 4 shows the experiments and results for both the validation and
testing set. Lastly, Section 5 presents conclusion and future work.


2   Method Description

Unlike previous approaches [1, 2] that trained CNN to capture solely generic
representation from the plantation images, HGO-CNN [11] is able to encapsulate
both the organ and generic information prior to the plant classification. We
consider features from the organ because plant organs in general, are known
prior to the exploration of its characteristics. For instant, when botanists study
a leaf, they focus on the leaf characters such as margin or venation, and, when
they study a flower, they focus on the characteristics of petals, sepals and stamen
to identify the plant species. So, we believe that a better recognition method for
plant species requires prior information of their respective organs.
    The proposed HGO-CNN comprises of four layers or components: (i) a shared
layer, (ii) an organ layer, (iii) generic layer, and (iv) a species layer. We intro-
duce shared layer for both the generic and organ components. The reasons are
threefold. First, [17, 16] demonstrated that preceding layers in deep networks
response to low-level features such as corners and edges. Since both the higher
level generic and organ components require low-level features to build higher
level features, we introduce shared preceding layers for these components. Sec-
ond, according to [16], the shared layer may reduce floating point operations
and memory footprint of the network execution, which are of importance for
real world application. Lastly, using shared layer will help to reduce the number
of training parameters, which is beneficial to the architecture’s computational
efficiency. Fig. 1 depicts the configuration of our proposed model. Input to our
proposed model is a color image of 224 × 224 pixels. For the convolutional layer,
we utilise 3 × 3 convolution filters with spatial resolution preserved using stride
1. Max pooling is performed using a 2 × 2 pixel window with stride 2. Three
fully connected layers, which have 4096, 4096 and 10000 channels respectively,
follow behind the stacks of convolutional layers. Finally, the HGO-CNN output
is fed into a softmax layer to produce the softmax output. Note that for the
                                                 (q)        sq
q-th class, the softmax output is defined as Pn = PMe esm where M stands
                                                          m=1
for the total number of classes and s stands for the class prediction score. After
performing the softmax operation, softmax loss L is computed as follows:
                                      B
                                1 X
                             L=       −log(Pn(T ) )                            (1)
                                B n=1
where B is the batch size and T is the ground truth class label for the n-th input
image.
    In this challenge, we refine some of the configurations in the original HGO-
CNN architecture: (1) the data layer normalization technique, called Batch Nor-
malization (BN) [6] is included. We added BN from the last convolution layer
of both the generic and organ components respectively until the fully connected
layers. This is to enhance the correlation of representation learning between the
two components, so that it is more robust to non-linearities. (2) During fea-
ture fusion, features summation is performed instead of concatenation to further
amplify the correspondences of these features.


3   Training

Pre-Training Two-Path CNN We design a two-path CNN as shown in Fig.
2 for the purpose of training two different components: the generic and organ
specific features. This two-path CNN is initially pre-trained using the ImageNet
challenge dataset [14].

Organ layer After we obtained the pre-trained two-path CNN, one of the CNN
path is repurposed for the organ task. This organ layer is trained together with
the shared layer, using seven kinds of predefined organ labels. We obtain organ-
based feature maps, xorg ∈ RH×W ×Z where H, W and Z are the height, width
and number of channels of the respective feature maps. Since PlantClef2017
                      Fig. 2: A two-path CNN architecture


dataset does not provide organ information for every plant image, we train the
organ layer based on the previous PlantClef2015 training set.

Generic layer After training the organ layer, another CNN path is repur-
posed for the generic task. This generic layer is trained using the species la-
bels, regardless of organ information. We obtain generic-based feature maps,
xgen ∈ RH×W ×Z . To allow both the organ and generic layers to share the com-
mon proceeding layer, we keep the shared layer’s weights to be consistent. This
is achieved by setting their learning rate to zero.

Species layer To introduce correlation between both the organ and generic
components, a fusion function y = g(xorg , xgen ) is employed at stage L (after
the last convolutional layer for both components as shown in Fig. 1) to produce
the organ and generic correlation feature maps, y ∈ RH×W ×Z . In our model, g
performs summation of these two sets of features:

                           yi,j,k = xorg i,j,k + xgen i,j,k                    (2)
where 1 ≤ i ≤ H, 1 ≤ j ≤ W , 1 ≤ k ≤ Z. The feature maps, y will
then go through two convolution layers to learn the combined representation of
generic and organ features. Since these two convolution layers are new randomly-
initialised, we set their learning rate to be 10 times higher than the other layers
during training.


4   Experiments and Results

Our architecture is trained using the Caffe [7] framework. The networks are
trained with back-propagation, using stochastic gradient descent [9]. For the
training parameter setting, we employed the fixed learning policy. We set the
learning rate to 0.01, and then decreased it by a factor of 10 when the validation
set accuracy stops improving. The momentum was set to 0.9 and the weight
decay to 0.0001. We run the experiments using a NVIDIA K40 graphics card.

4.1   Data Preparation
For the trusted training set, we first downloaded all 256287 images. We then
randomly selected 208878 images for training and 47409 images for validation.
To increase the robustness of the system in recognising multi-organ plant images,
a multi-scale training was adopted. We isotropically rescaled the training images
into three different sizes: 256, 385 and 512, then randomly cropped 224*224 pixels
from the rescaled images to feed into the network for training. By doing this, the
crop from the larger scaled images will correspond to a small part of the image
or particularly subpart of the organ; while the crop from the smaller scaled
images will correspond to the global structure of a plant. Besides that, we also
increased the data size by mirroring the input image during training. After the
data augmentation, we obtained 626634 training images and a validation set of
142227 images.
    However, for the noisy dataset, we only managed to crawl up to 918216
number of images which is about 60% of the total number of images from the web
due to resource limitations. We then separated it into 738716 images for training
and 179500 images for validation. We performed the same data augmentation
to produce another training set that contains 2216148 images and a validation
set of 538500 images. For the testing set, all 25170 images are downloaded and
similarly augmented.

4.2   Experimental results on validation set
For the evaluation of our validation set, the softmax output from our CNN
model for each image was first collected. An averaging fusion was then used to
combine the softmax scores of the augmented validation set. In this experiment,
we computed the top-1 classification result (Acc) to infer the robustness of the
system. We compared our method to the generic network, VGG-16 net [15].


                       Table 1: Performance comparison
                       Method           Trusted (Acc) Noisy (Acc)
             Finetuned VGG-16 top layer     0.44         0.44
                     HGO-CNN                0.45         0.42


    Table 1 shows the comparison of the performance results. We can observe that
the VGG-16 net performed better in the noisy dataset while our proposed HGO-
CNN performed better in the trusted dataset. There are two possible reasons:
(1) the organ layer in HGO-CNN that was trained on previous PlantClef2015
dataset might not be robust enough to model such a huge and diverse data, (2)
noisy dataset in this case is better modeled using generic features regardless of
the organ information as the generic features might include many independent
plant features that help in the classification performance.

4.3    Experimental Results on Test set
We submitted four runs to the LifeClef 2017 challenge. We finetuned all models
using both the training and validation set to increase the robustness of the
models. To obtain the observation level predictions, an averaging fusion method
was employed to combine the results of the testing images that have the same
observation id. Their performance were evaluated based on Mean Reciprocal
Rank (MRR). The characteristics of each run are stated below:
 – UM Run 1: Proposed HGO-CNN that trained with the trusted set only.
 – UM Run 2: VGG-16 net that trained with the noisy set only.
 – UM Run 3: Combined results of UM Run 1 and UM Run 2 based on averaging
   fusion at image level.
 – UM Run 4: Combined results of UM Run 1 and UM Run 2 based on max
   voting at image level.


      Fig. 3: Results of the LifeClef2017 multi-organ plant classification task


    Fig. 3 shows the overall results of the LifeClef2017 multi-organ plant classi-
fication task. We observed that Run 2 which is ranked at 12th out of a total of
28 runs is the best among the submitted runs while Run 1 which is ranked at
19th shows the lowest result. Henceforth, we make a deduction that the fusion
model HGO-CNN that we currently trained is not generalized enough to predict
        Table 2: Performance comparison for trusted training set(EOL)
                                Method        MRR
                              CMP Run 3       0.807
                          FHDO BCSG Run 1     0.792
                            KDETUT Run 1      0.772
                              CMP Run 4       0.733
                              UM Run 1        0.700
                            PlantNet Run 1    0.613
                        SabanciUGebzeTU Run 2 0.581
                          UPB HES SO Run 3    0.361
                          UPB HES SO Run 4    0.361
                          UPB HES SO Run 1    0.326
                          UPB HES SO Run 2    0.305


        Table 3: Performance comparison for noisy training set(WEB)
                               Method    MRR
                            KDETUT Run 2 0.824
                             UM Run 2 0.799


     Table 4: Performance comparison for noisy training set(WEB+EOL)
                                Method        MRR
                         MarioTsaBerlin Run 4 0.920
                            KDETUT Run 4      0.853
                            KDETUT Run 3      0.837
                              UM Run 3        0.798
                              UM Run 4        0.789
                        SabanciUGebzeTU Run 4 0.638
                        SabanciUGebzeTU Run 1 0.636
                        SabanciUGebzeTU Run 3 0.622


unseen testing images. Run 3 (ranked at 13th) and Run 4 (ranked at 15th),
both are the combined results of Run 1 and Run 2 respectively, are ranked lower
compared to Run 2. This is clearly due to the poor performance of the Run 1
model. Furthermore, Run 3 is ranked higher than Run 4, an indication that the
averaging fusion method performs better than the max voting method.
    Next, we compared the results of our submitted runs based on different
composition of the dataset. Table 2 shows the results of the trusted training
set(EOL). We observe that our UM Run 1 provides a comparable result, where
it is ranked at 5th out of a total of 11 runs. We believe that the performance
could have been better if the organ layer in the HGO-CNN is trained with the
latest Plantclef2017 dataset. However, it was restricted as the organ information
is not provided for most of the images. In Table 3, we have the lowest rank but
we used only 60% of the noisy WEB dataset. Furthermore, there are only two
participants in this category which is hardly a thorough comparison. Lastly, in
Table 4, we observed comparable results for our submitted runs. However, we
believe that our performance can be improved. In the current experiments, we
separately trained the CNN models using two different datasets and inferred
the results using average fusion. It is possible that the performance could have
been better if both of the datasets are trained in one single end-to-end CNN
model without the prerequisite of external fusion to infer the species. Moreover,
training on 100% of noisy images might be able to boost up the classification
performance.

5    Conclusions and Future work
This working note explains the implementation of the HGO-CNN for the Plant-
clef2017 challenge. We described the methodology of our proposed architecture
and analyzed the results based on both validation and testing set. We observed
that our current HGO-CNN model is not generalized enough to predict unseen
testing images. This might due to the lack of robustness at the organ layer
trained using the previous PlantClef2015 dataset. In the future, we will revise
our proposed model to increase its robustness by re-training all layers using the
latest datasets that incorporate both trusted and noisy images.

Acknowledgement
This research is supported by the Postgraduate (PPP) Grant PG007-2016A,
from University of Malaya; and the used K40 GPU was donated by NVIDIA
Corporation.

References
 1. Champ, J., Lorieul, T., Servajean, M., Joly, A.: A comparative study of fine-grained
    classification methods in the context of the lifeclef plant identification challenge
    2015. In: CLEF 2015. vol. 1391 (2015)
 2. Choi, S.: Plant identification with deep convolutional neural network: Snumedinfo
    at lifeclef plant identification task 2015. In: Working notes of CLEF 2015 conference
    (2015)
 3. Goëau, H., Bonnet, P., Joly, A.: Plant identification based on noisy web data: the
    amazing performance of deep learning (lifeclef 2017). In: CLEF working notes 2017
    (2017)
 4. Goëau, H., Bonnet, P., Joly, A., Bakic, V., Barthélémy, D., Boujemaa, N., Molino,
    J.F.: The imageclef 2013 plant identification task. In: CLEF (2013)
 5. Haug, S., Michaels, A., Biber, P., Ostermann, J.: Plant classification system for
    crop/weed discrimination without segmentation. In: Applications of Computer Vi-
    sion (WACV), 2014 IEEE Winter Conference on. pp. 1142–1149. IEEE (2014)
 6. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
    reducing internal covariate shift. In: Proceedings of the 32nd International Confer-
    ence on Machine Learning (ICML-15). pp. 448–456. JMLR Workshop and Confer-
    ence Proceedings (2015)
 7. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
    rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding.
    In: Proc. of the ACM International Conference on Multimedia. pp. 675–678 (2014)
 8. Joly, A., Goëau, H., Glotin, H., Spampinato, C., Bonnet, P., Vellinga, W.P., Lom-
    bardo, J.C., Planqué, R., Palazzo, S., Müller, H.: Lifeclef 2017 lab overview: multi-
    media species identification challenges. In: CLEF 2017 Proceedings, Springer Lec-
    ture Notes in Computer Science (LNCS) (2017)
 9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
    volutional neural networks. In: Advances in neural information processing systems.
    pp. 1097–1105 (2012)
10. Lee, S.H., Chan, C.S., Mayo, S.J., Remagnino, P.: How deep learning extracts and
    learns leaf features for plant classification. Pattern Recognition 71, 1–13 (2017)
11. Lee, S.H., Chan, C.S., Remagnino, P.: Hgo-cnn: Hybrid generic-organ convolutional
    neural network for multi-organ plant classification. In: ICIP (2017)
12. Naresh, Y., Nagendraswamy, H.: Classification of medicinal plants: an approach
    using modified lbp with symbolic representation. Neurocomputing 173, 1789–1797
    (2016)
13. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation
    invariant texture classification with local binary patterns. TPAMI 24(7), 971–987
    (2002)
14. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
    Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog-
    nition challenge. International Journal of Computer Vision 115(3), 211–252 (2015)
15. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
    image recognition. CoRR, abs/1409.1556 (2014)
16. Yan, Z., Zhang, H., Piramuthu, R., Jagadeesh, V., DeCoste, D., Di, W., Yu, Y.: Hd-
    cnn: Hierarchical deep convolutional neural networks for large scale visual recog-
    nition. In: Proc. of the IEEE International Conference on Computer Vision. pp.
    2740–2748 (2015)
17. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks.
    In: Computer vision–ECCV 2014, pp. 818–833. Springer (2014)