=Paper= {{Paper |id=Vol-1609/16090459 |storemode=property |title=Bluefield (KDE TUT) at LifeCLEF 2016 Plant Identification Task |pdfUrl=https://ceur-ws.org/Vol-1609/16090459.pdf |volume=Vol-1609 |authors=Siang Thye Hang,Atsushi Tatsuma,Masaki Aono |dblpUrl=https://dblp.org/rec/conf/clef/HangTA16 }} ==Bluefield (KDE TUT) at LifeCLEF 2016 Plant Identification Task== https://ceur-ws.org/Vol-1609/16090459.pdf

Bluefield (KDE TUT) at
LifeCLEF 2016 Plant Identification Task

Siang Thye Hang, Atsushi Tatsuma, Masaki Aono

Knowledge Data Engineering& Information Retrieval Laboratory (KDE Lab), Department of
Computer Science and Engineering, Toyohashi University of Technology, Japan

hang@kde.cs.tut.ac.jp, tatsuma@cs.tut.ac.jp, aono@tut.jp

Abstract. In this paper, we propose an automatic approach for plant image
identification. We enhanced the well-known VGG 16-layers Convolutional
Neural Network model [1] by replacing the last pooling layer with a Spatial
Pyramid Pooling layer [2]. Rectified Linear Units (ReLU) are also replaced
with Parametric ReLUs [3]. The enhanced model is trained without external da-
taset. A post processing method is also proposed to reject irrelevant samples.
We further improved identification performance using observation identity
(ObservationId) provided in the dataset. Our methods showed outstanding per-
formance in official evaluation results of the LifeCLEF 2016 Plant Identifica-
tion Task.

Keywords: LifeCLEF, plant identification, deep learning, sample rejection.

1 Introduction

Nowadays, conservation of biodiversity is becoming an important duty for us. To
achieve this, accurate knowledge is essential. However, even for professionals, identi-
fying a species can be a very difficult task. Convolutional Neural Networks (CNNs)
are leading the best performance in various image retrieval tasks such as the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [4] [5] [1] [6]. CNNs
learn filters of filters through multiple convolution layers, enabling higher level of
abstraction than hand crafted features such as Scale Invariant Feature Transform
(SIFT).
In recent years, CNN has gained popularity in various image retrieval tasks includ-
ing the LifeCLEF Plant Identification Task (PlantCLEF). Over the past years, partici-
pants of PlantCLEF have adopted to use CNN in their works [7]. In PlantCLEF 2016
[8] [9], in addition to the 1000 class identification task, a new problem is introduced.
The evaluation set consists of not only native plant images but also images of poten-
tially invasive plants or irrelevant objects such as a table or a computer keyboard.
Considering the new setting, we propose a post processing method to reject these
samples. In the next section we propose to enhance the VGG 16-layers model with
Spatial Pyramid Pooling and Parametric Rectified Linear Units. We detail our model
training strategy in Section 3, followed by irrelevant sample rejection algorithm in
Section 4. Section 5 shows evaluation results and Section 6 concludes this paper.

2 Model Enhancement

In the following sections, a few enhancements to the VGG 16-layers model will be
described.

2.1 Spatial Pyramid Pooling

The original VGG 16-layers model requires input image of spatial size 224 × 224.
On the other hand, the images in the PlantCLEF 2016 dataset are arbitrarily sized, as
shown in Figure 1. Following the spatial size restriction in VGG 16-layers model,
input images has to be either cropped or warped. Both of these methods may lead to
information loss.

vertical image square image horizontal image

30
% of dataset

25
20
15
10
aspect
5
ratio
0
0.50 0.56 0.63 0.71 0.83 1.00 1.20 1.40 1.60 1.80 2.00

Fig. 1. Aspect ratio of images in the PlantCLEF 2016 dataset

To circumvent such restriction, we have replaced the last pooling layer (pool5) with a
Spatial Pyramid Pooling (SPP) layer [2]. The last convolution layer (conv5_3) pro-
duces feature map of 512 channels. With SPP, the feature map is spatially divided
into 1 × 1, 2 × 2, 4 × 4, a total of 21 regions. Each region is then average pooled,
producing a vector of fixed size 21 × 512 = 10752. Such conversion of arbitrarily
sized feature map into a fixed size vector allows the model to accept input image of
any size. Meanwhile, with such layer replacement, the number of parameters in the
model is reduced from 138 million to 80 million.

2.2 Parametric Rectified Linear Unit
We have also replaced all of the Rectified Linear Unit (ReLU) activations with a
learnable version known as Parametric ReLU [3], which consistently outperforms
ReLU in empirical experiments by Bing et al. [10]. Before training, we initialize the
learnable parameters with 0.05. Weight decay of the learnable parameters is disabled
throughout the training process.

3 Model Training

The following strategies are used to train the enhanced model.

3.1 Data preparation
The PlantCLEF 2016 dataset consists of 113204 images. 2048 images are used for
validation purpose, while the remaining 111156 images are used to train the model.
We have augmented the training images as follows: while preserving aspect ratio,
images are resized such that the shorter side becomes 224 and 336. Random cropping
to 224 × 224 and random flipping by y-axis are also applied to the resized images.
For evaluation process, images are resized such that the shorter side becomes 224, and
no cropping or flipping is applied. Cropping is not required during evaluation as the
enhanced model accepts image of any size.

3.2 Image Mean

As the enhanced model is trained from scratch i.e. without external resources, image
mean is computed from the training set. Excluding LeafScan images (with high RGB
values due to bright background), the computed mean values of red, green and blue
channels are 105, 111, and 79 respectively. Images augmented in Section 3.1 are sub-
tracted with these mean values before being used for training.

3.3 Training
We use the Caffe framework to train the enhanced model. To train such a deep model,
we use Xavier’s method [11] to initialize the weights of the convolution and fully
connected layers. We train the model using Stochastic Gradient Descent with momen-
tum 0.9, learning rate 0.01 through 0.0001, and batch size 50. The learning rate is
multiplied by 0.1 when the validation accuracy stops improving. As the result, we
have trained with learning rate 0.01 for 30 epochs, 0.001 for 15 epochs and finally
0.0001 for 8 epochs. Figure 2 shows the normalized softmax loss of both training and
validation. Figure 3 shows the validation accuracy of the first prediction.
8.0
loss
7.0 training
6.0 validation
5.0
4.0
3.0
2.0
1.0
epoch
0.0
0 5 10 15 20 25 30 35 40 45 50 55 60

Fig. 2. Training and validation loss

0.7
accuracy
0.6
0.5
0.4
0.3
0.2
0.1
epoch
0
0 5 10 15 20 25 30 35 40 45 50 55 60

Fig. 3. Validation accuracy

4 Irrelevant Sample Rejection

The next section discusses the One-Versus-All method’s limitation to reject irrelevant
samples in multiclass classification. To overcome such problem, an algorithm is pro-
posed in Section 4.2.

4.1 Multiclass Classification with One Versus All Method
Assume a binary classifier 𝐹 of class 𝐶 mapping an input 𝑥 into a score 𝐹(𝑥). 𝐹 can
be optimized such that 𝐹(𝑥) > 0 when 𝑥 ∈ 𝐶, and 𝐹(𝑥) < 0 when 𝑥 ∉ 𝐶.
In the case of multiclass classification, the One-Versus-All method can be used to
classify 𝑁 classes 𝑪 = [𝐶1 , … , 𝐶𝑁 ] using 𝑁 binary classifiers 𝑭 = [𝐹1 , … , 𝐹𝑁 ]. As
shown in (1), 𝑥 ∈ 𝐶𝑖 when 𝐹𝑖 (𝑥) is the highest among all of the binary classifiers.
𝑥 ∈ 𝐶𝑖 where 𝑖 = arg max 𝐹𝑘 (𝑥) (1)
𝑘∈𝑁

With (1), it is guaranteed a class will be assigned to 𝑥, regardless how high or low the
scores produced by each classifier. Figure 4’s first example (from left) is ideal, as
there is only one strong positive score, thus we can confidently predict that input 𝛼 in
Figure 4 belongs to the first class. Meanwhile, second and third examples have either
weak or no positive score. With (1), input 𝛽 is classified to the third class, while in-
put 𝛾 is classified to the first class, despite that all of the scores are rather low or even
negative.

F(α) F(β) F(γ)
class class
1 2 3 4 1 2 3 4

class
1 2 3 4

Fig. 4. An example of class scores (𝑁 = 4) based on three inputs 𝛼, 𝛽, 𝛾

Based on the scenario above, we realized that there are cases that a sample should be
rejected by all of the binary classifiers. To identify such irrelevant sample, an algo-
rithm is elaborated in the next section.

4.2 Irrelevant Sample Rejection Algorithm

After training the enhanced model as mentioned in Section 3, a matrix of raw (i.e.
before softmax normalization) scores 𝑺𝑀×𝑁 of 𝑁 classes and 𝑀 training samples are
extracted from the classifier layer (fc8). Incorrect predictions are omitted. With only
correctly predicted class scores 𝑺𝑀′ ×𝑁 (𝑀′ ≤ 𝑀), rejection threshold 𝒕 = [𝑡1 , … , 𝑡𝑁 ]
of each class is computed from the class wise minima, as shown in (2).
𝑡𝑖 = min′ 𝑆𝑘,𝑖 (2)
𝑘𝜖𝑀

Figure 5 shows the threshold 𝒕 of each class based on the PlantCLEF 2016 training set
(𝑁 = 1000), sorted in ascending order.

30
threshold
25

5
class (sorted)
0
0 100 200 300 400 500 600 700 800 900 1000

Fig. 5. Rejection threshold 𝒕 of the training set, sorted in ascending order
During evaluation, any sample with score lower than 𝒕 for all of the classes will be
rejected as irrelevant. With the evaluation set, 195 out of 8000 images are rejected.
Some of the images are shown in Figure 6.

Fig. 6. Subset of rejected samples with threshold 𝒕

4.3 Taking Validation Set into Account
The threshold 𝒕 obtained in Section 4.2 is solely based on the training set. Due to
factors such as overfitting, there is a possibility that lower threshold can be acquired
from the validation set. Figure 7 shows threshold obtained from the validation set,
corresponding to the sorted classes shown in Figure 5.
threshold
30
25
20
15
10
5 training validation
class (sorted by training set)
0
0 100 200 300 400 500 600 700 800 900 1000

Fig. 7. Rejection threshold of training and validation set

Although we expect lower thresholds from the validation set, as shown in Figure 7,
majority of the thresholds are higher than that of the training set. This is due to the
fact that there is too little sample (2048 ≪ 𝑀) in the validation set. Thus, only lower
thresholds are considered. Ratios of each of these (seven) thresholds to its correspond-
ing training set thresholds are computed, and then averaged into 𝑄. As detailed in (3),
the denominators are the thresholds of the training set, while the numerators are of the
validation set. The values in (3) are based on Figure 7.
1 9.1 11.1 9.0 12.5 12.4 13.3 16.7
𝑄= ( + + + + + + ) ≈ 0.91 (3)
7 10.8 11.3 12.4 12.6 13.3 14.3 18.0
𝑄 is then multiplied to the threshold 𝒕 of the training set, as shown in (4), before using
it to reject samples during evaluation.
𝒕′ = 𝑄𝒕 (4)

Applying 𝒕′ to the evaluation set, only 69 samples are rejected as it is lower compared
to the original 𝒕. Some of the images are shown in Figure 8.

Fig. 8. Subset of rejected samples with threshold 𝒕′

4.4 Observation Based Identification

Identification based on a single image may be insufficient. In the PlantCLEF2016
dataset, images based on the same observation share a unique ObservationId. One
ObservationId may be assigned with multiple images of different organs. Specifically
images of flower or fruit which have much characteristic features, their existence in
an observation often improves identification performance.
To further improve the identification performance, after rejecting samples as ex-
plained in Section 4.2 and 4.3, we sum the raw (i.e. before softmax normalization)
class scores of images with the same ObservationId. The summed scores are then
softmax normalized. As the result, images with the same ObservationId share the
same normalized scores.

5 Evaluation

In our evaluation experiments, the enhanced CNN model is used to extract class
scores. Four different post processing methods are then applied to the extracted class
scores, detailed as follows. As mentioned in Section 4.2 and 4.3, 𝒕 is the rejection
threshold obtained from training set, while 𝒕′ is the rejection threshold obtained by
considering both training and validation sets.

─ Run 1: Sample rejection with 𝒕 , identification based on single sample
─ Run 2: Sample rejection with 𝒕′ , identification based on single sample
─ Run 3: Sample rejection with 𝒕 , observation based identification
─ Run 4: Sample rejection with 𝒕′ , observation based identification
In our run files (Bluefield), we provide scores up to top 30 classes, and rejected sam-
ples are entirely excluded from our run files. Evaluation results compared with other
participants are summarized in Table 1 and Figure 9.

Table 1. Evaluation results sorted by official score MAP (MAP: Mean Average Precision)

MAP restricted MAP ignoring
Official score
Run to potentially unknown classes
MAP
invasive species and queries
Bluefield Run 4 0.742 0.717 0.827
SabanciUGebzeTU Run 1 0.738 0.704 0.806
SabanciUGebzeTU Run 3 0.737 0.703 0.807
Bluefield Run 3 0.736 0.718 0.820
SabanciUGebzeTU Run 2 0.736 0.683 0.807
SabanciUGebzeTU Run 4 0.735 0.695 0.802
CMP Run 1 0.710 0.653 0.790
LIIR KUL Run 3 0.703 0.674 0.761
LIIR KUL Run 2 0.692 0.667 0.744
LIIR KUL Run 1 0.669 0.652 0.708
UM Run 4 0.669 0.598 0.742
CMP Run 2 0.644 0.564 0.729
CMP Run 3 0.639 0.590 0.723
QUT Run 3 0.629 0.610 0.696
Floristic Run 3 0.627 0.533 0.693
UM Run 1 0.627 0.537 0.700
Floristic Run 1 0.619 0.541 0.694
Bluefield Run 1 0.611 0.600 0.692
Bluefield Run 2 0.611 0.600 0.693
Floristic Run 2 0.611 0.538 0.681
QUT Run 1 0.601 0.563 0.672
UM Run 3 0.589 0.509 0.652
QUT Run 2 0.564 0.562 0.641
UM Run 2 0.481 0.446 0.552
QUT Run 4 0.367 0.359 0.378
BME TMIT Run 4 0.174 0.144 0.213
BME TMIT Run 3 0.170 0.125 0.197
BME TMIT Run 1 0.169 0.125 0.196
BME TMIT Run 2 0.066 0.128 0.101
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Official score MAP
0.2
MAP restricted to a black list of (potentially) invasive species
0.1 MAP igoring unknown classes and queries
0.0
Bluefield Run 1
Bluefield Run 2
Bluefield Run 3
Bluefield Run 4
BME TMIT Run 1
BME TMIT Run 2
BME TMIT Run 3
BME TMIT Run 4

Floristic Run 1
Floristic Run 2
Floristic Run 3
LIIR KUL Run 1
LIIR KUL Run 2
LIIR KUL Run 3

SabanciUGebzeTU Run 1
SabanciUGebzeTU Run 2
SabanciUGebzeTU Run 3
SabanciUGebzeTU Run 4
UM Run 1
UM Run 2
UM Run 3
UM Run 4
CMP Run 1
CMP Run 2
CMP Run 3

QUT Run 1
QUT Run 2
QUT Run 3
QUT Run 4
Fig. 9. Evaluation results sorted by run name in alphabetical order

Bluefield Run 4 yields the highest official score among all of the participants. Usage
of ObservationId shows significant improvement. On the other hand, rejection thresh-
old 𝒕′ that takes validation set into account shows a slight improvement in official
MAP.

6 Conclusion

In this paper, we described our approach to PlantCLEF 2016, focusing on model en-
hancements, data augmentations, and an irrelevant sample rejection strategy. We still
leave some rooms for improvements, which are itemized as follows:

 As mentioned in Section 3.1, images for training process are resized to two scales.
We should consider applying random scaling instead of a constant number of (two)
scales. Other than scaling, random rotation should be applied as well.

 In Section 4.2 and 4.3, instead of taking minima as rejection threshold, we should
consider using mean and standard deviation to obtain a more stable threshold.

 The PlantCLEF 2016 dataset includes rich metadata information such as Genus,
Family, Date, Longitude, Latitude, Location and Content (organ type). However, in
our work, only ClassId (class label) and ObservationId are used. More metadata in-
formation should be utilized to obtain more accurate identification performance.
Acknowledgement

We would like to thank MEXT KAKENHI, Grant-in-Aid for Challenging Exploratory
Research, Grant Number 15K12027 for partial support of our work.

References

1. K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-
Scale Image Recognition," in CVPR, 2014.
2. K. He, X. Zhang, S. Ren and J. Sun, "Spatial Pyramid Pooling in Deep
Convolutional Networks for Visual Recognition," in TPAMI, 2015.
3. K. He, X. Zhang, S. Ren and J. Sun, "Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification," in CVPR, 2015.
4. A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification with
Deep Convolutional," in NIPS, 2012.
5. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.
Vanhoucke and A. Rabinovich, "Going deeper with convolutions," in CVPR,
2014.
6. K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image
Recognition," in CVPR, 2015.
7. H. Goeau, P. Bonnet and A. Joly, "LifeCLEF Plant Identification Task," in CLEF,
2015.
8. A. Joly, H. Goëau, H. Glotin, C. Spampinato, P. Bonnet, W.-P. Vellinga, J.
Champ, R. Planqué, S. Palazzo and H. Müller, "LifeCLEF 2016: Multimedia Life
Species Identification Challenges," in LifeCLEF, 2016.
9. H. Goëau, P. Bonnet and A. Joly, "Plant Identification In An Open-World
(LifeCLEF 2016)," in PlantCLEF, 2016.
10. B. Xu, N. Wang, T. Chen and M. Li, "Empirical Evaluation of Rectified
Activations in Convolution," in CVPR, 2015.
11. X. Glorot and Y. Bengio, "Understanding the difficulty of training deep
feedforward neural networks," in AISTATS, 2010.