=Paper=
{{Paper
|id=Vol-1609/16090459
|storemode=property
|title=Bluefield (KDE TUT) at LifeCLEF 2016 Plant Identification Task
|pdfUrl=https://ceur-ws.org/Vol-1609/16090459.pdf
|volume=Vol-1609
|authors=Siang Thye Hang,Atsushi Tatsuma,Masaki Aono
|dblpUrl=https://dblp.org/rec/conf/clef/HangTA16
}}
==Bluefield (KDE TUT) at LifeCLEF 2016 Plant Identification Task==
Bluefield (KDE TUT) at LifeCLEF 2016 Plant Identification Task Siang Thye Hang, Atsushi Tatsuma, Masaki Aono Knowledge Data Engineering& Information Retrieval Laboratory (KDE Lab), Department of Computer Science and Engineering, Toyohashi University of Technology, Japan hang@kde.cs.tut.ac.jp, tatsuma@cs.tut.ac.jp, aono@tut.jp Abstract. In this paper, we propose an automatic approach for plant image identification. We enhanced the well-known VGG 16-layers Convolutional Neural Network model [1] by replacing the last pooling layer with a Spatial Pyramid Pooling layer [2]. Rectified Linear Units (ReLU) are also replaced with Parametric ReLUs [3]. The enhanced model is trained without external da- taset. A post processing method is also proposed to reject irrelevant samples. We further improved identification performance using observation identity (ObservationId) provided in the dataset. Our methods showed outstanding per- formance in official evaluation results of the LifeCLEF 2016 Plant Identifica- tion Task. Keywords: LifeCLEF, plant identification, deep learning, sample rejection. 1 Introduction Nowadays, conservation of biodiversity is becoming an important duty for us. To achieve this, accurate knowledge is essential. However, even for professionals, identi- fying a species can be a very difficult task. Convolutional Neural Networks (CNNs) are leading the best performance in various image retrieval tasks such as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [4] [5] [1] [6]. CNNs learn filters of filters through multiple convolution layers, enabling higher level of abstraction than hand crafted features such as Scale Invariant Feature Transform (SIFT). In recent years, CNN has gained popularity in various image retrieval tasks includ- ing the LifeCLEF Plant Identification Task (PlantCLEF). Over the past years, partici- pants of PlantCLEF have adopted to use CNN in their works [7]. In PlantCLEF 2016 [8] [9], in addition to the 1000 class identification task, a new problem is introduced. The evaluation set consists of not only native plant images but also images of poten- tially invasive plants or irrelevant objects such as a table or a computer keyboard. Considering the new setting, we propose a post processing method to reject these samples. In the next section we propose to enhance the VGG 16-layers model with Spatial Pyramid Pooling and Parametric Rectified Linear Units. We detail our model training strategy in Section 3, followed by irrelevant sample rejection algorithm in Section 4. Section 5 shows evaluation results and Section 6 concludes this paper. 2 Model Enhancement In the following sections, a few enhancements to the VGG 16-layers model will be described. 2.1 Spatial Pyramid Pooling The original VGG 16-layers model requires input image of spatial size 224 × 224. On the other hand, the images in the PlantCLEF 2016 dataset are arbitrarily sized, as shown in Figure 1. Following the spatial size restriction in VGG 16-layers model, input images has to be either cropped or warped. Both of these methods may lead to information loss. vertical image square image horizontal image 30 % of dataset 25 20 15 10 aspect 5 ratio 0 0.50 0.56 0.63 0.71 0.83 1.00 1.20 1.40 1.60 1.80 2.00 Fig. 1. Aspect ratio of images in the PlantCLEF 2016 dataset To circumvent such restriction, we have replaced the last pooling layer (pool5) with a Spatial Pyramid Pooling (SPP) layer [2]. The last convolution layer (conv5_3) pro- duces feature map of 512 channels. With SPP, the feature map is spatially divided into 1 × 1, 2 × 2, 4 × 4, a total of 21 regions. Each region is then average pooled, producing a vector of fixed size 21 × 512 = 10752. Such conversion of arbitrarily sized feature map into a fixed size vector allows the model to accept input image of any size. Meanwhile, with such layer replacement, the number of parameters in the model is reduced from 138 million to 80 million. 2.2 Parametric Rectified Linear Unit We have also replaced all of the Rectified Linear Unit (ReLU) activations with a learnable version known as Parametric ReLU [3], which consistently outperforms ReLU in empirical experiments by Bing et al. [10]. Before training, we initialize the learnable parameters with 0.05. Weight decay of the learnable parameters is disabled throughout the training process. 3 Model Training The following strategies are used to train the enhanced model. 3.1 Data preparation The PlantCLEF 2016 dataset consists of 113204 images. 2048 images are used for validation purpose, while the remaining 111156 images are used to train the model. We have augmented the training images as follows: while preserving aspect ratio, images are resized such that the shorter side becomes 224 and 336. Random cropping to 224 × 224 and random flipping by y-axis are also applied to the resized images. For evaluation process, images are resized such that the shorter side becomes 224, and no cropping or flipping is applied. Cropping is not required during evaluation as the enhanced model accepts image of any size. 3.2 Image Mean As the enhanced model is trained from scratch i.e. without external resources, image mean is computed from the training set. Excluding LeafScan images (with high RGB values due to bright background), the computed mean values of red, green and blue channels are 105, 111, and 79 respectively. Images augmented in Section 3.1 are sub- tracted with these mean values before being used for training. 3.3 Training We use the Caffe framework to train the enhanced model. To train such a deep model, we use Xavier’s method [11] to initialize the weights of the convolution and fully connected layers. We train the model using Stochastic Gradient Descent with momen- tum 0.9, learning rate 0.01 through 0.0001, and batch size 50. The learning rate is multiplied by 0.1 when the validation accuracy stops improving. As the result, we have trained with learning rate 0.01 for 30 epochs, 0.001 for 15 epochs and finally 0.0001 for 8 epochs. Figure 2 shows the normalized softmax loss of both training and validation. Figure 3 shows the validation accuracy of the first prediction. 8.0 loss 7.0 training 6.0 validation 5.0 4.0 3.0 2.0 1.0 epoch 0.0 0 5 10 15 20 25 30 35 40 45 50 55 60 Fig. 2. Training and validation loss 0.7 accuracy 0.6 0.5 0.4 0.3 0.2 0.1 epoch 0 0 5 10 15 20 25 30 35 40 45 50 55 60 Fig. 3. Validation accuracy 4 Irrelevant Sample Rejection The next section discusses the One-Versus-All method’s limitation to reject irrelevant samples in multiclass classification. To overcome such problem, an algorithm is pro- posed in Section 4.2. 4.1 Multiclass Classification with One Versus All Method Assume a binary classifier 𝐹 of class 𝐶 mapping an input 𝑥 into a score 𝐹(𝑥). 𝐹 can be optimized such that 𝐹(𝑥) > 0 when 𝑥 ∈ 𝐶, and 𝐹(𝑥) < 0 when 𝑥 ∉ 𝐶. In the case of multiclass classification, the One-Versus-All method can be used to classify 𝑁 classes 𝑪 = [𝐶1 , … , 𝐶𝑁 ] using 𝑁 binary classifiers 𝑭 = [𝐹1 , … , 𝐹𝑁 ]. As shown in (1), 𝑥 ∈ 𝐶𝑖 when 𝐹𝑖 (𝑥) is the highest among all of the binary classifiers. 𝑥 ∈ 𝐶𝑖 where 𝑖 = arg max 𝐹𝑘 (𝑥) (1) 𝑘∈𝑁 With (1), it is guaranteed a class will be assigned to 𝑥, regardless how high or low the scores produced by each classifier. Figure 4’s first example (from left) is ideal, as there is only one strong positive score, thus we can confidently predict that input 𝛼 in Figure 4 belongs to the first class. Meanwhile, second and third examples have either weak or no positive score. With (1), input 𝛽 is classified to the third class, while in- put 𝛾 is classified to the first class, despite that all of the scores are rather low or even negative. F(α) F(β) F(γ) class class 1 2 3 4 1 2 3 4 class 1 2 3 4 Fig. 4. An example of class scores (𝑁 = 4) based on three inputs 𝛼, 𝛽, 𝛾 Based on the scenario above, we realized that there are cases that a sample should be rejected by all of the binary classifiers. To identify such irrelevant sample, an algo- rithm is elaborated in the next section. 4.2 Irrelevant Sample Rejection Algorithm After training the enhanced model as mentioned in Section 3, a matrix of raw (i.e. before softmax normalization) scores 𝑺𝑀×𝑁 of 𝑁 classes and 𝑀 training samples are extracted from the classifier layer (fc8). Incorrect predictions are omitted. With only correctly predicted class scores 𝑺𝑀′ ×𝑁 (𝑀′ ≤ 𝑀), rejection threshold 𝒕 = [𝑡1 , … , 𝑡𝑁 ] of each class is computed from the class wise minima, as shown in (2). 𝑡𝑖 = min′ 𝑆𝑘,𝑖 (2) 𝑘𝜖𝑀 Figure 5 shows the threshold 𝒕 of each class based on the PlantCLEF 2016 training set (𝑁 = 1000), sorted in ascending order. 30 threshold 25 20 15 10 5 class (sorted) 0 0 100 200 300 400 500 600 700 800 900 1000 Fig. 5. Rejection threshold 𝒕 of the training set, sorted in ascending order During evaluation, any sample with score lower than 𝒕 for all of the classes will be rejected as irrelevant. With the evaluation set, 195 out of 8000 images are rejected. Some of the images are shown in Figure 6. Fig. 6. Subset of rejected samples with threshold 𝒕 4.3 Taking Validation Set into Account The threshold 𝒕 obtained in Section 4.2 is solely based on the training set. Due to factors such as overfitting, there is a possibility that lower threshold can be acquired from the validation set. Figure 7 shows threshold obtained from the validation set, corresponding to the sorted classes shown in Figure 5. threshold 30 25 20 15 10 5 training validation class (sorted by training set) 0 0 100 200 300 400 500 600 700 800 900 1000 Fig. 7. Rejection threshold of training and validation set Although we expect lower thresholds from the validation set, as shown in Figure 7, majority of the thresholds are higher than that of the training set. This is due to the fact that there is too little sample (2048 ≪ 𝑀) in the validation set. Thus, only lower thresholds are considered. Ratios of each of these (seven) thresholds to its correspond- ing training set thresholds are computed, and then averaged into 𝑄. As detailed in (3), the denominators are the thresholds of the training set, while the numerators are of the validation set. The values in (3) are based on Figure 7. 1 9.1 11.1 9.0 12.5 12.4 13.3 16.7 𝑄= ( + + + + + + ) ≈ 0.91 (3) 7 10.8 11.3 12.4 12.6 13.3 14.3 18.0 𝑄 is then multiplied to the threshold 𝒕 of the training set, as shown in (4), before using it to reject samples during evaluation. 𝒕′ = 𝑄𝒕 (4) Applying 𝒕′ to the evaluation set, only 69 samples are rejected as it is lower compared to the original 𝒕. Some of the images are shown in Figure 8. Fig. 8. Subset of rejected samples with threshold 𝒕′ 4.4 Observation Based Identification Identification based on a single image may be insufficient. In the PlantCLEF2016 dataset, images based on the same observation share a unique ObservationId. One ObservationId may be assigned with multiple images of different organs. Specifically images of flower or fruit which have much characteristic features, their existence in an observation often improves identification performance. To further improve the identification performance, after rejecting samples as ex- plained in Section 4.2 and 4.3, we sum the raw (i.e. before softmax normalization) class scores of images with the same ObservationId. The summed scores are then softmax normalized. As the result, images with the same ObservationId share the same normalized scores. 5 Evaluation In our evaluation experiments, the enhanced CNN model is used to extract class scores. Four different post processing methods are then applied to the extracted class scores, detailed as follows. As mentioned in Section 4.2 and 4.3, 𝒕 is the rejection threshold obtained from training set, while 𝒕′ is the rejection threshold obtained by considering both training and validation sets. ─ Run 1: Sample rejection with 𝒕 , identification based on single sample ─ Run 2: Sample rejection with 𝒕′ , identification based on single sample ─ Run 3: Sample rejection with 𝒕 , observation based identification ─ Run 4: Sample rejection with 𝒕′ , observation based identification In our run files (Bluefield), we provide scores up to top 30 classes, and rejected sam- ples are entirely excluded from our run files. Evaluation results compared with other participants are summarized in Table 1 and Figure 9. Table 1. Evaluation results sorted by official score MAP (MAP: Mean Average Precision) MAP restricted MAP ignoring Official score Run to potentially unknown classes MAP invasive species and queries Bluefield Run 4 0.742 0.717 0.827 SabanciUGebzeTU Run 1 0.738 0.704 0.806 SabanciUGebzeTU Run 3 0.737 0.703 0.807 Bluefield Run 3 0.736 0.718 0.820 SabanciUGebzeTU Run 2 0.736 0.683 0.807 SabanciUGebzeTU Run 4 0.735 0.695 0.802 CMP Run 1 0.710 0.653 0.790 LIIR KUL Run 3 0.703 0.674 0.761 LIIR KUL Run 2 0.692 0.667 0.744 LIIR KUL Run 1 0.669 0.652 0.708 UM Run 4 0.669 0.598 0.742 CMP Run 2 0.644 0.564 0.729 CMP Run 3 0.639 0.590 0.723 QUT Run 3 0.629 0.610 0.696 Floristic Run 3 0.627 0.533 0.693 UM Run 1 0.627 0.537 0.700 Floristic Run 1 0.619 0.541 0.694 Bluefield Run 1 0.611 0.600 0.692 Bluefield Run 2 0.611 0.600 0.693 Floristic Run 2 0.611 0.538 0.681 QUT Run 1 0.601 0.563 0.672 UM Run 3 0.589 0.509 0.652 QUT Run 2 0.564 0.562 0.641 UM Run 2 0.481 0.446 0.552 QUT Run 4 0.367 0.359 0.378 BME TMIT Run 4 0.174 0.144 0.213 BME TMIT Run 3 0.170 0.125 0.197 BME TMIT Run 1 0.169 0.125 0.196 BME TMIT Run 2 0.066 0.128 0.101 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Official score MAP 0.2 MAP restricted to a black list of (potentially) invasive species 0.1 MAP igoring unknown classes and queries 0.0 Bluefield Run 1 Bluefield Run 2 Bluefield Run 3 Bluefield Run 4 BME TMIT Run 1 BME TMIT Run 2 BME TMIT Run 3 BME TMIT Run 4 Floristic Run 1 Floristic Run 2 Floristic Run 3 LIIR KUL Run 1 LIIR KUL Run 2 LIIR KUL Run 3 SabanciUGebzeTU Run 1 SabanciUGebzeTU Run 2 SabanciUGebzeTU Run 3 SabanciUGebzeTU Run 4 UM Run 1 UM Run 2 UM Run 3 UM Run 4 CMP Run 1 CMP Run 2 CMP Run 3 QUT Run 1 QUT Run 2 QUT Run 3 QUT Run 4 Fig. 9. Evaluation results sorted by run name in alphabetical order Bluefield Run 4 yields the highest official score among all of the participants. Usage of ObservationId shows significant improvement. On the other hand, rejection thresh- old 𝒕′ that takes validation set into account shows a slight improvement in official MAP. 6 Conclusion In this paper, we described our approach to PlantCLEF 2016, focusing on model en- hancements, data augmentations, and an irrelevant sample rejection strategy. We still leave some rooms for improvements, which are itemized as follows: As mentioned in Section 3.1, images for training process are resized to two scales. We should consider applying random scaling instead of a constant number of (two) scales. Other than scaling, random rotation should be applied as well. In Section 4.2 and 4.3, instead of taking minima as rejection threshold, we should consider using mean and standard deviation to obtain a more stable threshold. The PlantCLEF 2016 dataset includes rich metadata information such as Genus, Family, Date, Longitude, Latitude, Location and Content (organ type). However, in our work, only ClassId (class label) and ObservationId are used. More metadata in- formation should be utilized to obtain more accurate identification performance. Acknowledgement We would like to thank MEXT KAKENHI, Grant-in-Aid for Challenging Exploratory Research, Grant Number 15K12027 for partial support of our work. References 1. K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large- Scale Image Recognition," in CVPR, 2014. 2. K. He, X. Zhang, S. Ren and J. Sun, "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition," in TPAMI, 2015. 3. K. He, X. Zhang, S. Ren and J. Sun, "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification," in CVPR, 2015. 4. A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification with Deep Convolutional," in NIPS, 2012. 5. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, "Going deeper with convolutions," in CVPR, 2014. 6. K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," in CVPR, 2015. 7. H. Goeau, P. Bonnet and A. Joly, "LifeCLEF Plant Identification Task," in CLEF, 2015. 8. A. Joly, H. Goëau, H. Glotin, C. Spampinato, P. Bonnet, W.-P. Vellinga, J. Champ, R. Planqué, S. Palazzo and H. Müller, "LifeCLEF 2016: Multimedia Life Species Identification Challenges," in LifeCLEF, 2016. 9. H. Goëau, P. Bonnet and A. Joly, "Plant Identification In An Open-World (LifeCLEF 2016)," in PlantCLEF, 2016. 10. B. Xu, N. Wang, T. Chen and M. Li, "Empirical Evaluation of Rectified Activations in Convolution," in CVPR, 2015. 11. X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," in AISTATS, 2010.