IBM Research Australia at LifeCLEF2014:
            Plant Identification Task

              Qiang Chen, Mani Abedini, Rahil Garnavi, Xi Liang

                                IBM Research Australia


       Abstract. In this paper, we present the system and learning strategies
       that were applied by the IBM Research team to the plant identification
       task of LifeCLEF 2014. Plant identification is one of the most popular
       fine-grained categorization tasks. To ensure high classification accuracy,
       we have utilised strong visual features together with fusion of robust
       machine learning techniques. Our proposed system involves automatic
       delineation of the region of interest (e.g. plant’s leaf, flower, etc.) in the
       given image, followed by extracting multiple complementary low level
       features. The features have been then encoded into the sophisticated
       Fisher Vector representation which enables accurate classification with
       linear classifiers. We have also applied the recent development of deep
       learning. More importantly our system combines multiple source of in-
       formation, i.e. integrates organ annotation with image data, and adopts
       fusion of classifiers which has led to great results. The extensive experi-
       ments demonstrate the effectiveness of the proposed system, where three
       (out of four) of our submissions outperforms all submissions by other
       teams, therefore the team achieves the first place in LifeCLEF 2014 Plant
       task.

       Keywords: Fine-grained object recognition, plant recognition, deep learn-
       ing, feature coding


1    Introduction
The fine-grained classification, i.e. classification among categories which are both
visually and semantically very similar, is a very difficult task. It is even challeng-
ing for humans without careful training, and is critical for establishing a more
detailed understanding of the visual world [11][18][10]. Identifying plant species
is essential for successful agricultural development and conservation of biodiver-
sity. However, it is a very difficult task even for professionals (e.g. farmers, wood
exploiters, botanists). To evaluate recent advances of information retrieval and
computer vision on this challenging task, the CLEF Cross Language Evalua-
tion forum has been organizing yearly competition on plant identification since
2011. Following the success of the three previous ImageCLEF Plant identifica-
tion tasks, LifeCLEF 2014 plant task consists of 500 plant species dedicated to
botanical data. The task has focused on tree, herbs and ferns species identifi-
cation based on different types of images. Main novelties compared to the last
years are the following:


                                         693
 – Multi-image query: The motivation of the task is to fit better with a real
   scenario where one user tries to identify a plant by observing its different
   organs, such as it has been demonstrated in [MAED2012]. Indeed, botanists
   usually observe simultaneously several organs, e.g. the leaves and the fruits
   or the flowers in order to identify species which could be mystified if only
   one organ were observed.
 – Observation-based evaluation: Unlike previous years, the species identifica-
   tion task is not image-centered but observation-centered. The aim of the task
   is to produce a list of relevant species for each observation of a plant of the
   test dataset, i.e. one or a set of several pictures related to a same event, where
   a same person photographs several detailed views on various organs the same
   day with the same device with the same lightening conditions observing the
   same plant.
 – Large number of species: The number of species is about 500, which is an
   important step towards covering the entire flora of a given region.

    This paper presents the system and learning strategies that were applied by
the IBM Research team to the plant identification task of LifeCLEF 2014. The
rest of the paper is organised as follows: Section 2 describes the main compo-
nents of the system, involving segmentation, feature extraction and learning, and
methodologies applied at each module. Section 3 discusses the details of experi-
ments and obtained results on validation and test sets. Section 4 concludes the
paper.


2     Approach
In this section we first briefly describe our submission to the LifeCLEF 2014
competition. Then various components of the system, including segmentation
(delineation of the region of interest), feature coding and deep convolutional
nerual network are explained.

2.1   IBM Research Australia Runs
We have submitted four different runs to LifeCLEF [22] Plant identification
task [23].
1. Run1- Deep convolutional nerual network: In this run, we utilize the deep
   convolutional nerual network model explained in Sec.2.4 with five layers
   of convolutional network, three layers of fully connection and cost layer of
   logistic regression. The CNN is trained on the plant training data provided
   by LifeClef 2014, which is relative small scale data for this deep model.
2. Run2- Advanced feature encoding: In this run, we apply the advanced
   feature encoding methods explained in Sec2.3 using Fisher Kernel encod-
   ing [20][5][17]. We first extract dense feature, e.g. SIFT [2] and Color Mo-
   ment from raw images. Each feature is modeled with Mixture of Gaussian
   (GMMs) and forms the Fisher Vector representation. Then we learn a linear


                                      694
    SVM [14] for each feature. The final submission is the average score from
    the two features.
 3. Run3- Fusion of Run1 and Run2: We have applied an empirical fusion
    method, where each of the components are fed into the fusion module and
    a weight has been tuned for each component.
 4. Run4- Segmentation and Fusion of Run2 and Run3: We have applied the
    feature encoding method on images with ROI extracted. We then fuse this
    result with Run3. We also made a few improvements, e.g. SVM averaging.
The details of the methods used above are explained in the following section.

2.2    Segmentation
In most images, using a region of interest (ROI) which encloses the main object is
sufficient to determine their class label. In fact, segmentation of the main objects
and extracting the ROI often results in removing the irrelevant background
which could introduce noise to a supervised classier. In categories of flower, fruit,
leaf, leafScan and stem, we apply different segmentation methods to remove the
background. For the two categories of wholeTree and branch, however, we have
observed that the whole image contains useful information. Therefore, we choose
not to extract any ROI from these views, instead the original image is used in
the next modules of the system, i.e. feature extraction and learning.
    For the flower and fruit categories, we apply a similar segmentation method to
extract the ROI that contains flower or a fruit, as follows: assuming that a flower
or fruit is usually “more red” compared to leaves, we locate the regions of which
their red channel is larger than the green channel. Each color image is converted
to a gray-value image, and the localized “red” regions are then employed as
initial masks in segmenting the flower or fruit in the gray-value image using a
active contour method [21]. In the end, we compute the minimum bounding box
of the flower or fruit mask as the ROI. Figure 1 shows ROI extraction results on
some samples on flower pictures and Figure 1 shows those on fruit pictures.


      Fig. 1. Original flower image (left) and automatically extracted ROI (right)


   For the leafScan category, every image typically includes a leaf with a light
background. However, there are variations among the background colors. We
therefore normalize the background with a consistent white color which can
potentially improve the accuracy of the classification. We convert the color image


                                       695
     Fig. 2. Original fruit image (left) and automatically extracted ROI (right)


into a gray-value image and then apply Otsu method [1] to compute a threshold.
The pixels in the gray-value image that are smaller than the threshold are labeled
as background, and all background pixels in the corresponding color image are
assigned with a while color. Some examples are shown in Figure 3.


             Fig. 3. Original LeafScan image (left) and the ROI (right)


    In images of the leaf category, we observed that a leaf is usually located in the
center area of the picture and there are typically some margin between the leaf
boundary and the picture border. We therefore extract the object that is located
in the center of the picture. In pre-processing, we convert the color picture to
a gray-value image and apply a Gaussian filter on the image with σ = 3. We
then extract an rectangular ROI in the image as an initial ROI that contain
the leaf. The ROI is defined by its left top and bottom right coordination in
the image. Let (r, l) is the number of rows and columns in the image, the left
top coordination is ( 6r , 6l ) and the right bottom coordination is ( 5r  5l
                                                                       6 , 6 ). We then
apply active contour on the gray-value image using the rectangular ROI as the


                                       696
initial mask. The boundary box of the segmented leaf is the final ROI and some
examples are shown in Figure 4.


      Fig. 4. Original leaf image (left) and automatically extracted ROI (right)


   In the stem category, the stem in usually located in the center of the image.
For segmenting the stem, we convert the color image to a gray-value image
and creating a central mask on the image by cropping %25 from left and %25
from right. Active contour is then applied on the gray-value image using the
mask. The bounding box of the resulting area is the ROI for the stem image.
This method can effectively remove most of the background for vertical stems
without branches. Some examples are shown in Figure 5.


      Fig. 5. Original Stem image (left) and automatically extracted ROI (right)


2.3    Advanced Feature Coding

Feature coding is the generalization of the popular BoW model [8][3]. The recent
evaluation [4] shows that the FisherKernel [20][5] feature coding achieves best
results in most of the cases. We introduce this coding method in this section.
    Suppose we have a probability density function uλ (x) which models a genera-
tive process in feature space. Let X = {x1 , · · · , xN } be the set of N local features


                                       697
extracted from an image. Then the image can be described by the gradient vector
of log likelihood with respect to the model parameters λ:
                                       1
                                GX
                                 λ =     ∇λ log uλ (X).                               (1)
                                       N
                                                                  0 −1 Y
    A natural kernel on these gradients is K(X, Y ) = GX       λ Fλ Gλ where Fλ is
the Fisher information matrix of uλ : Fλ = Ex∼uλ [∇λ log uλ (x)∇λ log uλ (x)0 ]. As
Fλ is symmetric and positive definite, it can be decomposed as Fλ = L0λ Lλ ,
and the kernel K(X, Y ) can be expressed as a dot-product between normalized
vectors GλX = Lλ GX    λ called Fisher vectors. We stress that learning a kernel
classifier using the Fisher kernel is equivalent to learning a linear classier on the
Fisher vectors GλX . As been recognized widely, linear classifiers offer significant
advantages in terms of efficiency both for training and testing.
    Fisher Vector encoding utilizes a Gaussian mixture model (GMM), uλ (x) =
PK
   k=1 πk uk (x) trained on local features of a large image set using Maximum
Likelihood (ML) estimation. The parameters of the trained GMM are denoted as
λ = {πk , µk , Σk , k = 1, · · · , K}, where {π, µ, Σ} are the prior probability, mean
vector and diagonal covariance matrix of the Gaussian mixture respectively. This
GMM is used for description of low level features. Then for a set of low level
features X = {x1 , · · · , xN } extracted from an image y, the soft assignments of
the descriptor xi to the kth Gaussian component γik is computed by:
                                        πk uk (xi )
                                γik = PK                                              (2)
                                       k=1 πk uk (xi )

And the Fisher vector (FV) for X is denoted as φ(X) = {GµX1 , GσX1 , · · · , GµXK , GσXK }
where Gµk and Gσk is defined as:
                                N
                                X     1     xi − µk
                       GµXk =         √ γik         ,                                 (3)
                                i=1
                                    N   π k    σk
                                N
                                X     1         (xi − µk )2
                       GσXk =        √    γik [             − 1],                     (4)
                                i=1
                                    N 2πk           σk2

Where σk is the square root of the diagonal values of Σk . The FV has several
good properties: (a) Fisher Vector encoding is not limited to computing visual
word occurrences. It also encodes the distribution information of the feature
points, which will perform more stable when encoding a single feature point.
(b) it can naturally separate the image specific information from the noisy local
features. (c) we can use a linear model for this representation.
    Power Normalization and L2 Normalization: It is easy to observe that
as the number of Gaussians increases, Fisher vectors become sparser, and the
distribution of features in a given dimension becomes more peaky around zero.
This issue is addressed by a combination of power normalization and l2 nor-
malization for each Fisher vector descriptor. Suppose z is one dimension of the
fisher vector φ, the power normalization is defined as f (z) = sign(z)|z|α where


                                        698
0 ≤ α ≤ 1 is a parameter of the normalization and we choose α = 0.5 in all the
experiments. Subsequently, the Fisher vectors are l2 normalized.

2.4     Deep Convolutional Neural Network


      Fig. 6. A deep convolutional neural network with similar configuration in [6]
                                           .


    We also applied a deep convolutional nerual network (CNN) for the plant
identification task. We use the training data from the plant task only, which is
relative small scale data for this deep model. We also tried a pretrained CNN
using ImageNet [7][13] dataset, as this usage of extra data is not allowed in
the LifeCLEF challenge, we didn’t submit the result but only evaluated on our
internal validation set.
    Our CNN has around 60 million parameters. it consists of five convolu-
tional layers, some of which are followed by max-pooling layers, and three fully-
connected layers with a final softmax layer. To make training faster, we have used
non-saturating neurons and a very efficient GPU implementation of the convolu-
tion operation. To reduce overfitting in the fully-connected layers we employed a
recently-developed regularization method called dropout that proved to be very
effective. We also use the data augmentation as in [6]. In order to incorporate the
information of the “Organ” annotation, we propose two methods: (1) we train
the CNN with two objective function which targets the label accuracy and view
accuracy at the same time. (2) we train the CNN with one objective function but
we set the class label as the enumerate of the species and view annotation. These
two implementation turns out providing similar performance in our validation
set.

2.5     Fusing of Multiple Systems
Until now we have presented the sub models of the system, we then fuse the
results at the late stage. Each sub model k provides a confidence score ski,j for
each observationID/imagei and each categoryPj. We optimize to get the final
confident score as the weighted sum: Si,j = k wk ∗ ski,j . The optimization is
performed on the validation set. Each time, we select on sub model with best


                                        699
accuracy and tune the weight to get the best fused accuracy. The same set of the
weight parameters have been used to get the confidence score on the test set.


3     Experiments

In this section we discusses the details of experiments and obtained results on
validation and test sets.


3.1   Experimental setting

Dataset The PlantCLEF dataset focuses on 500 herb, tree and fern species
centered on France (some plants observations are from neighboring countries).
It contains more than 60000 pictures belonging to one of the seven types of view
reported into the meta-data, in a xml file (one per image) with explicit tags. A
part of the dataset has been provided as training data and the remaining part
will be used later as test data. Test observation will be chosen by randomly
sampling 1/3 of the observations of each species.
    The training data finally results in 47815 images, including 1987 of “Branch”,
6356 photographs of “Entire”, 13164 of “Flower”, 3753 “Fruit”, 7754 of “Leaf”,
3466 “Stem” and 11335 scans and scan-like pictures of “leaf”. The test data con-
tains 8163 plant-observation-queries. These queries are based on 13146 images;
731 of “Branch”, 2983 photographs of “Entire”, 4559 of “Flower”, 1184 “Fruit”,
2058 of “Leaf”, 935 “Stem” and 696 scans and scan-like pictures of “leaf”.


Validation Set In order to verify the effective of each components in our sys-
tem, we split the “training data” provided by PlantCLEF into two parts: a train
set and a validation set. The validation set is roughly 1/5 of the total training
data. We split the training data according to the observation id which is critical
since the final evaluation is based on observation id. In the following section, we
will report the results on both the validation set and the testing set.


Implementation details As previously mentioned, in the feature coding ap-
proach, we extract two types of local features: dense SIFT feature and Color
moment feature. The dimension of each local feature has been reduced to 64
using PCA [16]. Then for each type of feature, we generate a GMM model which
has 512 components.
     For deep convolutional neural network, we follow the pipeline of Alex [6]. The
filter size and filter number is the same as Alex’s. We restrict the node number
of the fully connection network to 2048 as this number is far more enough to
model the plant images.
     We use open source libraries, e.g. SIFT from VL-feat [9] and SVM solver
from LibSVM [15]


                                     700
3.2     Results on Validation Set
For proper assessment of each components of our system, we perform diagnosis
evaluation on the validation set. The evaluation results are shown in Table 1
based on two metrics as proposed by PlantCLEF organiser, i.e. the Accuracy
w.r.t. image (Acc image) and Accuracy w.r.t. observation id (Acc observ).


               Table 1. Results for each components on validation set.

           ValidationSet FeatureCoding FeatureCoding+ROI DNN Combine
            Acc image        0.445            0.458      0.32 0.483
            Acc observ       0.624             0.63      0.44 0.656


      Based on result shown in Table 1, we can observe the following:
1. The feature coding pipeline (by combing two type of features, i.e. SIFT
   and color moment) achieves stable results. As reported in the literature, the
   Fisher Kernel encoding performs best among many of the encoding methods
   in visual recognition.
2. Segmentation (ROI extraction) improves the pipeline of feature coding and
   increase the classification accuracy.
3. The CNN method results in lower classification accuracy, compared to fea-
   ture coding. We believe the effect is the due to limited number of training
   data. Deep learning has been demonstrated great success when using large
   scale dataset. The PlantCLEF dataset is a middle scale dataset and the deep
   model can be easily overfitted.

3.3     Results on Test Set
In this subsection, we present our final results submitted to LifeCLEF2014 plant
identification task. Results of each run is shown in Table 2, based on two met-
rics: the Accuracy w.r.t. image (Acc image) and Accuracy w.r.t. observation id
(Acc observ).


                      Table 2. Results for each runs on test set.

                        TestSet RUN1 RUN2 RUN3 RUN4
                       Acc image 0.263 0.438 0.446 0.456
                       Acc observ 0.271 0.454 0.459 0.471


      From Table 2 we observe:
 – RUN1 is the result of using deep convolutional nerual network. It obtains
   reasonable result but inferior to other runs. The main problem is that the
   deep model is very easy to overfitting on the training set with so many
   parameters.


                                       701
 – RUN2 is the raw result with feature coding pipeline which gives quite good
   result. It demonstrates that the feature coding pipeline is very mature and
   easily adaptable to many tasks in visual recognition.
 – RUN3 is the result from combing RUN2 and RUN1. It shows that the CNN
   framework is complementary to the traditional feature coding framework.
 – RUN4 is the result from combing RUN3 and feature coding pipeline applied
   on segmented images. It shows the correct delineation of ROI improves the
   classification results.

   Further evaluation results:

 1. Overall performance: Our submissions achieve the top three results when
    comparing with other teams as shown in Figure 7 in terms of both metrics,
    i.e. the Accuracy w.r.t. image (Acc image) and Accuracy w.r.t. observation
    id (Acc observ). Our classification results obtains 15-20% improvement over
    other teams.


Fig. 7. Official Test results for the submitted Plant Identification runs, based on ob-
servation id (left) and image id (right). IBM runs achieved top three performances.


 2. Results for each organ (view): Figure 8 shows the result for each organ
    (view). Our classification results again leads other teams in most of cases.
    It is worth noting that our system’s performance on LeafScan and Flower
    is better than 50%. Considering such fine-grained categroization task, we
    believe this result has practical value for real system.
 3. Observation-based evaluation: We observe a notable improvement from
    image-based evaluation to observation-based evaluation on the validation
    set, which shows the ability of our system in using multiple source of image
    data and perform a more accurate classification. Unfortunately, the trend is
    not that obvious on the test set. This is caused by dissimilarity between the
    training and testing set in terms of average number of images per observation
    id, i.e. there is around 4.5 images per observation id in the training set, while
    we only have around 1.6 images per observation id in the test set.


                                       702
        Fig. 8. The final results evaluated based on each image for each organs.


4     Conclusion
In this paper, we described the system and learning methodology applied by the
IBM Australia Research team to the plant identification task of LifeCLEF 2014.
We utilized the advanced feature coding method with automatic ROI extraction.
We also applied the recent development of deep learning and achieved great
result on the validation set. The most important contribution of this work is in
effective fusion of various learning schemes and proper use of multiple source
of information (annotation data and image data). The extensive experiments
demonstrated the effectiveness of the proposed system and the final submitted
run achieved the first place in LifeCLEF 2014 Plant task.


References
    1. Otsu, Nobuyuki. A threshold selection method from gray-level histograms. Au-
       tomatica 11.285-296 (1975): 23-27.
    2. DG Lowe. Distinctive image features from scale-invariant keypoints. International
       Journal of Computer Vision, 60(2):91–110, 2004.
    3. L Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene
       categories. In Computer Vision and Pattern Recognition, 2005.
    4. K Chatfield, V Lempitsky, and A Vedaldi. The devil is in the details: an evaluation
       of recent feature encoding methods. In British Machine Vision Conference, 2011.
    5. Florent Perronnin, Jorge Sanchez, and Thomas Mensink. Improving the Fisher
       Kernel for Large-Scale Image Classification. In European Conference on Computer
       Vision, 2010.
    6. A Krizhevsky, I Sutskever, and G Hinton. Imagenet classification with deep
       convolutional neural networks. In Advances in Neural Information Processing
       Systems, 2012.
    7. Jia Deng, Wei Dong, R Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet:
       A large-scale hierarchical image database . In Computer Vision and Pattern
       Recognition, 2009.
    8. J Sivic and A Zisserman. Video Google: a text retrieval approach to object
       matching in videos. In International Conference on Computer Vision, 2003.


                                         703
 9. A Vedaldi and B Fulkerson. VLFeat: An open and portable library of computer
    vision algorithms. http://www.vlfeat.org/, 2008.
10. P Welinder, S Branson, T Mita, C Wah, F Schroff, S Belongie, and P. Perona.
    Caltech-UCSD birds 200. Technical report, California Institute of Technology,
    2010.
11. M-E Nilsback and A Zisserman. Automated Flower Classification over a Large
    Number of Classes. In ICVGIP, pages 722–729, 2008.
12. L Fei-Fei, R Fergus, and P. Perona. Learning generative visual models from
    few training examples: An incremental Bayesian approach tested on 101 object
    categories. Computer Vision and Image Understanding, 106(1):59–70, 2007.
13. Jia Deng, Alexander C Berg, and Li Fei-Fei. Hierarchical semantic indexing for
    large scale image retrieval,. In Computer Vision and Pattern Recognition, 2011.
14. CJC Burges. A tutorial on support vector machines for pattern recognition. Data
    Min Knowl Discovery, 1998.
15. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector
    machines. ACM Transactions on Intelligent Systems and Technology (TIST),
    2(3):27, 2011.
16. I T Jolliffe. Principal Component Analysis. Springer Verlag, October 2002.
17. F Perronnin, Z Akata, and Z Harchaoui. Towards Good Practice in Large-Scale
    Learning for Image Classification. In Computer Vision and Pattern Recognition,
    2012.
18. P. Belhumeur, D. Chen, S. Feiner, D. Jacobs, W. Kress, H. Ling, I. Lopez, R. Ra-
    mamoorthi, S. Sheorey, S. White, and L. Zhang. Searching the World’s Herbaria:
    A System for Visual Identification of Plant Species. In European Conference on
    Computer Vision, 2008.
19. Y. Su and F. Jurie. Improving Image Classication using Semantic Attributes.
    International Journal of Computer Vision, 100, 1 (2012) 59-77, 2012.
20. Perronnin, F and Dance, C. Fisher Kernels on Visual Vocabularies for Image
    Categorization. In Computer Vision and Pattern Recognition, 2007.
21. Chan, T. F., Vese, L. A. (2001). Active contours without edges. IEEE Transac-
    tions on Image Processing, 10(2), 266277.
22. Joly, Alexis and Müller, Henning and Goëau, Hervé and Glotin, Hervé and Spamp-
    inato, Concetto and Rauber, Andreas and Bonnet, Pierre and Vellinga, Willem-
    Pier and Fisher, Bob LifeCLEF 2014: multimedia life species identification chal-
    lenges. Proceedings of CLEF 2014.
23. Goëau, Hervé and Joly, Alexis and Bonnet, Pierre and Molino, Jean-François and
    Barthélémy, Daniel and Boujemaa, Nozha LifeCLEF Plant Identification Task
    2014 Proceedings of CLEF 2014.


                                      704