Viewpoints combined classification method in image-
               based plant identification task

                         Gábor Szűcs1, Dávid Papp2, Dániel Lovas2
    1 Inter-University Centre for Telecommunications and Informatics, Kassai str. 26., H-4028,

                                       Debrecen, Hungary
      2 Department of Telecommunications and Media Informatics, Budapest University of

        Technology and Economics, Magyar Tudósok krt. 2., H-1117, Budapest, Hungary,

       szucs@tmit.bme.hu, pappdavid27@gmail.com, lovas.daniel@simonyi.bme.hu


         Abstract. The image-based plant identification challenge was focused on tree,
         herbs and ferns species identification based on different types of images. The aim
         of the task was to produce relevant species for each observation of a plant of the
         test dataset. We have elaborated a viewpoints combined classification method for
         this challenge. We have applied dense SIFT for feature detection and description;
         and Gaussian Mixture Model based Fisher vector was calculated to represent an
         image with high-level descriptor. The chosen classifier was the C-support vector
         classification algorithm with RBF (Radial Basis Function) kernel, and we have
         optimized two hyperparameters (C from C-SVC and γ from RBF kernel) by a
         grid search with two-dimensional grid. We have constructed a combined classi-
         fier using the weighted average of reliability values of classifier at each view-
         point. The results show that our combined method exceeds our best classifier
         among the list of classifiers constructed for different viewpoints.

         Keywords: GMM based Fisher vector, C-support vector classification, view-
         point combination


1        Introduction
Accurate knowledge of the identity, statistics and uses of plants is essential in the agri-
cultural development. Identifying plant species is usually a very difficult task, even for
professionals (such as farmers or wood exploiters) or for the botanists themselves. Us-
ing image retrieval technologies is nowadays considered by botanists as a promising
direction in this problem, and in order to solve it a challenge is announced in the
LifeCLEF campaign [3].

The image-based plant identification task [7] was focused on tree, herbs and ferns spe-
cies identification based on different types of images. There are 7 viewpoints at the
images: branch, leaf, scan (scan or scan-like pictures of leaf, briefly “LeafScan”),


                                                763
flower, fruit, stem, and entire views. The number of species was about 500, which is an
important step towards covering the entire flora of a given region.

The aim of the task was to produce a list of relevant species for each observation of a
plant of the test dataset, i.e. one or a set of several pictures related to a same event: one
same person photographing several detailed views on various organs the same day with
the same device with the same lightening conditions observing one same plant. So the
task was observation-centered (not image-centered).

The task was based on the Pl@ntView dataset focusing plants on France (some plants
observations came from neighbouring countries). It contains more than 60000 pictures
belonging each to one of the 7 types of view reported into the meta-data, in an xml file
(one per image) with explicit tags, like ObservationId, species names, date, etc.

The task was evaluated as a plant species retrieval task based on multi-image plant
observations queries. The goal was to retrieve the correct plant species among the top
results of a ranked list of species returned by the evaluated system. An observation may
contain 1 to 5 images depicting the same individual plant observed by the same person
the same day. Each image of a query observation is associated with a single view type
(entire plant, branch, leaf, fruit, flower, stem or leaf scan) and with contextual meta-
data (data, location, and author). Each participating group was allowed to submit up to
4 runs built from different methods.

User rating information (pictures with the average of the user ratings on image quality)
was also available, but we have not used this additional information.


2      Image-based plant classification

2.1    Elaboration of image descriptors
The first part of the classification is the accomplishment of representation of each image
based on the visual content. This consists of three steps: (i) feature detection, (ii) feature
description, (iii) image description as usual phases in computer vision.

Feature detection: Lots of different feature types can be detected in an image, e.g. cor-
ners, edges, ridges, as “interesting” part of an image. Furthermore many possible fea-
ture extraction methods are available for images, but we have chosen SIFT (Scale-In-
variant Feature Transform) algorithm [11][12], because this is a widely used method in
practice and in theoretical works (as well) with some possible further development of
this method.


                                             764
Feature description: In our solution we have used dense sampling method with SIFT
(briefly dense SIFT). This sampling method can be considered as a two-dimensional
grid upon the image, where SIFT descriptors were calculated at each grid point. After
that we have used PCA (Principal Component Analysis) [1][9] to reduce the dimensions
of the descriptor vectors from 128 to 80. This descriptor vector belongs to only one
“interesting” point of an image, but an image possesses many feature descriptor vectors,
which should be aggregated into an image descriptor.

Image description: The final step of the representation creating is the completion of
high level representation of each image. We have applied BoW (bag-of-words) model
[6][10] for this purpose, where images are treated as documents. According to this,
“visual words” (so called “codewords”) in images need to be defined from feature de-
scriptors. The whole set of codewords gives the codebook (similarly to dictionary in
text tasks). To determine the codebook we used GMM (Gaussian Mixture Model)
[15][17]. This is a parametric probability density function represented as a weighted
sum of (in our case 256) Gaussian component densities. GMM parameters were esti-
mated based on the training set by using the iterative EM (Expectation Maximization)
algorithm [5], but an initial model was needed for EM. In our training procedure the k-
means clustering [13] was performed over all the vectors with 256 clusters, which re-
sulted the initial model for EM. As a result of the algorithms described above, a code-
book with 256 codewords was available for further calculations, which can be consid-
ered as a concise representation of the image set. According to the codebook the next
step is to create a descriptor that specifies the distribution of the visual codewords in
any image, called high-level descriptor. To represent an image with high-level de-
scriptor, the GMM based Fisher vector [14][15] was calculated. These vectors were the
final representation (image descriptor) of the images. The code used to train GMM vo-
cabularies and compute the Fisher vectors is a standalone C++ library, developed by
Jorge Sánchez, to support the research of Visual Geometry Group of Oxford University
[8].


2.2    Training the classifier
For the classification task we have divided the labelled image set into three subsets:
training, validation and test set (the last one is used for preliminary testing). The vali-
dation image set was used for calibration of the trained model during the validation
phase of the training procedure. To train the classifier (classification model) based on
training image set, a variation of SVM (Support Vector Machine) was used, the C-SVC
(C-support vector classification) [2][4] with RBF (Radial Basis Function) kernel. The
SVM is basically a binary linear classifier, thus in order to extend it to a number of
classified categories, the one-against-all technique was used. During this method a bi-
nary classifier was created for each category in the training set.


                                           765
The two hyperparameters (C from C-SVC and γ from RBF kernel) were optimized by
a grid search with two-dimensional grid. The algorithm was trained with the training
image set, and then validated on the validation set, while the hyperparameters were
different in each iteration. The parameter pair that gave the best result is selected to
train the final classification model (for each category) based on the whole image set.


2.3    Preliminary testing
After the training, the codebook was already available and only Fisher vector of each
image should be computed. At the preliminary testing we have selected only 50 species
(classes) for training and testing as well. RBF based kernel matrix was built from the
Fisher vectors of the test and training images. Each C-SVC classifier was parametered
with this matrix and the hyperparameters were the same as in the final classification
models. Since the classifiers are assigned to species, the generated model for a classifier
is responsible to separate the designated class from the other ones. Thus a classifier is
able to provide a confidence value showing a certainty of the class in a given image.

We have trained 7 classifiers for each viewpoint and we have evaluated as preliminary
testing based on precision and computer run time. The results of the preliminary testing
can be seen in Table 1.

                         Table 1. Results of the preliminary testing

  viewpoint                              precision              testing time (per image)
                                                                          [sec]
  Branch                                    0.341                          1.82
  Leaf                                      0.583                        1.59
  LeafScan                                  0.965                        0.95
  Stem                                      0.492                        1.39
  Flower                                    0.512                        1.61
  Entire                                    0.314                        1.44
  Fruit                                     0.482                        1.56


2.4    Viewpoints combination for observation classification
The decision about the observation could be based on majority voting of image deci-
sions, but we have used continuous information instead of discrete one. C-SVC classi-
fier calculates continuous reliability value for each class at each image, and we have
constructed a combined classifier using the weighted average of reliability values. Our


                                            766
combined classifier has applied a formula (as can be seen in Equation 1.) for the aggre-
gated reliability value that an image belongs to class c (species c).

                       NVP                                7           N
                  1                                  1            1 i,p
      R(c)  NVP        w  R (c ) 
                                i   i            7           wi          rn (c)     (1)
               wi                            wi
                        i 1                             i 1    N i , p n 1
               i 1                           i 1

 NVP is the number of viewpoints, which equals to seven in this challenge
 wi is the weight parameter of viewpoint i
 rn(c) is reliability value for class c coming from C-SVC classifier
 Ni,p is the number of images in viewpoint i taken from the p-th plant observed

Based on R(c) values the final decision is always the species that possesses the largest
R(c) value. In the challenge the order of predicted species should have been submitted,
and we have constructed the order based on R(c) values as well.

At the estimation of weight parameters we have taken the goodness of different view-
point classifiers into the consideration. As can be seen in the results of the preliminary
testing (at Table 1), the LeafScan has the best precision. So the LeafScan has got the
largest weight parameter, and on an empirical way we have chosen the following weight
parameters: LeafScan: 7.5, Leaf: 2.5, Flower: 1.5, Fruit: 1.5, Stem: 1.5, Branch: 1.5,
Entire: 1.5.


3       Evaluation

3.1     Evaluation metrics
In the official evaluation instead of precision (as used in our preliminary testing) a new
evaluation metric was defined for measurement of goodness of the observation classi-
fication. This metric (S score) is defined as follows.


                                  1 U 1 Pu
                               S    Su, p                                          (2)
                                  U u 1 Pu p 1
 U : number of users (who have at least one image in the test data)
 Pu : number of individual plants observed by the u-th user
 Nu,p : number of pictures taken from the p-th plant observed by the u-th user
 Su,p : score between 1 and 0 equals to the inverse of the rank of the correct species
  (for the p-th plant observed by the u-th user)


                                           767
Although the goal was to classify the observations containing more images, an addi-
tional metric was defined for the image classification as can be seen in Equation 3.

                                                         N
                         1 U 1 Pu 1 u , p
                 Simage                       Su , p , n                         (3)
                         U u 1 Pu p 1 N u , p n 1
 U : number of users (who have at least one image in the test data)
 Pu : number of individual plants observed by the u-th user
 Nu,p : number of pictures taken from the p-th plant observed by the u-th user
 Su,p,n : score between 1 and 0 equals to the inverse of the rank of the correct spe-
  cies (for the n-th picture taken from the p-th plant observed by the u-th user)


3.2    Final official results
Simage score can be calculated for each viewpoint, and these scores can be compared.
Our final official results for each viewpoint and the observation can be seen in Table
2., and it can be shown that S score of observation exceeds the best S score of all view-
points.

                             Table 2. Our final official results

                 viewpoints and observation                        S score
                 Branch                                            0.052
                 Leaf                                              0.019
                 LeafScan                                          0.119
                 Stem                                              0.072
                 Flower                                            0.115
                 Entire                                             0.06
                 Fruit                                              0.07
                 Observation                                       0.255


Our final official observation results (BME TMIT) compared with other participants
can be seen in Fig. 1.


                                            768
                   Fig. 1. Final official observation results of participants


4      Conclusion
We have elaborated a viewpoints combined classification method for image-based plant
identification task. We have applied dense SIFT for feature detection and description;
and Gaussian Mixture Model based Fisher vector was calculated to represent an image
with high-level descriptor. The chosen classifier was the C-support vector classification
algorithm with RBF (Radial Basis Function) kernel, and we have optimized two hy-
perparameters (C from C-SVC and γ from RBF kernel) by a grid search with two-di-
mensional grid. We have constructed a combined classifier using the weighted average
of reliability values of classifier at each viewpoint. The weight parameters of the com-
bined classifier were based on our preliminary testing results. Our observation result of
the combined method exceeds our best score of all viewpoints. At the official evaluation
our solution has reached 0.255 score value.


Acknowledgement
The publication was supported by the TÁMOP-4.2.2.C-11/1/KONV-2012-0001 pro-
ject. The project has been supported by the European Union, co-financed by the Euro-
pean Social Fund.


                                              769
References
 1. Abdi H., Williams L. J.: Principal Component Analysis, Wiley Interdisciplinary Reviews:
    Computational Statistics, Vol 2. No. 4, pp. 433-459 (2010)
 2. Boser, B., Guyon, I., Vapnik, V.: A Training Algorithm for Optimal Margin Classifier, Proc.
    of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 144-152 (1992)
 3. Joly, A., Müller, H., Goëau, H., Glotin, H., Spampinato, C., Rauber, A., Bonnet, P., Vellinga,
    W.P., Fisher, B.: Lifeclef 2014: multimedia life species identification challenges. In: Pro-
    ceedings of CLEF 2014 (2014)
 4. Cortes, C., Vapnik, V.: Support-vector networks, Machine Learning, Vol. 20, No. 3, pp.
    273-297 (1995)
 5. Dempster A., Laird N., Rubin D.: Maximum likelihood from Incomplete Data via the EM
    Algorithm, Journal of the Royal Statistical Society, Vol. 39, No. 1, pp. 1-38 (1977)
 6. Fei-Fei, L., Fergus, R., & A. Torralba, A.: Recognizing and Learning Object Categories,
    IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
    (2007)
 7. Goeau, H., Joly, A., Bonnet, P., Selmi, S., Molino, J.F., Barthélémy, D., Boujemaa, N.:
    Lifeclef plant identifcation task 2014. In: CLEF working notes 2014 (2014)
 8. K. Chatfield, V. Lempitsky, A. Vedaldi and A. Zisserman.: The devil is in the details: an
    evaluation of recent feature encoding methods, British Machine Vision Conference, pp.
    76.1-76.12, (2011)
 9. Ke, Y., & Sukthankar, R.: PCA-SIFT: A more distinctive representation for local image
    descriptors, In Computer Vision and Pattern Recognition, CVPR 2004. Proceedings of the
    2004 IEEE Computer Society Conference on, Vol. 2, pp. II-506. (2004)
10. Lazebnik, S., Schmid, C. and Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching
    for Recognizing Natural Scene Categories, Proceedings of the IEEE Conference on Com-
    puter Vision and Pattern Recognition, New York, Vol. 2, pp. 2169-2178 (2006)
11. Lowe, D. G.: Distinctive Image Features from Scale-Invariant Keypoints, International
    Journal of Computer Vision, Vol. 60, No 2., pp. 91-110 (2004)
12. Lowe, D. G.: Object Recognition from local scale-invariant features, In International Con-
    ference on Computer Vision, Corfu, Greece, pp. 1150-1157 (1999)
13. MacQueen, J.: Some methods for classification and analysis of multivariate observations,
    Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,
    Vol. 1, pp. 281-297 (1967)
14. Perronnin, F., Dance, C.: Fisher kernel on visual vocabularies for image categorization,
    IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
    (2007)
15. Reynolds D. A.: Gaussian Mixture Models, Encyclopedia of Biometric Recognition,
    Springer, February, pp. 659-663 (2009)
16. Sánchez, J. Perronnin, F., Mensink, T.: Improved Fisher Vector for Large Scale Image Clas-
    sification, In Proc. of the 11th European Conference on Computer Vision (ECCV): Part IV,
    September 05-11, pp. 143-156 (2010)
17. Tomasi C.: Estimating gaussian mixture densities with EM: A tutorial, (Tech. rep., Duke
    University); Chinese Journal of Electron Devices, pp, 15-18 (2004)


                                               770