Participation of LSIS/DYNI to ImageCLEF
       2012 plant images classification task

           Sébastien Paris1 ? , Xanadu Halkias2 , and Hervé Glotin23
                      1
                       LSIS/DYNI, Aix-Marseille University
                          sebastien.paris@lsis.org
                  2
                    LSIS/DYNI, University of South Toulon-Var
                             halkias@univ-tln.fr
                         3
                           Institut National de France
                             glotin@univ-tln.fr


      Abstract. This paper presents the participation of the LSIS/DYNI
      team for the ImageCLEF 2012 plant identification challenge. Image-
      CLEF’s plant identification task provides a testbed for the system-oriented
      evaluation of tree species identification based on leaf images. The goal is
      to investigate image retrieval approaches in the context of crowd sourced
      images of leaves collected in a collaborative manner. The LSIS/DYNI
      team submitted three runs to this task and obtained the best evalua-
      tion scores (S = 0.32) for the ”photograph” image category with an
      automatic method. Our approach is based on a modern computer vi-
      sion framework involving local, highly discriminative visual descriptors,
      sophisticated visual-patches encoder and large-scale supervised classifi-
      cation. The paper presents the three procedures employed, and provides
      an analysis of the obtained evaluation results.

      Keywords: LSIS, DYNI, ImageCLEF, plant, leaves, images, collection,
      identification, classification, evaluation, benchmark


1   Introduction
This paper presents the contribution of the LSIS/DYNI group for the plant
identification task that was organized within ImageCLEF 20124 for the system-
oriented evaluation of visual based plant identification. Similar to the Image-
CLEF 2011 challenge, this second year pilot task was also precisely focused on
tree species identification based on leaf images. This year, the challenge was or-
ganized as a classification task over 126 tree species with visual content being
the main available information. Three types of image content were considered:
leaf ”scans”, leaf photographs with a white uniform background (referred to as
”scan-like” pictures) and unconstrained leaf ”photographs” acquired on trees
with natural background (see Fig. 1). The LSIS/DYNI team submitted three
?
  Granded by COGNILEGO ANR 2010-CORD-013 and PEPS RUPTURE Scale
  Swarm Vision
4
  http://www.imageclef.org/2012/plant
runs, all of them based on local feature extraction and large-scale supervised
classification. We obtained the best score for the ”photographs” category with
an automatic method (S = 0.32).


      Fig. 1. From left to right: ”scans”, ”scan-like” and ”photographs” category.


2     Task description

The task has been evaluated as a plant species retrieval task.


2.1    Training and Test data

A part of Pl@ntLeaves II dataset was provided as training data whereas the
remaining part was used later as test data. The training subset was built by
including the training AND test subsets of last year’s Pl@ntLeaves I dataset,
and by randomly selecting 2/3 of the individual plants for each NEW species
(several pictures might belong to the same individual plant but cannot be split
across training and test data).

 – The training data is comprised of 8422 images (4870 ”scans”, 1819 ”scan-
   like” photos, 1733 natural photos) with full xml files associated to them (see
   previous section for few examples). A ground-truth file listing all images of
   each species was provided complementary.
 – The test data is comprised of 3150 images (1760 ”scans”, 907 ”scan-like”
   photos, 483 natural photos) with purged xml files (i.e. without the taxon
   information that has to be predicted).
2.2     Task objective and evaluation metric

The goal of the task was to retrieve the correct species among the top k species
of a ranked list of retrieved species for each test image. Each participant was
allowed to submit up to 4 runs built from different methods. As many species as
possible can be associated to each test image, sorted by decreasing confidence
score. However, only the most confident species were used in the primary evalu-
ation metric described below. Providing an extended ranked list of species was
encouraged in order to derive complementary statistics (e.g. recognition rate at
other taxonomic levels, suggestion rate on top k species, etc.).
     The primary metric used to evaluate the submitted runs was a normalized
classification rate evaluated on the 1st species returned for each test image. Each
test image is attributed with a score of 1 if the 1st returned species is correct and
0 if it is wrong. An average normalized score is then computed on all test images.
A simple mean on all test images would indeed introduce some bias with regard
to a real world identification system. Indeed, we recollect that the Pl@ntLeaves
II dataset was built in a collaborative manner; So that few contributors might
have provided much more pictures than many other contributors who provided
few. Since we want to evaluate the ability of a system to provide correct answers
for all users, we would rather measure the mean of the average classification
rate per author. Furthermore, some authors sometimes provided many pictures
of the same individual plant (to enrich training data with less efforts). Since we
want to evaluate the ability of a system to provide the correct answer based on
a single plant observation, we also decided to average the classification rate on
each individual plant. Finally, our primary metric was defined as the following
average classification score S:

                                U      Pu      Nu,p
                             1 X 1 X       1 X
                        S=                          su,p,n ,                     (1)
                             U u=1 Pu p=1 Nu,p n=1

where

 – U : number of users (who have at least one image in the test data)
 – P u : number of individual plants observed by the uth user
 – N u, p : number of pictures taken from the pth plant observed by the u-th
   user
 – su,p,n : classification score (1 or 0) for the nth picture taken from the pth
   plant observed by the uth h user

Finally, to isolate and evaluate the impact of the image acquisition type (”scans”,
”scan-like”, ”photograph”), a normalized classification score S was computed
for each type separately. Participants were therefore allowed to train distinct
classifiers, use different training subsets or use distinct methods for each data
type.
3     Description of used methods
For all submitted runs, whatever the particular image type, we followed the same
pipeline: i) feature extraction coupled with spatial pyramid (SP) for local analysis
and a linear large-scale supervised classification. For our first participation, we
didn’t performe any (supervised) segmentation leading to the extraction of more
elaborate and specific descriptors for leaf classification.

3.1   Common procedures
Spatial pyramid local analysis
We define our SP matrix Λ with L levels such as Λ , [r y , r x , dy , dx , λ]. Λ
is a matrix of size (L × 5). For a level l ∈ {0, . . . , L − 1}, the image I, with
size (ny × nx ), is divided into potentially overlapping sub-windows Rl,v of size
(hl × wl ). All these windows are sharing the same associated weight λl . In our
implementation, hl , bny .ry,l c and wl , bnx .rx,l c where ry,l , rx,l and λl are
the lth element of vectors r y , r x and λ respectively. Sub-window shifts in x − y
axis are defined by integers δy,l , bny .dy,l c and δx,l , bnx .dx,l c where dy,l and
dx,l are elements of dy and dx respectively. Overlapping can be performed if
dy,l ≤ ry,l and/or dx,l ≤ rx,l . The total number of sub-windows is equal to
                     L−1          L−1
                     X            X         (1 − ry,l )        (1 − rx,l )
               V =         Vl =         b               + 1c.b             + 1c.       (2)
                                               dy,l               dx,l
                     l=0          l=0
                                                                                  
                                                                       1 1 1 11
Fig. 2 shows an example of SP with our particular choice Λ = 1 1 1 1              .
                                                                       2 4 4 8 1
For this particular Λ matrix, we divided twice more the vertical axis than the
horizontal one according to the aspect ratio distribution of images in the dataset.


Linear support vector machines for large-scale classification
Let’s assume available a training data set {xi , yi }N                 d
                                                     i=1 , where xi ∈ R is a descrip-
tor extracted from image I i and yi ∈ {1, . . . , M }, where M = 126 is the number
of classes and N = 8422 is the number of training samples. As in [13, 1], we will
use a simple large-scale linear SVM such as LIBLINEAR [6] with the 1-vs-all
multi-class strategy. The associated binary unconstrained convex optimization
problem to solve is:
                         (            N
                                                                         )
                             1 T     X                       2
                   min         w w+C     max 1 − yi wT xi , 0                ,         (3)
                   w         2       i=1

 where the parameter C controls the generalization error and is tuned on a
specific validation set. LIBLINEAR converges to a solution linearly in O(dN )
                    2
compared to O(dNsv    ).
    Moreover, in order to obtain an estimate of p(y = l|x), we performed an SVM
regression given the output of the previous classification stage for each binary
classifier.
Fig. 2. Example of SPM Λ with L = 2 and V = 1 + 21. Upper-left corner of each
window Rl,v is indicated with a red cross. Left: R0,0 = I for l = 0 (first level). Right:
{R1,v }, v = 0, . . . , 20 for l = 1 (second level).


3.2     Multiscale Color Local Phase Quantization (MSCLPQ) →
        LSIS DYNI run 1
Following [4, 5], we extend the basic decorrelated Local Phase Quantization
(LPQ) descriptor for a multi-scale and color channel analysis over a spatial
pyramid.
    In LPQ, Short Fourier Transforms (SFT) are computed over M ×M windows
                                             T             T             T
centered on z at four frequencies u1 = [a, 0] , u2 = [0, a] , u3 = [a, a] and
            T            1
u4 = [a, −a] with a = M such that

                                     f (z − y)e−j2πu y ,
                                 X                  T
                      F (u, z) =                                           (4)
                                        y∈Nz

where z ∈ R ⊂ I. For each pixel, we compute the LPQ code as5
                       3
                       X                                3
                                                        X
           LP Q(z) =          22i 1{Re(F (u,z ))≥0} +         22i+1 1{Im(F (u,z ))≥0} ,   (5)
                        i=0                             i=0

where LP Q(z) ∈ {0, . . . , 255}. Local histograms of LPQ codes are retrieved by
counting occurrences of each individual LPQ code j such as:
                                 X
               xLP Q (j, R) =        1{LP Q(z )=j} , j = 0, . . . , 255.     (6)
                                 z∈R

The local histogram vector is defined by
                    xLP Q (R) , [xLP Q (0, R), . . . , xLP Q (255, R)] ,                  (7)
5
    1{x} = 1 if event x is true, 0 otherwise.
where xLP Q (R) is furthermore `2 normalized. The full vector x is obtained
by concatenating previous normalized
                                     histograms for 4 different scales M ∈
                  1 1 1 1 1
{3, 5, 7, 9}, Λ = 1 1 1 1 1 (V = 1 + 21) and the 3 (R, G, B) color channels.
                   2 4 4 8 8
The total dimension of this vector is equal to d = 256.(1 + 21).4.3 = 67584.
     Finally, we normalize each element of xi such that xi,l ∈ [−1, 1] , l =
1, . . . , d, i = 1, . . . , N followed by `2 normalization on xi . The a posteriori
probabilities associated with the MSCLPQ approach are denoted p1 (y = l|x).


3.3   Late fusion of MSCLPQ, MSCILBP and MSILBP+ScSPM →
      LSIS DYNI run 2

Multiscale Color Local Phase Quantization
See sec. 3.2


Multiscale Color Improved Local Binary Pattern (MSCILBP)
Basically, the operator ILBP encodes the relationship between a central block
                                             T
of (s × s) pixels located in z c = [yx , xc ] with its 8 neighboring blocks [8] and
also adds a ninth bit encoding a term homogeneous to the differential excitation.
This operator can be considered as a non-parametric local texture encoder for
scale s. In order to capture information at different scales, the range analysis
s ∈ S, is typically set at S = [1, 2, 3, 4] for this task, where S = Card(S). This
micro-codes are defined as follows:
                                  i=7
                                  X
               ILBP (z c , s) =         2i 1{Ai ≥Ac } + 28 1 P
                                                              7
                                                                                    ,    (8)
                                                                          Ai ≥8Ac
                                  i=0                               i=0


where ∀z c ∈ R ⊂ I, ILBP (z c ) ∈ N29 .
   The different areas {Ai } and Ac in eq.(8) can be computed efficiently using
the image integral technique [12]. Let’s define II the image integral of I by:
                                           0     0
                                          yX <y xX <x
                          II(y, x) ,                     I(y 0 , x0 ).                    (9)
                                          y 0 =0 x0 =0


Any square area A(y, x, s) ∈ R (see right Fig. 3) with upper-left corner located
in (y, x) and side length s is the addition of only 4 values:

  A(y, x, s) = II(y + s, x + s) + II(y, x) − (II(y, x + s) + II(y + s, x)).              (10)

 As for MSCLPQ, efficient features are obtained by counting occurrences of the
j th visual ILBP at scale s in a ROI R ⊆ I:
                                        X
                     xILBP (R, j, s) =        1{ILBP (z c ,s)=j} ,
                                       z c ∈R
Fig. 3. Left: I and ILBP (yx , xc ) overlaid. Right: corresponding image integral II and
the central block Ac . Ac can be efficiently computed with the 4 corner points.


where j = 0, . . . , b − 1 is the j th bin of the histogram and b = 512. Full histogram
of ILBP, denoted z ILBP is computed by:

             xILBP (R, s) , [xILBP (R, 0, s), . . . , xILBP (R, b − 1, s)] .       (11)

Finally, the full vector x is obtained by concatenating
                                                      previous normalized his-
                                                      1 1 1 1 1
tograms for 4 different scales s ∈ {1, 2, 3, 4}, Λ = 1 1 1 1 1 (V = 1 + 21)
                                                            2 4 4 8 8
and the 3 (R, G, B) color channels. The total dimension of this vector is equal to
d = 512.(1 + 21).4.3 = 135168. We also normalize each element of xi such that
xi,l ∈ [−1, 1] , l = 1, . . . , d, i = 1, . . . , N followed by `2 normalization on xi .
The a posteriori probabilities associated with MSCILBP approach are denoted
as p2 (y = l|x).


Sparse coding of dense MSILBP patches
Following the same framework as in [7, 13, 1, 3, 10], we will show here that the
traditional Bag of Features (BoF) approach can be advantageously replaced by
i) Sparse coding (Sc), ii) max-pooling technique.
    Specifically, F ILBP patches z ILBP (O k ) of size (m × m) centered on ROI’s
{O k } (possibly overlapping) are extracted (cf. eq. 7) for k = 0, . . . , F − 1 and
∀s ∈ S (see Fig. 4). For a faster computation for each scale s, the integral image
II is first computed from I.
    For a complete dataset containing N images and ∀s ∈ S, we obtain a collec-
tion of P = T S patches Z , {z i }, i = 1, . . . , P , where T = N F . We define, the
subset of patches z i at scale s by Z(s) ⊆ Z with T elements. In order to obtain
highly discriminative visual features, a common procedure consists of encoding
each patch z i ∈ Z(s) at scale s through an unsupervised trained dictionary
D , [d1 , . . . , dK ] ∈ Rb×K , where K denotes the number of dictionary elements,
and its corresponding weight vector ci ∈ RK . In the BoF framework, the vector
Fig. 4. ExampleLeft: ROI’s {O k }, k = 0, . . . , F − 1 of extracted patches used to
compute each ILBP where F = 10 · 10. Right: associated normalized histograms
{z ILBP (O k )}, one per column.


ci is assumed to have only one non-zero element:

                           XT
                   arg min     kz i − Dci k22   s.t. kci k`0 = 1,              (12)
                      D ,C i=1

where C , [c1 , . . . , cK ] and k • k`0 defines the pseudo zero-norm, where here
only one element of ci is non-zero. In eq. (12), under these constraints, (D, C)
can be optimized jointly by a Kmeans algorithm for example.
    In the Sc approach, in order to i) reduce the quantization error and ii) to have
a more accurate representation of the patches, each vector xi is now expressed
as a linear combination of a few vectors of the dictionary D and not only by a
single one. Imposing the exact number of non-zero elements in ci (sparsity level)
involves a non-convex optimization [9]. In general, it is preferred to relax this
constraint and to use instead an `1 penalty which also involves sparsity. The
problem is then reformulated using the following equation:

                     XT
             arg min     kz i − Dci k22 + βkci k`1   s.t. kci k`1 = 1,         (13)
                D ,C i=1

where the sparsity in controlled by the parameter β. The last equation is not
jointly convex in (D, C) and a common procedure consists of optimizing alter-
natively D given C by a block coordinate descent and then C given D by a
LASSO procedure [11]. At the end of the process, for each scale s ∈ S, a trained
dictionary D(s)
             b     is obtained.
    For an image I and given a trained dictionary D(s)    b    for a type of code at
scale s, F sparse vectors {ck (s)} are
                                  0   computed        by a LASSO algorithm. The
final efficient descriptor x(s) , x (s), . . . , xK−1 (s) ∈ RK is obtained by the
following max-pooling procedure [13, 2]:

                     xj (s) ,      max (|cjk (s)|), j = 0, . . . , K − 1,               (14)
                                 k|O k ∈R


where each element of x(s) represents the max-response of the absolute value
of sparse codes belonging to the ROI R. In order to improve accuracy, a spatial
pyramidal matching procedure helps to perform a more robust local analysis.
                                   L−1
                                   P
The spatial pyramid Λ has V =          Vl ROIs {Rl,v } with l = 0, . . . , L − 1,
                                            l=0
v = 0, . . . , Vl − 1 (see Fig. 5 for an example). The quantity z jl,v (s) for each ROI
Rl,v is computed by:

                   xjl,v (s) ,      max (|cjk (s)|), j = 0, . . . , K − 1.              (15)
                                 k|O k ∈Rl,v


We reinforce our model by an important normalization step that improves con-


Fig. 5. Example of SPM Λ with L = 2, F = 10 · 10 and V = 1 + 21. The F ROIs
{O k }, k = 0, . . . , F − 1 associated with each patch z k are represented by blue squares.
Sparse codes ck are computed for each ROI O k . Upper-left corner of each max-pooling
window Rl,v taking {100, 10} ck is indicated with a green cross. Left: R0,0 = I for
l = 0. Right: {R1,v }, v = 0, . . . , 20 for l = 1


siderably accuracy and consists of the `2 normalization of all vectors {xl,v (s)},
v = 0, . . . , Vl − 1, s ∈ S, i.e. belonging to the same pyramidal layer l. This step
is also very important and often hidden in the existing literature.
    The final descriptor x(Λ) will be defined by the weighted concatenation of all
the xl,v (s) vectors, i.e. x(Λ) , {λl xl,v (s)}, l = 0, . . . , L−1, v = 0, . . . , Vl −1 and
∀s ∈ S. The total size of the feature vector x(Λ) is d = K.V.S, where typically
in our simulations, we fixed K = 2048, V = 22 and S = 4. A final `2 clamped
normalization step is performed on the full vector x(Λ). In our experiment, we
extracted F = 35 · 35 patches per scale and per image with m = 26. 2000 patches
per class for each scale have been randomly selected to train dictionary (β = 0.2).
The a posteriori probabilities associated with MSILBP+ScSPM approach are
denoted p3 (y = l|x).

Late fusion
To obtain a final decision, we simple performed an average of all pf (y = l|x) a
posteriori probabilities, i.e.
                                          3
                                        1X
                         p(y = l|x) =      pf (y = l|x).                      (16)
                                        3
                                         f =1


3.4   Late fusion of MSCLPQ, MSCILBP, MSILBP+ScSPM and
      SIFT+ScSPM→ LSIS DYNI run 3
The three first stages are identical as in LSIS DYNI run 2.

Sparse coding of dense SIFT patches
As for MSILBP parches, we extracted F = 35 · 35 SIFT patches (m = 16)
per image and for each of the 4 scales (σ = {0.5, 0.65, 0.8, 1.0}). 2000 patches
per class for each scale have been randomly selected to train dictionary (β =
0.2, K = 2048). The a posteriori probabilities associated with SIFT+ScSPM
approach are denoted p4 (y = l|x).

Late fusion
We also performed an average of all pf (y = l|x) a posteriori probabilities
                                          4
                                        1X
                         p(y = l|x) =      pf (y = l|x).                      (17)
                                        4
                                         f =1


4     Results
Fig. 6 presents the summarized results for the ”scans” category. Without any
segmentation and/or specific pre-processing, we obtained a score S = 0.41 with
LSIS DYNI run 3, i.e. the 6th best score for all submitted runs (29 in total),
relatively close to the top-4 (S = 0.43). Higher scores can probably be obtained
with the use of color MBILBP+ScSPM and color SIFT+ScSPM features.
    In Fig. 7 we summarize the results for the ”scan-like” category. We obtained
a score S = 0.42 with LSIS DYNI run 3, i.e. the 7th best score for all submitted
runs (29 in total). In this case, with an unsupervised detector to ”home-in”
leafs more precisely, we could also improve results. Ideally, as for all runs above
S = 0.42, a prior segmentation is known to help considerably results for ”scans”
and ”scan-like” categories. Finally, Fig. 8 provides the summarized results for the
”photographs” category. We obtained a score S = 0.32 with LSIS DYNI run 3,
i.e. the 1th best score for automatic method and for all submitted runs (29 in
total). Our 3 runs obtained the best top-3 of all submitted runs.
                     Fig. 6. Results for the ”scans” category.


                   Fig. 7. Results for the ”scan-like” category.


5   Conclusions

For our first participation to ImageCLEF plants identification 2012 challenge, we
demonstrated that for ”photographs” category, our framework offers best perfor-
mances for automatic method. This category is considered the most challenging
due to ”real” in-situ conditions and shows that computer vision approaches for
image catagorization/fine-grained visual categorization are well adapted for this
challenge. Several improvements can be obtained, for example with some better
encoding schemes (Fisher vectors) and/or pooling technics.
                    Fig. 8. Results for the ”photographs” category.


References
 1. Boureau, Y., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recog-
    nition. In: CVPR’ 10 (2010)
 2. Boureau, Y., Ponce, J., LeCun, Y.: A theoretical analysis of feature pooling in
    vision algorithms. In: ICML’ 10 (2010)
 3. Chatfield, K., Lempitsky, V., Vedaldi, A., Zisserman, A.: The devil is in the details:
    an evaluation of recent feature encoding methods. In: BMVC (2011)
 4. Heikkilä, J., Ojansivu, V.: Methods for local phase quantization in blur-insensitive
    image analysis. In: LNLA’ 09 (2009)
 5. Heikkilä, J., Ojansivu, V., Rahtu, E.: Improved blur insensitivity for decorrelated
    local phase quantization. In: ICPR’ 10 (2010)
 6. Hsieh, C., Chang, K., Lin, C., Keerthi, S.: A dual coordinate descent method for
    large-scale linear svm (2008)
 7. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid
    matching for recognizing natural scene categories. In: CVPR’ 06 (2006)
 8. Liao, S., Zhu, X., Lei, Z., Zhang, L., Li, S.Z.: Learning multi-scale block local
    binary patterns for face recognition. In: ICB (2007)
 9. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse
    coding. In: ICML ’09 (2009)
10. Paris, S., Halkias, X., Glotin, H.: Sparse coding for histograms of local binary
    patterns applied for image categorization: Toward a bag-of-scenes analysis. In:
    ICPR’ 12 (2012)
11. Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the
    Royal Statistical Society (Series B) 58 (1996)
12. Viola, P., Jones, M.: Robust real-time face detection. International Journal of Com-
    puter Vision 57 (2004)
13. Yang, J., Yu, K., Gong, Y., Huang, T.S.: Linear spatial pyramid matching using
    sparse coding for image classification. In: CVPR’ 09 (2009)