<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Inria IMEDIA2's participation at ImageCLEF 2012 plant identi cation task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vera Bakic</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Itheri Yahiaoui</string-name>
          <email>itheri.yahiaoui@univ-reims.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>So ene Mouine</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saloua Ouertani-Litayem</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wajih Ouertani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anne Verroust-Blondet</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Herve Goeau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexis Joly</string-name>
          <email>alexis.joly@inria.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Inria IMEDIA2 Team</institution>
          ,
          <addr-line>Rocquencourt</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Inria ZENITH Team</institution>
          ,
          <addr-line>Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Laboratoire CReSTIC, Universite de Reims</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of Inria IMEDIA2 Team, within the Pl@ntNet project4, at the ImageCLEF2012 plant identi cation task. The runs used very distinct approaches, sometimes relying on similar extracted features. For Scan and Scan-like categories, the rst two runs combine distinct local and contour approaches in two ways (late and early fusion), while the third run explores the learning capacity of a multi-class SVM technique on a contour based descriptor. For Photograph our runs used local features positioned towards the center of the image to reduce the impact of background features. In the second run, an automatic segmentation with a rejection criterion was attempted. In the third run, points were associated with interesting zones. In general, even if they were distinct, the methods used performed very well.</p>
      </abstract>
      <kwd-group>
        <kwd>Pl@ntNet</kwd>
        <kwd>IMEDIA</kwd>
        <kwd>Inria</kwd>
        <kwd>ImageCLEF</kwd>
        <kwd>plant</kwd>
        <kwd>leaves</kwd>
        <kwd>images</kwd>
        <kwd>collection</kwd>
        <kwd>identi cation</kwd>
        <kwd>classi cation</kwd>
        <kwd>evaluation</kwd>
        <kwd>benchmark</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The plant identi cation task of ImageCLEF2012 is a tree species identi cation
based on leaf images. It was organized as a plant species retrieval task over
126 species with visual content being the main available information. Three
types of image content were considered: Scans - scans with a white background,
Scan-like - photographs with a white, uniform background and Photographs
unconstrained leaf's images acquired on trees with natural background.</p>
      <p>A part of image dataset was provided as training data with full class labels
at the beginning of the task, while the test dataset was provided several weeks
later without labels. The training and test subsets were built so that the images
from an individual plant are not present in both sets, which makes the task more
similar to real external queries. The table below shows the composition of the
data sets:
4 http://www.plantnet-project.org/</p>
      <p>S =</p>
      <p>The identi cation score S was related to the rank of the correct species in
the list of retrieved species as follows:
1 XU 1 XPu 1</p>
      <p>Nu;p</p>
      <p>X Su;p;n; where</p>
      <p>U u=1 Pu p=1 Nu;p n=1
U = number users (who have at least one image in the test data),
Pu = number of individual plants observed by the uth user,
Nu;p = number of pictures taken from the pth plant observed by the uth user,
Su;p;n = score between 1 and 0 equal to the inverse of the rank of the right species (for
the nth picture taken from the pth plant observed by the uth user).</p>
      <p>Inria IMEDIA2 team, within the Pl@ntNet project, submitted three runs,
covering all three image categories. For Scan and Scan-like we tested early and
late fusion of local and shape boundary features using KNN and SVM classi ers.
For Photograph local features were selected using di erent geometric constraints
and leaf detection and segmentation. In the following text, we will describe the
Scan and Scan-like approaches in Section 2, while the Photograph ones will
be presented in Section 3. The results are discussed in Section 4. Concluding
remarks are in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods used for Scan and Scan-like</title>
      <p>Both contour and local interest points descriptions are useful for the leaf species
identi cation: contour based descriptors capture the global shape of the leaves,
while local descriptors associated to the extracted interest points retain their
micro-texture. Our three runs test di erent approaches based on these two kinds
of descriptors: the rst two runs combine distinct local and contour approaches
in two ways (late and early fusion in Sections 2.1 and 2.2), while the third run
(Section 2.3) explores the learning capacity of a multi-class SVM technique on
a contour based descriptor.
2.1</p>
      <sec id="sec-2-1">
        <title>Late fusion of a large-scale local features matching method and</title>
        <p>
          a shape boundary feature based method ! RUN1
This method is inspired by two runs submitted by Inria IMEDIA Team at
ImageCLEF 2011 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]: the large-scale local features matching method was the best
in the Scan category, while contour based method was the best in the Scan-like
category. The two approaches are complementary because they retain both
local and boundary shape information and we observed that they performed well
last year on distinct test images. Thus, we decided to use a basic combination
of these two methods with a fusion of their responses on the image level. This
allows us to keep the good results of each, while minimizing the impact of the
erroneous response of one of the approaches. In addition, several major changes
in both approaches were made to improve the performance of each or reduce the
use of computing resources.
        </p>
        <p>
          Large-scale local features matching. The basic algorithm applied for this
method is: (i) Interest points detection, (ii) chosen local description for each
interest point, (iii) local features matching.
(i) Interest points detection - 200 Harris points were used at four distinct
resolutions with a scale factor equal to 0.8 between each resolution [
          <xref ref-type="bibr" rid="ref11 ref9">9, 11</xref>
          ]. We
output up to 4 signi cant orientations for a patch around each point, where
applicable. This method enabled us to boost the number of training samples
and compensate for non-ideal single orientation detection, while keeping the
number of points rather small. Figure 1 illustrates the process for sample
patches. Finally, the number of points rose to an average of 343 per image5.
orig.patch major ori. + additional ori. orig.patch major ori. + additional ori.
(ii) Local features are extracted around each Harris point from an image patch
oriented and scaled according to the given orientation and scale: SURF [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
is based on sums of 2D Haar wavelet responses, we used OpenSURF
implementation [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]; a 16-dim. histogram based on the Hough transform [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]; a
20-dim. Fourier histogram [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and an 8-dim. Edge Orientation Histogram
were concatenated, resulting in a 108-dimensional vector.
(iii) Matching - Signatures in the training dataset are compressed and indexed
using RMMH method [
          <xref ref-type="bibr" rid="ref10 ref8">10, 8</xref>
          ]. Local features of a query image are compressed
through a 256-bit hash code and its approximate 30-nearest neighbors are
searched to obtain a set of candidate matches. To an image, a score equal
to the number of its local features matched is assigned. Then, images are
re-ranked according to their score.
        </p>
        <p>
          Contour based descriptor. It consists in a leaf boundary based descriptor
that combines two complementary information: (i) The Directional Fragment
Histogram descriptor with arbitrary parameters, introduced in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. This
descriptor has the advantage that it outlines local orientation variations of the leaf
margin which is a discriminant key indicator of leaf species. It encodes the
relative distribution density of groups of contour points with uniform orientation.
The DFH descriptor succeed to detail local properties of the leaf boundary. (ii)
For the global geometric properties of the leaf form, we used six shape
parameters as in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]: circularity, convexity, solidity, rectangularity, sphericity and ellipse
5 Same results were achieved on the last year's task with 500 points per image.
variance. Segmentation was done automatically using Otsu algorithm [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], with
addition of automatic selection of a channel that gives the best separation.
Late fusion. The two methods described above will each return a list of images
belonging to the same training database but ordered di erently: a same image
may be present in both lists, but at a di erent rank. Then, the two lists are
merged by setting the rank of an image to a minimum of the ranks in each list.
In this way, we preserve a good position of an image returned by one method
and ignore the other, presumably, incorrect rank. After the fusion, the uni ed
image list is re-ranked according to the new scores.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Classi cation with a top-knn decision rule. To obtain the species list from</title>
        <p>the images list, we counted the number of occurrences of each species in the top
15 images. The list of species was then re-ranked according to their score, to
obtain the nal ordering of the proposed species.</p>
        <p>Training data and descriptor choices. In order to determine the e ciency
of a descriptor and which image category to use for training, we performed a
series of \leave-individual-out" tests on the training data itself. That is, in the
score calculation for an image from the training database, we excluded from
the returned response all the images belonging to the same individual as the
query image. In addition, we averaged the score as for the o cial one. We
concluded that the combination of the presented descriptors gives the best results
on the train data sets and that the Scan images should be searched in Scan
dataset, while the Scan-like should be searched in the union of Scan and
Scanlike. Assuming that the test images will perform similarly to the train images,
we decided to keep the above parameters for the submitted run.
2.2</p>
        <sec id="sec-2-2-1">
          <title>Advanced shape context ! RUN2</title>
          <p>
            The methods used in this run are based on an advanced shape context approach
[
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], which extends the standard shape context [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. Here, two di erent sets of
points are distinguished when computing the shape contexts: the voting set,
i.e. the points used to describe the coarse arrangement of the shape and the
computing set containing the points where the shape contexts are computed.
Two scenarios are proposed by varying the computing set C and the voting set
V of points in the image.
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>SC0: Spatial relations between margin points. Here the computing set C</title>
          <p>
            and the voting set V are identical. They involve only margin points i.e. n points
extracted from the margin by a uniform quantization: C = V = fmargin pointsg
(Figure 2 (a)). This description corresponds to the shape context proposed by
Belongie et al. [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] with a di erent matching method. Note that the venation
network is not introduced here. Segmentation algorithm was the same as in
Section 2.1.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>SC2: Spatial relations between salient and margin points. Here we want</title>
        <p>to measure the spatial relationships between the salient points described in the
(a) (b)
Fig. 2. Points used in scenario SC0 (a) and SC2 (b). The small circles represent the
sample points on the leaf margin. The crosses represent Harris salient points.
context de ned by the leaf margin (Figure 2 (b)). The voting set of points V is
composed of all the margin points. The Harris points form the computing set C:
C 6= V; C = fsalient pointsg and V = fmargin pointsg. As mentioned above, the
salient points may lay inside the leaf or may belong to the leaf margin. Our aim
is to study the correlation between the venation network and the margin of the
leaves belonging to the same species.</p>
        <p>
          Local features. The advanced shape context captures a spatial con guration
of points without taking into account local properties of the image around the
set C of computing points. Thus, to enrich the description, a set of local features
computed on the neighborhood of each point of C is introduced. As the color
is not a discriminant feature for leaves, we focus on texture and shape. Three
local features are extracted from the gray-level of an image patch located around
each point: a 16-dim. Hough histogram; a 40-dim. Fourier histogram [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and an
8-dim. classical Edge Orientation Histogram, which is known to be suitable for
non-uniform textures. These three features have given promising results when
associated with Harris points on scans of leaves in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          Matching Method. The features matching, is done by a Multi Probe Locality
Sensitive Hashing technique [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and the distance L2 is used to compute the
similarity between two feature vectors. The principle of this algorithm is to
project all the features in an L dimensional space and to use hash functions to
reduce the search and the time cost.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Classi cation with a top-knn decision rule is as in Section 2.1.</title>
        <p>Descriptor choices. After an evaluation similar to the one described in Section
2.1 on the training data, RUN2 was constructed as follows:
- A combination of SC2 and local features with 200 Harris points for Scan;
- SC0 with 200 sample points on the leaf margin for Scan-like.
2.3</p>
        <sec id="sec-2-4-1">
          <title>SVM multi-class classi cation !</title>
          <p>
            RUN3
This run is performed in order to explore the enhancement of shape feature
behaviour through a learning schema. In our experiments we have tested 6
strategies to do classi cation (see Table 2). To do so, we split the training dataset
into train and validation sets with di erent proportions which also respects the
non-split of images coming from individual plants (the relative images of a given
individual plant are either used in train or in validation and not in both). We
performed a multi-class SVM technique on a contour based descriptor [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ].
Indeed, we adopted a one-vs-one schema [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ], which o ers more balanced
elementary binary classi cation compared to one-vs-all. It decomposes the classi cation
problem into several binary classi cation tasks. They are built to discriminate
between each pair of classes, while discarding the rest of the classes. If K is the
number of classes, one will need to train one binary classi er for each of the
possible two classes combinations. This procedure will generate K(K 1)=2 binary
classi ers. When applied to a test data, a voting is performed among the
classiers and the class is predicted according to the maximum voting strategy [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. As
a kernel choice we conventionally adopted a linear kernel on the 630-dimensional
shape features.
          </p>
          <p>
            We notice the surprising e ect of performance's increase between the
70%30% splitting and the 50%-50% one. That might come from less over tting on the
training subset. For the run, we have chosen to keep learning on both Scan and
Scan-like images together without a prior distinction. That sounds more stable
according to the preliminary tests (see rows 5 and 6 in the table 2). For more
classes recall and since the nal run evaluation takes that into consideration, we
have also kept the whole classes ranked according to the voting strategy [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methods used for Photograph</title>
      <p>For the Photograph images, the background around the leaf is not uniform (sand,
stones, other leaves), and the leaves may be deformed or mutually occluded. The
shape boundary features used for Scan and Scan-like in Section 2.1 are unsuitable
in this case: the automatic segmentation of the leaf and background is far from
perfect and the detected shape is not good. In addition, the Harris points detector
may detect mainly the points in the background. In our runs we explored the
following directions: the fact that the leaf is, in general, centered (Section 3.1);
whether the automatic segmentation improves performance (Section 3.2) and
nally, using multi-class SVM on embedded local features (Section 3.3).
3.1</p>
      <sec id="sec-3-1">
        <title>Rhomboid masking and local features matching ! RUN2</title>
        <p>The basic algorithm applied for this method is the same as described in Section
2.1 for local features. The major di erence is in the selection of Harris points,
where masking and points weighting were applied.</p>
        <p>Rhomboid ltering. In order to minimize the e ect of the cluttered
background, we modi ed the input image for the Harris points detector. The
assumption is that the leaf is centered, so we masked with an adaptable rhomboid-shape
the corners of the image; the transition from foreground to masked out region
was smooth to avoid that the points get detected on the mask boundary. Figure
3 illustrates (a) the points detected in the original image, (b) the masked image
used as the new input and (c) the points detected in the masked image.
(a)
(b)
(c)
(d)
Grid-based points weighting. In the above detection, we noticed that even
if the small amount of the cluttered background appears, most of the points
are still located in the background. Thus, before the selection of the best 200
points, we applied a weighting scheme based on a grid (of size 7x7): the number
of points allocated for the current scale is distributed in the grid cells using
Gaussian-like distribution - the closer to the center, the more points allocated
and selected. Figure 3 (d) illustrates the nal selection of the points using
gridand centered-based weighting.</p>
        <sec id="sec-3-1-1">
          <title>Training data and descriptors choice. For the selection of descriptors and</title>
          <p>training dataset we used the same procedure as in Section 2.1. Only the
photograph data were used as the training database.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Segmentation, ltering and local features matching !</title>
        <p>
          RUN1
Another approach for the Photograph images was to attempt the segmentation,
where it was possible, and to reject the points that do not belong to the leaf
region. In this case, we use the complete training database, with Scan, Scan-like
and Photograph images. The advantage of this method is that we could use it
for all types of test images. We used the provided "content" annotations: Leaf,
Picked leaf and Leafage images and applied di erent processing for each type.
The algorithm and the decision rule were the same as in Section 3.1.
Scan and Scan-like processing All training images of these types were
processed using the same algorithm as in Section 2.1, local features part.
Leafage processing All images of this type were processed using the same
algorithm as in Section 3.1. The reason for this choice is that Leafage images
have multiple leaves and are extremely cluttered and hard for segmentation.
Leaf and Picked leaf processing. For each image we attempted
segmentation using Otsu algorithm [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], with addition of automatic selection of a channel
that gives the best separation and LUV channels were used as they give better
separation for cluttered backgrounds. Then, we automatically veri ed if the
region was well-formed or if the foreground and background classes were too mixed:
under the assumption that we have two classes (Figure 4 (b)), we calculated the
average distance of each point from the region centers. If the regions were mainly
centered and the di erence of the distances was more than 20%, segmentation
was accepted as good, otherwise, we rejected it and the image was processed as
if it was labeled Leafage. For the correct segmentation, the biggest found region
that does not touch the image boundary was considered as the leaf. Figure 4
illustrates the process of correct (top) and failed (bottom) segmentation. We
output 400 points from the Harris detector, however, for the nal points list, we
keep up to 200 points that do belong to the leaf region (Figure 4 (c)). With the
addition of the multi-orientation points, the number of points per image rose
to an average of 395 for the Photograph. Figures 4 (a) and (d) show the points
distribution for the original detection and after ltering or rejecting.
(a) orig. points
(b) segm. mask
(c) masked img
(d) ltered points
(a) orig. points
(b) segm. mask
(c) masked img
(d) points with rhomboid
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Multi-Class SVM on embedded local features ! RUN3</title>
        <p>
          In this run, we explore a data-independent bag of word schema for description
and automated partial images zones de nition. On this representation, we
perform multi-class SVM for learning and predicting plant leaves' classes. The run
is motivated by the joint retrieval and learning schema presented in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ],
however, we do not have image's interesting zones annotations provided by a user.
Instead, we were interested on bounding boxes sampling approaches de ning the
zones.
        </p>
        <p>
          Bounding box de nition. We used the objectness measure introduced in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ],
which is a class generic object detector, quantifying how likely a part (e.g.
window) of an image contains an object of any class. During learning objectness cues'
parameters, we did not use plants' images with ground truth relevant zones.
Instead, we learned using generic VOC Pascal classes. We expect a broad learning
transfer from generic objects to leaves' world. We arguably consider few number
of windows per image (10), since images mostly have a single or few interest
zones. Figure 5 shows an automatic selection of windows expected as object's
delimiter.
Representations. We used the local features from Section 3.1 projected through
an e cient approach for feature set representation based on random histograms
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Two representations were generated for each image: (i) the features falling
within bounding boxes, and (ii) the features in the whole image. Embedding
parameters are chosen such that the nal representation's histograms have a
considerably high dimension of 20480 bins. To evaluate which representation
is more suitable, we tested the following combinations using a train/validation
dataset splits and a linear multi-class SVM (as in Section 2.3): bounding box or
whole image features used for train or validation sets (Table 2, left 3 columns).
For both 1st and 4th strategies, we adopt a ranking based on a voting of classes.
Each window votes one time for the predicted class, the nal vote was the
number of times a class is voted by the bounding boxes of a given image. Table 2
shows the decision results for the three splits. It is clear that predicting on
sampled bounding boxes, while learning on the whole image noticeably outperforms
other strategies. Thus, we used the 4th strategy. Since the nal score explores
a ranking of classes, to prevent loosing the relevant class, we keep the best 4
classes voted within each image.
        </p>
        <p>Train set Test set Decision, 3 splits Avg.
1st bounding box bounding box 0.001 0.001 0.000 0.001
2nd bounding box whole image 0.250 0.254 0.269 0.257
3rd whole image whole image 0.251 0.265 0.255 0.257
4th whole image bounding box 0.325 0.291 0.345 0.320
RUN2 gave the best result of the task while RUN1 is 3rd and RUN3 5th. For the
more di cult category of Photograph RUN1, RUN2 and RUN3 were respectively
4th, 5th and 10th among 21 runs that used a full automatic approach.</p>
        <p>Run
SABANCI OKAN run 2</p>
        <p>THEWHO run 3
SABANCI OKAN run 1</p>
        <p>IFSC USP run 1</p>
        <p>LIRIS reves run 1
RUN1 : INRIA Imedia PlantNet run 1
RUN2 : INRIA Imedia PlantNet run 2</p>
        <p>THEWHO run 4
LSIS DYNI run 3</p>
        <p>THEWHO run 1
RUN3 : INRIA Imedia PlantNet run 3</p>
        <p>IFSC USP run 2</p>
        <p>O cial scores</p>
        <p>Run type Scan Scan-like photo Avg
Humanly Assisted 0.58 0.55 0.22 0.45
Humanly Assisted 0.43 0.40 0.49 0.44</p>
        <p>Automatic 0.58 0.55 0.16 0.43
Humanly Assisted 0.35 0.41 0.51 0.42
Humanly Assisted 0.42 0.51 0.33 0.42</p>
        <p>Automatic 0.49 0.54 0.22 0.42</p>
        <p>Automatic 0.39 0.59 0.21 0.40
Humanly Assisted 0.37 0.53 0.43 0.38</p>
        <p>Automatic 0.41 0.42 0.32 0.38
Humanly Assisted 0.37 0.34 0.43 0.38</p>
        <p>Automatic 0.47 0.46 0.15 0.36</p>
        <p>Humanly Assisted 0.34 0.43 0.30 0.36</p>
        <p>Scan and Scan-like. With respect to last year's absolute score values, we
expected that this year's scores achieve at least the same value if not higher.
Indeed, this year the score re ects the correct species rank (which gives always
a value distinct from zero), while last year the score was a pure classi cation
score (0 or 1). However, this was neither the case for last year's participants.
The reasons are probably that this year the task was more challenging, with
the number of species almost doubled and higher number of contributors, which
increased the overall visual diversity of images and species. For instance we
observed that train and test Scan images are rather di erent: most of the train
dataset leaves are mature and green, while a good number of those in the test
dataset are young or dead. For Scan-like, the image dataset has varying quality
like illumination changes, more or less pronounced shadows and variations in the
background.</p>
        <p>In spite of these di culties RUN1 performs well and con rms the good results
obtained in the last year's task. The late fusion of two complementary approaches
keeps the good performances for both Scan and Scan-like categories, while last
year each method had the best performances in only one category.</p>
        <p>For Scan-like, RUN2 achieved the best performances in this year's task. We
suppose that the delayed use of the boundary points for the matching phase
brought as more similar the partial shape boundary information. This may have
compensated for the variations in the leaf boundary (leaf poses, missing lea ets)
and imperfect segmentations (shadows), as they were not taken as global
information. For Scan results are lower than expected. As we noted a signi cant
intra-class variations in the micro-texture, we suppose that the early fusion
between the shape and local descriptors was less robust to them. However, this
problem could be compensated by more suitable local features to this visual
content.</p>
        <p>The RUN3 gave also very good results on both Scan and Scan-like categories,
with more or less the same score, which seems coherent with the fact that the two
types of images were considered as one in order to compensate a lack of examples
for some species. Moreover, this method tends toward the score of RUN1, notably
for Scan, while it used only a subset of contour shape descriptors.
Photographs. RUN1 and RUN2 gave good results, if we consider only fully
automatic runs. Our approaches gave even more or less the same results as some
methods with human interactions.</p>
        <p>RUN1 had slightly higher score than RUN2, which shows us that the
automatic segmentation and the use of the complete database for training proved
useful. It is important to note here that the segmentation was performed over
35% of all Photograph images following the "leaf" and "picked leaf" tags and
the rejection criterion. Within the segmented images, only 2% were complete
misses (e.g. no leaf part included), and 5% with partial leaf information. This
illustrates the idea that it is better to focus on well framed (and segmentable)
pictures and reject the ones that are too cluttered.</p>
        <p>For its part, RUN3 gave intermediate performances within a group of runs
with automatic approaches, which had more or less similar scores around 0.15.
These results are lower than expected, which could be due to the fact that
the bounding boxes did not quite correspond to the central image part with
the object of interest (the leaf). We are convinced that this method could be
improved for instance by using points detection for each interest box separately,
and also by using more specialized models of object detection learned on plant
images rather than on the general objects.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>Inria IMEDIA2 Team submitted runs that used distinct approaches, sometimes
relying on similar extracted features. Despite these di erences, the methods used
performed well and the three runs are placed in the 10 top runs for each category.</p>
      <p>Again for the second year we obtained very good results on Scan and
Scanlike. However, with respect to last year's absolute score value, we achieved lower
scores, as other previous participants, in spite of the improvement of our
methods, the proposition of new approaches and a less strict metric. This highlight
the fact that plant identi cation from leaf Scans and Scan-like images is far from
being resolved, especially when will consider more than 500 tree species as it can
be observed in France for instance. We will try to improve our methods with
the study of new features even more suitable to the visual diversity introduced
by new image contributors, by exploring new classi cation and combination
approaches with the metadatas (gps, dates, hierarchical taxonomy information).</p>
      <p>For Photograph, it is di cult to compare results obtained with full automatic
approaches to semi-automatic approaches with human assistance. If we consider
only full-automatic approaches, we obtained promising results that we hope to
reproduce on plant organs like ower or fruit, for which it is far more di cult to
have Scan or Scan-like images. Considering fully automatic or humanly assisted
approaches, on one hand, we notice that with human assistance runs from other
teams tend to have quite similar absolute scores as best Scan and Scan-like runs.
On the other hand, we notice also that several automatic approaches give better
performances than assisted ones. Maybe we will have to consider, alongside
improving automatic approaches, several human interactions, like semi-supervised
segmentation for test images only.</p>
      <p>For all three categories, we have to pursuit that the correct species is returned
within the top 5 proposed species. This would make our methods suitable for a
mobile based recognition application.</p>
      <p>Acknowledgments. Part of this work was funded by the Agropolis
foundation through the project Pl@ntNet (http://www.plantnet-project.org/)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alexe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deselaers</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrari</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>What is an object? In: CVPR (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bay</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ess</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuytelaars</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Gool</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Surf: Speeded up robust features</article-title>
          .
          <source>In: Computer Vision and Image Understanding (CVIU)</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Puzicha</surname>
          </string-name>
          , J.:
          <article-title>Shape matching and object recognition using shape contexts</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell. (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charikar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>E ciently matching sets of features with random histograms</article-title>
          .
          <source>In: ACM Multimedia Conference</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Notes on the opensurf library</article-title>
          .
          <source>Tech. rep.</source>
          , University of Bristol (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ferecatu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Image retrieval with active relevance feedback using both visual and keyword-based descriptors</article-title>
          .
          <source>Ph.D. thesis</source>
          , University of Versailles St-Quentin-enYvelines (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>Another approach to polychotomous classi cation</article-title>
          .
          <source>Tech. rep. (96)</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Goeau, H.,
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yahiaoui</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mouysset</surname>
          </string-name>
          , E.:
          <article-title>Participation of INRIA&amp; Pl@ntNet to ImageCLEF 2011 plant images classi cation task</article-title>
          . In: CLEF (Notebook Papers/Labs/Workshop) (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gouet</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boujemaa</surname>
          </string-name>
          , N.:
          <article-title>Object-based queries using color points of interest. Content-Based Access of Image and Video Libraries</article-title>
          , IEEE Workshop on (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buisson</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Random maximum margin hashing</article-title>
          .
          <source>In: CVPR</source>
          <year>2011</year>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>New local descriptors based on dissociated dipoles</article-title>
          .
          <source>In: Proceedings of the 6th ACM international conference on Image and video retrieval</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buisson</surname>
            ,
            <given-names>O.:</given-names>
          </string-name>
          <article-title>A posteriori multi-probe locality sensitive hashing</article-title>
          .
          <source>In: 16th ACM international conference on Multimedia</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Knerr</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Personnaz</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dreyfus</surname>
          </string-name>
          , G.:
          <article-title>Single-layer learning revisited: A stepwise procedure for building and training a neural network</article-title>
          . In: Neurocomputing: Algorithms, Architectures and Applications (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Mouine</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yahiaoui</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verroust-Blondet</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Advanced shape context for plant species identi cation using leaf image retrieval</article-title>
          .
          <source>In: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Otsu</surname>
            , N.:
            <given-names>A Threshold</given-names>
          </string-name>
          <string-name>
            <surname>Selection Method From Gray-Level Histogram</surname>
          </string-name>
          .
          <source>IEEE Trans. Syst</source>
          .,
          <string-name>
            <surname>Man</surname>
            ,
            <given-names>Cybern.</given-names>
          </string-name>
          (
          <year>1979</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Ouertani</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crucianu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boujemaa</surname>
          </string-name>
          , N.:
          <article-title>Interactive learning of heterogeneous visual concepts with local features</article-title>
          .
          <source>In: MM '10: Proceedings of the seventeen ACM international conference on Multimedia</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Yahiaoui</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herve</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boujemaa</surname>
          </string-name>
          , N.:
          <article-title>Shape-based image retrieval in botanical collections</article-title>
          .
          <source>In: Advances in Multimedia Information Processing - PCM</source>
          <year>2006</year>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>