<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MRIM-LIG at ImageCLEF 2010 Visual Concept Detection and Annotation task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rami Al Batal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philippe Mulhem Rami.Albatal@imag.fr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philippe.Mulhem@imag.fr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratoire Informatique de Grenoble (LIG), Grenoble University</institution>
          ,
          <addr-line>CNRS, LIG</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper focuses on one of the Image CLEF Photo tasks at which the MRIM research group of the LIG participated: the Visual Concept Detection and Annotation. For this task, we applied a simple state of the art technique based on bag of visual words. We extracted SIFT-like features that integrate colors (rgSIFT) proposed by van de Sande[10]. We used then a Kmeans clustering in a way to group these features according to 4000 clusters. We generated then for each image of the training set a 4000 dimensions histogram by summing all the occurrences of each cluster, using the nearest neighbour centroid for each extracted feature. For the recognition we extracted the rgSIFT features from the test set, before generating the 4000 dimensional histograms. We applied then SVMs with RBF kernels using a probabilistic estimation of recognition. The results obtained by our run are presented.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Image representation</title>
      <p>
        This year, we only worked on applying a simple state of the art technique based
on bag of visual words for the annotation of images. This approach is inspired
by the work on text categorization in [4]. In the context of visual media, this
approach has been originaly proposed by Sivic et Zisserman in [9] for the
retrieval of video documents, before been applied on still images initially by Csurka
and his colleagues in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for image classification and then in numerous works
([10], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [5], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) for image annotation.
2.1
      </p>
      <p>Visual feature extracted
We focus now on the features extracted from the images. Scale Invariant
Feature Transforms, namely SIFT[6], have been successsul for the classifiaction and
the annotation of images. The images considered are consumers photographs in
which color may play a great role, we considered then SIFT-like features that
integrate colors. Among such existing features, after experiments we used for the
CLEF task the rgSIFT feature proposed by van de Sande in [10]. The rgSIFT
features include color information around the salient points in the images. The
set of features extracted from the trainig set Strain is named Sfeat train.</p>
      <p>As usual in bag of words approaches, we need to group several features in
clusters, in a way to identify visually similar features. To do that, we applied
on a subset of Sfeat train a Kmeans clustering in a way to group these features
according to Nc clusters. Recent studies demontrated that large numbers for Nc
(vocabulary size), namely several thousands, perform better for image
classification and retrieval ([7], [8],[10]). That is why, after some tests, we chose to use
Nc=4000.
2.2</p>
      <p>Learning of concepts
For learning step of our approach, we generated for each image of Strain a 4000
dimensions histogram by summing all the occurrences of each cluster, using the
nearest neighbour centroid for each extracted feature.</p>
      <p>Then, a learning of each concept model is achieved using Support Vector
Machines (SVMs). The one against all (OAA) approach was experimented: all
the positive sample and negative samples are used to learn each concept. In the
SVMs, we use the common Radial Basis Function kernel defined by equation
(1).</p>
      <p>K(x, y) = e− kx2−σy2k2
(1)</p>
      <p>For the definition of the values of the parameter sigma we learned the models
for each label using half of Strain for testing, namely Strain train, and half of
Strain for validation, namely Strain valid. There two subsets form a partition
of Strain, and they were selected randomly. For each concept the same subsets
Strain train and Strain valid were used.
2.3</p>
      <p>Annotation of images
For the generation of the results, we extracted the same rgSIFT features for the
test set Stest, before generating the 4000 dimensional histograms (one per image
of Stest).</p>
      <p>We applied then the recognition based on the SVM models defined during
the laerning step, using a probabilistic estimation of recognition. We submitted
only one result.</p>
    </sec>
    <sec id="sec-3">
      <title>Submitted run and results</title>
      <p>We submitted one run based on the characteristics described above. The run has
the following identifier: LIG 1277153756343 clefResults.txt binary.txt .</p>
      <p>We focus first on the Mean Average Precision (MAP) result obtained by our
approach. We obtained the rank 30 on 45 submissions, with a MAP value of
0.225 . This value is 0.1 lower than the median value for theses runs, 0.237.
We can mention however that for one concept, V isual Art, the MAP that we
obtained is the second result, 0.374, after the IJS result at 0.385 . In this case
we think that the color aspect integrated in the rgSIFT features is the reason
for this result.</p>
      <p>For the hierarchical recognition measure based on example-based F-measure,
we obtained the rank 27 on 45. The value obtained is 0.477, and the median value
is 0.530; we achieve then quite poor results according to this measure. For the
hierarchical recognition measure based on the Ontology Score incorporating the
Flickr Context Similarity, we achieved the 20th rank, with a value of 0.530. Our
result is above the median value of 0.515 for the 45 visual only runs considered.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>To summarize our work for the ImageCLEF 2010 Visual Concept Detection
and Annotation task, we proposed a simple state of art method. Our work here
demonstrates that such state of the art techniques are a basis for further
extensions.</p>
      <p>In the future, we will integrate grouping of regions of interest to increase our
results.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment References</title>
      <p>This work was partly supported by: a) the French National Agency of Research
(ANR-06-MDCA-002), b) the R´egion Rhones Alpes (projet LIMA).
4. Thorsten Joachims. Text categorization with support vector machines: learning
with many relevant features. In Claire N´edellec and C´eline Rouveirol, editors,
Proceedings of ECML-98, 10th European Conference on Machine Learning, number
1398, pages 137–142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE.
5. Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features:
Spatial pyramid matching for recognizing natural scene categories. In CVPR ’06:
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, volume 2, pages 2169–2178. IEEE Computer Society,
October 2006.
6. David G. Lowe. Object recognition from local scale-invariant features. In
International Conference on Computer Vision, 1999.
7. David Nister and Henrik Stewenius. Scalable recognition with a vocabulary tree.</p>
      <p>In In CVPR, volume 2, pages 2161–2168, 2006.
8. James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman.</p>
      <p>Object retrieval with large vocabularies and fast spatial matching. In Computer
Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, pages 1–8,
2007.
9. J. Sivic and A. Zisserman. Video google: a text retrieval approach to object
matching in videos. In Computer Vision, 2003. Proceedings. Ninth IEEE International
Conference on, pages 1470–1477. IEEE, April 2003.
10. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek. Evaluation of color
descriptors for object and scene recognition. In IEEE Conference on Computer
Vision and Pattern Recognition, Anchorage, Alaska, USA, June 2008.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Gabriella</given-names>
            <surname>Csurka</surname>
          </string-name>
          ,
          <string-name>
            <surname>Christopher R. Dance</surname>
          </string-name>
          , Lixin Fan, Jutta Willamowski, and C´edric Bray.
          <article-title>Visual categorization with bags of keypoints</article-title>
          . In In Workshop on Statistical Learning in
          <source>Computer Vision</source>
          , ECCV, pages
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. L.
          <string-name>
            <surname>Fei-Fei</surname>
            and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Perona</surname>
          </string-name>
          .
          <article-title>A bayesian hierarchical model for learning natural scene categories</article-title>
          . In IEEE, editor,
          <source>CVPR '05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2,</source>
          volume
          <volume>2</volume>
          , pages
          <fpage>524</fpage>
          -
          <lpage>531</lpage>
          vol.
          <volume>2</volume>
          , Washington, DC, USA,
          <year>June 2005</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Yu</surname>
            <given-names>G</given-names>
          </string-name>
          . Jiang, Chong W. Ngo, and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Towards optimal bag-of-features for object categorization and semantic video retrieval</article-title>
          .
          <source>In CIVR '07: Proceedings of the 6th ACM international conference on Image and video retrieval</source>
          , pages
          <fpage>494</fpage>
          -
          <lpage>501</lpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>