<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CEA LIST's participation to the Concept Annotation Task of ImageCLEF 2012</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amel Znaidia</string-name>
          <email>amel.znaidia@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aymen Shabou</string-name>
          <email>aymen.shabou@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrian Popescu</string-name>
          <email>adrian.popescu@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Herve´ Le Borgne</string-name>
          <email>herve.le-borgne@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CEA, LIST, Laboratory of Vision and Content Engineering</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our participation to the ImageCLEF2012 Photo Annotation Task. We focus on how to use the tags associated to the images to improve the annotation performance. We submitted one textual-only and three multimodal runs. Our first textual model [14] is based on the local soft coding of images tags over a dictionary of most frequent tags. A second model of tag is an adaptation of the TF-IDF model to the social space in order to compute the social relatedness of two tags[9]. For the fusion we used a trainable combiner, called stacked generalization [12] which uses predictions from base classifiers to learn a new model. Results have shown that combination of textual and visual features can improve the annotation performance significantly. Our best run achieves 41:59 % in terms of MAP, allowing us to rank 3rd team.</p>
      </abstract>
      <kwd-group>
        <kwd>Multimedia fusion</kwd>
        <kwd>Bag-of-Visual-Words</kwd>
        <kwd>Bag-of-Multimedia-Words</kwd>
        <kwd>image annotation</kwd>
        <kwd>classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The ImageCLEF 2012 Photo Annotation Task [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is a multi-label classification
problem, with 15:000 image for training, 10:000 for testing and 94 concepts to detect.
Images are extracted from the MIR Flickr dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and the Flickr user tags and/or EXIF
information are available for most photos.
      </p>
      <p>In our participation to the ImageCLEF Photo Annotation Task, we focus on how
to use the tags associated to the images to enhance the annotation performance. We
propose three different models: textual only and two multimodal models.</p>
      <p>This paper is organized as follows. In Section 2 we describe our visual features.
In Section 3 we give an overview of our textual model which uses user tags. Then in
Section 4 we present in more detail the experiments we did, the submitted runs and the
obtained results.
2.1
image, its visual features are built in three steps (i) codebook learning, (ii) local features
coding and (iii) pooling.</p>
    </sec>
    <sec id="sec-2">
      <title>1. Codebook Learning</title>
      <p>
        The codebook, which entries are termed codewords, is a collection of basic patterns
used to reconstruct the input local features. A simple way to generate the codebook
is to use clustering based methods such as K-means [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In the rest of this paper, we
denote by B = bk ; bk 2 &lt;d ; k = 1; :::; K a codebook of K codeword vectors,
which is learned on a training subset of local features xi ; xi 2 &lt;d ; i = 1; :::; N
extracted from the learning dataset.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Coding</title>
      <p>
        For each image dense local descriptors (such as SIFTS [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) are extracted and mapped
to codes. Following recent observations in scene classification, we chose to
implement the locality-constraint coding based on local soft coding [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], because of its
effectiveness and robustness toward quantization errors. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], authors propose
another efficient implementation of the locality-constrained coding by restricting the
probabilistic soft coding approach [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to only the M -nearest-codewords to a local
feature, i.e.,
where NM (xi) denotes the M -nearest neighborhood of xi, under the Euclidean
distance for instance.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Pooling</title>
      <p>Given the coding coefficients of all local features within one image, a pooling
operation has to be performed to obtain a compact signature h, while preserving
important information and discarding irrelevant details. This operation can be formulated
as the following:</p>
      <p>
        h = [hTR1 ; hTR2 ; :::; hTRP ]T :
hj = g
zi;j ; i 2 f1; :::; N g
; 8j 2 f1; :::; Kg ;
with g a pooling function such as the average, the sum or the maximum functions.
The sum-pooling is the sum of the coding coefficients obtained on local features
while the average-pooling is its normalized form. Both have been usually
considered in the original BoW model. Recent works [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ] show, both theoretically and
empirically, that max-pooling is best suited to the recognition task. Max-pooling is
obtained by selecting the maximum coding coefficient (or codeword response) over
local features for each codeword.
      </p>
      <p>
        Furthermore, since the classic BoVW is an orderless signature that disregards the
location of the visual words in the image, the spatial pyramid matching (SPM) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
is an interesting way to incorporate some global spatial contextual information into
the signature. The image is divided into P different regions and a pooling is
conducted in each of them. The final signature is then obtained by a concatenation of
all the region-relative Ri signatures, i.e.,
2.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Bag-of-Multimedia-Words model</title>
      <p>The Bag-of-Multimedia-Words is a method of early fusion that combines textual and
visual features. Since the late fusion method presented in section 2.1 gives better result,
we do not present this method in this paper and refer to [15] for further details.
3</p>
      <sec id="sec-5-1">
        <title>Textual Features</title>
        <p>
          It is commonly accepted [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] that visual features alone do not convey a high level
semantic description of image content. In order to build robust BoW based tag-signatures
toward quantization errors, we rely on the locality-constrained coding method that has
proved to be effective for visual features when paired with max-pooling. This model is
detailed in [14]. The coding step of a given tag over a codebook requires a tag-similarity
measure.
        </p>
        <p>The two similarity measures that we detail below, capture complementary facets of
tags and their combination improves the quality of predicted tags.</p>
        <p>– Hierarchical similarity:</p>
        <p>WordNet concepts are structured as synsets (sets of synonyms) that are arranged as
a hierarchy whose main structural axis is defined by conceptual inheritance.
WuPalmer measure [13] gives a similarity between two tags as their distance in the
WordNet hierarchy.</p>
        <p>Since a tag can belong to more than one synset in WordNet (i.e., can have more
than one meaning), we opt to determine the semantic relationship between two tags
t1 and t2 as the maximum Wu-Palmer similarity between the synset or the synsets
that include syns(t1) and syns(t2):
simhierarchical(t1; t2) = max simwup(s1; s2) ;
(s1; s2) 2 syns(t1)
syns(t2) ; (4)
where simwup is the Wu-Palmer similarity.
– Contextual similarity</p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], an adaptation of the TF-IDF model to the social space is proposed in order
to compute the social relatedness of two tags.
        </p>
        <p>Let S be the matrix of size N K defined by:</p>
        <p>S(i; j) = users(ti; tj) log(</p>
        <p>userscollection ) ;
userscollection(tj)
(5)
where ti is the target tag, tj is an element of the codebook, users(ti; tj ) is the
number of distinct users who associate the tag ti to the tag tj among the top results
returned by the Flickr API for ti; userscollection(tj) is the number of distinct users
from a pre-fetched subset of Flickr users that have tagged photos with tag tj , and
N is the number of unique tags associated to photos of the dataset and K is the size
of the codebook.</p>
        <p>
          Relying on this matrix, a Flickr model for a given tag ti is proposed in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] as the
following vector of weights:
with wi;j the normalized social weight defined by:
Coding/pooling Once the similarity measures are calculated, we perform local soft
coding for each ti in order to achieve the assignment step. Consequently, a tag is
mapped to only its M-nearest tags under a similarity measure.
(sim(ti; bj ) if bj 2 NM (ti) ;
0
        </p>
        <p>otherwise,
wi = [wi;1; wi;2; :::; wi;K ]T ;
wi;j =</p>
        <p>S(i; j)
maxfS(i; k) ; k = 1; :::; Kg</p>
        <p>:
simcontextual(ti; tj ) =</p>
        <p>wiT wj
jjwijjjjwj jj
:
Thereby, given two tag-Flickr models wi and wj , we compute the contextual
similarities between their related tags ti and tj using the cosine similarity:
(6)
(7)
(8)
(9)
where NM (ti) denotes the M -nearest neighbors of ti, under the hierarchical or the
contextual similarity denoted by sim(ti; bj ). The locality assumption in the tag-space
induces sparse codes while reducing the reconstruction errors, mainly in terms of
semantic reconstruction.</p>
        <p>Given the tag-related codes within one image, a max-pooling is performed in order
to obtain the final tag-signature vector. In our case, separate signatures are generated
considering each similarity measure.
3.1</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Classifier Fusion</title>
      <p>A linear SVM classifier is used for the features obtained from each modality. To
combine classifiers learned on different modalities and/or features, we use a trainable
combiner, called stacked generalization, originally introduced in [12]. It is an ensemble
learning technique, which aims to increase the performance of individual classifiers by
combining them in a hierarchical architecture. The key idea is to learn a meta-level
(level-1) classifier based on the outputs of base-level (level-0) classifiers, estimated via
cross-validation. An example of combination of one visual and two textual classifiers is
presented in Figure 1.</p>
      <p>Given a training dataset D = f(xiF ; yi); i = 1; :::; ng where xiF is the F-feature
vector among the visual (V-feature), the contextual tag (C-feature) and the
hierarchical tag (S-feature) and yi is the associated vector of labels, the algorithm operates as
follows:
1. A K-fold cross-validation process randomly splits D into disjoint parts of almost
equal size D1; :::; DK ;</p>
      <sec id="sec-6-1">
        <title>Experiments</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Submitted runs</title>
      <p>We submitted four runs to the campaign, allowing relevant comparison between
methods:
– textual tagflickr tagwordnet uses only the textual feature described in section 3.</p>
      <p>The codebook size is fixed to 2500 (resp. 5134) for the hierarchical (resp.
contextual) similarity. the codewords for the soft assignment. The optimal size of the
neighborhood has been estimated by cross-validation on the training dataset
leading to a number of 5 (resp. 50) neighboring codewords for the hierarchical (resp.
contextual) based tag-distance measures. A one-versus-all linear kernel based
Support Vector Machine (SVM) classier is used, for each measure. They are combined
using the stack generalization using a 10-fold cross validation.
– multimedia visualrootsift tagflickr tagwordnet This run is a combination of the
previous textual run and the Bag-Of-Visual-Words model detailed in section 2.1.
The pipeline is as follows :</p>
      <p>Local visual descriptors: dense SIFTs of size 128 are extracted within a
regular spatial grid and only one scale. The patch-size is fixed to 16 16 pixels
and the step-size for dense sampling to 6 pixels;
Codebook: a visual codebook of size 4; 000 is created using the K-means
clustering method on a randomly selected subset of SIFTs from the training dataset
( 105 SIFTs).</p>
      <p>Coding/pooling: for coding the local visual descriptors SIFTS, we also fix the
patch-size to 16 16 pixels and the step-size for dense sampling to 6 pixels.
Then for the extracted visual descriptors associated to one image, we consider
a neighborhood in the visual feature space of size 5 for local soft coding and
the softness parameter is set to 10. The max-pooling operation is performed
to aggregate the obtained codes and a spatial pyramid decomposition into 3
levels (1 1; 2 2; 3 3) is adopted for the visual-signature.</p>
      <p>
        A one-versus-all linear kernel based Support Vector Machine (SVM) classifier
is used, since it has shown good performances in scene categorization task
when paired with the max-pooling operation on local features [
        <xref ref-type="bibr" rid="ref6">11, 6</xref>
        ].
      </p>
      <p>
        Classifier fusion: base classifiers are trained on the considered modalities
(visual, contextual and hierarchical) and combined by the stack generalization
approach using 10-cross-validations on the training set as shown in figure 1.
– multimedia visualcsift tagflickr tagwordnet Is the same run as the previous one
except the SIFT version. In this run, we use a colored SIFT (CSIFT) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
4.2
      </p>
    </sec>
    <sec id="sec-8">
      <title>Results and Discussion of Submission Outcomes</title>
      <p>The official results of our runs are illustrated in Table 1. Among the multimodal runs (2
and 3), we notice that using a Colored SIFT works better than the conventional SIFT.
The multimodal run (run 4) scores shows the competitive performances of the
Bagof-Multimedia-Words, ensuring a trade-off between classification accuracy and
computation cost. This model is easier to scale for large-scale datasets since it achieves
comparable performances compared to the other multimodal runs while using only a
feature vector of size 512.</p>
      <p>The first purely textual submission is the combination of the semantic and the
contextual classifiers detailed in section 3. Its performance was almost identical to the best
textual submission of LIRIS ECL Group (the best MAP in the textual modality) as
shown in Table 2.</p>
      <sec id="sec-8-1">
        <title>Acknowledgment</title>
        <p>This work is supported by grants from DIGITEO and Re´gion Ile-de-France, and has
been partially funded by I2S in the context of the project Polinum. We acknowledge
support from the French ANR (Agence Nationale de la Recherche) via the YOJI
(ANR09-CORD-104) and PERIPLUS (ANR-10-CORD-026) projects.
11. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding
for image classification. In: CVPR (2010)
12. Wolpert, D.H.: Stacked generalization. Neural Networks 5, 241–259 (1992)
13. Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: Annual Meeting of the
Association for Computational Linguistics. pp. 133 –138 (1994)
14. Znaidia, A., Shabou, A., Borgne, A.P.H.L., Hudelot, C.: Multimodal Feature Generation
Framework for Semantic Image Classification. In: ACM International Conference on
Multimedia Retrieval (ICMR), Hong Kong (Jun 2012)
15. Znaidia, A., Shabou, A., Le Borgne, H., Hudelot, C., Paragios, N.:
Bag-of-MultimediaWords for Image Classification. In: International Conference on Pattern Recognition
ICPR’12, Tsukuba, JAPAN (Nov 2012)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abdel-Hakim</surname>
            ,
            <given-names>A.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farag</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>Csift: A sift descriptor with color invariant characteristics</article-title>
          .
          <source>In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision</source>
          and Pattern Recognition - Volume
          <volume>2</volume>
          . pp.
          <fpage>1978</fpage>
          -
          <lpage>1983</lpage>
          . CVPR '06, IEEE Computer Society, Washington, DC, USA (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Boureau</surname>
            ,
            <given-names>Y.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bach</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>LeCun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponce</surname>
          </string-name>
          , J.:
          <article-title>Learning mid-level features for recognition</article-title>
          .
          <source>In: CVPR</source>
          . pp.
          <fpage>2559</fpage>
          -
          <lpage>2566</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. van Gemert,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Veenman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Smeulders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Geusebroek</surname>
          </string-name>
          , J.:
          <article-title>Visual word ambiguity</article-title>
          . PAMI pp.
          <fpage>1271</fpage>
          -
          <lpage>1283</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Huiskes</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lew</surname>
            ,
            <given-names>M.S.:</given-names>
          </string-name>
          <article-title>The mir flickr retrieval evaluation</article-title>
          .
          <source>In: MIR '08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval. ACM</source>
          , New York, NY, USA (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponce</surname>
          </string-name>
          , J.:
          <article-title>Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories</article-title>
          .
          <source>In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          . pp.
          <fpage>2169</fpage>
          -
          <lpage>2178</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>In Defense of Soft-assignment Coding</article-title>
          . In: ICCV (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          :
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>International journal of computer vision</source>
          (IJCV)
          <volume>60</volume>
          (
          <issue>2</issue>
          ),
          <fpage>91</fpage>
          -
          <lpage>110</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Nowak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nagel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liebetrau</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The clef 2011 photo annotation and concept-based retrieval tasks</article-title>
          .
          <source>In: CLEF 2011 working notes</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grefenstette</surname>
          </string-name>
          , G.:
          <article-title>Social media driven image retrieval</article-title>
          .
          <source>In: ACM International Conference on Multimedia Retrieval (ICMR)</source>
          . pp.
          <volume>33</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>33</lpage>
          :
          <issue>8</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Smeulders</surname>
            ,
            <given-names>A.W.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Worring</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jain</surname>
          </string-name>
          , R.:
          <article-title>Content-based image retrieval at the end of the early years</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 22</source>
          ,
          <fpage>1349</fpage>
          -
          <lpage>1380</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>