<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DBRIS at ImageCLEF 2012 Photo Annotation Task</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computer Science Heinrich-Heine-University of Duesseldorf D-40204 Duesseldorf</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <abstract>
        <p>For our participation in the ImageCLEF 2012 Photo Annotation Task we develope an image annotation system and test several combinations of SIFT-based descriptors with bow-based image representations. Our focus is on the comparison of two image representation types which include spatial layout: the spatial pyramids and the visual phrases. The experiments on the training and test set show that image representations based on visual phrases signi cantly outperform spatial pyramids.</p>
      </abstract>
      <kwd-group>
        <kwd>SIFT</kwd>
        <kwd>bow</kwd>
        <kwd>spatial pyramids</kwd>
        <kwd>visual phrases</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        This paper presents our participation in the ImageCLEF 2012 Photo
Annotation Task. The ImageCLEF 2012 Photo Annotation Task is a multi-label image
classi cation challenge: given a training set of images with underlying concepts
the aim is to detect the presence of these concepts for each image of a test set
using an annotation system based on visual or textual features or a combination
of both. Detailed information on the task, the training and test set of images,
the concepts and the evaluation measures can be found in the overview paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Our automatic image annotation system bases only on visual features. We focus
on the comparison of two image representations which regard spatial layout: the
spatial pyramid[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and the visual phrases[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The spatial pyramid is very popular
and often used, especially in the context of scene categorization, whereas visual
phrases seem to pass out of mind in the literature.
      </p>
      <p>The remainder of this paper is organized as follows: in section 2 we describe
the architecture and the technical details of our image annotation system, in
section 3 we present the evaluation on the training and the test set and discuss
the results to end with a conclusion in section 4.</p>
    </sec>
    <sec id="sec-2">
      <title>Architecture of the DBRIS image annotation system</title>
      <p>
        The architecture of our automatic image annotation system together with the
methods used in each step is illustrated in gure 1. To obtain the image
representation of the training and test images, local features are extracted by applying
the Harris-Laplace detector and the SIFT[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] descriptor in di erent color variants.
The extracted local features are then summarized to the bag-of-words (bow)
image representation as well as the image representations spatial pyramid[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
visual phrases[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For the classi er training and classi cation steps we use an
KNN-like classi er with one representative per concept. In the following
subsections we describe each step in detail.
      </p>
      <p>training
images
test
images
feature
extraction
Harris-Laplace</p>
      <p>&amp;
● C-SIFT
● rgSIFT
● OpponentSIFT
● RGB-SIFT
● SIFT</p>
      <p>image
representation
● BOW
● Spatial Pyramids
● Visual Phrases
classifier
training
classifier</p>
      <p>KNN-like with
one representative</p>
      <p>
        per concept
classification
For the choice of local features we refer to the evalution of color descriptors
presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We adopt the features C-SIFT, rgSIFT, OpponentSIFT,
RGBSIFT and SIFT as they are shown to perform best on the evaluation's underlying
image benchmark, PASCAL VOC Challenge 2007. To extract these features
with the Harris-Laplace point sampling strategy as the base, we use the color
descriptor software [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
2.2
      </p>
      <p>
        Image representations
For each of the features, we quantize its descriptor space (225.000 descriptors)
into 500 and 5000 visual words using K-Means. The visual words serve as a basis
for the BoW, spatial pyramid and visual phrases representations. The
representations are created in the common way using hard assignment of image features
to visual words. We use the spatial pyramid constructions 1x3, 1x1+1x3 and
1x1+2x2+4x4 in a weighted and unweighted version. To construct visual phrases
we follow [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and de ne a visual phrase as a pair of adjacent visual words. Assume
an image contains the keypoints kpa = f(xa; ya); scalea; orienta; descrag and
kpb = f(xb; yb); scaleb; orientb; descrbg with their assigned visual words vwi and
vwj, respectively. Then the image contains the visual phrase vpij = fvwi; vwjg
if the Euclidean distance of the keypoints' location in the image satisfy the term
EuclideanDistance((xa; ya); (xb; yb)) &lt; max(scalea; scaleb)
(1)
We set = 3. Analogously to the bow representation an image is represented by
a histogram of visual phrases. Furthermore we create a representation
combining bow with visual phrases, weighting bow with a value of 0.25 and the visual
phrases histogram with 0.75. Table 1 summarizes all image representations with
their number of dimensions we used in combination with each feature.
image representation number of dimensions
      </p>
      <p>bow 5.000
sp 1x3 15.000
sp 1x1+1x3 20.000
sp 1x1+2x2+4x4 105.000
sp 1x1+2x2+4x4 w 105.000</p>
      <p>vp 125.250
bow &amp; vp 130.250
We use a KNN-like classi er, where concepts are not represented by the set of
the corresponding images, but only by one representative. The representative
of a concept is obtained by averaging the image representations of all images
belonging to the concept. To classify a test image the similarities between the
test image and the representatives of all concepts are determined. As similarity
function we use the histogram intersection. To receive binary decisions on the
membership to the concepts, we set an image-dependent threshold: a concept is
present in the test image if the similarity between the test image and the concept
is equal or greater than 0.75 times the maximum of the similarities of the test
image to all concepts.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>In the following we describe two evaluations: rstly we present the results of our
experiments made on the training set. Secondly we discuss the evaluation of the
ve runs submitted to ImageCLEF.</p>
      <p>Training and classi cation on the training set
To train and evaluate the DBRIS image annotation system, we split the training
set of images into two disjoint parts (of size 7500), whereby both parts contain
almost equal size of images for each concept. For each training and test pair we
train the classi er on the one part and then use this classi er to classify the other
part of images. The evaluation results are then averaged over the two training
and test pairs.</p>
      <p>
        In the rst experiment we train one image annotation system for each of
the 35 combinations of descriptors and image representations. Figure 2 shows
the results in terms of MiAP values (averaged over all concepts). Comparing
the systems with regard to the descriptors we observe an almost identical
performance behaviour as shown in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Except for the rgSIFT combined with the
visual phrases based image representations, C-SIFT outperforms all the other
descriptors in every image representation. The worst results are obtained by the
SIFT descriptor. When we consider the image representations, we can see that
the image representations based on visual phrases perform signi cantly better
than the other ones for all descriptors. In the case of the descriptors C-SIFT,
rgSIFT and OpponentSIFT the representations vp and bow &amp; vp achieve similar
values. When using RGB-SIFT and SIFT, the bow &amp; vp representation is the
better choice of the two. Bow and the representations based on spatial pyramid
di er slightly from each other. Which one to choose depends on the descriptor
used.
      </p>
      <p>C-SIFT
rgSIFT</p>
      <p>RGB-SIFT</p>
      <p>OpponentSIFT</p>
      <p>SIFT</p>
      <p>In the next experiment we join all descriptors into one annotation system, i.e.
for each of the seven image representations we train an image annotation system
whose classi er consists of ve classi ers corresponding to the ve descriptors. At
the classi cation step, the similarities between the test image and the concept
0,1020
0,1000
0,0980
0,0960
0,0940
P 0,0920
A
iM 0,0900
0,0880
0,0860
0,0840
0,0820
0,0990
0,0980
0,0970
0,0960
P 0,0950
iAM 0,0940
0,0930
0,0920
0,0910
0,0900
representatives obtained in each of the ve classi ers are averaged over these
ve classi ers. The binary decisions on the membership to the concepts are
calculated in the same way as described in section 2.3. Furthermore we create a
con guration which combines the ve descriptors with the image representations
sp 1x1+1x3, sp 1x1+2x2+4x4 w, bow &amp; vp and vp. The annotation system with
this con guration consists of 20 classi ers (5 descriptors x 4 representations) and
is called combined. The MiAP values for all con gurations are shown in gure 3.
The performances of the image representations behave similar to the progress at
the C-SIFT descriptor in gure 2, but the MiAP values are lower and comparable
with the rgSIFT results. The combination of more representations improve the
performance of the bow and the spatial pyramids, but the image representations
based on visual phrases still achieve better results.</p>
      <p>all features
For the classi cation of the test set, we train the classi er on the whole training
set. For the ve submission runs we choose ve of the image representations
from the second experiment presented in section 3.1: sp 1x1+2x2+4x4 w as run
DBRIS 1, combined as DBRIS 2, sp 1x1+1x3 as DBRIS 3, vp as DBRIS 4 and
bow &amp; vp as DBRIS 5. Figure 4 and gure 5 present the results of the con
gurations for each concept (MiAP values) and as averages (MiAP, MnAP, GMiAP,
GMnAP) over all concepts. Best values or values which are signi cantly
better than others within a certain concept are highlighted in green. To evaluate
the image representations as a whole, rstly we consider the averages MiAP,
MnAP, GMiAP, GMnAP in gure 5. The image representations vp and bow &amp;
vp yield the best values again, followed by combined, sp 1x1+2x2+4x4 w and sp
1x1+1x3. These results re ect the evaluation in gure 3. When we consider the
concepts with their concept categories, we can see that there are some concept
categories where the image representations based on visual phrases dominate.
These concept categories are quantity, age, (gender ) and view. These
observations have also been made in the experiments on the training set. Other concept
categories which yield best results with the visual phrases on the training set are
relation and setting. A possible reason for the success of the visual phrases in
these concept categories can be that these concepts contain a lot of pictures of
persons. Visual phrases can catch human features like eyes, mouth, etc. better
than the spatial pyramids because they work on a ner level. The success of
the visual phrases in the concept category water can not be con rmed by the
experiments on the training set. As visual phrases are popular for object
detection tasks, it is surprising that these image representations fail in the concept
category fauna. The best results in the concept category fauna are achieved with
the image representations based on spatial pyramids. Spatial pyramids are also
successful in sentiment and transport.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>At the end we want to summarize the experiences we gathered in the
experiments. The best performing descriptor, which is C-SIFT in the experiments,
yields better performance than joining all descriptors together. For the choice
on image representations, the image representations based on visual phrases
signi cantly outperform the spatial pyramids and the bow representation. The
evaluation shows that visual phrases are especially appropriate for concepts dealing
with persons. Although visual phrases are often used in object detection tasks,
they are also successful in scene categorization.
concept
0 timeofday_day
1 timeofday_night
2 timeofday_sunrisesunset</p>
      <p>Fig. 4. Results of the submitted runs 1 (in MiAP)</p>
      <p>concept
51 age_baby
52 age_child
53 age_teenager
54 age_adult
55 age_elderly
56 gender_male
57 gender_female
58 relation_familyfriends
59 relation_coworkers
60 relation_strangers
61 quality_noblur
62 quality_partialblur
63 quality_completeblur
64 quality_motionblur
65 quality_artifacts
66 style_pictureinpicture
67 style_circularwarp
68 style_graycolor
69 style_overlay
70 view_portrait
71 view_closeupmacro
72 view_indoor
73 view_outdoor
74 setting_citylife
75 setting_partylife
76 setting_homelife
77 setting_sportsrecreation
78 setting_fooddrink
sp 1x1+2x2+4x4 w combined sp 1x1+1x3</p>
      <p>DBRIS 1 DBRIS 2 DBRIS 3
0,0084 0,0087 0,0085
0,0362 0,0346 0,0342
0,0484 0,0399 0,0481
0,3153 0,2604 0,2854
0,024 0,0435 0,0264</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Thomee</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Overview of the ImageCLEF 2012 Flickr Photo Annotation and Retrieval Task</article-title>
          .
          <source>CLEF 2012 working notes</source>
          , Rome, Italy (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. van de Sande,
          <string-name>
            <given-names>K. E. A.</given-names>
            ,
            <surname>Gevers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. G. M.</surname>
          </string-name>
          :
          <article-title>Evaluating Color Descriptors for Object and Scene Recognition</article-title>
          .
          <source>In: IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , vol.
          <volume>32</volume>
          (
          <issue>9</issue>
          ), pp.
          <fpage>1582</fpage>
          -
          <lpage>1596</lpage>
          , (
          <year>2010</year>
          ) http://www. colordescriptors.com
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Zheng</surname>
          </string-name>
          ,
          <article-title>Qing-Fang and Gao, Wen: Constructing visual phrases for e ective and e - cient object-based image retrieval</article-title>
          .
          <source>In: ACM Trans. Multimedia Comput. Commun. Appl.</source>
          , vol
          <volume>5</volume>
          , 1, art.
          <volume>7</volume>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>S.</given-names>
            <surname>Lazebnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ponce</surname>
          </string-name>
          <article-title>: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories</article-title>
          . In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York (
          <year>2006</year>
          ), vol.
          <volume>2</volume>
          , pp.
          <fpage>2169</fpage>
          -
          <lpage>2178</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lowe</surname>
          </string-name>
          , David G.:
          <article-title>Distinctive Image Features from Scale-Invariant Keypoints</article-title>
          .
          <source>In: International Journal of Computer Vision</source>
          , vol.
          <volume>60</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>110</lpage>
          ,
          <year>2004</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>