DBRIS at ImageCLEF 2012 Photo Annotation
                     Task

                     Magdalena Rischka and Stefan Conrad

                          Institute of Computer Science
                     Heinrich-Heine-University of Duesseldorf
                         D-40204 Duesseldorf, Germany
                   {rischka,conrad}@cs.uni-duesseldorf.de


      Abstract. For our participation in the ImageCLEF 2012 Photo Anno-
      tation Task we develope an image annotation system and test several
      combinations of SIFT-based descriptors with bow-based image repre-
      sentations. Our focus is on the comparison of two image representation
      types which include spatial layout: the spatial pyramids and the visual
      phrases. The experiments on the training and test set show that image
      representations based on visual phrases significantly outperform spatial
      pyramids.

      Keywords: SIFT, bow, spatial pyramids, visual phrases


1    Introduction
This paper presents our participation in the ImageCLEF 2012 Photo Annota-
tion Task. The ImageCLEF 2012 Photo Annotation Task is a multi-label image
classification challenge: given a training set of images with underlying concepts
the aim is to detect the presence of these concepts for each image of a test set
using an annotation system based on visual or textual features or a combination
of both. Detailed information on the task, the training and test set of images,
the concepts and the evaluation measures can be found in the overview paper [1].
Our automatic image annotation system bases only on visual features. We focus
on the comparison of two image representations which regard spatial layout: the
spatial pyramid[4] and the visual phrases[3]. The spatial pyramid is very popular
and often used, especially in the context of scene categorization, whereas visual
phrases seem to pass out of mind in the literature.
    The remainder of this paper is organized as follows: in section 2 we describe
the architecture and the technical details of our image annotation system, in
section 3 we present the evaluation on the training and the test set and discuss
the results to end with a conclusion in section 4.

2    Architecture of the DBRIS image annotation system
The architecture of our automatic image annotation system together with the
methods used in each step is illustrated in figure 1. To obtain the image represen-
tation of the training and test images, local features are extracted by applying
2          Magdalena Rischka and Stefan Conrad

the Harris-Laplace detector and the SIFT[5] descriptor in different color variants.
The extracted local features are then summarized to the bag-of-words (bow) im-
age representation as well as the image representations spatial pyramid[4] and
visual phrases[3]. For the classifier training and classification steps we use an
KNN-like classifier with one representative per concept. In the following subsec-
tions we describe each step in detail.


    training           feature         image              classifier
                      extraction   representation          training       classifier
    images
                                                                         KNN-like with
                  Harris-Laplace   ●   BOW                             one representative
                                                                          per concept
                         &         ●   Spatial Pyramids
                  ● C-SIFT
                  ● rgSIFT         ●   Visual Phrases
                  ● OpponentSIFT
                                                                                            annotated
        test      ● RGB-SIFT
                                                                       classification          test
      images      ● SIFT

                                                                                             images


                                                                           TRAINING PHASE
                                                                            ANNOTATION PHASE


               Fig. 1. Architecture of the DBRIS image annotation system


2.1      Features

For the choice of local features we refer to the evalution of color descriptors
presented in [2]. We adopt the features C-SIFT, rgSIFT, OpponentSIFT, RGB-
SIFT and SIFT as they are shown to perform best on the evaluation’s underlying
image benchmark, PASCAL VOC Challenge 2007. To extract these features
with the Harris-Laplace point sampling strategy as the base, we use the color
descriptor software [2].


2.2      Image representations

For each of the features, we quantize its descriptor space (225.000 descriptors)
into 500 and 5000 visual words using K-Means. The visual words serve as a basis
for the BoW, spatial pyramid and visual phrases representations. The represen-
tations are created in the common way using hard assignment of image features
to visual words. We use the spatial pyramid constructions 1x3, 1x1+1x3 and
1x1+2x2+4x4 in a weighted and unweighted version. To construct visual phrases
we follow [3] and define a visual phrase as a pair of adjacent visual words. Assume
an image contains the keypoints kpa = {(xa , ya ), scalea , orienta , descra } and
kpb = {(xb , yb ), scaleb , orientb , descrb } with their assigned visual words vwi and
                        DBRIS at ImageCLEF 2012 Photo Annotation Task            3

vwj , respectively. Then the image contains the visual phrase vpij = {vwi , vwj }
if the Euclidean distance of the keypoints’ location in the image satisfy the term

        EuclideanDistance((xa , ya ), (xb , yb )) < max(scalea , scaleb ) · λ   (1)

We set λ = 3. Analogously to the bow representation an image is represented by
a histogram of visual phrases. Furthermore we create a representation combin-
ing bow with visual phrases, weighting bow with a value of 0.25 and the visual
phrases histogram with 0.75. Table 1 summarizes all image representations with
their number of dimensions we used in combination with each feature.


                    image representation number of dimensions
                            bow                  5.000
                          sp 1x3                15.000
                        sp 1x1+1x3              20.000
                      sp 1x1+2x2+4x4           105.000
                     sp 1x1+2x2+4x4 w          105.000
                             vp                125.250
                         bow & vp              130.250
                          Table 1. Image representations


2.3   Classifier

We use a KNN-like classifier, where concepts are not represented by the set of
the corresponding images, but only by one representative. The representative
of a concept is obtained by averaging the image representations of all images
belonging to the concept. To classify a test image the similarities between the
test image and the representatives of all concepts are determined. As similarity
function we use the histogram intersection. To receive binary decisions on the
membership to the concepts, we set an image-dependent threshold: a concept is
present in the test image if the similarity between the test image and the concept
is equal or greater than 0.75 times the maximum of the similarities of the test
image to all concepts.


3     Evaluation

In the following we describe two evaluations: firstly we present the results of our
experiments made on the training set. Secondly we discuss the evaluation of the
five runs submitted to ImageCLEF.
4          Magdalena Rischka and Stefan Conrad

3.1      Training and classification on the training set
To train and evaluate the DBRIS image annotation system, we split the training
set of images into two disjoint parts (of size 7500), whereby both parts contain
almost equal size of images for each concept. For each training and test pair we
train the classifier on the one part and then use this classifier to classify the other
part of images. The evaluation results are then averaged over the two training
and test pairs.
    In the first experiment we train one image annotation system for each of
the 35 combinations of descriptors and image representations. Figure 2 shows
the results in terms of MiAP values (averaged over all concepts). Comparing
the systems with regard to the descriptors we observe an almost identical per-
formance behaviour as shown in [2]. Except for the rgSIFT combined with the
visual phrases based image representations, C-SIFT outperforms all the other
descriptors in every image representation. The worst results are obtained by the
SIFT descriptor. When we consider the image representations, we can see that
the image representations based on visual phrases perform significantly better
than the other ones for all descriptors. In the case of the descriptors C-SIFT,
rgSIFT and OpponentSIFT the representations vp and bow & vp achieve similar
values. When using RGB-SIFT and SIFT, the bow & vp representation is the
better choice of the two. Bow and the representations based on spatial pyramid
differ slightly from each other. Which one to choose depends on the descriptor
used.


       0,1020

       0,1000

       0,0980

       0,0960
                                                                     bow
       0,0940                                                        sp 1x3
                                                                     sp 1x1 + 1x3
       0,0920
MiAP


                                                                     sp 1x1 + 2x2 + 4x4
       0,0900                                                        sp 1x1 + 2x2 + 4x4 w
                                                                     vp
       0,0880                                                        bow & vp
       0,0860

       0,0840

       0,0820
                C-SIFT   rgSIFT    RGB-SIFT   OpponentSIFT   SIFT


    Fig. 2. MiAP values for each combination of descriptor and image representation


    In the next experiment we join all descriptors into one annotation system, i.e.
for each of the seven image representations we train an image annotation system
whose classifier consists of five classifiers corresponding to the five descriptors. At
the classification step, the similarities between the test image and the concept
                         DBRIS at ImageCLEF 2012 Photo Annotation Task              5

representatives obtained in each of the five classifiers are averaged over these
five classifiers. The binary decisions on the membership to the concepts are
calculated in the same way as described in section 2.3. Furthermore we create a
configuration which combines the five descriptors with the image representations
sp 1x1+1x3, sp 1x1+2x2+4x4 w, bow & vp and vp. The annotation system with
this configuration consists of 20 classifiers (5 descriptors x 4 representations) and
is called combined. The MiAP values for all configurations are shown in figure 3.
The performances of the image representations behave similar to the progress at
the C-SIFT descriptor in figure 2, but the MiAP values are lower and comparable
with the rgSIFT results. The combination of more representations improve the
performance of the bow and the spatial pyramids, but the image representations
based on visual phrases still achieve better results.


              0,0990
              0,0980
              0,0970
                                                             bow
              0,0960                                         sp 1x3
              0,0950                                         sp 1x1 + 1x3
       MiAP


                                                             sp 1x1 + 2x2 + 4x4
              0,0940                                         sp 1x1 + 2x2 + 4x4 w
              0,0930                                         vp
                                                             bow & vp
              0,0920
                                                             combined
              0,0910
              0,0900
                                 all features


                  Fig. 3. MiAP values for each image representation


3.2   Classification of the test set

For the classification of the test set, we train the classifier on the whole training
set. For the five submission runs we choose five of the image representations
from the second experiment presented in section 3.1: sp 1x1+2x2+4x4 w as run
DBRIS 1, combined as DBRIS 2, sp 1x1+1x3 as DBRIS 3, vp as DBRIS 4 and
bow & vp as DBRIS 5. Figure 4 and figure 5 present the results of the configu-
rations for each concept (MiAP values) and as averages (MiAP, MnAP, GMiAP,
GMnAP) over all concepts. Best values or values which are significantly bet-
ter than others within a certain concept are highlighted in green. To evaluate
the image representations as a whole, firstly we consider the averages MiAP,
MnAP, GMiAP, GMnAP in figure 5. The image representations vp and bow &
vp yield the best values again, followed by combined, sp 1x1+2x2+4x4 w and sp
1x1+1x3. These results reflect the evaluation in figure 3. When we consider the
concepts with their concept categories, we can see that there are some concept
6       Magdalena Rischka and Stefan Conrad

categories where the image representations based on visual phrases dominate.
These concept categories are quantity, age, (gender ) and view. These observa-
tions have also been made in the experiments on the training set. Other concept
categories which yield best results with the visual phrases on the training set are
relation and setting. A possible reason for the success of the visual phrases in
these concept categories can be that these concepts contain a lot of pictures of
persons. Visual phrases can catch human features like eyes, mouth, etc. better
than the spatial pyramids because they work on a finer level. The success of
the visual phrases in the concept category water can not be confirmed by the
experiments on the training set. As visual phrases are popular for object detec-
tion tasks, it is surprising that these image representations fail in the concept
category fauna. The best results in the concept category fauna are achieved with
the image representations based on spatial pyramids. Spatial pyramids are also
successful in sentiment and transport.


4   Conclusion

At the end we want to summarize the experiences we gathered in the exper-
iments. The best performing descriptor, which is C-SIFT in the experiments,
yields better performance than joining all descriptors together. For the choice
on image representations, the image representations based on visual phrases sig-
nificantly outperform the spatial pyramids and the bow representation. The eval-
uation shows that visual phrases are especially appropriate for concepts dealing
with persons. Although visual phrases are often used in object detection tasks,
they are also successful in scene categorization.


References
1. Thomee, B., Popescu, A.: Overview of the ImageCLEF 2012 Flickr Photo Annota-
   tion and Retrieval Task. CLEF 2012 working notes, Rome, Italy (2012)
2. van de Sande, K. E. A., Gevers, T., Snoek, C. G. M.: Evaluating Color Descrip-
   tors for Object and Scene Recognition. In: IEEE Transactions on Pattern Anal-
   ysis and Machine Intelligence, vol. 32 (9), pp. 1582-1596, (2010) http://www.
   colordescriptors.com
3. Zheng, Qing-Fang and Gao, Wen: Constructing visual phrases for effective and effi-
   cient object-based image retrieval. In: ACM Trans. Multimedia Comput. Commun.
   Appl., vol 5, 1, art. 7 (2008)
4. S. Lazebnik, C. Schmid and J. Ponce: Beyond Bags of Features: Spatial Pyramid
   Matching for Recognizing Natural Scene Categories. In: IEEE Computer Society
   Conference on Computer Vision and Pattern Recognition, New York (2006), vol. 2,
   pp. 2169 - 2178
5. Lowe, David G.: Distinctive Image Features from Scale-Invariant Keypoints. In:
   International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004
                              DBRIS at ImageCLEF 2012 Photo Annotation Task                     7


                               sp 1x1+2x2+4x4 w   combined   sp 1x1+1x3     vp      bow & vp
          concept
                                    DBRIS 1        DBRIS 2    DBRIS 3     DBRIS 4   DBRIS 5
0    timeofday_day                      0,4714      0,4412       0,4399    0,4687     0,4699
1    timeofday_night                    0,0498      0,0506       0,0463    0,0453     0,0458
2    timeofday_sunrisesunset            0,0384      0,0484       0,0386     0,056     0,0563

3    celestial_sun                      0,0238      0,0274       0,0247    0,0271     0,0276
4    celestial_moon                     0,0068      0,0068       0,0068    0,0068     0,0068
5    celestial_stars                    0,0025      0,0025       0,0025    0,0025     0,0025

6    weather_clearsky                   0,0715      0,0723       0,0718    0,0715     0,0713
7    weather_overcastsky                0,0517      0,0482       0,0497    0,0472     0,0479
8    weather_cloudysky                  0,1355      0,2272       0,1647    0,2312     0,2312
9    weather_rainbow                    0,0026      0,0092       0,0035    0,0029     0,0028
10   weather_lightning                  0,0132      0,0133       0,0132    0,0131      0,013
11   weather_fogmist                    0,0175      0,0128       0,0163    0,1016     0,0172
12   weather_snowice                    0,0203      0,0185       0,0182    0,0178     0,0153

13   combustion_flames                  0,0039      0,0044        0,004    0,0046     0,0046
14   combustion_smoke                   0,0049      0,0048       0,0048    0,0048     0,0048
15   combustion_fireworks               0,0052      0,0039       0,0048    0,0035     0,0043

16   lighting_shadow                    0,0824      0,0762       0,0795    0,0756     0,0767
17   lighting_reflection                0,0332      0,0332       0,0333     0,035     0,0347
18   lighting_silhouette                0,0341      0,0341       0,0344    0,0422     0,0467
19   lighting_lenseffect                0,0429      0,0517       0,0445    0,0542     0,0497

20   scape_mountainhill                 0,0602      0,1355       0,0692    0,0918     0,0931
21   scape_desert                       0,0189       0,053       0,0346     0,097     0,0958
22   scape_forestpark                   0,2221       0,257       0,2415    0,2138     0,2113
23   scape_coast                        0,1782      0,1725       0,1785    0,1744     0,1773
24   scape_rural                        0,0505      0,0615       0,0541    0,0675        0,07
25   scape_city                         0,1104       0,118       0,1123    0,1114     0,1112
26   scape_graffiti                     0,0444      0,0493       0,0456    0,0446     0,0444

27   water_underwater                   0,0161      0,0092       0,0127    0,0125      0,019
28   water_seaocean                     0,0513      0,0486       0,0516    0,0552     0,0559
29   water_lake                         0,0127       0,013       0,0136    0,0156     0,0156
30   water_riverstream                  0,0507      0,0492       0,0603    0,1249     0,1263
31   water_other                        0,0328      0,0363       0,0331    0,0353     0,0346

32   flora_tree                         0,3332       0,321       0,3395    0,3376     0,3379
33   flora_plant                        0,0585      0,0715       0,0649    0,0717     0,0707
34   flora_flower                       0,0778      0,0832       0,0792    0,0984     0,0985
35   flora_grass                        0,2075      0,2603       0,2248    0,1724     0,1707

36   fauna_cat                          0,0226      0,0179       0,0154    0,0163     0,0145
37   fauna_dog                          0,0506      0,0451       0,0476    0,0498     0,0485
38   fauna_horse                        0,0166      0,0183       0,0164    0,0146     0,0146
39   fauna_fish                         0,0089      0,0051       0,0055    0,0059      0,006
40   fauna_bird                         0,0345      0,0339        0,038    0,0326     0,0331
41   fauna_insect                       0,0156      0,0213       0,0178     0,018     0,0167
42   fauna_spider                       0,0047      0,0045       0,0051    0,0046      0,005
43   fauna_amphibianreptile             0,0067      0,0072       0,0075    0,0069     0,0071
44   fauna_rodent                        0,015      0,0136       0,0156    0,0147     0,0147

45   quantity_none                      0,7234      0,7399        0,738    0,7234     0,7233
46   quantity_one                       0,2774      0,2324        0,247    0,2798     0,2798
47   quantity_two                          0,05     0,0498       0,0499     0,052     0,0518
48   quantity_three                      0,019      0,0188       0,0189    0,0205     0,0207
49   quantity_smallgroup                0,0535      0,0551       0,0531    0,0595     0,0592
50   quantity_biggroup                  0,0579      0,0622       0,0589    0,0664     0,0661


                    Fig. 4. Results of the submitted runs 1 (in MiAP)
8        Magdalena Rischka and Stefan Conrad


                                    sp 1x1+2x2+4x4 w   combined   sp 1x1+1x3     vp      bow & vp
              concept
                                         DBRIS 1        DBRIS 2    DBRIS 3     DBRIS 4   DBRIS 5
    51   age_baby                            0,0084      0,0087       0,0085     0,009     0,0091
    52   age_child                           0,0362      0,0346       0,0342    0,0371     0,0367
    53   age_teenager                        0,0484      0,0399       0,0481    0,1173     0,1177
    54   age_adult                           0,3153      0,2604       0,2854    0,2996     0,2986
    55   age_elderly                          0,024      0,0435       0,0264    0,0267     0,0273

    56   gender_male                         0,1927      0,1923       0,1922    0,2043      0,203
    57   gender_female                        0,269      0,2158       0,2395    0,2353     0,2347

    58   relation_familyfriends              0,0775      0,0763       0,0774    0,0832     0,0828
    59   relation_coworkers                  0,0257       0,032       0,0275    0,0338     0,0321
    60   relation_strangers                  0,1295      0,0516       0,0835    0,0663      0,072

    61   quality_noblur                       0,735      0,7375       0,7359     0,723     0,7232
    62   quality_partialblur                 0,3039      0,2484       0,3037    0,3073     0,3076
    63   quality_completeblur                0,0083      0,0083       0,0083    0,0094     0,0085
    64   quality_motionblur                  0,0208      0,0232       0,0228    0,0204     0,0205
    65   quality_artifacts                   0,0235      0,0199       0,0206    0,0208     0,0205

    66   style_pictureinpicture              0,0207      0,0231       0,0228    0,0212     0,0195
    67   style_circularwarp                  0,0157      0,0154       0,0155    0,0161     0,0159
    68   style_graycolor                     0,0371       0,114       0,0859    0,0904     0,0903
    69   style_overlay                       0,0398      0,0399       0,0399    0,0392     0,0392

    70   view_portrait                       0,1448       0,163       0,1447    0,2032     0,2031
    71   view_closeupmacro                   0,1684      0,1722       0,1703    0,1722     0,1736
    72   view_indoor                         0,1434      0,1469       0,1436    0,1624     0,1624
    73   view_outdoor                        0,4595      0,4303       0,4208    0,4527     0,4532

    74   setting_citylife                    0,1762      0,1831       0,1759    0,1814     0,1809
    75   setting_partylife                   0,0375      0,0421       0,0398    0,0422     0,0429
    76   setting_homelife                       0,07     0,0698       0,0704    0,0763     0,0763
    77   setting_sportsrecreation            0,0362       0,037       0,0361    0,0368     0,0368
    78   setting_fooddrink                   0,0821      0,0974       0,0766    0,1064     0,1006

    79   sentiment_happy                     0,1198      0,1859       0,1819    0,1247      0,123
    80   sentiment_calm                      0,1601       0,171       0,1624    0,1703     0,1719
    81   sentiment_inactive                  0,0949      0,0951       0,0954    0,0944     0,0943
    82   sentiment_melancholic                0,062      0,0614       0,0612    0,0666     0,0642
    83   sentiment_unpleasant                0,0535      0,0504       0,0515    0,0486     0,0489
    84   sentiment_scary                     0,0389      0,0309       0,0358    0,0333     0,0328
    85   sentiment_active                    0,1657      0,0899       0,1202    0,1207      0,166
    86   sentiment_euphoric                   0,017      0,0182       0,0176    0,0186     0,0182
    87   sentiment_funny                     0,1521      0,1543       0,1527    0,1121     0,1121

    88   transport_cycle                     0,0336      0,0309       0,0301     0,031     0,0298
    89   transport_car                       0,0719      0,0677       0,0718    0,0651     0,0674
    90   transport_truckbus                   0,013      0,0103        0,012    0,0107     0,0112
    91   transport_rail                      0,0147      0,0147       0,0154    0,0143     0,0148
    92   transport_water                     0,0693      0,0508       0,0687    0,0669      0,068
    93   transport_air                       0,0057      0,0059       0,0058    0,0056     0,0057

               map_n                         0,0774       0,081       0,0788    0,0818     0,0818
               map_i                         0,0927      0,0938       0,0925    0,0976     0,0972
              gmap_n                         0,0355      0,0374       0,0363    0,0385     0,0385
               gmap_i                        0,0441      0,0454       0,0445    0,0476      0,047


                         Fig. 5. Results of the submitted runs 2 (in MiAP)