Introduction

DBRIS at ImageCLEF 2012 Photo Annotation Task

0 Institute of Computer Science Heinrich-Heine-University of Duesseldorf D-40204 Duesseldorf , Germany

2012

For our participation in the ImageCLEF 2012 Photo Annotation Task we develope an image annotation system and test several combinations of SIFT-based descriptors with bow-based image representations. Our focus is on the comparison of two image representation types which include spatial layout: the spatial pyramids and the visual phrases. The experiments on the training and test set show that image representations based on visual phrases signi cantly outperform spatial pyramids.

SIFT bow spatial pyramids visual phrases

Introduction

This paper presents our participation in the ImageCLEF 2012 Photo Annotation Task. The ImageCLEF 2012 Photo Annotation Task is a multi-label image classi cation challenge: given a training set of images with underlying concepts the aim is to detect the presence of these concepts for each image of a test set using an annotation system based on visual or textual features or a combination of both. Detailed information on the task, the training and test set of images, the concepts and the evaluation measures can be found in the overview paper [ 1 ]. Our automatic image annotation system bases only on visual features. We focus on the comparison of two image representations which regard spatial layout: the spatial pyramid[ 4 ] and the visual phrases[ 3 ]. The spatial pyramid is very popular and often used, especially in the context of scene categorization, whereas visual phrases seem to pass out of mind in the literature.

The remainder of this paper is organized as follows: in section 2 we describe the architecture and the technical details of our image annotation system, in section 3 we present the evaluation on the training and the test set and discuss the results to end with a conclusion in section 4.

Architecture of the DBRIS image annotation system

The architecture of our automatic image annotation system together with the methods used in each step is illustrated in gure 1. To obtain the image representation of the training and test images, local features are extracted by applying the Harris-Laplace detector and the SIFT[ 5 ] descriptor in di erent color variants. The extracted local features are then summarized to the bag-of-words (bow) image representation as well as the image representations spatial pyramid[ 4 ] and visual phrases[ 3 ]. For the classi er training and classi cation steps we use an KNN-like classi er with one representative per concept. In the following subsections we describe each step in detail.

training images test images feature extraction Harris-Laplace

& ● C-SIFT ● rgSIFT ● OpponentSIFT ● RGB-SIFT ● SIFT

image representation ● BOW ● Spatial Pyramids ● Visual Phrases classifier training classifier

KNN-like with one representative

per concept classification For the choice of local features we refer to the evalution of color descriptors presented in [ 2 ]. We adopt the features C-SIFT, rgSIFT, OpponentSIFT, RGBSIFT and SIFT as they are shown to perform best on the evaluation's underlying image benchmark, PASCAL VOC Challenge 2007. To extract these features with the Harris-Laplace point sampling strategy as the base, we use the color descriptor software [ 2 ]. 2.2

Image representations For each of the features, we quantize its descriptor space (225.000 descriptors) into 500 and 5000 visual words using K-Means. The visual words serve as a basis for the BoW, spatial pyramid and visual phrases representations. The representations are created in the common way using hard assignment of image features to visual words. We use the spatial pyramid constructions 1x3, 1x1+1x3 and 1x1+2x2+4x4 in a weighted and unweighted version. To construct visual phrases we follow [ 3 ] and de ne a visual phrase as a pair of adjacent visual words. Assume an image contains the keypoints kpa = f(xa; ya); scalea; orienta; descrag and kpb = f(xb; yb); scaleb; orientb; descrbg with their assigned visual words vwi and vwj, respectively. Then the image contains the visual phrase vpij = fvwi; vwjg if the Euclidean distance of the keypoints' location in the image satisfy the term EuclideanDistance((xa; ya); (xb; yb)) < max(scalea; scaleb) (1) We set = 3. Analogously to the bow representation an image is represented by a histogram of visual phrases. Furthermore we create a representation combining bow with visual phrases, weighting bow with a value of 0.25 and the visual phrases histogram with 0.75. Table 1 summarizes all image representations with their number of dimensions we used in combination with each feature. image representation number of dimensions

bow 5.000 sp 1x3 15.000 sp 1x1+1x3 20.000 sp 1x1+2x2+4x4 105.000 sp 1x1+2x2+4x4 w 105.000

vp 125.250 bow & vp 130.250 We use a KNN-like classi er, where concepts are not represented by the set of the corresponding images, but only by one representative. The representative of a concept is obtained by averaging the image representations of all images belonging to the concept. To classify a test image the similarities between the test image and the representatives of all concepts are determined. As similarity function we use the histogram intersection. To receive binary decisions on the membership to the concepts, we set an image-dependent threshold: a concept is present in the test image if the similarity between the test image and the concept is equal or greater than 0.75 times the maximum of the similarities of the test image to all concepts. 3

Evaluation

In the following we describe two evaluations: rstly we present the results of our experiments made on the training set. Secondly we discuss the evaluation of the ve runs submitted to ImageCLEF.

Training and classi cation on the training set To train and evaluate the DBRIS image annotation system, we split the training set of images into two disjoint parts (of size 7500), whereby both parts contain almost equal size of images for each concept. For each training and test pair we train the classi er on the one part and then use this classi er to classify the other part of images. The evaluation results are then averaged over the two training and test pairs.

In the rst experiment we train one image annotation system for each of the 35 combinations of descriptors and image representations. Figure 2 shows the results in terms of MiAP values (averaged over all concepts). Comparing the systems with regard to the descriptors we observe an almost identical performance behaviour as shown in [ 2 ]. Except for the rgSIFT combined with the visual phrases based image representations, C-SIFT outperforms all the other descriptors in every image representation. The worst results are obtained by the SIFT descriptor. When we consider the image representations, we can see that the image representations based on visual phrases perform signi cantly better than the other ones for all descriptors. In the case of the descriptors C-SIFT, rgSIFT and OpponentSIFT the representations vp and bow & vp achieve similar values. When using RGB-SIFT and SIFT, the bow & vp representation is the better choice of the two. Bow and the representations based on spatial pyramid di er slightly from each other. Which one to choose depends on the descriptor used.

C-SIFT rgSIFT

RGB-SIFT

OpponentSIFT

SIFT

In the next experiment we join all descriptors into one annotation system, i.e. for each of the seven image representations we train an image annotation system whose classi er consists of ve classi ers corresponding to the ve descriptors. At the classi cation step, the similarities between the test image and the concept 0,1020 0,1000 0,0980 0,0960 0,0940 P 0,0920 A iM 0,0900 0,0880 0,0860 0,0840 0,0820 0,0990 0,0980 0,0970 0,0960 P 0,0950 iAM 0,0940 0,0930 0,0920 0,0910 0,0900 representatives obtained in each of the ve classi ers are averaged over these ve classi ers. The binary decisions on the membership to the concepts are calculated in the same way as described in section 2.3. Furthermore we create a con guration which combines the ve descriptors with the image representations sp 1x1+1x3, sp 1x1+2x2+4x4 w, bow & vp and vp. The annotation system with this con guration consists of 20 classi ers (5 descriptors x 4 representations) and is called combined. The MiAP values for all con gurations are shown in gure 3. The performances of the image representations behave similar to the progress at the C-SIFT descriptor in gure 2, but the MiAP values are lower and comparable with the rgSIFT results. The combination of more representations improve the performance of the bow and the spatial pyramids, but the image representations based on visual phrases still achieve better results.

all features For the classi cation of the test set, we train the classi er on the whole training set. For the ve submission runs we choose ve of the image representations from the second experiment presented in section 3.1: sp 1x1+2x2+4x4 w as run DBRIS 1, combined as DBRIS 2, sp 1x1+1x3 as DBRIS 3, vp as DBRIS 4 and bow & vp as DBRIS 5. Figure 4 and gure 5 present the results of the con gurations for each concept (MiAP values) and as averages (MiAP, MnAP, GMiAP, GMnAP) over all concepts. Best values or values which are signi cantly better than others within a certain concept are highlighted in green. To evaluate the image representations as a whole, rstly we consider the averages MiAP, MnAP, GMiAP, GMnAP in gure 5. The image representations vp and bow & vp yield the best values again, followed by combined, sp 1x1+2x2+4x4 w and sp 1x1+1x3. These results re ect the evaluation in gure 3. When we consider the concepts with their concept categories, we can see that there are some concept categories where the image representations based on visual phrases dominate. These concept categories are quantity, age, (gender ) and view. These observations have also been made in the experiments on the training set. Other concept categories which yield best results with the visual phrases on the training set are relation and setting. A possible reason for the success of the visual phrases in these concept categories can be that these concepts contain a lot of pictures of persons. Visual phrases can catch human features like eyes, mouth, etc. better than the spatial pyramids because they work on a ner level. The success of the visual phrases in the concept category water can not be con rmed by the experiments on the training set. As visual phrases are popular for object detection tasks, it is surprising that these image representations fail in the concept category fauna. The best results in the concept category fauna are achieved with the image representations based on spatial pyramids. Spatial pyramids are also successful in sentiment and transport. 4

Conclusion

At the end we want to summarize the experiences we gathered in the experiments. The best performing descriptor, which is C-SIFT in the experiments, yields better performance than joining all descriptors together. For the choice on image representations, the image representations based on visual phrases signi cantly outperform the spatial pyramids and the bow representation. The evaluation shows that visual phrases are especially appropriate for concepts dealing with persons. Although visual phrases are often used in object detection tasks, they are also successful in scene categorization. concept 0 timeofday_day 1 timeofday_night 2 timeofday_sunrisesunset

Fig. 4. Results of the submitted runs 1 (in MiAP)

concept 51 age_baby 52 age_child 53 age_teenager 54 age_adult 55 age_elderly 56 gender_male 57 gender_female 58 relation_familyfriends 59 relation_coworkers 60 relation_strangers 61 quality_noblur 62 quality_partialblur 63 quality_completeblur 64 quality_motionblur 65 quality_artifacts 66 style_pictureinpicture 67 style_circularwarp 68 style_graycolor 69 style_overlay 70 view_portrait 71 view_closeupmacro 72 view_indoor 73 view_outdoor 74 setting_citylife 75 setting_partylife 76 setting_homelife 77 setting_sportsrecreation 78 setting_fooddrink sp 1x1+2x2+4x4 w combined sp 1x1+1x3

DBRIS 1 DBRIS 2 DBRIS 3 0,0084 0,0087 0,0085 0,0362 0,0346 0,0342 0,0484 0,0399 0,0481 0,3153 0,2604 0,2854 0,024 0,0435 0,0264

1. Thomee , B. , Popescu , A. : Overview of the ImageCLEF 2012 Flickr Photo Annotation and Retrieval Task . CLEF 2012 working notes , Rome, Italy ( 2012 )

2. van de Sande, K. E. A. , Gevers , T. , Snoek , C. G. M. : Evaluating Color Descriptors for Object and Scene Recognition . In: IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 32 ( 9 ), pp. 1582 - 1596 , ( 2010 ) http://www. colordescriptors.com

3. Zheng , Qing-Fang and Gao, Wen: Constructing visual phrases for e ective and e - cient object-based image retrieval . In: ACM Trans. Multimedia Comput. Commun. Appl. , vol 5 , 1, art. 7 ( 2008 )

Lazebnik ,

Schmid and

Ponce : Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories . In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York ( 2006 ), vol. 2 , pp. 2169 - 2178

5. Lowe , David G.: Distinctive Image Features from Scale-Invariant Keypoints . In: International Journal of Computer Vision , vol. 60 , no. 2 , pp. 91 - 110 , 2004