-

The PRA and AmILAB at ImageCLEF 2012 Photo Flickr Annotation Task

Luca Piras

luca.piras@diee.unica.it 1

Roberto Tronci

roberto.tronci@diee.unica.it roberto.tronci@sardegnaricerche.it 0 1

Gabriele Murgia

gabriele.murgia@sardegnaricerche.it 0

Giorgio Giacinto

giacinto@diee.unica.it 1 0 AmILAB - Laboratorio Intelligenza d'Ambiente , Sardegna Ricerche , Italy 1 DIEE - Department of Electric and Electronic Engineering University of Cagliari , Italy

2012

This paper presents the rst participation of the Pattern Recognition and Application Group (PRA Group), and the Ambient Intelligence Lab (AmILAB) at the ImageCLEF 2012 Photo Flickr Concept Annotation Task. In this task, the teams' goal is to detect the presence of 94 concepts in the images, and to provide a con dence score related to the con dence of the decision of each concept detector. We faced the challenge by relying on visual information only, combining di erent image descriptors by means of di erent score combination techniques. Experimental results show that just combining concept detectors not speci cally designed for handling the large variety of concepts does not allow reaching satisfactory results.

image annotation dynamic score combination SVM

The visual concept annotation task is a multi-label classi cation challenge where the goal consists in the analysis of a collection of photos in order to detect the presence of one or more concepts. The number of selected concepts is equal to 94, and their semantics cover a wide range. They include categories related to persons (e.g. baby, child, teenager, adult), animals (e.g. cat, dog, horse), and sentiments (e.g. unpleasant, euphoric). In addition to the images and the associated concepts, participants are provided with textual features, and visual features. Our main objective to solve this task is to use the combination of outputs of visual concept detectors based on visual descriptors. A detailed overview of the data set, and the related task can be found in [ 4 ].

Visual features For this task, a subset of the MIRFLICKR3 collection has been used. This subset comprises 25 thousand images that have been manually annotated using a limited 3 http://press.liacs.nl/mir ickr/ number of concepts. With respect to the previous editions of the competition, this year the annotation process has been carried out by resorting to crowdsourcing mechanisms. Several concepts have been reused of last year's task, and, for most of these concepts, the remaining photos of the MIRFLICKR-25K collection that had not yet been used in the previous task, have been annotated. In order to boost the quality of all 25,000 images, they have been reannotated for several concepts too. All the images have been naturally annotated for the new concepts. All images have been accompanied by di erent kind of features: textual, and visual features. Detailed information about the feature sets can be found in [ 4 ].

In our approach, we focused on visual descriptors only. The visual descriptors proposed for this task are the following: sift, c-sift, RGB-sift, and Opponentsift. For each descriptor, the histogram of the occurrence frequencies has been extracted by using the Color Descriptors toolkit [ 3 ]. As expected, the K-means clustering used to produce a "bag of visual words" representation, is quite slow for the data sizes at hand, as clustering 250,000 points takes at least 12 hours per iteration. The solution usually proposed is to reduce the number of points to cluster. By default, the toolkit extracts 250,000 points regardless of the number of training images, thus reducing the number of point per image automatically as the number of images increase. It means that the toolkit extracts less than 17 points per image, thus loosing dozens of descriptors. At the same time, if up to 200 points are extracted for each of 15,000 training images, the K-means algorithm should cluster 3,000,000 points!

For this reasons we decided to divide the 15,000 training images into four groups, by retaining the same proportion of image per concept as in the whole training set. Then, for each group, we clustered around 750,000 points in order to obtain four di erent codebooks (one codebook for each descriptor) with 2,048 visual words. Each codebook has then be used to produce the Bag of Visual Words descriptors. This procedure allowed obtaining a large vocabulary of "visual words", and at the same time reduced the number of points to cluster. 3

Concept detection by dynamic combination of visual classi ers We submitted three runs in total. All of these runs are based only on the bag of visual words descriptors illustrated in the previous section.

For all the runs, we used the Multiple Classi er System paradigm, and the Support Vector Machine has been used as the base classi er for its good performances on various image classi cation tasks [ 1 ]. We trained a single SVM for each global image descriptor and each visual concept. Thus, for each concept i, a set of four SVMs fSisift; Sirgb; Sicolor; Siopponentg is available.

We classi ed all the pattern of the test set by means of these sets of SVMs. k Thus, for each test pattern xj , we obtained as output the class decision dij taken by the classi er k (i.e., 1 if the pattern belongs or not to the concept i, 0 otherwise), and the distance from the decision border is transformed through a

In the case of the other two combination rules, we used the Dynamic Score Selection (DSC) approach [ 5 ]: sidjsc = (1

k ) mkinfsij g +

k mkaxfsij g This combination rule is able to perform a dynamic combination at the score level, by allowing to dynamically chose the best scores and weights to be combined. In [ 5 ] di erent methods to compute dynamically the weights are proposed. In these runs, we used one of those methods, and one that has been speci cally designed for the task at hand.

The rule for computing for the Dynamic Score Selection by Majority Vote is the following: = 1; if at least half of the dikj are equal to 1 0; otherwise

The rule for computing the following: for the Dynamic Score Selection by Mean Rule is (1) (2) (3) (4) min-max normalization into a classi cation score sikj , in the range [0; 1], of the test pattern with respect to the concept.

We used the following three combination rules: { The Mean rule { The Dynamic Score Selection by Majority Vote { The Dynamic Score Selection by Mean rule

For the Mean rule, we computed the average of the classi cation scores obtained from the classi ers [ 2 ]: simj ean = 1 k

X sij

k k = 1 k

X sij

k k 4

Results and Discussion The performances (Interpolated Mean Average Precision (MiAP), Interpolated Geometric Mean Average Precision (GMiAP), and F1-measure on all concepts related to our runs are listed in Table 1, and they are compared to the performances obtained by the other team that used visual features only. Detailed information about the evaluation process can be found in [ 4 ].

A rst conclusion that can be drawn from Table 1 is that using a combination of general purpose classi ers does not permit to obtain very satisfactory results, as we obtained just the tenth position out of thirteen participants.

The proposed results also show that the Dynamic Score Selection by Majority Vote does not work as expected, as it is outperformed by the Dynamic Score Selection by Mean Rule for the MiAP and GMiAP measures, and by the Mean rule for the F1-measure. m e - o me -m fda "0 ofda eofd y_da y a y _ y " s _ u n nrises ight" c u e n ce les-a set" le l_ s- su a n ce l_m " le o w s wewawetaheteahrte_hroe_vrce_-rcacllea_assrttssakkrnys""" e lo y ath ud " weather_rainysky" e b wea r_ligh ow" we ther_ tning co athe fogm " m r_ is cocombbuus-ons_nflowicet"" m s- am bus- on_ es" o sm ligh-n_firewoke" ligh- ng_shorks" n a ligh g_re dow - fl " n e g c lig _s -o scaphe-_nmg_leilnhsoeuffeFen"" o e u c s nta t" scapcape_dinhil" e e _ s fore ert" s scape_tpark" c s o cap ast e " _ scaprural" s e water_cuape_gr_acity" n ffi wa de -" te rw r_s ate e r a " o w w ce ate ate an" r_ r_ rivers lake" w tre a a te m r_ " o florather" _ flo tre ra " _ fl p o la ra_ nt" fl o fl w o e ra r" _ g fa ras un " a _ faun cat" a fauna__dog" h o fa rs u e n " a _ fau fish n " a fa fauna _bird u _ " na_amfaunainsect ph _sp " ib id ia e fa nre r" u p n qu a_ro le" a d n e - n ty t" quan _non q -ty e" uan _on q qu -ty e" uan-tya_n-ty_t_htwo" qu sm re an al " - g ty ro _ u b p ig " n a age ger" _ a a d ge_e ult" relraer-eloa-ng_oegfnnae_dmncedoirle_yrff_relidmmeenaardllleey""" la-on work s" _ e ququalqituyalitsyt_rannogerrss"" a _ b lity_ par- lur" qualityc_ompletaelblur"

m b stylest_yqplsueitc_aytlcluieitrryec_oiuna-lapro-rincwftabuclluurtesrr"""" _ a gra rp" sty yco le lo _o r" v vie ve iew_ w_p rlay" c o lose rtra up it" m vie ac v w_in ro" ie d w o _ seK outd r" n o seK g_c r" seK seKng_paitylife" ng_sseespnKo-nrngtsg_r_feohcordemratey-lloiiffnee""" m en rin se t_ k" n h se -m ap sen n-m ent_ y" - e c m n a e t_ lm sen- nt_m inac- " m e v en lan e" t_ ch u o sen nple lic" - a m s sen- ent_ ant" se m sc n e a -m nt_ ry" e a n c s t_ -v e e e n u " - p m h tranent_fuoric" sp n transpttorraartn_sotprrout_rctck_yccalyer""" n b tra spo us" nspo rt_ra rt_ il" tra wa nspo ter" rt_ a ir "

Average'Precision' 0 ,5 " 0 ,6 " 0 ,1 " 0 ,2 " 0 ,3 " 0 ,4 " 0 ,7 " 0 ,8 " 0 ,9 " 1 "

Dynamic"Mean"Rule" Mean"rule" Dynamic"Majority"Vote" MEDIAN"

Conclusions In our participation to the ImageCLEF photo annotation task, multiple visual features has been used for representing the images. We combine the di erent information using the Bag-of-Words model taking care that a number of image descriptor big enough was used for each image. After the BoW extraction, we combined the four feature spaces in three di erent ways. The evaluation results showed that a simple combination of di erent feature spaces using classi ers not speci cally designed for taking into account the big variety of concepts is not able to reach satisfactory results.

1. Cristianini , N. , Shawe-Taylor , J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods . Cambridge University Press ( 2000 )

2. Kittler , J. , Hatef , M. , Duin , R.P.W. , Matas , J.: On combining classi ers . IEEE Trans. Pattern Anal. Mach. Intell . 20 ( 3 ), 226 { 239 ( 1998 )

3. van de Sande, K.E.A. , Gevers , T. , Snoek , C.G.M.: Evaluating color descriptors for object and scene recognition . IEEE Trans. Pattern Anal. Mach. Intell . 32 ( 9 ), 1582 { 1596 ( 2010 )

4. Thomee , B. , Popescu , A. : Overview of the imageclef 2012 ickr photo annotation and retrieval task . CLEF 2012 working notes , Rome, Italy ( 2012 )

5. Tronci , R. , Giacinto , G. , Roli , F. : Dynamic score combination: A supervised and unsupervised score combination method . Machine Learning and Data Mining in Pattern Recognition 5632 , 163 { 177 ( 2009 )