1. INTRODUCTION

SocialSensor: Finding Diverse Images at MediaEval 2014

Eleftherios Spyromitros-Xioufis

0 1

Symeon Papadopoulos

papadop@iti.gr 1

Yiannis Kompatsiaris

Ioannis Vlahavas

vlahavas@csd.auth.gr 0 0 Aristotle University of Thessaloniki , Thessaloniki , Greece 1 Information Technologies Institute, CERTH , Thessaloniki , Greece

2014

16 17

This paper describes the participation of the SosialSensor team in the Retrieving Diverse Social Images Task of MediaEval 2014. All our entries are produced by a di erent instantiation (set of features, parameter con guration) of the same diversi cation algorithm that optimizes a joint relevance-diversity criterion. All our runs are automated and use only resources given by the task organizers. Our best results in terms of the o cial ranking metric (F1@20 0.59) came by the runs that combine visual and textual information, followed by the visual-only run.

1. INTRODUCTION

The Retrieving Diverse Social Images task of MediaEval 2014 deals with the problem of result diversi cation in social photo retrieval. Participants are given a list of images retrieved from Flickr in response to a query for a speci c location e.g., \Ei el Tower" and are asked to return a re ned short-list that contains images which are at the same time relevant and diverse (see [ 4 ] for more details).

To deal with this problem we build upon the approach that we developed for the visual-only run of previous year's task [ 3 ], termed Relevance and Diversity (ReDiv ) [ 1 ]. For this year's task, the ReDiv approach was re ned and used to produce all our runs. Section 2 describes the ReDiv approach and Section 3 details the di erent instantiations of the approach used to produce each of the submitted runs. Finally, in Section 4 we brie y summarize and discuss our experimental results.

OVERVIEW OF OUR APPROACH

Let I = fim1; : : : ; imN g be a set of images that have been retrieved from Flickr in response to a query q for a speci c location. The goal of the diversi cation algorithm is to select a K-sized subset of images from I that are as relevant (to the query location) and as diverse (among each other) as possible. ReDiv formalizes this verbal description into the following optimization problem: arg maxS I;jSj=k U (S) = wR(Sjq) + (1 w)D(S) where we want to identify the set S that has maximum utility U (S), de ned as a weighted combination of the relevance R(Sjq) and the diversity D(S) of S. A similar formulation of the problem was used in [ 2 ]. In ReDiv, however, we use di erent de nitions for R(Sjq) and D(S) that we found more suitable for this task. These changes are described below.

Relevance: In [ 2 ], the authors de ne relevance as R(Sjq) = Pimi2S R(imijq), where R(imijq) = 1 d(imi; imq) and d(imi; imq) denotes the dissimilarity between image imi and the image that depicts the query location imq. We observed that, in the context of this task, this de nition can be problematic (especially when using only visual information) as there are several images that are visually dissimilar to the reference Wikipedia images of the location but are still considered relevant to the location e.g., inside views. Also, in many cases, images that are similar to the reference images are considered irrelevant to the location due to people being part of the image. Motivated by these shortcomings, we developed a more principled way for computing the relevance of each image to the query location. This is achieved by building a (distinct for each location) supervised classication model that is trained to distinguish relevant from irrelevant images. More speci cally, we use the probabilistic output of this model in place of R(imijq). To train this model, we use the relevance ground truth provided by the task organizers for the development set locations and use relevant/irrelevant images of other locations as positive/negative examples. Additionally, the Wikipedia images of each location are used as positive (relevant) examples and are assigned a large weight.

Diversity: Assuming a ranking imr1; : : : ; imrK of the images in S, the authors in [ 2 ] de ne diversity as D(S) = PiK=1 1i Pij=1 d(imri; imrj ), where d(imri; imrj ) is the dissimilarity between the images ranked at positions i and j. Thus, high diversity scores are given to image sets with a high average dissimilarity. We notice that this de nition of diversity can assign relatively high diversity scores to image sets containing images with highly similar image pairs (probably belonging to the same cluster) and this results in a direct negative impact on the CR@20 measure and consequently to F1@20. Therefore, we adopt a more strict de nition of diversity where the diversity of a set S is de ned as the dissimilarity between the most similar pair of images in S: D(S) = min d(imi; imj).

imi;imj2S;i6=j

Optimization: An exact optimization of U comes with a high complexity as it would require computing the utility of all K!(NN! K)! K-subsets of I. With N 300 and K = 20 (in order to maximize F1@20) the computational cost of exact optimization becomes prohibitive. We therefore adopt the greedy, approximate optimization approach that was used in [ 2 ] with appropriate changes to re ect our new de nitions for relevance and diversity. This algorithm starts with an empty set S and sequentially expands it by adding at each step J = 1; : : : ; K the image im that scores highest (among the unselected images), to the following criterion: U (im ) = wR(im ) + (1 w) min d(im ; imj ), where imj2SJ 1 SJ 1 represents S at step J 1. We also developed a less greedy version of this algorithm that in each step J keeps M highest scoring image subsets. Since the two algorithms coincide for M = 1 we used the less greedy version and tuned the M parameter.

Experimental Protocol: Depending on the type of the run (visual/textual/both) a variety of di erent (vector) representations of the images could be utilized for building the relevance detection models and computing pairwise image similarities in ReDiv (note that the algorithm allows using di erent representations for relevance and diversity). To reduce the complexity of the experiments, we rst evaluated each representation in terms of its relevance detection ability and then evaluated combinations of only the top performing representations in the ReDiv algorithm. To judge the e ectiveness of each representation in terms of relevance detection and to perform model selection we used a variant of leave-one(-location)-out cross-validation and measured performance via area under ROC (AUC). As classi cation algorithm we used L2-regularized logistic regression, as it led to near optimal results for a variety of representations in preliminary experiments.

Given an instantiation of the ReDiv approach (a speci c combination of relevance detection model and diversity features) we performed leave-one(-location)-out cross-validation and evaluated the performance of each instantiation in terms of F1@20. The process was repeated for di erent values of w in the [0; 1] range. We also noticed that using only the n < N most relevant images (according to the relevance detection model) leads to improved performance. We therefore also performed a coarse search over the domain of N = f1; 2; : : : ; 300g in order to nd an optimal value. Finally, we tested the values f1; 2; 3; 4; 5g for the M parameter. 3.1

Visual (Run 1)

For this run we experimented with all the precomputed visual features made available by the task organizers and also extracted our own visual features. The best results were obtained using VLAD+CSURF [ 5 ] vectors (computed from a 128-dimensional visual vocabulary and projected to 128 dimensions with PCA and whitening) for both the relevance and the diversity component. Cosine distance was used as dissimilarity measure. The parameters used to produce the 1st run are: w = 0:4, n = 75, M = 3. 3.2

Textual (Run 2)

A bag-of-words representation with the 20K/7.5K most frequent words was used for the relevance/diversity component. Wikipedia images were represented using a parsed version of the corresponding Wikipedia page and Flickr images by a concatenation of the words in their titles ( 3), description ( 2) and tags ( 1). Again, cosine distance was used as dissimilarity measure. The parameters used to produce the 2nd run are: w = 0:95, n = 110, M = 1.

Visual+Textual (Runs 3 & 5)

An early fusion of the visual and textual features described above was used for the relevance component and the visual features described above were used for the diversity component. The parameters used to produce the 3rd run are: w = 0:75, n = 90, M = 5. The 5th run di ers from the 3rd run only in the value used for n (= 95). 4.

RESULTS AND DISCUSSION ACKNOWLEDGEMENTS

This work is supported by the SocialSensor FP7 project, partially funded by the EC under contract number 287975.

[1]

Corney ,

Martin , A . Goker, E. Spyromitros-Xiou s , S. Papadopoulos,

Kompatsiaris ,

Aiello , and

Thomee . Socialsensor: Finding diverse images at mediaeval 2013 . In MediaEval, 2013 .

[2]

Deselaers ,

Gass ,

Dreuw , and

Ney . Jointly optimising relevance and diversity in image retrieval . In ACM CIVR '09 , New York, USA, 2009 .

[3]

Ionescu ,

Menendez , H.

Muller, and

Popescu . Retrieving diverse social images at MediaEval 2013: Objectives, dataset and evaluation . In MediaEval, 2013 .

[4]

Ionescu ,

Popescu ,

Lupu , A.

G^nsca, and

ller. Retrieving diverse social images at MediaEval 2014: Challenge, dataset and evaluation . In MediaEval, 2014 .

[5]

Spyromitros-Xiou s , S. Papadopoulos, I. Kompatsiaris, G. Tsoumakas, and I. Vlahavas. A comprehensive study over vlad and product quantization in large-scale image retrieval . IEEE Transactions on Multimedia , 2014 .