=Paper=
{{Paper
|id=Vol-1263/paper36
|storemode=property
|title=SocialSensor: Finding Diverse Images at MediaEval 2014
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_36.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/XioufisPKV14
}}
==SocialSensor: Finding Diverse Images at MediaEval 2014==
SocialSensor: Finding Diverse Images at MediaEval 2014 Eleftherios Spyromitros-Xioufis1,2 , Symeon Papadopoulos1 , Yiannis Kompatsiaris1 , Ioannis Vlahavas2 1 Information Technologies Institute, CERTH, Thessaloniki, Greece 2 Aristotle University of Thessaloniki, Thessaloniki, Greece {espyromi,papadop,ikom}@iti.gr,vlahavas@csd.auth.gr ABSTRACT and D(S) that we found more suitable for this task. These This paper describes the participation of the SosialSensor changes are described below. team in the Retrieving Diverse Social Images Task of Me- Relevance: P In [2], the authors define relevance as R(S|q) diaEval 2014. All our entries are produced by a different = imi ∈S R(imi |q), where R(imi |q) = 1 − d(imi , imq ) and instantiation (set of features, parameter configuration) of d(imi , imq ) denotes the dissimilarity between image imi and the same diversification algorithm that optimizes a joint the image that depicts the query location imq . We observed relevance-diversity criterion. All our runs are automated that, in the context of this task, this definition can be prob- and use only resources given by the task organizers. Our lematic (especially when using only visual information) as best results in terms of the official ranking metric (F1@20 there are several images that are visually dissimilar to the ≈ 0.59) came by the runs that combine visual and textual reference Wikipedia images of the location but are still con- information, followed by the visual-only run. sidered relevant to the location e.g., inside views. Also, in many cases, images that are similar to the reference images are considered irrelevant to the location due to people be- 1. INTRODUCTION ing part of the image. Motivated by these shortcomings, The Retrieving Diverse Social Images task of MediaEval we developed a more principled way for computing the rele- 2014 deals with the problem of result diversification in so- vance of each image to the query location. This is achieved cial photo retrieval. Participants are given a list of images by building a (distinct for each location) supervised classi- retrieved from Flickr in response to a query for a specific lo- fication model that is trained to distinguish relevant from cation e.g., “Eiffel Tower” and are asked to return a refined irrelevant images. More specifically, we use the probabilis- short-list that contains images which are at the same time tic output of this model in place of R(imi |q). To train relevant and diverse (see [4] for more details). this model, we use the relevance ground truth provided by To deal with this problem we build upon the approach the task organizers for the development set locations and that we developed for the visual-only run of previous year’s use relevant/irrelevant images of other locations as posi- task [3], termed Relevance and Diversity (ReDiv ) [1]. For tive/negative examples. Additionally, the Wikipedia images this year’s task, the ReDiv approach was refined and used of each location are used as positive (relevant) examples and to produce all our runs. Section 2 describes the ReDiv ap- are assigned a large weight. proach and Section 3 details the different instantiations of Diversity: Assuming a ranking imr1 , . . . , imrK of the the approach used to produce each of the submitted runs. images in S, the authors in [2] define diversity as D(S) = PK 1 Pi Finally, in Section 4 we briefly summarize and discuss our i=1 i j=1 d(imri , imrj ), where d(imri , imrj ) is the dis- experimental results. similarity between the images ranked at positions i and j. Thus, high diversity scores are given to image sets with a 2. OVERVIEW OF OUR APPROACH high average dissimilarity. We notice that this definition of diversity can assign relatively high diversity scores to im- Let I = {im1 , . . . , imN } be a set of images that have been age sets containing images with highly similar image pairs retrieved from Flickr in response to a query q for a specific (probably belonging to the same cluster) and this results in location. The goal of the diversification algorithm is to select a direct negative impact on the CR@20 measure and conse- a K-sized subset of images from I that are as relevant (to quently to F1@20. Therefore, we adopt a more strict defini- the query location) and as diverse (among each other) as tion of diversity where the diversity of a set S is defined as possible. ReDiv formalizes this verbal description into the the dissimilarity between the most similar pair of images in following optimization problem: arg maxS⊂I,|S|=k U (S) = S: D(S) = min d(imi , imj ). wR(S|q) + (1 − w)D(S) where we want to identify the set imi ,imj ∈S,i6=j S that has maximum utility U (S), defined as a weighted Optimization: An exact optimization of U comes with a combination of the relevance R(S|q) and the diversity D(S) high complexity as it would require computing the utility of of S. A similar formulation of the problem was used in [2]. all K!(NN−K)! ! K-subsets of I. With N ≈ 300 and K = 20 (in In ReDiv, however, we use different definitions for R(S|q) order to maximize F1@20) the computational cost of exact optimization becomes prohibitive. We therefore adopt the greedy, approximate optimization approach that was used Copyright is held by the author/owner(s). in [2] with appropriate changes to reflect our new defini- MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain tions for relevance and diversity. This algorithm starts with an empty set S and sequentially expands it by adding at Table 1: Performance of the submitted runs. Development Set Test Set (official) each step J = 1, . . . , K the image im∗ that scores highest Run P@20 CR@20 F1@20 P@20 CR@20 F1@20 (among the unselected images), to the following criterion: 1 0.815 0.497 0.609 0.775 0.460 0.569 U (im∗ ) = wR(im∗ ) + (1 − w) min d(im∗ , imj ), where 2 0.863 0.468 0.599 0.832 0.407 0.538 imj ∈S J−1 J−1 3 0.855 0.521 0.642 0.817 0.473 0.593 S represents S at step J − 1. We also developed a less 5 0.857 0.527 0.647 0.815 0.475 0.594 greedy version of this algorithm that in each step J keeps M highest scoring image subsets. Since the two algorithms co- incide for M = 1 we used the less greedy version and tuned the M parameter. 3.3 Visual+Textual (Runs 3 & 5) Experimental Protocol: Depending on the type of the An early fusion of the visual and textual features described run (visual/textual/both) a variety of different (vector) rep- above was used for the relevance component and the visual resentations of the images could be utilized for building the features described above were used for the diversity com- relevance detection models and computing pairwise image ponent. The parameters used to produce the 3rd run are: similarities in ReDiv (note that the algorithm allows using w = 0.75, n = 90, M = 5. The 5th run differs from the 3rd different representations for relevance and diversity). To re- run only in the value used for n (= 95). duce the complexity of the experiments, we first evaluated each representation in terms of its relevance detection ability and then evaluated combinations of only the top perform- 4. RESULTS AND DISCUSSION ing representations in the ReDiv algorithm. To judge the Table 1 shows the performance of the submitted runs in effectiveness of each representation in terms of relevance de- the test locations as well as their estimated performance us- tection and to perform model selection we used a variant of ing our internal evaluation procedure. We observe that in leave-one(-location)-out cross-validation and measured per- all cases we overestimated the final performance, neverthe- formance via area under ROC (AUC). As classification al- less by a small and approximately constant (among different gorithm we used L2-regularized logistic regression, as it led runs) margin. Most importantly, the relative ranking of the to near optimal results for a variety of representations in runs is the same, suggesting that our model selection proce- preliminary experiments. dure was appropriate. The best performance was obtained Given an instantiation of the ReDiv approach (a specific by the two variations of our visual+textual run, followed by combination of relevance detection model and diversity fea- the visual-only run. Despite the comparatively lower per- tures) we performed leave-one(-location)-out cross-valida- formance obtained using only textual features, we see that tion and evaluated the performance of each instantiation a significant performance boost was possible by combining in terms of F1@20. The process was repeated for different them with visual features for relevance detection. values of w in the [0, 1] range. We also noticed that using In the future we plan to develop a more principled ap- only the n < N most relevant images (according to the rele- proach for combining different types of features in the ReDiv vance detection model) leads to improved performance. We algorithm, especially for the diversity component. therefore also performed a coarse search over the domain of N = {1, 2, . . . , 300} in order to find an optimal value. Fi- 5. ACKNOWLEDGEMENTS nally, we tested the values {1, 2, 3, 4, 5} for the M parameter. This work is supported by the SocialSensor FP7 project, partially funded by the EC under contract number 287975. 3. INSTANTIATIONS 6. REFERENCES 3.1 Visual (Run 1) [1] D. Corney, C. Martin, A. Göker, For this run we experimented with all the precomputed E. Spyromitros-Xioufis, S. Papadopoulos, visual features made available by the task organizers and Y. Kompatsiaris, L. Aiello, and B. Thomee. also extracted our own visual features. The best results were Socialsensor: Finding diverse images at mediaeval 2013. obtained using VLAD+CSURF [5] vectors (computed from In MediaEval, 2013. a 128-dimensional visual vocabulary and projected to 128 [2] T. Deselaers, T. Gass, P. Dreuw, and H. Ney. Jointly dimensions with PCA and whitening) for both the relevance optimising relevance and diversity in image retrieval. In and the diversity component. Cosine distance was used as ACM CIVR ’09, New York, USA, 2009. dissimilarity measure. The parameters used to produce the [3] B. Ionescu, M. Menéndez, H. Müller, and A. Popescu. 1st run are: w = 0.4, n = 75, M = 3. Retrieving diverse social images at MediaEval 2013: Objectives, dataset and evaluation. In MediaEval, 2013. 3.2 Textual (Run 2) [4] B. Ionescu, A. Popescu, M. Lupu, A. Gı̂nsca, and A bag-of-words representation with the 20K/7.5K most H. Müller. Retrieving diverse social images at frequent words was used for the relevance/diversity compo- MediaEval 2014: Challenge, dataset and evaluation. In nent. Wikipedia images were represented using a parsed MediaEval, 2014. version of the corresponding Wikipedia page and Flickr im- [5] E. Spyromitros-Xioufis, S. Papadopoulos, ages by a concatenation of the words in their titles (×3), I. Kompatsiaris, G. Tsoumakas, and I. Vlahavas. A description (×2) and tags (×1). Again, cosine distance was comprehensive study over vlad and product used as dissimilarity measure. The parameters used to pro- quantization in large-scale image retrieval. IEEE duce the 2nd run are: w = 0.95, n = 110, M = 1. Transactions on Multimedia, 2014.