=Paper= {{Paper |id=Vol-1263/paper36 |storemode=property |title=SocialSensor: Finding Diverse Images at MediaEval 2014 |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_36.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/XioufisPKV14 }} ==SocialSensor: Finding Diverse Images at MediaEval 2014== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_36.pdf
  SocialSensor: Finding Diverse Images at MediaEval 2014

                          Eleftherios Spyromitros-Xioufis1,2 , Symeon Papadopoulos1 ,
                                   Yiannis Kompatsiaris1 , Ioannis Vlahavas2
                            1
                                Information Technologies Institute, CERTH, Thessaloniki, Greece
                                    2
                                      Aristotle University of Thessaloniki, Thessaloniki, Greece
                                      {espyromi,papadop,ikom}@iti.gr,vlahavas@csd.auth.gr



ABSTRACT                                                            and D(S) that we found more suitable for this task. These
This paper describes the participation of the SosialSensor          changes are described below.
team in the Retrieving Diverse Social Images Task of Me-               Relevance:
                                                                       P             In [2], the authors define relevance as R(S|q)
diaEval 2014. All our entries are produced by a different           = imi ∈S R(imi |q), where R(imi |q) = 1 − d(imi , imq ) and
instantiation (set of features, parameter configuration) of         d(imi , imq ) denotes the dissimilarity between image imi and
the same diversification algorithm that optimizes a joint           the image that depicts the query location imq . We observed
relevance-diversity criterion. All our runs are automated           that, in the context of this task, this definition can be prob-
and use only resources given by the task organizers. Our            lematic (especially when using only visual information) as
best results in terms of the official ranking metric (F1@20         there are several images that are visually dissimilar to the
≈ 0.59) came by the runs that combine visual and textual            reference Wikipedia images of the location but are still con-
information, followed by the visual-only run.                       sidered relevant to the location e.g., inside views. Also, in
                                                                    many cases, images that are similar to the reference images
                                                                    are considered irrelevant to the location due to people be-
1.   INTRODUCTION                                                   ing part of the image. Motivated by these shortcomings,
   The Retrieving Diverse Social Images task of MediaEval           we developed a more principled way for computing the rele-
2014 deals with the problem of result diversification in so-        vance of each image to the query location. This is achieved
cial photo retrieval. Participants are given a list of images       by building a (distinct for each location) supervised classi-
retrieved from Flickr in response to a query for a specific lo-     fication model that is trained to distinguish relevant from
cation e.g., “Eiffel Tower” and are asked to return a refined       irrelevant images. More specifically, we use the probabilis-
short-list that contains images which are at the same time          tic output of this model in place of R(imi |q). To train
relevant and diverse (see [4] for more details).                    this model, we use the relevance ground truth provided by
   To deal with this problem we build upon the approach             the task organizers for the development set locations and
that we developed for the visual-only run of previous year’s        use relevant/irrelevant images of other locations as posi-
task [3], termed Relevance and Diversity (ReDiv ) [1]. For          tive/negative examples. Additionally, the Wikipedia images
this year’s task, the ReDiv approach was refined and used           of each location are used as positive (relevant) examples and
to produce all our runs. Section 2 describes the ReDiv ap-          are assigned a large weight.
proach and Section 3 details the different instantiations of           Diversity: Assuming a ranking imr1 , . . . , imrK of the
the approach used to produce each of the submitted runs.            images in S, the authors in [2] define diversity as D(S) =
                                                                    PK 1 Pi
Finally, in Section 4 we briefly summarize and discuss our             i=1 i   j=1 d(imri , imrj ), where d(imri , imrj ) is the dis-
experimental results.                                               similarity between the images ranked at positions i and j.
                                                                    Thus, high diversity scores are given to image sets with a
2.   OVERVIEW OF OUR APPROACH                                       high average dissimilarity. We notice that this definition of
                                                                    diversity can assign relatively high diversity scores to im-
   Let I = {im1 , . . . , imN } be a set of images that have been   age sets containing images with highly similar image pairs
retrieved from Flickr in response to a query q for a specific       (probably belonging to the same cluster) and this results in
location. The goal of the diversification algorithm is to select    a direct negative impact on the CR@20 measure and conse-
a K-sized subset of images from I that are as relevant (to          quently to F1@20. Therefore, we adopt a more strict defini-
the query location) and as diverse (among each other) as            tion of diversity where the diversity of a set S is defined as
possible. ReDiv formalizes this verbal description into the         the dissimilarity between the most similar pair of images in
following optimization problem: arg maxS⊂I,|S|=k U (S) =            S: D(S) =         min       d(imi , imj ).
wR(S|q) + (1 − w)D(S) where we want to identify the set                         imi ,imj ∈S,i6=j
S that has maximum utility U (S), defined as a weighted                Optimization: An exact optimization of U comes with a
combination of the relevance R(S|q) and the diversity D(S)          high complexity as it would require computing the utility of
of S. A similar formulation of the problem was used in [2].         all K!(NN−K)!
                                                                             !
                                                                                  K-subsets of I. With N ≈ 300 and K = 20 (in
In ReDiv, however, we use different definitions for R(S|q)          order to maximize F1@20) the computational cost of exact
                                                                    optimization becomes prohibitive. We therefore adopt the
                                                                    greedy, approximate optimization approach that was used
Copyright is held by the author/owner(s).                           in [2] with appropriate changes to reflect our new defini-
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain
tions for relevance and diversity. This algorithm starts with
an empty set S and sequentially expands it by adding at                  Table 1: Performance of the submitted runs.
                                                                               Development Set           Test Set (official)
each step J = 1, . . . , K the image im∗ that scores highest         Run    P@20 CR@20 F1@20          P@20 CR@20 F1@20
(among the unselected images), to the following criterion:           1      0.815   0.497   0.609     0.775   0.460      0.569
U (im∗ ) = wR(im∗ ) + (1 − w) min d(im∗ , imj ), where               2      0.863   0.468   0.599     0.832   0.407      0.538
                                 imj ∈S J−1
 J−1
                                                                     3      0.855   0.521   0.642     0.817   0.473      0.593
S      represents S at step J − 1. We also developed a less          5      0.857   0.527  0.647      0.815   0.475     0.594
greedy version of this algorithm that in each step J keeps M
highest scoring image subsets. Since the two algorithms co-
incide for M = 1 we used the less greedy version and tuned
the M parameter.                                                   3.3     Visual+Textual (Runs 3 & 5)
   Experimental Protocol: Depending on the type of the
                                                                     An early fusion of the visual and textual features described
run (visual/textual/both) a variety of different (vector) rep-
                                                                   above was used for the relevance component and the visual
resentations of the images could be utilized for building the
                                                                   features described above were used for the diversity com-
relevance detection models and computing pairwise image
                                                                   ponent. The parameters used to produce the 3rd run are:
similarities in ReDiv (note that the algorithm allows using
                                                                   w = 0.75, n = 90, M = 5. The 5th run differs from the 3rd
different representations for relevance and diversity). To re-
                                                                   run only in the value used for n (= 95).
duce the complexity of the experiments, we first evaluated
each representation in terms of its relevance detection ability
and then evaluated combinations of only the top perform-           4.     RESULTS AND DISCUSSION
ing representations in the ReDiv algorithm. To judge the              Table 1 shows the performance of the submitted runs in
effectiveness of each representation in terms of relevance de-     the test locations as well as their estimated performance us-
tection and to perform model selection we used a variant of        ing our internal evaluation procedure. We observe that in
leave-one(-location)-out cross-validation and measured per-        all cases we overestimated the final performance, neverthe-
formance via area under ROC (AUC). As classification al-           less by a small and approximately constant (among different
gorithm we used L2-regularized logistic regression, as it led      runs) margin. Most importantly, the relative ranking of the
to near optimal results for a variety of representations in        runs is the same, suggesting that our model selection proce-
preliminary experiments.                                           dure was appropriate. The best performance was obtained
   Given an instantiation of the ReDiv approach (a specific        by the two variations of our visual+textual run, followed by
combination of relevance detection model and diversity fea-        the visual-only run. Despite the comparatively lower per-
tures) we performed leave-one(-location)-out cross-valida-         formance obtained using only textual features, we see that
tion and evaluated the performance of each instantiation           a significant performance boost was possible by combining
in terms of F1@20. The process was repeated for different          them with visual features for relevance detection.
values of w in the [0, 1] range. We also noticed that using           In the future we plan to develop a more principled ap-
only the n < N most relevant images (according to the rele-        proach for combining different types of features in the ReDiv
vance detection model) leads to improved performance. We           algorithm, especially for the diversity component.
therefore also performed a coarse search over the domain of
N = {1, 2, . . . , 300} in order to find an optimal value. Fi-     5.     ACKNOWLEDGEMENTS
nally, we tested the values {1, 2, 3, 4, 5} for the M parameter.
                                                                     This work is supported by the SocialSensor FP7 project,
                                                                   partially funded by the EC under contract number 287975.
3.     INSTANTIATIONS
                                                                   6.     REFERENCES
3.1    Visual (Run 1)                                              [1] D. Corney, C. Martin, A. Göker,
  For this run we experimented with all the precomputed                E. Spyromitros-Xioufis, S. Papadopoulos,
visual features made available by the task organizers and              Y. Kompatsiaris, L. Aiello, and B. Thomee.
also extracted our own visual features. The best results were          Socialsensor: Finding diverse images at mediaeval 2013.
obtained using VLAD+CSURF [5] vectors (computed from                   In MediaEval, 2013.
a 128-dimensional visual vocabulary and projected to 128           [2] T. Deselaers, T. Gass, P. Dreuw, and H. Ney. Jointly
dimensions with PCA and whitening) for both the relevance              optimising relevance and diversity in image retrieval. In
and the diversity component. Cosine distance was used as               ACM CIVR ’09, New York, USA, 2009.
dissimilarity measure. The parameters used to produce the          [3] B. Ionescu, M. Menéndez, H. Müller, and A. Popescu.
1st run are: w = 0.4, n = 75, M = 3.                                   Retrieving diverse social images at MediaEval 2013:
                                                                       Objectives, dataset and evaluation. In MediaEval, 2013.
3.2    Textual (Run 2)                                             [4] B. Ionescu, A. Popescu, M. Lupu, A. Gı̂nsca, and
   A bag-of-words representation with the 20K/7.5K most                H. Müller. Retrieving diverse social images at
frequent words was used for the relevance/diversity compo-             MediaEval 2014: Challenge, dataset and evaluation. In
nent. Wikipedia images were represented using a parsed                 MediaEval, 2014.
version of the corresponding Wikipedia page and Flickr im-         [5] E. Spyromitros-Xioufis, S. Papadopoulos,
ages by a concatenation of the words in their titles (×3),             I. Kompatsiaris, G. Tsoumakas, and I. Vlahavas. A
description (×2) and tags (×1). Again, cosine distance was             comprehensive study over vlad and product
used as dissimilarity measure. The parameters used to pro-             quantization in large-scale image retrieval. IEEE
duce the 2nd run are: w = 0.95, n = 110, M = 1.                        Transactions on Multimedia, 2014.