<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SocialSensor: Finding Diverse Images at MediaEval 2014</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eleftherios Spyromitros-Xioufis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Symeon Papadopoulos</string-name>
          <email>papadop@iti.gr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yiannis Kompatsiaris</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Vlahavas</string-name>
          <email>vlahavas@csd.auth.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aristotle University of Thessaloniki</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Technologies Institute, CERTH</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This paper describes the participation of the SosialSensor team in the Retrieving Diverse Social Images Task of MediaEval 2014. All our entries are produced by a di erent instantiation (set of features, parameter con guration) of the same diversi cation algorithm that optimizes a joint relevance-diversity criterion. All our runs are automated and use only resources given by the task organizers. Our best results in terms of the o cial ranking metric (F1@20 0.59) came by the runs that combine visual and textual information, followed by the visual-only run.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The Retrieving Diverse Social Images task of MediaEval
2014 deals with the problem of result diversi cation in
social photo retrieval. Participants are given a list of images
retrieved from Flickr in response to a query for a speci c
location e.g., \Ei el Tower" and are asked to return a re ned
short-list that contains images which are at the same time
relevant and diverse (see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for more details).
      </p>
      <p>
        To deal with this problem we build upon the approach
that we developed for the visual-only run of previous year's
task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], termed Relevance and Diversity (ReDiv ) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For
this year's task, the ReDiv approach was re ned and used
to produce all our runs. Section 2 describes the ReDiv
approach and Section 3 details the di erent instantiations of
the approach used to produce each of the submitted runs.
Finally, in Section 4 we brie y summarize and discuss our
experimental results.
      </p>
    </sec>
    <sec id="sec-2">
      <title>OVERVIEW OF OUR APPROACH</title>
      <p>
        Let I = fim1; : : : ; imN g be a set of images that have been
retrieved from Flickr in response to a query q for a speci c
location. The goal of the diversi cation algorithm is to select
a K-sized subset of images from I that are as relevant (to
the query location) and as diverse (among each other) as
possible. ReDiv formalizes this verbal description into the
following optimization problem: arg maxS I;jSj=k U (S) =
wR(Sjq) + (1 w)D(S) where we want to identify the set
S that has maximum utility U (S), de ned as a weighted
combination of the relevance R(Sjq) and the diversity D(S)
of S. A similar formulation of the problem was used in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
In ReDiv, however, we use di erent de nitions for R(Sjq)
and D(S) that we found more suitable for this task. These
changes are described below.
      </p>
      <p>
        Relevance: In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the authors de ne relevance as R(Sjq)
= Pimi2S R(imijq), where R(imijq) = 1 d(imi; imq) and
d(imi; imq) denotes the dissimilarity between image imi and
the image that depicts the query location imq. We observed
that, in the context of this task, this de nition can be
problematic (especially when using only visual information) as
there are several images that are visually dissimilar to the
reference Wikipedia images of the location but are still
considered relevant to the location e.g., inside views. Also, in
many cases, images that are similar to the reference images
are considered irrelevant to the location due to people
being part of the image. Motivated by these shortcomings,
we developed a more principled way for computing the
relevance of each image to the query location. This is achieved
by building a (distinct for each location) supervised
classication model that is trained to distinguish relevant from
irrelevant images. More speci cally, we use the
probabilistic output of this model in place of R(imijq). To train
this model, we use the relevance ground truth provided by
the task organizers for the development set locations and
use relevant/irrelevant images of other locations as
positive/negative examples. Additionally, the Wikipedia images
of each location are used as positive (relevant) examples and
are assigned a large weight.
      </p>
      <p>
        Diversity: Assuming a ranking imr1; : : : ; imrK of the
images in S, the authors in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] de ne diversity as D(S) =
PiK=1 1i Pij=1 d(imri; imrj ), where d(imri; imrj ) is the
dissimilarity between the images ranked at positions i and j.
Thus, high diversity scores are given to image sets with a
high average dissimilarity. We notice that this de nition of
diversity can assign relatively high diversity scores to
image sets containing images with highly similar image pairs
(probably belonging to the same cluster) and this results in
a direct negative impact on the CR@20 measure and
consequently to F1@20. Therefore, we adopt a more strict de
nition of diversity where the diversity of a set S is de ned as
the dissimilarity between the most similar pair of images in
S: D(S) = min d(imi; imj).
      </p>
      <p>imi;imj2S;i6=j</p>
      <p>
        Optimization: An exact optimization of U comes with a
high complexity as it would require computing the utility of
all K!(NN! K)! K-subsets of I. With N 300 and K = 20 (in
order to maximize F1@20) the computational cost of exact
optimization becomes prohibitive. We therefore adopt the
greedy, approximate optimization approach that was used
in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] with appropriate changes to re ect our new de
nitions for relevance and diversity. This algorithm starts with
an empty set S and sequentially expands it by adding at
each step J = 1; : : : ; K the image im that scores highest
(among the unselected images), to the following criterion:
U (im ) = wR(im ) + (1 w) min d(im ; imj ), where
imj2SJ 1
SJ 1 represents S at step J 1. We also developed a less
greedy version of this algorithm that in each step J keeps M
highest scoring image subsets. Since the two algorithms
coincide for M = 1 we used the less greedy version and tuned
the M parameter.
      </p>
      <p>Experimental Protocol: Depending on the type of the
run (visual/textual/both) a variety of di erent (vector)
representations of the images could be utilized for building the
relevance detection models and computing pairwise image
similarities in ReDiv (note that the algorithm allows using
di erent representations for relevance and diversity). To
reduce the complexity of the experiments, we rst evaluated
each representation in terms of its relevance detection ability
and then evaluated combinations of only the top
performing representations in the ReDiv algorithm. To judge the
e ectiveness of each representation in terms of relevance
detection and to perform model selection we used a variant of
leave-one(-location)-out cross-validation and measured
performance via area under ROC (AUC). As classi cation
algorithm we used L2-regularized logistic regression, as it led
to near optimal results for a variety of representations in
preliminary experiments.</p>
      <p>Given an instantiation of the ReDiv approach (a speci c
combination of relevance detection model and diversity
features) we performed leave-one(-location)-out
cross-validation and evaluated the performance of each instantiation
in terms of F1@20. The process was repeated for di erent
values of w in the [0; 1] range. We also noticed that using
only the n &lt; N most relevant images (according to the
relevance detection model) leads to improved performance. We
therefore also performed a coarse search over the domain of
N = f1; 2; : : : ; 300g in order to nd an optimal value.
Finally, we tested the values f1; 2; 3; 4; 5g for the M parameter.
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Visual (Run 1)</title>
      <p>
        For this run we experimented with all the precomputed
visual features made available by the task organizers and
also extracted our own visual features. The best results were
obtained using VLAD+CSURF [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] vectors (computed from
a 128-dimensional visual vocabulary and projected to 128
dimensions with PCA and whitening) for both the relevance
and the diversity component. Cosine distance was used as
dissimilarity measure. The parameters used to produce the
1st run are: w = 0:4, n = 75, M = 3.
3.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Textual (Run 2)</title>
      <p>A bag-of-words representation with the 20K/7.5K most
frequent words was used for the relevance/diversity
component. Wikipedia images were represented using a parsed
version of the corresponding Wikipedia page and Flickr
images by a concatenation of the words in their titles ( 3),
description ( 2) and tags ( 1). Again, cosine distance was
used as dissimilarity measure. The parameters used to
produce the 2nd run are: w = 0:95, n = 110, M = 1.</p>
    </sec>
    <sec id="sec-5">
      <title>Visual+Textual (Runs 3 &amp; 5)</title>
      <p>An early fusion of the visual and textual features described
above was used for the relevance component and the visual
features described above were used for the diversity
component. The parameters used to produce the 3rd run are:
w = 0:75, n = 90, M = 5. The 5th run di ers from the 3rd
run only in the value used for n (= 95).
4.</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND DISCUSSION</title>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work is supported by the SocialSensor FP7 project,
partially funded by the EC under contract number 287975.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Corney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            . Goker, E.
            <surname>Spyromitros-Xiou s</surname>
          </string-name>
          , S. Papadopoulos,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Aiello</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          . Socialsensor:
          <article-title>Finding diverse images at mediaeval 2013</article-title>
          . In MediaEval,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Deselaers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gass</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dreuw</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Ney</surname>
          </string-name>
          .
          <article-title>Jointly optimising relevance and diversity in image retrieval</article-title>
          .
          <source>In ACM CIVR '09</source>
          , New York, USA,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Menendez</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <article-title>Muller, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          .
          <article-title>Retrieving diverse social images at MediaEval 2013: Objectives, dataset and evaluation</article-title>
          . In MediaEval,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>G^nsca, and</article-title>
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller. Retrieving diverse social images at MediaEval 2014: Challenge, dataset and evaluation</article-title>
          . In MediaEval,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Spyromitros-Xiou s</surname>
          </string-name>
          , S. Papadopoulos, I. Kompatsiaris, G. Tsoumakas,
          <string-name>
            <surname>and I. Vlahavas.</surname>
          </string-name>
          <article-title>A comprehensive study over vlad and product quantization in large-scale image retrieval</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>