<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SocialSensor: Finding Diverse Images at MediaEval 2013</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Aiello</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bart Thomee Yahoo! Research Barcelona</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Barcelona</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spain [alucca</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>bthomee]@yahoo-inc.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>David Corney, Carlos Martin, Ayse Göker IDEAS Research Institute Robert Gordon University</institution>
          ,
          <addr-line>Aberdeen [d.p.a.corney</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Eleftherios Spyromitros-Xioufis, Symeon Papadopoulos, Yiannis Kompatsiaris Information Technologies Institute CERTH</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>We describe the participation of the SocialSensor team in the Retrieving Diverse Social Images Task of MediaEval 2013. We submitted entries for all ve runs after developing independent algorithms for visual features, text features and internet features (including local weather data). Our best CR@10 results came in the visual-only run, while the visiontext fusion run produced a slightly higher precision.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The goal here is to produce a ranked list of images that
are both relevant and diverse in response to a location-based
query [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Throughout our work, we aimed to maximise
the CR@10 score based on leave-one(-location)-out
crossvalidation results from the 50 devset locations. Below, we
describe our methods for the ve runs in turn before brie y
summarising and discussing the results.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACHES 2. 2.1</title>
    </sec>
    <sec id="sec-3">
      <title>Run 1: Visual-only features</title>
      <p>
        For the visual-only run, each image is represented using
optimized VLAD+SURF vectors. Compared to standard
VLAD+SURF vectors [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], these vectors include multiple
vocabulary aggregation (four visual vocabularies with k = 128
centroids each) and joint dimensionality reduction (to 1024
dimensions) with PCA and whitening [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Relevance &amp; Diversity Method: Given a set of
images I = fim1; :::; imN g, we developed an algorithm that
selects a xed-size set S I that is (approximately)
optimal with respect to both relevance to the query location
and diversity within S. We de ne the utility U of a set
of images S with respect to a query location l as: U (Sjl) =
Pimsi2S w R(imsijl)+(1 w) D(imsijS) where R(imjl) is
the relevance score for im given the location and D(imjS) is
the diversity score within S. The same joint criterion, which
we call Relevance &amp; Diversity (RD), was used in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
However, we use di erent de nitions for R(imjl) and D(imjS)
that are more suitable for this task. While relevance in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
is de ned using a similarity measure between each image
and a given query image, we use the ground truth data to
train a classi er whose prediction for an image is used as the
relevance score. We use all relevant images as positive and
all irrelevant images as negative examples. Diversity in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
is de ned as: D(imsijS; l) = jS1j Pimsj2S;j6=i d(imsi; imsj )
where d(imsi; imsj ) is a dissimilarity measure between imsi
and imsj . We found that this de nition is not ideal because
a single image imsj in S that has a high similarity with imsi
reduces the diversity of the set. Instead, we de ne diversity
as: D(imsijS; l) = minj;j6=i d(imsi; imsj ) which de nes it as
the dissimilarity of imsi to the most similar image in S. As a
dissimilarity measure we use the Euclidean distance between
the VLAD vectors representing each image.
      </p>
      <p>
        Optimization &amp; Experiments: To nd a set S that
approximately optimizes U , we use the greedy optimization
algorithm of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This algorithm rst adds to S the image
with the highest relevance score and then sequentially adds
the remaining image which has the highest RD score. We
experimented with several types of relevance classi ers used
in the RD method. Area Under ROC (AUC) was used for
model selection by applying cross-validation. We applied
the greedy optimization algorithm with the best performing
classi er for several values of the weight w and chose the
parameters that gave the best results for CR@10 (' 0.56)
on the devset, for producing the test set predictions.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Run 2: Text-only features</title>
      <p>
        To predict the relevance of an image, we built a forest
of 100 random decision trees [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] using most of the textual
descriptors available in the datasets. The textual descriptors
used for classi cation were: number of comments and views;
Flickr ranking; author name. We also derived features from
the description, tags and title elds separately: the number
of words in the eld; the normalised sum of tf-idf, social
tf-idf and probabilistic values of each word (as provided by
the organisers); the normalised sum of tf-idf values of each
keyword where each value is the tf-idf value of each word
from the Wikipedia page of the corresponding location, and
using the remaining locations as the full corpus; and the
average of the previous four values. We also discretized the
continuous variables; the Flickr ranking and author were
already discrete.
      </p>
      <p>Independently, we used hierarchical clustering to nd 15
clusters for each location. Within each cluster, we then
ranked the images by the predicted relevance using the
random forest. We then stepped through the clusters iteratively
selecting the most relevant remaining image until (up to) 50
had been selected.</p>
      <p>We found some cases where groups of images have
identical text features but had di erent ground truth labels. These
include casual holiday pictures where the Flickr user
provided the same tags, descriptions etc. for a whole set of
images, despite their diversity. Any deterministic text-only
approach will fail to label these images correctly.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Run 3: Visual-text fusion</title>
      <p>In order to leverage both visual and textual information
we developed a simple late fusion scheme that combines the
outputs of the visual and textual approaches described in the
previous subsections. This is done by taking the union of
the images returned for each location by the two approaches
and ordering them in ascending average rank, i.e. the
average of the ranks that they receive by each approach.
Preliminary experiments indicated that early fusion (i.e. taking
the individual features derived from each aspect of the data
and combining them before making any decisions about
relevance or diversity) was less e ective.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Run 4: Human-machine hybrid approach</title>
      <p>We developed a very simple approach to combine human
and computer responses in an attempt to make use of
people's natural visual processing abilities and their abilities to
make rapid judgements from incomplete data. The test set
comprised a total of 38,300 images from 346 locations. To
obtain any form of human response requires either a large
number of people (e.g. through crowd-sourcing) or a
substantial reduction in the number of images. We chose the
latter and presented the participants with computer-generated
short-lists of images and asked them to improve it.
Specifically, we used the text-only methods (Section 2.2) to list
the top 15 relevant and diverse images. The human
participant then had to select ve of these 15 as being either
poor-quality images or images that (nearly) duplicate any
of the remaining set. Participants were not expected to be
familiar with any of the locations, nor did they consult other
sources. The nal submission for each location consisted of
the 10 remaining images, followed by the 5 \rejected" images.
Two participants carried out the annotation on a total of 46
locations, around 12% of the total test set.
2.5</p>
    </sec>
    <sec id="sec-7">
      <title>Run 5: Device and local weather data</title>
      <p>
        Multimedia objects captured with modern cameras and
smartphones are labeled with Exif metadata generated
directly from the mobile device at the time the photo or video
is taken. For this task, among all the data available we
consider i) date and time the photo was taken, generally reliable
at the granularity of one day; ii) f-stop (aperture size of the
shutter) and the exposure time (shutter speed), that can be
combined as EV =f-stop2 exposure, used previously to
differentiate indoor from outdoor pictures [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]; iii) geo-location
of the device when the photo was taken, from which we
compute the angle and distance to the photographed landmark.
We also query a public database of historical weather data
(www.ncdc.noaa.gov) to get the weather of the day the
picture was taken, which indicates the main weather conditions
(e.g. sun, fog, rain, snow, haze, thunderstorm, tornado).
      </p>
      <p>We combine all these data sources to get pictures that are
diverse in terms of distance from the landmark, angle of the
shot, weather conditions and time of the day. We input the
feature to the k-means algorithm (k = 10). Inside each
cluster, when multiple candidates photos are available, we select
the photo with the highest number of Flickr favourites. We
veri ed that including the number of favourites as an
additional feature to the k-means is bene cial for the selection
of diverse images.
3.</p>
    </sec>
    <sec id="sec-8">
      <title>RESULTS AND DISCUSSION</title>
      <p>Table 1 summarises the results when returning the top
10 images per location compared to the expert and
crowdsourced ground truth. Our strongest results came from the
visual features (run 1); a slight improvement in precision
came when these were combined with text features (run 3).
Our results are close for all ve runs, despite the variety of
features and algorithms used. This could indicate that the
inherent signal/noise ratio of the data is a limiting factor,
although further algorithmic development and optimisation
could also improve matters. Future work includes the use
of concept detection algorithms to improve diversity by
explicitly including images matching di erent concepts (e.g.
exterior; detail; night-time etc.).
4.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work is supported by the SocialSensor FP7 project,
partially funded by the EC under contract number 287975.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          .
          <article-title>Random forests</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>45</volume>
          (
          <issue>1</issue>
          ):5{
          <fpage>32</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Deselaers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gass</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dreuw</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Ney</surname>
          </string-name>
          .
          <article-title>Jointly optimising relevance and diversity in image retrieval</article-title>
          .
          <source>In ACM CIVR '09</source>
          , New York, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Menendez</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <article-title>Muller, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          .
          <article-title>Retrieving diverse social images at MediaEval 2013: Objectives, dataset and evaluation</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jegou</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Chum</surname>
          </string-name>
          .
          <article-title>Negative evidences and co-occurences in image retrieval: The bene t of PCA and whitening</article-title>
          .
          <source>In ECCV</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. N.</given-names>
            <surname>Lee</surname>
          </string-name>
          , W.-Y. Chen, and
          <string-name>
            <given-names>E. Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>A scalable service for photo annotation, sharing, and search</article-title>
          .
          <source>In ACM MULTIMEDIA '06</source>
          , pages
          <fpage>699</fpage>
          {
          <fpage>702</fpage>
          ,
          <string-name>
            <surname>Santa</surname>
            <given-names>Barbara</given-names>
          </string-name>
          , CA, USA,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Spyromitros-Xiou s</surname>
          </string-name>
          , S. Papadopoulos, I. Kompatsiaris, G. Tsoumakas,
          <string-name>
            <surname>and I. Vlahavas.</surname>
          </string-name>
          <article-title>An empirical study on the combination of SURF features with VLAD vectors for image search</article-title>
          .
          <source>In WIAMIS</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>