=Paper= {{Paper |id=None |storemode=property |title=SocialSensor: Finding Diverse Images at MediaEval 2013 |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_24.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/CorneyMGXPKAT13 }} ==SocialSensor: Finding Diverse Images at MediaEval 2013== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_24.pdf
  SocialSensor: Finding Diverse Images at MediaEval 2013

                David Corney, Carlos Martin,                           Eleftherios Spyromitros-Xioufis,
                       Ayse Göker                                          Symeon Papadopoulos,
                   IDEAS Research Institute                                  Yiannis Kompatsiaris
               Robert Gordon University, Aberdeen                       Information Technologies Institute
                 [d.p.a.corney|c.j.martin-                  CERTH, Thessaloniki, Greece
              dancausa|a.s.goker]@rgu.ac.uk              [espyromi|papadop|ikom]@iti.gr
                                      Luca Aiello, Bart Thomee
                                                 Yahoo! Research Barcelona
                                                   08018 Barcelona, Spain
                                          [alucca|bthomee]@yahoo-inc.com

ABSTRACT                                                         is defined using a similarity measure between each image
We describe the participation of the SocialSensor team in the    and a given query image, we use the ground truth data to
Retrieving Diverse Social Images Task of MediaEval 2013.         train a classifier whose prediction for an image is used as the
We submitted entries for all five runs after developing in-      relevance score. We use all relevant images as positive and
dependent algorithms for visual features, text features and      all irrelevant images as negative examples.       Diversity in [2]
                                                                                                   1
                                                                                                      P
internet features (including local weather data). Our best       is defined as: D(imsi |S, l) = |S|      imsj ∈S,j6=i d(imsi , imsj )
CR@10 results came in the visual-only run, while the vision-     where d(imsi , imsj ) is a dissimilarity measure between imsi
text fusion run produced a slightly higher precision.            and imsj . We found that this definition is not ideal because
                                                                 a single image imsj in S that has a high similarity with imsi
                                                                 reduces the diversity of the set. Instead, we define diversity
1.    INTRODUCTION                                               as: D(imsi |S, l) = minj,j6=i d(imsi , imsj ) which defines it as
  The goal here is to produce a ranked list of images that       the dissimilarity of imsi to the most similar image in S. As a
are both relevant and diverse in response to a location-based    dissimilarity measure we use the Euclidean distance between
query [3]. Throughout our work, we aimed to maximise             the VLAD vectors representing each image.
the CR@10 score based on leave-one(-location)-out cross-            Optimization & Experiments: To find a set S that
validation results from the 50 devset locations. Below, we       approximately optimizes U , we use the greedy optimization
describe our methods for the five runs in turn before briefly    algorithm of [2]. This algorithm first adds to S the image
summarising and discussing the results.                          with the highest relevance score and then sequentially adds
                                                                 the remaining image which has the highest RD score. We
2.    APPROACHES                                                 experimented with several types of relevance classifiers used
                                                                 in the RD method. Area Under ROC (AUC) was used for
2.1    Run 1: Visual-only features                               model selection by applying cross-validation. We applied
   For the visual-only run, each image is represented using      the greedy optimization algorithm with the best performing
optimized VLAD+SURF vectors. Compared to standard                classifier for several values of the weight w and chose the
VLAD+SURF vectors [6], these vectors include multiple vo-        parameters that gave the best results for CR@10 (' 0.56)
cabulary aggregation (four visual vocabularies with k = 128      on the devset, for producing the test set predictions.
centroids each) and joint dimensionality reduction (to 1024
dimensions) with PCA and whitening [4].                          2.2    Run 2: Text-only features
   Relevance & Diversity Method: Given a set of im-                 To predict the relevance of an image, we built a forest
ages I = {im1 , ..., imN }, we developed an algorithm that       of 100 random decision trees [1] using most of the textual
selects a fixed-size set S ⊂ I that is (approximately) op-       descriptors available in the datasets. The textual descriptors
timal with respect to both relevance to the query location       used for classification were: number of comments and views;
and diversity within S. We define the utility U of a set         Flickr ranking; author name. We also derived features from
of
P images S with respect to a query location l as: U (S|l) =      the description, tags and title fields separately: the number
   imsi ∈S w ∗R(imsi |l)+(1−w)∗D(imsi |S) where R(im|l) is       of words in the field; the normalised sum of tf-idf, social
the relevance score for im given the location and D(im|S) is     tf-idf and probabilistic values of each word (as provided by
the diversity score within S. The same joint criterion, which    the organisers); the normalised sum of tf-idf values of each
we call Relevance & Diversity (RD), was used in [2]. How-        keyword where each value is the tf-idf value of each word
ever, we use different definitions for R(im|l) and D(im|S)       from the Wikipedia page of the corresponding location, and
that are more suitable for this task. While relevance in [2]     using the remaining locations as the full corpus; and the
                                                                 average of the previous four values. We also discretized the
                                                                 continuous variables; the Flickr ranking and author were
Copyright is held by the author/owner(s).                        already discrete.
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain      Independently, we used hierarchical clustering to find 15
clusters for each location. Within each cluster, we then                              Expert                Crowd-sourced
                                                                     Method   P@10    CR@10 F1@10       P@10 CR@10 F1@10
ranked the images by the predicted relevance using the ran-
                                                                     Run 1    0.733    0.429 0.521      0.729  0.764    0.723
dom forest. We then stepped through the clusters iteratively
                                                                     Run 2    0.732    0.390 0.491      0.702   0.760   0.691
selecting the most relevant remaining image until (up to) 50         Run 3    0.785    0.405 0.510      0.800   0.763  0.753
had been selected.                                                   Run 4    0.750    0.408 0.508      0.725   0.738   0.698
  We found some cases where groups of images have identi-            Run 5    0.733    0.406 0.504      0.702   0.696   0.672
cal text features but had different ground truth labels. These
include casual holiday pictures where the Flickr user pro-         Table 1: Results for test set for top 10 results, using
vided the same tags, descriptions etc. for a whole set of          expert and crowd-sourced ground truth sets.
images, despite their diversity. Any deterministic text-only
approach will fail to label these images correctly.

2.3    Run 3: Visual-text fusion                                      We combine all these data sources to get pictures that are
                                                                   diverse in terms of distance from the landmark, angle of the
   In order to leverage both visual and textual information
                                                                   shot, weather conditions and time of the day. We input the
we developed a simple late fusion scheme that combines the
                                                                   feature to the k-means algorithm (k = 10). Inside each clus-
outputs of the visual and textual approaches described in the
                                                                   ter, when multiple candidates photos are available, we select
previous subsections. This is done by taking the union of
                                                                   the photo with the highest number of Flickr favourites. We
the images returned for each location by the two approaches
                                                                   verified that including the number of favourites as an addi-
and ordering them in ascending average rank, i.e. the aver-
                                                                   tional feature to the k-means is beneficial for the selection
age of the ranks that they receive by each approach. Pre-
                                                                   of diverse images.
liminary experiments indicated that early fusion (i.e. taking
the individual features derived from each aspect of the data
and combining them before making any decisions about rel-          3.   RESULTS AND DISCUSSION
evance or diversity) was less effective.                              Table 1 summarises the results when returning the top
                                                                   10 images per location compared to the expert and crowd-
2.4    Run 4: Human-machine hybrid approach                        sourced ground truth. Our strongest results came from the
   We developed a very simple approach to combine human            visual features (run 1); a slight improvement in precision
and computer responses in an attempt to make use of peo-           came when these were combined with text features (run 3).
ple’s natural visual processing abilities and their abilities to   Our results are close for all five runs, despite the variety of
make rapid judgements from incomplete data. The test set           features and algorithms used. This could indicate that the
comprised a total of 38,300 images from 346 locations. To          inherent signal/noise ratio of the data is a limiting factor,
obtain any form of human response requires either a large          although further algorithmic development and optimisation
number of people (e.g. through crowd-sourcing) or a sub-           could also improve matters. Future work includes the use
stantial reduction in the number of images. We chose the lat-      of concept detection algorithms to improve diversity by ex-
ter and presented the participants with computer-generated         plicitly including images matching different concepts (e.g.
short-lists of images and asked them to improve it. Specif-        exterior; detail; night-time etc.).
ically, we used the text-only methods (Section 2.2) to list
the top 15 relevant and diverse images. The human par-             4.   ACKNOWLEDGEMENTS
ticipant then had to select five of these 15 as being either         This work is supported by the SocialSensor FP7 project,
poor-quality images or images that (nearly) duplicate any          partially funded by the EC under contract number 287975.
of the remaining set. Participants were not expected to be
familiar with any of the locations, nor did they consult other     5.   REFERENCES
sources. The final submission for each location consisted of
                                                                   [1] L. Breiman. Random forests. Machine Learning,
the 10 remaining images, followed by the 5 “rejected” images.
                                                                       45(1):5–32, 2001.
Two participants carried out the annotation on a total of 46
locations, around 12% of the total test set.                       [2] T. Deselaers, T. Gass, P. Dreuw, and H. Ney. Jointly
                                                                       optimising relevance and diversity in image retrieval. In
2.5    Run 5: Device and local weather data                            ACM CIVR ’09, New York, USA, 2009. ACM.
                                                                   [3] B. Ionescu, M. Menéndez, H. Müller, and A. Popescu.
   Multimedia objects captured with modern cameras and
                                                                       Retrieving diverse social images at MediaEval 2013:
smartphones are labeled with Exif metadata generated di-
                                                                       Objectives, dataset and evaluation. In MediaEval 2013
rectly from the mobile device at the time the photo or video
                                                                       Workshop, Barcelona, Spain, October 18-19 2013.
is taken. For this task, among all the data available we con-
sider i) date and time the photo was taken, generally reliable     [4] H. Jégou and O. Chum. Negative evidences and
at the granularity of one day; ii) f-stop (aperture size of the        co-occurences in image retrieval: The benefit of PCA
shutter) and the exposure time (shutter speed), that can be            and whitening. In ECCV, 2012.
combined as EV =f-stop2 ·exposure, used previously to dif-         [5] B. N. Lee, W.-Y. Chen, and E. Y. Chang. A scalable
ferentiate indoor from outdoor pictures [5]; iii) geo-location         service for photo annotation, sharing, and search. In
of the device when the photo was taken, from which we com-             ACM MULTIMEDIA ’06, pages 699–702, Santa
pute the angle and distance to the photographed landmark.              Barbara, CA, USA, 2006. ACM.
We also query a public database of historical weather data         [6] E. Spyromitros-Xioufis, S. Papadopoulos,
(www.ncdc.noaa.gov) to get the weather of the day the pic-             I. Kompatsiaris, G. Tsoumakas, and I. Vlahavas. An
ture was taken, which indicates the main weather conditions            empirical study on the combination of SURF features
(e.g. sun, fog, rain, snow, haze, thunderstorm, tornado).              with VLAD vectors for image search. In WIAMIS, 2012.