Retrieving Diverse Social Images at MediaEval 2013:
                  Objectives, Dataset and Evaluation

                                   Bogdan Ionescu                       María Menéndez
                             LAPI, University Politehnica of     DISI, University of Trento, Italy
                                 Bucharest, Romania                       menendez@unitn.it
                                bionescu@alpha.imag.pub.ro

                                    Henning Müller                       Adrian Popescu
                              HES-SO, Sierre, Switzerland                CEA-LIST, France
                                  henning.mueller@hevs.ch                adrian.popescu@cea.fr


ABSTRACT                                                          photo, the geographical position of the place and basic de-
This paper provides an overview of the Retrieving Diverse         scriptions. Before deciding whether this location suits her
Social Images task that is organized as part of the Medi-         needs, the person is interested in getting a more complete
aEval 2013 Benchmarking Initiative for Multimedia Evalua-         visual description of the place.
tion. The task addresses the problem of result diversification       In this task, participants receive a list of photos for a cer-
in the context of social photo retrieval. We present the task     tain location retrieved from Flickr2 and ranked with Flickr’s
challenges, the proposed data set and ground truth, the re-       default “relevance” algorithm. These results are typically
quired participant runs and the evaluation metrics.               noisy and redundant. The requirements of the task are to
                                                                  refine these results by providing a ranked list of up to 50
1.     INTRODUCTION                                               photos that are considered to be both relevant and diverse
                                                                  representations of the query according to the definitions:
  The MediaEval 2013 Retrieving Diverse Social Images Task
addresses the problem of result diversification in the context    Relevance: a photo is relevant for the location if it is a
of social photo retrieval. Existing retrieval technology fo-      common visual representation of the location, e.g., different
cuses almost exclusively on the accuracy of the results that      views at different times of the day/year and under varying
often provides the user with near replicas of the query. How-     weather conditions, inside views, close-ups on architectural
ever, users would expect to retrieve not only representative      details, drawings, sketches, creative views, etc, which con-
photos but also diverse results depicting the query in a com-     tain partially or entirely the target location. Photos of poor
prehensive and complete manner. Another equally impor-            quality (e.g., severely blurred, out of focus, etc) as well as
tant aspect is that retrieval should focus on summarizing         photos showing people in focus (e.g., a big picture of me in
the query with a small set of images, since most of the users     front of the monument) are not considered relevant.
commonly browse only the top retrieval results.                   Diversity: a set of photos is considered to be diverse if it
  The task aims to foster new research in this area [1, 2]        depicts different visual characteristics of the target location,
by creating a multi-modal evaluation framework specifically       e.g., different views at different times of the day/year and
designed to encourage the creation of new solutions from var-     under varying weather conditions, inside views, close-ups
ious research areas, such as: machine analysis, human-based       on architectural details, creative views, etc, with a certain
approaches (e.g., crowd-sourcing) and hybrid machine-human        degree of complementarity, i.e., most of the perceived visual
approaches (e.g., relevance feedback). Compared to other          information is different from one photo to another.
existing tasks addressing diversity, e.g., ImageCLEF 2009
Photo Retrieval [3], the main novelty of this task is in ad-      3. DATASET
dressing the social dimension that is reflected both in its          The 2013 data set consists of 396 locations, spread over
nature (variable quality of photos and of metadata) and in        34 countries around the world, ranging from very famous
the methods devised to retrieve it.                               ones (e.g., “Eiffel Tower”) to lesser known monuments (e.g.,
                                                                  “Palazzo delle Albere”). They are divided into a develop-
2.     TASK DESCRIPTION                                           ment set containing 50 locations (devset - to be used for
   The task is build around a tourist use case where a person     designing and validating the proposed approaches) and a
tries to find more information about a place she is poten-        test set containing 346 locations (testset - to be used for the
tially visiting. The person has only a vague idea about the       official evaluation). Each of the two data sets contains data
location, knowing the name of the place. She uses the name        that was retrieved from Flickr using the name of the loca-
to learn additional facts about the place from the Internet,      tion as query (keywords), as well as using the name of the
for instance by visiting a Wikipedia1 page, e.g., getting a       location together with its GPS coordinates (keywordsGPS ).
1
                                                                     For each location, the following information is provided:
    http://en.wikipedia.org/                                      the name of the location, its GPS coordinates, a link to a
                                                                  Wikipedia description webpage, a representative photo from
                                                                  Wikipedia, a ranked list of photos retrieved from Flickr (up
Copyright is held by the author/owner(s).                         2
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain        http://www.flickr.com/
to 150 photos per location; devset contains 5,118 images            sual information only; run2 - automated approaches using
while testset 38,300 images)3 , an xml file containing meta-        textual information only; and run3 - automated approaches
data from Flickr for all the retrieved photos (i.e., photo title,   using textual-visual information fused without other resources
photo description, photo id, tags, Creative Common license          than provided by the organizers. The last 2 runs are general
type, number of posted comments, the url link of the photo          runs: run4 - human-based or hybrid human-machine ap-
location from Flickr, the photo owner’s name and the num-           proaches and run5 - everything allowed including using data
ber of times the photo has been displayed), a set of global         from external sources (e.g., Internet). For generating run1
visual descriptors automatically extracted from the photos          to run4 participants are allowed to use only information that
(i.e., color histograms, histogram of oriented gradients, color     can be extracted from the provided data (e.g., provided con-
moments, local binary patterns, MPEG-7 color structure              tent descriptors, content descriptors of their own, etc). This
descriptor, run-length matrix statistics and spatial pyramid        includes also the Wikipedia webpage of the locations pro-
representation of these descriptors) and several textual mod-       vided via their links. For run5 everything is allowed, from
els (i.e., probabilistic model, term frequency-inverse docu-        the method point of view and information sources.
ment frequency — TF-IDF; weighting and social TF-IDF
weighting — an adaptation to the social space).                     6. EVALUATION
                                                                       Performance is assessed for both diversity and relevance.
4.   GROUND TRUTH                                                   The main evaluation metrics is cluster recall at X (CR@X)
   For each location, photos were manually annotated for            [3] — a measure that assesses how many different clusters
relevance and diversity. Ground truth was generated by a            from the ground truth are represented among the top X
small group of expert annotators with advanced knowledge            results provided by the retrieval system. Precision at X
of location characteristics. Software tools were specifically       (P@X) and the harmonic mean of CR@X and P@X (i.e.,
designed to facilitate the annotation process. The annota-          F1-measure@X) are used as secondary metrics. P@X mea-
tion process was not time restricted.                               sures the number of relevant photos among the top X re-
   For relevance annotation, annotators were asked to label         sults. F1-measure@X combines CR@X and P@X and gives
each photo (one at a time) as being relevant (value 1), non-        and overall assessment of both diversity and relevance. Par-
relevant (0) or with “don’t know” (-1). To help with their          ticipants were provided with these metrics computed at dif-
decisions, annotators were recommended to consult any ad-           ferent cutoff points, namely X∈ {5, 10, 20, 30, 40, 50}. The
ditional information source during the evaluation (e.g., from       official ranking was computed for X=10 (CR@10, P@10, F1-
the Internet). Final ground truth was determined after a            measure@10).
majority voting scheme. The devset was annotated by 6
persons. The average inter-annotator agreement (Weighted
                                                                    7. CONCLUSIONS
Kappa) for the annotations of the keywords data was 0.68               The Retrieving Diverse Social Images Task provides par-
(σ = 0.07) and for keywordsGPS data was 0.61 (σ = 0.08).            ticipants with a comparative and collaborative evaluation
The testset was annotated by 7 persons, each expert an-             framework for social image retrieval techniques with explicit
notated a different part of the data set leading in the end         focus on result diversification, relevance and summarization.
to 3 annotations per image. The average inter-annotator             Details on the methods and results of each individual par-
agreement (Free-Marginal Multirater Fleiss’ Kappa) for the          ticipant team can be found in the working note papers of
annotation of the keywords data was 0.86 and for keywords-          the MediaEval 2013 workshop proceedings.
GPS data was 0.75.
   Diversity annotation was carried out only for the photos         Acknowledgments
that were judged as relevant in the previous step. For each         This task is supported by the following projects: EXCEL
location, annotators were provided with a thumbnail list of         POSDRU, CUbRIK5 , PROMISE6 and MUCKE7 . Many thanks
all relevant photos. After getting familiar with their content,     to the task supporters for their precious help: Anca-Livia
they were asked to re-group the photos into similar visual          Radu, Bogdan Boteanu, Ivan Eggel, Sajan Raj Ojha, Oana
appearance clusters (up to 20) and then tag these clusters          Pleş, Ionuţ Mironică, Ionuţ Duţă, Andrei Purica, Macovei
with appropriate keywords. The devset was annotated by              Corina and Irina Nicolae.
3 persons and the testset by 4. In this case, each person
annotated distinct parts of the data leading to only one an-        8. REFERENCES
notation in the end.                                                    [1] S. Rudinac, A. Hanjalic, M.A. Larson, “Generating
   To explore differences between expert and non-expert an-                 Visual Summaries of Geographic Areas Using
notations, an additional crowd-sourcing annotated relevance                 Community-Contributed Images”, IEEE Trans. on
and diversity ground truth was generated for a selection of                 Multimedia, 15(4), pp. 921-932, 2013.
50 locations via CrowdFlower platform4 .                                [2] R.H. van Leuken, L. Garcia, X. Olivares, R. van Zwol,
                                                                            “Visual Diversification of Image Search Results”, ACM
5.   RUN DESCRIPTION                                                        Int. Conf. on World Wide Web, pp. 341-350, 2009.
  Participants were allowed to submit up to 5 runs. The first           [3] M.L. Paramita, M. Sanderson, P. Clough, “Diversity
3 are required runs: run1 - automated approaches using vi-                  in Photo Retrieval: Overview of the ImageCLEF
3
                                                                            Photo Task 2009”, ImageCLEF 2009.
  all the provided photos are under Creative Com-
mons licenses of type 1 to 7 that allow redis-
tribution    (see    http://www.flickr.com/services/                5
api/flickr.photos.licenses.getInfo.html/         and                  http://www.cubrikproject.eu/
                                                                    6
http://creativecommons.org/).                                         http://www.promise-noe.eu/
4                                                                   7
  http://crowdflower.com/                                             http://www.chistera.eu/projects/mucke/