Retrieving Diverse Social Images at MediaEval 2015:
                Challenge, Dataset and Evaluation

               Bogdan Ionescu∗                 Alexandru Lucian Gînscă†                  Bogdan Boteanu‡
         LAPI, University Politehnica of              CEA, LIST, France             LAPI, University Politehnica of
             Bucharest, Romania                      alexandru.ginsca@cea.fr            Bucharest, Romania
             bionescu@alpha.imag.pub.ro                                                 bboteanu@alpha.imag.pub.ro
                                     †                                  †
               Adrian Popescu                            Mihai Lupu                        Henning Müller
                CEA, LIST, France                    Vienna University of            HES-SO, Sierre, Switzerland
                adrian.popescu@cea.fr                Technology, Austria                  henning.mueller@hevs.ch
                                                      lupu@ifs.tuwien.ac.at


ABSTRACT                                                           tially visiting. Before deciding whether this location suits
This paper provides an overview of the Retrieving Diverse          her needs, the person is interested in getting a more com-
Social Images task that is organized as part of the Media-         plete and diversified visual description of the place.
Eval 2015 Benchmarking Initiative for Multimedia Evalua-              Participants are required to develop algorithms to auto-
tion. The task addresses the problem of result diversification     matically refine a list of images that has been returned by
and user annotation credibility estimation in the context of       Flickr in response to a query. Compared to the previous edi-
social photo retrieval. We present the task challenges, the        tions, this year’s task includes not only single-topic queries
proposed data set and ground truth, the required participant       (i.e., formulations such as the name of a location), but also
runs and the evaluation metrics.                                   multi-concept queries related to events and states associated
                                                                   with locations. The requirements of the task are to refine
                                                                   these results by providing a ranked list of up to 50 photos
1. INTRODUCTION                                                    that are both relevant and diverse representations of the
   An eﬃcient image retrieval system should be able to present     query, according to the following definitions:
results that are both relevant and that are covering diﬀerent      Relevance: a photo is considered to be relevant if it is a
aspects, i.e., diverse, of the query. Relevance was more           common photo representation of all query concepts at once.
thoroughly studied in existing literature than diversifica-        This includes sub-locations (e.g., subsuming indoor/outdoor,
tion [1, 2, 3] and even though a considerable amount of di-        close up), temporal information (e.g., historical shots, times
versification literature exists, the topic remains important,      of day), typical actors/objects (e.g., people who frequent the
especially in social multimedia [4, 5, 6, 7].                      location, vehicles), genesis information (e.g., images showing
   The 2015 Retrieving Diverse Social Images task is a fo-         how something got the way it is), and image style informa-
llowup of last years’ editions [9, 8, 10] and aims to foster       tion (e.g., creative views). Low quality photos (e.g., severely
new technology for improving both relevance and diversifi-         blurred, out of focus, etc) as well as photos with people as
cation of search results with explicit emphasis on the actual      the main subject (e.g., a big picture of me in front of the
social media context. This task was designed to be inter-          monument) are not considered relevant in this scenario;
esting for researchers working in either machine-based or          Diversity: a set of photos is considered to be diverse if
human-based media analysis, including areas such as: image         it depicts diﬀerent visual characteristics of the target con-
retrieval (text, vision, multimedia communities), re-ranking,      cepts, e.g., sub-locations, temporal information, typical ac-
machine learning, relevance feedback, natural language pro-        tors/objects, genesis and style information, etc with a cer-
cessing, crowdsourcing and automatic geo-tagging.                  tain degree of complementarity, i.e., most of the perceived
                                                                   visual information is diﬀerent from one photo to another.
2. TASK DESCRIPTION                                                   To carry out the refinement and diversification tasks, par-
   The task is built around a tourist use case where a person      ticipants may use social metadata associated with the im-
tries to find more information about a place she is poten-         ages, the visual characteristics of the images, information re-
∗                                                                  lated to user tagging credibility (an estimation of the global
  This work is supported by the European Science Founda-           quality of tag-image content relationships for a user’s con-
tion, activity on “Evaluating Information Access Systems”.
†                                                                  tributions) or external resources (e.g., Internet).
  This work is supported by the CHIST-ERA FP7 MUCKE
- Multimodal User Credibility and Knowledge Extraction
project (http://ifs.tuwien.ac.at/∼mucke/).
‡
  This work has been funded by the Ministry of Euro-
                                                                   3.       DATASET
pean Funds through the Financial Agreement POSDRU                     The 2015 data consists of a development set (devset) con-
187/1.5/S/155420.                                                  taining 153 location queries (45,375 Flickr photos — the
                                                                   2014 dataset [9]), a user annotation credibility set (credibil-
                                                                   ityset) containing information for ca. 300 locations and 685
Copyright is held by the author/owner(s).                          users (diﬀerent than the ones in devset and testset) and a
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany        test set (testset) containing 139 queries: 69 one-concept lo-
cation queries (20,700 Flickr photos) and 70 multi-concept          tion characteristics (mainly learned from last years’ tasks
queries related to events and states associated with locations      and Internet sources). For relevance, annotators were asked
(20,694 Flickr photos).                                             to label each photo (one at a time) as being relevant (value
   Each query is provided with the following information:           1), non-relevant (0) or with “don’t know” (-1). For devset, 11
query text formulation (used to retrieve the data), GPS             annotators were involved, for credibilityset 9 and for testset
coordinates (latitude and longitude in degrees — only for           single-topic 7 and multi-topic 5. Each annotator annotated
single-topic location queries), a link to a Wikipedia web-          diﬀerent parts of the data leading in the end to 3 diﬀerent
page (only when available), up to 5 representative photos           annotations for each photo. The final relevance ground truth
from Wikipedia (only for single-topic location queries), a          was determined after a lenient majority voting scheme. For
ranked list of up to 300 photos retrieved from Flickr using         diversity, only the photos that were judged as relevant in
Flickr’s default “relevance” algorithm (all photos are Cre-         the previous step were considered. For each location, an-
ative Commons licensed allowing redistribution, see http:           notators were provided with a thumbnail list of all relevant
//creativecommons.org/), and an xml file containing meta-           photos. After getting familiar with their contents, they were
data from Flickr for all the retrieved photos (e.g., photo title,   asked to re-group the photos into clusters with similar visual
photo description, photo id, tags, Creative Common license          appearance (up to 25). Devset and testset were annotated
type, number of posted comments, the url link of the photo          by 3 persons, each of them annotating distinct parts of the
location from Flickr, the photo owner’s name, user id, the          data (leading to only one annotation). An additional anno-
number of times the photo has been displayed, etc).                 tator acted as a master annotator and reviewed once more
   Apart from the metadata, to facilitate participation from        the final annotations.
various communities, we also provide content descriptors:
general purpose visual descriptors (e.g., color, texture and        5.   RUN DESCRIPTION
feature information) identical to the ones in 2014 [10]; con-
                                                                       Participants were allowed to submit up to 5 runs. The
volutional neural network based descriptors — generic based
                                                                    first 3 are required runs: run1 — automated using visual
on the reference convolutional neural network (CNN) model
                                                                    information only; run2 — automated using text informa-
provided along with the Caﬀe framework (this model is lear-
                                                                    tion only; and run3 — automated using text-visual fused
ned with the 1,000 ImageNet classes used during the Ima-
                                                                    without other resources than provided by the organizers.
geNet challenge) and adapted CNN based on a CNN model
                                                                    The last 2 runs are general runs: run4 — automated using
obtained with an identical architecture to that of the Caﬀe
                                                                    user annotation credibility descriptors (either the ones pro-
reference model. (This model is learned with 1,000 tourist
                                                                    vided by organizers or computed by the participants) and
points of interest classes for which the images were automat-
                                                                    run5 — everything allowed, e.g., human-based or hybrid
ically collected from the Web) [11]; text information which
                                                                    human-machine approaches, including using data from ex-
consists as in the previous edition of term frequency infor-
                                                                    ternal sources (e.g., Internet). For generating run1 to run4
mation, document frequency information and their ratio,
                                                                    participants are allowed to use only information that can
i.e., TF-IDF (used as in [12]); user annotation credibility de-
                                                                    be extracted from the provided data (e.g., provided descrip-
scriptors that give an automatic estimation of the quality of
                                                                    tors, descriptors of their own, etc). This includes also the
users’ tag-image content relationships. These descriptors are
                                                                    Wikipedia webpages of the locations (via their links).
extracted by visual or textual content mining: visualScore
(measure of user image relevance), faceProportion (the per-
centage of images with faces), tagSpecificity (average speci-       6.   EVALUATION
ficity of a user’s tags, where tag specificity is the percentage       Performance is assessed for both diversity and relevance.
of users having annotated with that tag in a large Flickr cor-      The following metrics are computed: Cluster Recall at X
pus), locationSimilarity (average similarity between a user’s       (CR@X) — a measure that assesses how many diﬀerent clus-
geotagged photos and a probabilistic model of a surrounding         ters from the ground truth are represented among the top
cell), photoCount (total number of images a user shared),           X results (only relevant images are considered), Precision at
uniqueTags (proportion of unique tags), uploadFrequency             X (P@X) — measures the number of relevant photos among
(average time between two consecutive uploads), bulkPro-            the top X results and F1-measure at X (F1@X) — the har-
portion (the proportion of bulk taggings in a user’s stream,        monic mean of the previous two. Various cut oﬀ points are
i.e., of tag sets which appear identical for at least two dis-      to be considered, i.e., X=5, 10, 20, 30, 40, 50. Oﬃcial rank-
tinct photos), meanPhotoViews (mean value of the number             ing metric is the F1@20 which gives equal importance to
of times a user’s image has been seen by other members of           diversity (via CR@20) and relevance (via P@20). This met-
the community), meanTitleWordCounts (mean value of the              ric simulates the content of a single page of a typical Web
number of words found in the titles associated with users’          image search engine and reflects user behavior, i.e., inspect-
photos), meanTagsPerPhoto (mean value of the number of              ing the first page of results with priority.
tags users put for their images), meanTagRank (mean rank
of a user’s tags in a list in which the tags are sorted in de-      7.   CONCLUSIONS
scending order according the the number of appearances in a
                                                                       The 2015 Retrieving Diverse Social Images task provides
large subsample of Flickr images), and meanImageTagClar-
                                                                    participants with a comparative and collaborative evalua-
ity (adaptation of the Image Tag Clarity from [13] using as
                                                                    tion framework for social image retrieval techniques with
individual tag language model a tf/idf language model).
                                                                    explicit focus on result diversification. This year in particu-
                                                                    lar, the task explores also the diversification of multi-concept
4. GROUND TRUTH                                                     queries. Details on the methods and results of each individ-
  Both relevance and diversity annotations were carried out         ual participant team can be found in the working note papers
by expert annotators with advanced knowledge of the loca-           of the MediaEval 2015 workshop proceedings.
8. REFERENCES
 [1] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta,
     R. Jain, “Content-based Image Retrieval at the End of
     the Early Years”, IEEE Trans. on Pattern Analysis
     and Machine Intelligence, 22(12), pp. 1349 - 1380,
     2000.
 [2] R. Datta, D. Joshi, J. Li, J.Z. Wang, “Image Retrieval:
     Ideas, Influences, and Trends of the New Age”, ACM
     Comput. Surv., 40(2), pp. 1-60, 2008.
 [3] R. Priyatharshini, S. Chitrakala, “Association Based
     Image Retrieval: A Survey”, Mobile Communication
     and Power Engineering, Springer Communications in
     Computer and Information Science, 296, pp. 17-26,
     2013.
 [4] R.H. van Leuken, L. Garcia, X. Olivares, R. van Zwol,
     “Visual Diversification of Image Search Results”, ACM
     World Wide Web, pp. 341-350, 2009.
 [5] M.L. Paramita, M. Sanderson, P. Clough, “Diversity
     in Photo Retrieval: Overview of the ImageCLEF
     Photo Task 2009”, ImageCLEF 2009.
 [6] B. Taneva, M. Kacimi, G. Weikum, “Gathering and
     Ranking Photos of Named Entities with High
     Precision, High Recall, and Diversity”, ACM Web
     Search and Data Mining, pp. 431-440, 2010.
 [7] S. Rudinac, A. Hanjalic, M.A. Larson, “Generating
     Visual Summaries of Geographic Areas Using
     Community-Contributed Images”, IEEE Trans. on
     Multimedia, 15(4), pp. 921-932, 2013.
 [8] B. Ionescu, A.-L. Radu, M. Menéndez, H. Müller, A.
     Popescu, B. Loni, “Div400: A Social Image Retrieval
     Result Diversification Dataset”, ACM MMSys,
     Singapore, 2014.
 [9] B. Ionescu, A. Popescu, M. Lupu, A.L. Gı̂nscă, B.
     Boteanu, H. Müller, “Div150Cred: A Social Image
     Retrieval Result Diversification with User Tagging
     Credibility Dataset”, ACM MMSys, Portland, Oregon,
     USA, 2015.
[10] B. Ionescu, A. Popescu, A.-L. Radu, H. Müller,
     “Result Diversification in Social Image Retrieval: A
     Benchmarking Framework”, Multimedia Tools and
     Applications, 2014.
[11] E. Spyromitros-Xioufis, S. Papadopoulos, A. Gı̂nscă,
     A. Popescu, I. Kompatsiaris, I. Vlahavas, “Improving
     Diversity in Image Search via Supervised Relevance
     Scoring”, ACM Int. Conf. on Multimedia Retrieval,
     ACM, Shanghai, China, 2015.
[12] B. Ionescu, A. Popescu, M. Lupu, A.L. Gı̂nscă, H.
     Müller, “Retrieving Diverse Social Images at
     MediaEval 2014: Challenge, Dataset and Evaluation”,
     CEUR-WS, Vol. 1263, http://ceur-ws.org/
     Vol-1263/mediaeval2014_submission_1.pdf, Spain,
     2014.
[13] A. Sun, S.S. Bhowmick, “Image Tag Clarity: in Search
     of Visual-Representative Tags for Social Images”,
     SIGMM workshop on Social media, 2009.