Retrieving Diverse Social Images at MediaEval 2014:
                Challenge, Dataset and Evaluation

                Bogdan Ionescu                          Adrian Popescu                            Mihai Lupu
          LAPI, University Politehnica of                CEA, LIST, France                    Vienna University of
              Bucharest, Romania                         adrian.popescu@cea.fr                Technology, Austria
              bionescu@alpha.imag.pub.ro                                                       lupu@ifs.tuwien.ac.at

                              Alexandru Lucian Gînscă                     Henning Müller
                                     CEA, LIST, France                HES-SO, Sierre, Switzerland
                                    alexandru.ginsca@cea.fr               henning.mueller@hevs.ch


ABSTRACT                                                              the same time relevant and provide a diversified summary
This paper provides an overview of the Retrieving Diverse             (up to 50 images), according to the following definitions:
Social Images task that is organized as part of the Media-            Relevance: a photo is considered to be relevant if it is a
Eval 2014 Benchmarking Initiative for Multimedia Evalua-              common photo representation of the location, e.g., different
tion. The task addresses the problem of result diversification        views at different times of the day/year and under different
in the context of social photo retrieval. We present the task         weather conditions, inside views, close-ups on architectural
challenges, the proposed data set and ground truth, the re-           details, drawings, sketches, creative views, etc, which con-
quired participant runs and the evaluation metrics.                   tain partially or entirely the target location. Bad quality
                                                                      photos (e.g., severely blurred, out of focus, etc) as well as
1.   INTRODUCTION                                                     photos with people as the main subject (e.g., a big picture
   An efficient image retrieval system should be able to present      of me in front of the monument) are not considered relevant;
results that are both relevant and that are covering diverse          Diversity: a set of photos is considered to be diverse if it
aspects of a query (e.g., sub-topics). Relevance has been             depicts different visual characteristics of the target location,
more thoroughly studied in existing literature than diversifi-        as stated by the relevance definition above, with a certain
cation and even though a considerable amount of diversifica-          degree of complementarity, i.e., most of the perceived visual
tion literature exists, the topic remains an important one, es-       information is different from one photo to another.
pecially in social media. The 2014 Retrieving Diverse Social             The refinement and diversification process will be based
Images task is a followup of last year’s edition [1][2][3] and        on the social metadata associated with the images and/or
aims to foster new technology for improving both relevance            on the visual characteristics of the images. New for this
and diversification of search results with explicit emphasis          year, we provide information about user annotation credi-
on the actual social media context. It creates an evaluation          bility. Credibility is determined as an automatic estimation
framework specifically designed to encourage the emergence            of the quality (correctness) of a particular user’s tags. Par-
of new diversification solutions from areas such as informa-          ticipants are allowed to exploit this credibility estimation
tion retrieval (text, vision and multimedia), re-ranking, rel-        or to compute their own approach, in addition to classical
evance feedback, crowdsourcing.                                       retrieval techniques.
2.   TASK DESCRIPTION                                                 3. DATASET
   The task is build around a tourist use case where a person            The 2014 data set is constructed around the 2013 data [1]
tries to find more information about a place she is poten-            [2] and consists of ca. 300 locations (e.g., monuments, cathe-
tially visiting. The person has only a vague idea about the           drals, bridges, sites, etc) spread over 35 countries around
location, knowing the name of the place. She uses the name            the world. Data is divided into a development set, devset,
to learn additional facts about the place from the Internet,          containing 30 locations — intended for designing the ap-
for instance by visiting a Wikipedia page, e.g., getting a            proaches; a test set, testset, containing 123 locations — to
photo, the geographical position of the place and basic de-           be used for the official evaluation; as well as an additional
scriptions. Before deciding whether this location suits her           credibilityset, ca. 300 locations and 685 users (chosen to be
needs, the person is interested in getting a more complete            different from the ones in devset and testset), used to train
and diversified visual description of the place.                      the credibility descriptors. All the data was retrieved from
   In this task, participants receive a list of photos for a cer-     Flickr using the name of the location as query.
tain location retrieved from Flickr and ranked with Flickr’s             Each location contains: the name of the location, its GPS
default “relevance” algorithm. These results are typically            coordinates, a link to a Wikipedia webpage, up to 5 rep-
noisy and redundant. The requirements of the task are to              resentative photos from Wikipedia, a ranked list of up to
refine these results by providing a set of images that are in         300 photos retrieved from Flickr using Flickr’s default “rele-
                                                                      vance” algorithm1 (devset provides 8,923 images and testset
                                                                      1
Copyright is held by the author/owner(s).                               all the photos are under Creative Commons licenses that
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain        allow redistribution, see http://creativecommons.org/.
36,452) and an xml file containing metadata from Flickr for         5. RUN DESCRIPTION
all the retrieved photos (e.g., photo title, photo description,        Participants are allowed to submit up to 5 runs. The first
photo id, tags, Creative Common license type, number of             3 are required runs: run1 - automated using visual infor-
posted comments, the url link of the photo location from            mation only; run2 - automated using text information only;
Flickr, the photo owner’s name, user id, the number of times        and run3 - automated using text-visual fused without other
the photo has been displayed, etc).                                 resources than provided by the organizers. The last 2 runs
   Apart from the metadata, the dataset contains also con-          are general runs: run4 - automated using user annotation
tent descriptors (visual, text and credibility based). Visual       credibility descriptors (either the ones provided by organiz-
descriptors include the same general purpose descriptors            ers or computed by the participants) and run5 - everything
(e.g., color, texture and feature information) as in 2013 [3].      allowed, e.g., human-based or hybrid human-machine ap-
Text information consists this year of term frequency infor-        proaches, including using data from external sources (e.g.,
mation (the number of occurrences of the term in the entity’s       Internet). For generating run1 to run4 participants are al-
text fields), document frequency information (the number of         lowed to use only information that can be extracted from
entities which have this term in their text fields) and their       the provided data (e.g., provided descriptors, descriptors of
ratio, i.e., TF-IDF. Text descriptors are computed on a per         their own, etc). This includes also the Wikipedia webpages
dataset basis and also a per image basis, a per location basis      of the locations (provided via their links).
and a per user basis. User annotation credibility descriptors
provide an automatic estimation of the quality of tag-image         6. EVALUATION
content relationships. This information gives an indication           Performance is assessed for both diversity and relevance.
about which users are most likely to share relevant images          The following metrics are computed: Cluster Recall at X
on Flickr according to the underlying task scenario. The            (CR@X) — a measure that assesses how many different clus-
following descriptors are provided: visualScore (measure of         ters from the ground truth are represented among the top
user image relevance), faceProportion (the percentage of im-        X results (only relevant images are considered), Precision at
ages with faces), tagSpecificity (average specificity of a user’s   X (P@X) — measures the number of relevant photos among
tags, where tag specificity is the percentage of users hav-         the top X results and F1-measure at X (F1@X) — the har-
ing annotated with that tag in a large Flickr corpus), loca-        monic mean of the previous two. Various cut off points are
tionSimilarity (average similarity between a user’s geotagged       to be considered, i.e., X=5, 10, 20, 30, 40, 50.
photos and a probabilistic model of a surrounding cell), pho-         Official ranking metric is the F1@20 which gives equal
toCount (total number of images a user shared), uniqueTags          importance to diversity (via CR@20) and relevance (via P@20).
(proportion of unique tags), uploadFrequency (average time          This metric simulates the content of a single page of a typi-
between two consecutive uploads) and bulkProportion (the            cal Web image search engine and reflects user behavior, i.e.,
proportion of bulk taggings in a user’s stream, i.e., of tag        inspecting the first page of results with priority.
sets which appear identical for at least two distinct photos).
                                                                    7. CONCLUSIONS
                                                                       The Retrieving Diverse Social Images task provides par-
4.   GROUND TRUTH                                                   ticipants with a comparative and collaborative evaluation
   Both relevance and diversity annotations were carried out        framework for social image retrieval techniques with explicit
by expert annotators with advanced knowledge of the lo-             focus on result diversification. This year in particular, the
cation characteristics (mainly learned from last year’s task        task explores also the benefits of employing automatically
and Internet sources). Specifically designed visual tools were      estimated user annotation credibility information to the di-
employed to facilitate the annotation process. Annotation           versification task. Details on the methods and results of each
was not time restricted.                                            individual participant team can be found in the working note
   For relevance, annotators were asked to label each photo         papers of the MediaEval 2014 workshop proceedings.
(one at a time) as being relevant (value 1), non-relevant (0)
or with “don’t know” (-1). To help with their decisions,            Acknowledgments
annotators were able to consult any additional information          This task is supported by the following projects: MUCKE2 ,
source during the evaluation (e.g., representative photos, In-      CUbRIK3 and PROMISE4 .
ternet, etc). For devset, 3 annotators were involved while
testset and credibilityset used 11 and 9 annotators, respec-        8. REFERENCES
                                                                        [1] B. Ionescu, A.-L. Radu, M. Menéndez, H. Müller, A.
tively, that annotated different parts of the data leading in               Popescu, B. Loni, “Div400: A Social Image Retrieval Result
the end to 3 different annotations. Final ground truth was                  Diversification Dataset”, ACM MMSys, Singapore, 2014.
determined after a lenient majority voting scheme.                      [2] B. Ionescu, A. Popescu, H. Müller, M. Menéndez, A.-L.
   For diversity, only the photos that were judged as relevant              Radu, “Benchmarking Result Diversification in Social
in the previous step were considered. For each location, an-                Image Retrieval”, IEEE ICIP, France, 2014.
notators were provided with a thumbnail list of all relevant            [3] B. Ionescu, M. Menéndez, H. Müller, A. Popescu,
                                                                            “Retrieving Diverse Social Images at MediaEval 2013:
photos. After getting familiar with their contents, they were
                                                                            Objectives, Dataset and Evaluation”, CEUR-WS, Vol.
asked to re-group the photos into clusters (up to 25) with                  1043, http://ceur-ws.org/Vol-1043/mediaeval2013_
similar visual appearance and tag these clusters with appro-                submission_3.pdf, Spain, 2013.
priate keywords that justify their choices. Devset was anno-
tated by 2 persons and testset by 3. Each person annotated
                                                                    2
distinct parts of the data leading to only one annotation.            http://ifs.tuwien.ac.at/~mucke/
                                                                    3
An additional annotator acted as a master annotator and               http://www.cubrikproject.eu/
                                                                    4
reviewed once more the final annotations.                             http://www.promise-noe.eu/