Retrieving Diverse Social Images at MediaEval 2017:
                         Challenges, Dataset and Evaluation
                                   Maia Zaharieva1 , Bogdan Ionescu2 , Alexandru Lucian Gînscă3 ,
                                             Rodrygo L.T. Santos4 , Henning Müller5
                                                                   1 TU Wien, Austria
                                                      2 University Politehnica of Bucharest, Romania
                                                                   3 CEA LIST, France
                                                      4 Universidade Federal de Minas Gerais, Brazil
                                       5 University of Applied Sciences Western Switzerland, Switzerland

                                 maia.zaharieva@tuwien.ac.at,bionescu@imag.pub.ro,alexandru.ginsca@cea.fr
                                                rodrygo@dcc.ufmg.br,henning.mueller@hevs.ch

ABSTRACT                                                                       Search1 ). Given a ranked list of up to 300 query-related images
This paper provides an overview of the Retrieving Diverse Social               retrieved from Flickr2 using text-based queries, participants are
Images task that is organized as part of the MediaEval 2017 Bench-             required to refine the results by providing a set of images that are
marking Initiative for Multimedia Evaluation. The task addresses               relevant to the query and, at the same time, represent a visually
the challenge of visual diversification of image retrieval results,            diversified summary of it. The queries include complex and general-
where images, metadata, user tagging profiles, and content and text            purpose, multi-concept queries (e.g. "dancing on the street", "trees
models are available for processing. We present the task challenges,           reflected in water", "sailing boat"). The queries in the development
the employed dataset and ground truth information, the required                set result from a broad user study and are constructed around the
runs, and the considered evaluation metrics.                                   the data of the MediaEval 2016 Retrieving Diverse Social Images
                                                                               task [8]. The queries in the test set were collected using Google
                                                                               Trends3 for image search (worldwide, last 5 years: 2012-2017).
1    INTRODUCTION                                                                  The goal of the task is to refine the image set retrieved as a result
                                                                               to a given text-based query by providing a ranked list of up to 50
An efficient image retrieval system should be able to present results          photos that are both relevant and visually diversified representations
that are both relevant to the provided query and that are covering             of the query, according to the following definitions:
different visual aspects of it. The diversification of image search            Relevance: an image is considered to be relevant for the query
results can considerably increase the probability of a system to               if it is a common visual representation of the query topics (all at
address a broad range of user information needs. In general, di-               once). Bad quality photos (e.g., severely blurred, out of focus) are
versification is an actively researched problem in various domains             not considered relevant in this scenario;
ranging from web search and query result diversification [9, 15, 19]           Diversity: a set of images is considered to be diverse if it depicts
to recommender systems [16, 17] and summarization [13, 14]. With               different visual characteristics of the query topics and subtopics
the emerging availability of publicly available images, the impor-             with a certain degree of complementarity, i.e. most of the perceived
tance for diversification of image data is steadily growing. This task         visual information is different from one image to another.
is especially challenging when handling real-world queries, which
are often complex, consisting of multiple concepts.                            3     DATA DESCRIPTION
   The 2017 Retrieving Diverse Social Images task is a follow-up of
the 2016 edition [8] and fosters the development of new techniques             The data consists of a development set (devset) with 110 queries
for improving both the relevance and the visual diversification of             (32, 487 images) and a test set (testset) with 84 queries (24, 986
image search results. The task is designed to support the evaluation           images). An additional dataset (credibilityset) provides credibility
and comparison of approaches emerging from a wide range of re-                 estimation for ca. 685 users and metadata for more than 3.5M im-
search fields, such as information retrieval (text, vision, multimedia         ages. We also provide semantic vectors for general English terms
communities), machine learning, relevance feedback, and natural                computed on top of the English Wikipedia4 (wikiset), which could
language processing.                                                           help participants to develop advanced text models.
                                                                                  Each query is accompanied by the following information: query
                                                                               text formulation (the actual query formulation used on Flickr to
2    TASK DESCRIPTION
                                                                               retrieve the data), a ranked list of up to 300 images in jpeg format
The task is built around the use case of a general ad-hoc image                retrieved from Flickr using Flickr’s default "relevance" algorithm
retrieval system, which provides the user with visually diversified            (all images are redistributable Creative Commons licensed5 ), an
representations of query results (see for instance Google Image
                                                                               1 https://images.google.com/
                                                                               2 https://www.flickr.com.
Copyright held by the owner/author(s).                                         3 http://trends.google.com/
MediaEval’17, 13-15 September 2017, Dublin, Ireland                            4 https://en.wikipedia.org/
                                                                               5 http://creativecommons.org/
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                           M.Zaharieva et al.


xml file containing Flickr metadata for the retrieved images, and           In contrast to the single relevance score for each query, in terms
ground truth for both relevance and diversity.                              of diversity we consider all three annotations as correct (ground
  To facilitate participation from various communities, we also             truth) as they typically depict different possibilities to group the
provide the following content-based descriptors:                            images representing different points of view.
- general purpose, visual-based descriptors extracted using the LIRE
library6 [10]: auto color correlogram (ACC) [12]; color and edge            5   RUN DESCRIPTION
directivity descriptor (CEDD) [5], fuzzy color and texture histogram        Participants were allowed to submit up to five runs. The first three
(FCTH) [6], Gabor texture, joint composite descriptor (JCD) [4],            are required (dedicated) runs: run1 – automated run using visual
several MPEG7 features including color layout, edge histogram, and          information only; run2 – automated run using text information
scalable color [11], pyramid of histograms of orientation gradients         only; and run3 – automated run using both visual and text infor-
(PHOG) [2], and speeded up robust features (SURF) [1].                      mation. For the generation of run1 to run3 only information that
- convolutional neural network (CNN)-based descriptors based on the         can be extracted from the provided data (e.g. provided descriptors,
 reference model provided with the Caffe framework7 . The descrip-          descriptors of their own, etc.) is allowed to be used. The last two
 tors are extracted from the last fully connected layer (fc7).              runs, run4 and run5, are general ones, i.e. any approach is allowed,
- text-based features include term frequency and document fre-              e.g. human-based or hybrid human-machine approaches, including
quency information and their ratio (TF-IDF). The text-based features        using data from external sources, such as Internet or pre-trained
are computed per image, per query, and per user basis.                      models obtained from external datasets related to this task.
- user annotation credibility descriptors provide an estimation of the
 quality of the users’ tag-image content relationships. The follow-         6   EVALUATION
 ing descriptors are provided: visualScore (measure of user image           Performance is assessed for both diversity and relevance using clus-
 relevance), faceProportion (the percentage of images with faces),          ter recall at X (CR@X ), precision at X (P@X ), and their harmonic
 tagSpecificity (average specificity of a user’s tags, where tag speci-     mean F 1@X . CR@X provides the ratio of the number of clusters
 ficity is the percentage of users having annotated with that tag in a      from the ground truth that are represented in the top X results and,
 large Flickr corpus), locationSimilarity (average similarity between a     thus, it reflects the diversification quality of a given image result
 user’s geotagged photos and a probabilistic model of a surrounding         set. We compute CR@X for each one of the available ground truth
 cell), photoCount (total number of images a user shared), uniqueTags       diversity annotations and select the one which maximizes CR@X
(proportion of unique tags), uploadFrequency (average time between          for each query. Since the clusters in the ground truth consider rel-
 two consecutive uploads), bulkProportion (the proportion of bulk           evant images only, the relevance of the top X results is implicitly
 taggings in a user’s stream, i.e., of tag sets that appear identical for   measured by CR@X . Nevertheless, P@X provides a more precise
 at least two distinct photos), meanPhotoViews (mean value of the           view on the relevance of a particular image set since it directly
 number of times a user’s image has been seen by other members of           measures the relevance among the top X images. We consider var-
 the community), meanTitleWordCounts (mean value of the number              ious cut off points, i.e. X = {5, 10, 20, 30, 40, 50}. Additionally, we
 of words found in the titles associated with users’ photos), mean-         consider two further evaluation metrics, which are well-established
 TagsPerPhoto (mean value of the number of tags users put for their         in the information retrieval community, the intent-aware expected
 images), meanTagRank (mean rank of a user’s tags in a list in which        reciprocal rank (ERR-IA@X ) [3] and the α-normalized discounted
 the tags are sorted in descending order according the number of            cumulative gain (α-nDCG@X ) [7] metrics. The official ranking met-
 appearances in a large subsample of Flickr images), and meanIm-            ric is F 1@20 which gives equal importance to diversity (via CR@20)
 ageTagClarity (adaptation of the Image Tag Clarity from [18] using         and relevance (via P@20). This metric simulates the content of a
 a TF-IDF language model as individual tag language model).                 single page of a typical Web image search engine and reflects user
                                                                            behavior, i.e., inspecting the first page of results with priority.
4     GROUND TRUTH
Both relevance and diversity annotations were carried out by 17             7   CONCLUSION
human annotators. The data were distributed among the annotators
                                                                            The 2017 Retrieving Diverse Social Images task provides partici-
such that each query was labeled by three different annotators. For
                                                                            pants with a comparative and collaborative evaluation benchmark
relevance, annotators were asked to label each image (one at a time)
                                                                            for social image retrieval approaches focusing on visual-based di-
as being relevant to the underlying query (value 1), non-relevant
                                                                            versification. The task explores the diversification in the context of
(0), or with "don’t know" (−1). The final relevance ground truth
                                                                            a challenging, ad-hoc image retrieval system, which should be able
score was determined using a majority voting scheme. For diver-
                                                                            to tackle complex and general-purpose multi-concept queries. This
sity, only the images that were judged as relevant in the previous
                                                                            year, we explicitly accounted for the possibility of having multi-
step were considered. For each query, annotators were provided
                                                                            ple different views on a given retrieval result, which might all be
with a thumbnail list of all relevant images. After getting familiar
                                                                            subjectively correct. This allows for an investigation of the aspect
with their contents, they were asked to re-group the images into
                                                                            of subjectivity in the perception of diversification in a next step.
clusters with similar visual appearance (up to 25 clusters in total).
                                                                            Details on the methods and results of the participating teams can be
6 http://www.lire-project.net/                                              found in the working note papers of the MediaEval 2017 workshop
7 http://caffe.berkeleyvision.org/                                          proceedings.
Retrieving Diverse Social Images Task                                                      MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
 [1] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. 2008. Speeded-Up
     Robust Features (SURF). Computer Vision and Image Understanding 110, 3 (2008),
     346–359. https://doi.org/10.1016/j.cviu.2007.09.014
 [2] Anna Bosch, Andrew Zisserman, and Xavier Munoz. 2007. Representing Shape
     with a Spatial Pyramid Kernel. In ACM International Conference on Image and
     Video Retrieval (CIVR). ACM, New York, NY, USA, 401–408. https://doi.org/10.
     1145/1282280.1282340
 [3] Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected
     Reciprocal Rank for Graded Relevance. In ACM Conference on Information and
     Knowledge Management (CIKM). ACM, New York, NY, USA, 621–630. https:
     //doi.org/10.1145/1645953.1646033
 [4] Savvas A Chatzichristofis, YS Boutalis, and Mathias Lux. 2009. Selection of
     the proper compact composite descriptor for improving content based image
     retrieval. In Signal Processing, Pattern Recognition and Applications (SPPRA). ACTA
     Press, 134–140.
 [5] Savvas A. Chatzichristofis and Yiannis S. Boutalis. 2008. CEDD: Color and Edge
     Directivity Descriptor: A Compact Descriptor for Image Indexing and Retrieval.
     In International Conference on Computer Vision Systems (ICCV). Springer-Verlag,
     Berlin, Heidelberg, 312–322. https://doi.org/10.1007/978-3-540-79547-6_30
 [6] S. A. Chatzichristofis and Y. S. Boutalis. 2008. FCTH: Fuzzy Color and Texture
     Histogram - A Low Level Feature for Accurate Image Retrieval. In Int. Workshop
     on Image Analysis for Multimedia Interactive Services (WIAMIS). IEEE Computer
     Society, Washington, DC, USA, 191–196. https://doi.org/10.1109/WIAMIS.2008.24
 [7] Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova,
     Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and Diversity
     in Information Retrieval Evaluation. In International ACM SIGIR Conference on
     Research and Development in Information Retrieval. ACM, New York, NY, USA,
     659–666. https://doi.org/10.1145/1390334.1390446
 [8] Bogdan Ionescu, Alexandru Lucian Ginsca, Maia Zaharieva, Bogdan Boteanu,
     Mihai Lupu, and Henning Müller. 2016. Retrieving Diverse Social Images at
     MediaEval 2016: Challenge, Dataset and Evaluation. In MediaEval 2016 Multime-
     dia Benchmark Workshop, Vol. 1739. CEUR-WS.org.
 [9] Bogdan Ionescu, Adrian Popescu, Anca-Livia Radu, and Henning Müller. 2016.
     Result diversification in social image retrieval: a benchmarking framework.
     Multimedia Tools and Applications 75, 2 (2016), 1301–1331. https://doi.org/10.
     1007/s11042-014-2369-4
[10] Mathias Lux. 2011. Content Based Image Retrieval with LIRe. In ACM In-
     ternational Conference on Multimedia. ACM, New York, NY, USA, 735–738.
     https://doi.org/10.1145/2072298.2072432
[11] B.S. Manjunath, J.-R. Ohm, V.V. Vasudevan, and A. Yamada. 2001. Color and
     texture descriptors. IEEE Transactions on Circuits and Systems for Video Technology
     11, 6 (2001), 703–715. https://doi.org/10.1109/76.927424
[12] Mandar Mitra, Ramin Zabih, Jing Huang, Wei-Jing Zhu, and S. Ravi Kumar. 1997.
     Image Indexing Using Color Correlograms. In IEEE Conference on Computer
     Vision and Pattern Recognition (CVPR). IEEE Computer Society, Washington, DC,
     USA, 762–768. https://doi.org/10.1109/CVPR.1997.609412
[13] Yanwei Pang, Qiang Hao, Yuan Yuan, Tanji Hu, Rui Cai, and Lei Zhang. 2011.
     Summarizing Tourist Destinations by Mining User-generated Travelogues and
     Photos. Computer Vision and Image Understanding 115, 3 (2011), 352–363. https:
     //doi.org/10.1016/j.cviu.2010.10.010
[14] S. Rudinac, A. Hanjalic, and M. Larson. 2013. Generating Visual Summaries of
     Geographic Areas Using Community-Contributed Images. IEEE Transactions on
     Multimedia 15, 4 (2013), 921–932. https://doi.org/10.1109/TMM.2013.2237896
[15] Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2015. Search result
     diversification. Foundations and Trends in Information Retrieval 9, 1 (2015), 1–90.
     https://doi.org/10.1561/1500000040
[16] Markus Schedl and David Hauger. 2015. Tailoring Music Recommendations to
     Users by Considering Diversity, Mainstreaminess, and Novelty. In International
     ACM SIGIR Conference on Research and Development in Information Retrieval.
     ACM, New York, NY, USA, 947–950. https://doi.org/10.1145/2766462.2767763
[17] Yue Shi, Xiaoxue Zhao, Jun Wang, Martha Larson, and Alan Hanjalic. 2012. Adap-
     tive Diversification of Recommendation Results via Latent Factor Portfolio. In
     International ACM SIGIR Conference on Research and Development in Information
     Retrieval. ACM, New York, NY, USA, 175–184. https://doi.org/10.1145/2348283.
     2348310
[18] Aixin Sun and Sourav S. Bhowmick. 2009. Image Tag Clarity: In Search of Visual-
     representative Tags for Social Images. In SIGMM Workshop on Social Media. ACM,
     New York, NY, USA, 19–26. https://doi.org/10.1145/1631144.1631150
[19] Kaiping Zheng, Hongzhi Wang, Zhixin Qi, Jianzhong Li, and Hong Gao. 2016. A
     survey of query result diversification. Knowledge and Information Systems 51, 1
     (2016), 1–36. https://doi.org/10.1007/s10115-016-0990-4