Retrieving Diverse Social Images at MediaEval 2015: Challenge, Dataset and Evaluation Bogdan Ionescu∗ Alexandru Lucian GînscㆠBogdan Boteanu‡ LAPI, University Politehnica of CEA, LIST, France LAPI, University Politehnica of Bucharest, Romania alexandru.ginsca@cea.fr Bucharest, Romania bionescu@alpha.imag.pub.ro bboteanu@alpha.imag.pub.ro † † Adrian Popescu Mihai Lupu Henning Müller CEA, LIST, France Vienna University of HES-SO, Sierre, Switzerland adrian.popescu@cea.fr Technology, Austria henning.mueller@hevs.ch lupu@ifs.tuwien.ac.at ABSTRACT tially visiting. Before deciding whether this location suits This paper provides an overview of the Retrieving Diverse her needs, the person is interested in getting a more com- Social Images task that is organized as part of the Media- plete and diversified visual description of the place. Eval 2015 Benchmarking Initiative for Multimedia Evalua- Participants are required to develop algorithms to auto- tion. The task addresses the problem of result diversification matically refine a list of images that has been returned by and user annotation credibility estimation in the context of Flickr in response to a query. Compared to the previous edi- social photo retrieval. We present the task challenges, the tions, this year’s task includes not only single-topic queries proposed data set and ground truth, the required participant (i.e., formulations such as the name of a location), but also runs and the evaluation metrics. multi-concept queries related to events and states associated with locations. The requirements of the task are to refine these results by providing a ranked list of up to 50 photos 1. INTRODUCTION that are both relevant and diverse representations of the An efficient image retrieval system should be able to present query, according to the following definitions: results that are both relevant and that are covering different Relevance: a photo is considered to be relevant if it is a aspects, i.e., diverse, of the query. Relevance was more common photo representation of all query concepts at once. thoroughly studied in existing literature than diversifica- This includes sub-locations (e.g., subsuming indoor/outdoor, tion [1, 2, 3] and even though a considerable amount of di- close up), temporal information (e.g., historical shots, times versification literature exists, the topic remains important, of day), typical actors/objects (e.g., people who frequent the especially in social multimedia [4, 5, 6, 7]. location, vehicles), genesis information (e.g., images showing The 2015 Retrieving Diverse Social Images task is a fo- how something got the way it is), and image style informa- llowup of last years’ editions [9, 8, 10] and aims to foster tion (e.g., creative views). Low quality photos (e.g., severely new technology for improving both relevance and diversifi- blurred, out of focus, etc) as well as photos with people as cation of search results with explicit emphasis on the actual the main subject (e.g., a big picture of me in front of the social media context. This task was designed to be inter- monument) are not considered relevant in this scenario; esting for researchers working in either machine-based or Diversity: a set of photos is considered to be diverse if human-based media analysis, including areas such as: image it depicts different visual characteristics of the target con- retrieval (text, vision, multimedia communities), re-ranking, cepts, e.g., sub-locations, temporal information, typical ac- machine learning, relevance feedback, natural language pro- tors/objects, genesis and style information, etc with a cer- cessing, crowdsourcing and automatic geo-tagging. tain degree of complementarity, i.e., most of the perceived visual information is different from one photo to another. 2. TASK DESCRIPTION To carry out the refinement and diversification tasks, par- The task is built around a tourist use case where a person ticipants may use social metadata associated with the im- tries to find more information about a place she is poten- ages, the visual characteristics of the images, information re- ∗ lated to user tagging credibility (an estimation of the global This work is supported by the European Science Founda- quality of tag-image content relationships for a user’s con- tion, activity on “Evaluating Information Access Systems”. † tributions) or external resources (e.g., Internet). This work is supported by the CHIST-ERA FP7 MUCKE - Multimodal User Credibility and Knowledge Extraction project (http://ifs.tuwien.ac.at/∼mucke/). ‡ This work has been funded by the Ministry of Euro- 3. DATASET pean Funds through the Financial Agreement POSDRU The 2015 data consists of a development set (devset) con- 187/1.5/S/155420. taining 153 location queries (45,375 Flickr photos — the 2014 dataset [9]), a user annotation credibility set (credibil- ityset) containing information for ca. 300 locations and 685 Copyright is held by the author/owner(s). users (different than the ones in devset and testset) and a MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany test set (testset) containing 139 queries: 69 one-concept lo- cation queries (20,700 Flickr photos) and 70 multi-concept tion characteristics (mainly learned from last years’ tasks queries related to events and states associated with locations and Internet sources). For relevance, annotators were asked (20,694 Flickr photos). to label each photo (one at a time) as being relevant (value Each query is provided with the following information: 1), non-relevant (0) or with “don’t know” (-1). For devset, 11 query text formulation (used to retrieve the data), GPS annotators were involved, for credibilityset 9 and for testset coordinates (latitude and longitude in degrees — only for single-topic 7 and multi-topic 5. Each annotator annotated single-topic location queries), a link to a Wikipedia web- different parts of the data leading in the end to 3 different page (only when available), up to 5 representative photos annotations for each photo. The final relevance ground truth from Wikipedia (only for single-topic location queries), a was determined after a lenient majority voting scheme. For ranked list of up to 300 photos retrieved from Flickr using diversity, only the photos that were judged as relevant in Flickr’s default “relevance” algorithm (all photos are Cre- the previous step were considered. For each location, an- ative Commons licensed allowing redistribution, see http: notators were provided with a thumbnail list of all relevant //creativecommons.org/), and an xml file containing meta- photos. After getting familiar with their contents, they were data from Flickr for all the retrieved photos (e.g., photo title, asked to re-group the photos into clusters with similar visual photo description, photo id, tags, Creative Common license appearance (up to 25). Devset and testset were annotated type, number of posted comments, the url link of the photo by 3 persons, each of them annotating distinct parts of the location from Flickr, the photo owner’s name, user id, the data (leading to only one annotation). An additional anno- number of times the photo has been displayed, etc). tator acted as a master annotator and reviewed once more Apart from the metadata, to facilitate participation from the final annotations. various communities, we also provide content descriptors: general purpose visual descriptors (e.g., color, texture and 5. RUN DESCRIPTION feature information) identical to the ones in 2014 [10]; con- Participants were allowed to submit up to 5 runs. The volutional neural network based descriptors — generic based first 3 are required runs: run1 — automated using visual on the reference convolutional neural network (CNN) model information only; run2 — automated using text informa- provided along with the Caffe framework (this model is lear- tion only; and run3 — automated using text-visual fused ned with the 1,000 ImageNet classes used during the Ima- without other resources than provided by the organizers. geNet challenge) and adapted CNN based on a CNN model The last 2 runs are general runs: run4 — automated using obtained with an identical architecture to that of the Caffe user annotation credibility descriptors (either the ones pro- reference model. (This model is learned with 1,000 tourist vided by organizers or computed by the participants) and points of interest classes for which the images were automat- run5 — everything allowed, e.g., human-based or hybrid ically collected from the Web) [11]; text information which human-machine approaches, including using data from ex- consists as in the previous edition of term frequency infor- ternal sources (e.g., Internet). For generating run1 to run4 mation, document frequency information and their ratio, participants are allowed to use only information that can i.e., TF-IDF (used as in [12]); user annotation credibility de- be extracted from the provided data (e.g., provided descrip- scriptors that give an automatic estimation of the quality of tors, descriptors of their own, etc). This includes also the users’ tag-image content relationships. These descriptors are Wikipedia webpages of the locations (via their links). extracted by visual or textual content mining: visualScore (measure of user image relevance), faceProportion (the per- centage of images with faces), tagSpecificity (average speci- 6. EVALUATION ficity of a user’s tags, where tag specificity is the percentage Performance is assessed for both diversity and relevance. of users having annotated with that tag in a large Flickr cor- The following metrics are computed: Cluster Recall at X pus), locationSimilarity (average similarity between a user’s (CR@X) — a measure that assesses how many different clus- geotagged photos and a probabilistic model of a surrounding ters from the ground truth are represented among the top cell), photoCount (total number of images a user shared), X results (only relevant images are considered), Precision at uniqueTags (proportion of unique tags), uploadFrequency X (P@X) — measures the number of relevant photos among (average time between two consecutive uploads), bulkPro- the top X results and F1-measure at X (F1@X) — the har- portion (the proportion of bulk taggings in a user’s stream, monic mean of the previous two. Various cut off points are i.e., of tag sets which appear identical for at least two dis- to be considered, i.e., X=5, 10, 20, 30, 40, 50. Official rank- tinct photos), meanPhotoViews (mean value of the number ing metric is the F1@20 which gives equal importance to of times a user’s image has been seen by other members of diversity (via CR@20) and relevance (via P@20). This met- the community), meanTitleWordCounts (mean value of the ric simulates the content of a single page of a typical Web number of words found in the titles associated with users’ image search engine and reflects user behavior, i.e., inspect- photos), meanTagsPerPhoto (mean value of the number of ing the first page of results with priority. tags users put for their images), meanTagRank (mean rank of a user’s tags in a list in which the tags are sorted in de- 7. CONCLUSIONS scending order according the the number of appearances in a The 2015 Retrieving Diverse Social Images task provides large subsample of Flickr images), and meanImageTagClar- participants with a comparative and collaborative evalua- ity (adaptation of the Image Tag Clarity from [13] using as tion framework for social image retrieval techniques with individual tag language model a tf/idf language model). explicit focus on result diversification. This year in particu- lar, the task explores also the diversification of multi-concept 4. GROUND TRUTH queries. Details on the methods and results of each individ- Both relevance and diversity annotations were carried out ual participant team can be found in the working note papers by expert annotators with advanced knowledge of the loca- of the MediaEval 2015 workshop proceedings. 8. REFERENCES [1] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, “Content-based Image Retrieval at the End of the Early Years”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(12), pp. 1349 - 1380, 2000. [2] R. Datta, D. Joshi, J. Li, J.Z. Wang, “Image Retrieval: Ideas, Influences, and Trends of the New Age”, ACM Comput. Surv., 40(2), pp. 1-60, 2008. [3] R. Priyatharshini, S. Chitrakala, “Association Based Image Retrieval: A Survey”, Mobile Communication and Power Engineering, Springer Communications in Computer and Information Science, 296, pp. 17-26, 2013. [4] R.H. van Leuken, L. Garcia, X. Olivares, R. van Zwol, “Visual Diversification of Image Search Results”, ACM World Wide Web, pp. 341-350, 2009. [5] M.L. Paramita, M. Sanderson, P. Clough, “Diversity in Photo Retrieval: Overview of the ImageCLEF Photo Task 2009”, ImageCLEF 2009. [6] B. Taneva, M. Kacimi, G. Weikum, “Gathering and Ranking Photos of Named Entities with High Precision, High Recall, and Diversity”, ACM Web Search and Data Mining, pp. 431-440, 2010. [7] S. Rudinac, A. Hanjalic, M.A. Larson, “Generating Visual Summaries of Geographic Areas Using Community-Contributed Images”, IEEE Trans. on Multimedia, 15(4), pp. 921-932, 2013. [8] B. Ionescu, A.-L. Radu, M. Menéndez, H. Müller, A. Popescu, B. Loni, “Div400: A Social Image Retrieval Result Diversification Dataset”, ACM MMSys, Singapore, 2014. [9] B. Ionescu, A. Popescu, M. Lupu, A.L. Gı̂nscă, B. Boteanu, H. Müller, “Div150Cred: A Social Image Retrieval Result Diversification with User Tagging Credibility Dataset”, ACM MMSys, Portland, Oregon, USA, 2015. [10] B. Ionescu, A. Popescu, A.-L. Radu, H. Müller, “Result Diversification in Social Image Retrieval: A Benchmarking Framework”, Multimedia Tools and Applications, 2014. [11] E. Spyromitros-Xioufis, S. Papadopoulos, A. Gı̂nscă, A. Popescu, I. Kompatsiaris, I. Vlahavas, “Improving Diversity in Image Search via Supervised Relevance Scoring”, ACM Int. Conf. on Multimedia Retrieval, ACM, Shanghai, China, 2015. [12] B. Ionescu, A. Popescu, M. Lupu, A.L. Gı̂nscă, H. Müller, “Retrieving Diverse Social Images at MediaEval 2014: Challenge, Dataset and Evaluation”, CEUR-WS, Vol. 1263, http://ceur-ws.org/ Vol-1263/mediaeval2014_submission_1.pdf, Spain, 2014. [13] A. Sun, S.S. Bhowmick, “Image Tag Clarity: in Search of Visual-Representative Tags for Social Images”, SIGMM workshop on Social media, 2009.