Retrieving Diverse Social Images at MediaEval 2013: Objectives, Dataset and Evaluation Bogdan Ionescu María Menéndez LAPI, University Politehnica of DISI, University of Trento, Italy Bucharest, Romania menendez@unitn.it bionescu@alpha.imag.pub.ro Henning Müller Adrian Popescu HES-SO, Sierre, Switzerland CEA-LIST, France henning.mueller@hevs.ch adrian.popescu@cea.fr ABSTRACT photo, the geographical position of the place and basic de- This paper provides an overview of the Retrieving Diverse scriptions. Before deciding whether this location suits her Social Images task that is organized as part of the Medi- needs, the person is interested in getting a more complete aEval 2013 Benchmarking Initiative for Multimedia Evalua- visual description of the place. tion. The task addresses the problem of result diversification In this task, participants receive a list of photos for a cer- in the context of social photo retrieval. We present the task tain location retrieved from Flickr2 and ranked with Flickr’s challenges, the proposed data set and ground truth, the re- default “relevance” algorithm. These results are typically quired participant runs and the evaluation metrics. noisy and redundant. The requirements of the task are to refine these results by providing a ranked list of up to 50 1. INTRODUCTION photos that are considered to be both relevant and diverse representations of the query according to the definitions: The MediaEval 2013 Retrieving Diverse Social Images Task addresses the problem of result diversification in the context Relevance: a photo is relevant for the location if it is a of social photo retrieval. Existing retrieval technology fo- common visual representation of the location, e.g., different cuses almost exclusively on the accuracy of the results that views at different times of the day/year and under varying often provides the user with near replicas of the query. How- weather conditions, inside views, close-ups on architectural ever, users would expect to retrieve not only representative details, drawings, sketches, creative views, etc, which con- photos but also diverse results depicting the query in a com- tain partially or entirely the target location. Photos of poor prehensive and complete manner. Another equally impor- quality (e.g., severely blurred, out of focus, etc) as well as tant aspect is that retrieval should focus on summarizing photos showing people in focus (e.g., a big picture of me in the query with a small set of images, since most of the users front of the monument) are not considered relevant. commonly browse only the top retrieval results. Diversity: a set of photos is considered to be diverse if it The task aims to foster new research in this area [1, 2] depicts different visual characteristics of the target location, by creating a multi-modal evaluation framework specifically e.g., different views at different times of the day/year and designed to encourage the creation of new solutions from var- under varying weather conditions, inside views, close-ups ious research areas, such as: machine analysis, human-based on architectural details, creative views, etc, with a certain approaches (e.g., crowd-sourcing) and hybrid machine-human degree of complementarity, i.e., most of the perceived visual approaches (e.g., relevance feedback). Compared to other information is different from one photo to another. existing tasks addressing diversity, e.g., ImageCLEF 2009 Photo Retrieval [3], the main novelty of this task is in ad- 3. DATASET dressing the social dimension that is reflected both in its The 2013 data set consists of 396 locations, spread over nature (variable quality of photos and of metadata) and in 34 countries around the world, ranging from very famous the methods devised to retrieve it. ones (e.g., “Eiffel Tower”) to lesser known monuments (e.g., “Palazzo delle Albere”). They are divided into a develop- 2. TASK DESCRIPTION ment set containing 50 locations (devset - to be used for The task is build around a tourist use case where a person designing and validating the proposed approaches) and a tries to find more information about a place she is poten- test set containing 346 locations (testset - to be used for the tially visiting. The person has only a vague idea about the official evaluation). Each of the two data sets contains data location, knowing the name of the place. She uses the name that was retrieved from Flickr using the name of the loca- to learn additional facts about the place from the Internet, tion as query (keywords), as well as using the name of the for instance by visiting a Wikipedia1 page, e.g., getting a location together with its GPS coordinates (keywordsGPS ). 1 For each location, the following information is provided: http://en.wikipedia.org/ the name of the location, its GPS coordinates, a link to a Wikipedia description webpage, a representative photo from Wikipedia, a ranked list of photos retrieved from Flickr (up Copyright is held by the author/owner(s). 2 MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain http://www.flickr.com/ to 150 photos per location; devset contains 5,118 images sual information only; run2 - automated approaches using while testset 38,300 images)3 , an xml file containing meta- textual information only; and run3 - automated approaches data from Flickr for all the retrieved photos (i.e., photo title, using textual-visual information fused without other resources photo description, photo id, tags, Creative Common license than provided by the organizers. The last 2 runs are general type, number of posted comments, the url link of the photo runs: run4 - human-based or hybrid human-machine ap- location from Flickr, the photo owner’s name and the num- proaches and run5 - everything allowed including using data ber of times the photo has been displayed), a set of global from external sources (e.g., Internet). For generating run1 visual descriptors automatically extracted from the photos to run4 participants are allowed to use only information that (i.e., color histograms, histogram of oriented gradients, color can be extracted from the provided data (e.g., provided con- moments, local binary patterns, MPEG-7 color structure tent descriptors, content descriptors of their own, etc). This descriptor, run-length matrix statistics and spatial pyramid includes also the Wikipedia webpage of the locations pro- representation of these descriptors) and several textual mod- vided via their links. For run5 everything is allowed, from els (i.e., probabilistic model, term frequency-inverse docu- the method point of view and information sources. ment frequency — TF-IDF; weighting and social TF-IDF weighting — an adaptation to the social space). 6. EVALUATION Performance is assessed for both diversity and relevance. 4. GROUND TRUTH The main evaluation metrics is cluster recall at X (CR@X) For each location, photos were manually annotated for [3] — a measure that assesses how many different clusters relevance and diversity. Ground truth was generated by a from the ground truth are represented among the top X small group of expert annotators with advanced knowledge results provided by the retrieval system. Precision at X of location characteristics. Software tools were specifically (P@X) and the harmonic mean of CR@X and P@X (i.e., designed to facilitate the annotation process. The annota- F1-measure@X) are used as secondary metrics. P@X mea- tion process was not time restricted. sures the number of relevant photos among the top X re- For relevance annotation, annotators were asked to label sults. F1-measure@X combines CR@X and P@X and gives each photo (one at a time) as being relevant (value 1), non- and overall assessment of both diversity and relevance. Par- relevant (0) or with “don’t know” (-1). To help with their ticipants were provided with these metrics computed at dif- decisions, annotators were recommended to consult any ad- ferent cutoff points, namely X∈ {5, 10, 20, 30, 40, 50}. The ditional information source during the evaluation (e.g., from official ranking was computed for X=10 (CR@10, P@10, F1- the Internet). Final ground truth was determined after a measure@10). majority voting scheme. The devset was annotated by 6 persons. The average inter-annotator agreement (Weighted 7. CONCLUSIONS Kappa) for the annotations of the keywords data was 0.68 The Retrieving Diverse Social Images Task provides par- (σ = 0.07) and for keywordsGPS data was 0.61 (σ = 0.08). ticipants with a comparative and collaborative evaluation The testset was annotated by 7 persons, each expert an- framework for social image retrieval techniques with explicit notated a different part of the data set leading in the end focus on result diversification, relevance and summarization. to 3 annotations per image. The average inter-annotator Details on the methods and results of each individual par- agreement (Free-Marginal Multirater Fleiss’ Kappa) for the ticipant team can be found in the working note papers of annotation of the keywords data was 0.86 and for keywords- the MediaEval 2013 workshop proceedings. GPS data was 0.75. Diversity annotation was carried out only for the photos Acknowledgments that were judged as relevant in the previous step. For each This task is supported by the following projects: EXCEL location, annotators were provided with a thumbnail list of POSDRU, CUbRIK5 , PROMISE6 and MUCKE7 . Many thanks all relevant photos. After getting familiar with their content, to the task supporters for their precious help: Anca-Livia they were asked to re-group the photos into similar visual Radu, Bogdan Boteanu, Ivan Eggel, Sajan Raj Ojha, Oana appearance clusters (up to 20) and then tag these clusters Pleş, Ionuţ Mironică, Ionuţ Duţă, Andrei Purica, Macovei with appropriate keywords. The devset was annotated by Corina and Irina Nicolae. 3 persons and the testset by 4. In this case, each person annotated distinct parts of the data leading to only one an- 8. REFERENCES notation in the end. [1] S. Rudinac, A. Hanjalic, M.A. Larson, “Generating To explore differences between expert and non-expert an- Visual Summaries of Geographic Areas Using notations, an additional crowd-sourcing annotated relevance Community-Contributed Images”, IEEE Trans. on and diversity ground truth was generated for a selection of Multimedia, 15(4), pp. 921-932, 2013. 50 locations via CrowdFlower platform4 . [2] R.H. van Leuken, L. Garcia, X. Olivares, R. van Zwol, “Visual Diversification of Image Search Results”, ACM 5. RUN DESCRIPTION Int. Conf. on World Wide Web, pp. 341-350, 2009. Participants were allowed to submit up to 5 runs. The first [3] M.L. Paramita, M. Sanderson, P. Clough, “Diversity 3 are required runs: run1 - automated approaches using vi- in Photo Retrieval: Overview of the ImageCLEF 3 Photo Task 2009”, ImageCLEF 2009. all the provided photos are under Creative Com- mons licenses of type 1 to 7 that allow redis- tribution (see http://www.flickr.com/services/ 5 api/flickr.photos.licenses.getInfo.html/ and http://www.cubrikproject.eu/ 6 http://creativecommons.org/). http://www.promise-noe.eu/ 4 7 http://crowdflower.com/ http://www.chistera.eu/projects/mucke/