Retrieving Diverse Social Images at MediaEval 2014: Challenge, Dataset and Evaluation Bogdan Ionescu Adrian Popescu Mihai Lupu LAPI, University Politehnica of CEA, LIST, France Vienna University of Bucharest, Romania adrian.popescu@cea.fr Technology, Austria bionescu@alpha.imag.pub.ro lupu@ifs.tuwien.ac.at Alexandru Lucian Gînscă Henning Müller CEA, LIST, France HES-SO, Sierre, Switzerland alexandru.ginsca@cea.fr henning.mueller@hevs.ch ABSTRACT the same time relevant and provide a diversified summary This paper provides an overview of the Retrieving Diverse (up to 50 images), according to the following definitions: Social Images task that is organized as part of the Media- Relevance: a photo is considered to be relevant if it is a Eval 2014 Benchmarking Initiative for Multimedia Evalua- common photo representation of the location, e.g., different tion. The task addresses the problem of result diversification views at different times of the day/year and under different in the context of social photo retrieval. We present the task weather conditions, inside views, close-ups on architectural challenges, the proposed data set and ground truth, the re- details, drawings, sketches, creative views, etc, which con- quired participant runs and the evaluation metrics. tain partially or entirely the target location. Bad quality photos (e.g., severely blurred, out of focus, etc) as well as 1. INTRODUCTION photos with people as the main subject (e.g., a big picture An efficient image retrieval system should be able to present of me in front of the monument) are not considered relevant; results that are both relevant and that are covering diverse Diversity: a set of photos is considered to be diverse if it aspects of a query (e.g., sub-topics). Relevance has been depicts different visual characteristics of the target location, more thoroughly studied in existing literature than diversifi- as stated by the relevance definition above, with a certain cation and even though a considerable amount of diversifica- degree of complementarity, i.e., most of the perceived visual tion literature exists, the topic remains an important one, es- information is different from one photo to another. pecially in social media. The 2014 Retrieving Diverse Social The refinement and diversification process will be based Images task is a followup of last year’s edition [1][2][3] and on the social metadata associated with the images and/or aims to foster new technology for improving both relevance on the visual characteristics of the images. New for this and diversification of search results with explicit emphasis year, we provide information about user annotation credi- on the actual social media context. It creates an evaluation bility. Credibility is determined as an automatic estimation framework specifically designed to encourage the emergence of the quality (correctness) of a particular user’s tags. Par- of new diversification solutions from areas such as informa- ticipants are allowed to exploit this credibility estimation tion retrieval (text, vision and multimedia), re-ranking, rel- or to compute their own approach, in addition to classical evance feedback, crowdsourcing. retrieval techniques. 2. TASK DESCRIPTION 3. DATASET The task is build around a tourist use case where a person The 2014 data set is constructed around the 2013 data [1] tries to find more information about a place she is poten- [2] and consists of ca. 300 locations (e.g., monuments, cathe- tially visiting. The person has only a vague idea about the drals, bridges, sites, etc) spread over 35 countries around location, knowing the name of the place. She uses the name the world. Data is divided into a development set, devset, to learn additional facts about the place from the Internet, containing 30 locations — intended for designing the ap- for instance by visiting a Wikipedia page, e.g., getting a proaches; a test set, testset, containing 123 locations — to photo, the geographical position of the place and basic de- be used for the official evaluation; as well as an additional scriptions. Before deciding whether this location suits her credibilityset, ca. 300 locations and 685 users (chosen to be needs, the person is interested in getting a more complete different from the ones in devset and testset), used to train and diversified visual description of the place. the credibility descriptors. All the data was retrieved from In this task, participants receive a list of photos for a cer- Flickr using the name of the location as query. tain location retrieved from Flickr and ranked with Flickr’s Each location contains: the name of the location, its GPS default “relevance” algorithm. These results are typically coordinates, a link to a Wikipedia webpage, up to 5 rep- noisy and redundant. The requirements of the task are to resentative photos from Wikipedia, a ranked list of up to refine these results by providing a set of images that are in 300 photos retrieved from Flickr using Flickr’s default “rele- vance” algorithm1 (devset provides 8,923 images and testset 1 Copyright is held by the author/owner(s). all the photos are under Creative Commons licenses that MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain allow redistribution, see http://creativecommons.org/. 36,452) and an xml file containing metadata from Flickr for 5. RUN DESCRIPTION all the retrieved photos (e.g., photo title, photo description, Participants are allowed to submit up to 5 runs. The first photo id, tags, Creative Common license type, number of 3 are required runs: run1 - automated using visual infor- posted comments, the url link of the photo location from mation only; run2 - automated using text information only; Flickr, the photo owner’s name, user id, the number of times and run3 - automated using text-visual fused without other the photo has been displayed, etc). resources than provided by the organizers. The last 2 runs Apart from the metadata, the dataset contains also con- are general runs: run4 - automated using user annotation tent descriptors (visual, text and credibility based). Visual credibility descriptors (either the ones provided by organiz- descriptors include the same general purpose descriptors ers or computed by the participants) and run5 - everything (e.g., color, texture and feature information) as in 2013 [3]. allowed, e.g., human-based or hybrid human-machine ap- Text information consists this year of term frequency infor- proaches, including using data from external sources (e.g., mation (the number of occurrences of the term in the entity’s Internet). For generating run1 to run4 participants are al- text fields), document frequency information (the number of lowed to use only information that can be extracted from entities which have this term in their text fields) and their the provided data (e.g., provided descriptors, descriptors of ratio, i.e., TF-IDF. Text descriptors are computed on a per their own, etc). This includes also the Wikipedia webpages dataset basis and also a per image basis, a per location basis of the locations (provided via their links). and a per user basis. User annotation credibility descriptors provide an automatic estimation of the quality of tag-image 6. EVALUATION content relationships. This information gives an indication Performance is assessed for both diversity and relevance. about which users are most likely to share relevant images The following metrics are computed: Cluster Recall at X on Flickr according to the underlying task scenario. The (CR@X) — a measure that assesses how many different clus- following descriptors are provided: visualScore (measure of ters from the ground truth are represented among the top user image relevance), faceProportion (the percentage of im- X results (only relevant images are considered), Precision at ages with faces), tagSpecificity (average specificity of a user’s X (P@X) — measures the number of relevant photos among tags, where tag specificity is the percentage of users hav- the top X results and F1-measure at X (F1@X) — the har- ing annotated with that tag in a large Flickr corpus), loca- monic mean of the previous two. Various cut off points are tionSimilarity (average similarity between a user’s geotagged to be considered, i.e., X=5, 10, 20, 30, 40, 50. photos and a probabilistic model of a surrounding cell), pho- Official ranking metric is the F1@20 which gives equal toCount (total number of images a user shared), uniqueTags importance to diversity (via CR@20) and relevance (via P@20). (proportion of unique tags), uploadFrequency (average time This metric simulates the content of a single page of a typi- between two consecutive uploads) and bulkProportion (the cal Web image search engine and reflects user behavior, i.e., proportion of bulk taggings in a user’s stream, i.e., of tag inspecting the first page of results with priority. sets which appear identical for at least two distinct photos). 7. CONCLUSIONS The Retrieving Diverse Social Images task provides par- 4. GROUND TRUTH ticipants with a comparative and collaborative evaluation Both relevance and diversity annotations were carried out framework for social image retrieval techniques with explicit by expert annotators with advanced knowledge of the lo- focus on result diversification. This year in particular, the cation characteristics (mainly learned from last year’s task task explores also the benefits of employing automatically and Internet sources). Specifically designed visual tools were estimated user annotation credibility information to the di- employed to facilitate the annotation process. Annotation versification task. Details on the methods and results of each was not time restricted. individual participant team can be found in the working note For relevance, annotators were asked to label each photo papers of the MediaEval 2014 workshop proceedings. (one at a time) as being relevant (value 1), non-relevant (0) or with “don’t know” (-1). To help with their decisions, Acknowledgments annotators were able to consult any additional information This task is supported by the following projects: MUCKE2 , source during the evaluation (e.g., representative photos, In- CUbRIK3 and PROMISE4 . ternet, etc). For devset, 3 annotators were involved while testset and credibilityset used 11 and 9 annotators, respec- 8. REFERENCES [1] B. Ionescu, A.-L. Radu, M. Menéndez, H. Müller, A. tively, that annotated different parts of the data leading in Popescu, B. Loni, “Div400: A Social Image Retrieval Result the end to 3 different annotations. Final ground truth was Diversification Dataset”, ACM MMSys, Singapore, 2014. determined after a lenient majority voting scheme. [2] B. Ionescu, A. Popescu, H. Müller, M. Menéndez, A.-L. For diversity, only the photos that were judged as relevant Radu, “Benchmarking Result Diversification in Social in the previous step were considered. For each location, an- Image Retrieval”, IEEE ICIP, France, 2014. notators were provided with a thumbnail list of all relevant [3] B. Ionescu, M. Menéndez, H. Müller, A. Popescu, “Retrieving Diverse Social Images at MediaEval 2013: photos. After getting familiar with their contents, they were Objectives, Dataset and Evaluation”, CEUR-WS, Vol. asked to re-group the photos into clusters (up to 25) with 1043, http://ceur-ws.org/Vol-1043/mediaeval2013_ similar visual appearance and tag these clusters with appro- submission_3.pdf, Spain, 2013. priate keywords that justify their choices. Devset was anno- tated by 2 persons and testset by 3. Each person annotated 2 distinct parts of the data leading to only one annotation. http://ifs.tuwien.ac.at/~mucke/ 3 An additional annotator acted as a master annotator and http://www.cubrikproject.eu/ 4 reviewed once more the final annotations. http://www.promise-noe.eu/