Retrieving Diverse Social Images at MediaEval 2017: Challenges, Dataset and Evaluation Maia Zaharieva1 , Bogdan Ionescu2 , Alexandru Lucian Gînscă3 , Rodrygo L.T. Santos4 , Henning Müller5 1 TU Wien, Austria 2 University Politehnica of Bucharest, Romania 3 CEA LIST, France 4 Universidade Federal de Minas Gerais, Brazil 5 University of Applied Sciences Western Switzerland, Switzerland maia.zaharieva@tuwien.ac.at,bionescu@imag.pub.ro,alexandru.ginsca@cea.fr rodrygo@dcc.ufmg.br,henning.mueller@hevs.ch ABSTRACT Search1 ). Given a ranked list of up to 300 query-related images This paper provides an overview of the Retrieving Diverse Social retrieved from Flickr2 using text-based queries, participants are Images task that is organized as part of the MediaEval 2017 Bench- required to refine the results by providing a set of images that are marking Initiative for Multimedia Evaluation. The task addresses relevant to the query and, at the same time, represent a visually the challenge of visual diversification of image retrieval results, diversified summary of it. The queries include complex and general- where images, metadata, user tagging profiles, and content and text purpose, multi-concept queries (e.g. "dancing on the street", "trees models are available for processing. We present the task challenges, reflected in water", "sailing boat"). The queries in the development the employed dataset and ground truth information, the required set result from a broad user study and are constructed around the runs, and the considered evaluation metrics. the data of the MediaEval 2016 Retrieving Diverse Social Images task [8]. The queries in the test set were collected using Google Trends3 for image search (worldwide, last 5 years: 2012-2017). 1 INTRODUCTION The goal of the task is to refine the image set retrieved as a result to a given text-based query by providing a ranked list of up to 50 An efficient image retrieval system should be able to present results photos that are both relevant and visually diversified representations that are both relevant to the provided query and that are covering of the query, according to the following definitions: different visual aspects of it. The diversification of image search Relevance: an image is considered to be relevant for the query results can considerably increase the probability of a system to if it is a common visual representation of the query topics (all at address a broad range of user information needs. In general, di- once). Bad quality photos (e.g., severely blurred, out of focus) are versification is an actively researched problem in various domains not considered relevant in this scenario; ranging from web search and query result diversification [9, 15, 19] Diversity: a set of images is considered to be diverse if it depicts to recommender systems [16, 17] and summarization [13, 14]. With different visual characteristics of the query topics and subtopics the emerging availability of publicly available images, the impor- with a certain degree of complementarity, i.e. most of the perceived tance for diversification of image data is steadily growing. This task visual information is different from one image to another. is especially challenging when handling real-world queries, which are often complex, consisting of multiple concepts. 3 DATA DESCRIPTION The 2017 Retrieving Diverse Social Images task is a follow-up of the 2016 edition [8] and fosters the development of new techniques The data consists of a development set (devset) with 110 queries for improving both the relevance and the visual diversification of (32, 487 images) and a test set (testset) with 84 queries (24, 986 image search results. The task is designed to support the evaluation images). An additional dataset (credibilityset) provides credibility and comparison of approaches emerging from a wide range of re- estimation for ca. 685 users and metadata for more than 3.5M im- search fields, such as information retrieval (text, vision, multimedia ages. We also provide semantic vectors for general English terms communities), machine learning, relevance feedback, and natural computed on top of the English Wikipedia4 (wikiset), which could language processing. help participants to develop advanced text models. Each query is accompanied by the following information: query text formulation (the actual query formulation used on Flickr to 2 TASK DESCRIPTION retrieve the data), a ranked list of up to 300 images in jpeg format The task is built around the use case of a general ad-hoc image retrieved from Flickr using Flickr’s default "relevance" algorithm retrieval system, which provides the user with visually diversified (all images are redistributable Creative Commons licensed5 ), an representations of query results (see for instance Google Image 1 https://images.google.com/ 2 https://www.flickr.com. Copyright held by the owner/author(s). 3 http://trends.google.com/ MediaEval’17, 13-15 September 2017, Dublin, Ireland 4 https://en.wikipedia.org/ 5 http://creativecommons.org/ MediaEval’17, 13-15 September 2017, Dublin, Ireland M.Zaharieva et al. xml file containing Flickr metadata for the retrieved images, and In contrast to the single relevance score for each query, in terms ground truth for both relevance and diversity. of diversity we consider all three annotations as correct (ground To facilitate participation from various communities, we also truth) as they typically depict different possibilities to group the provide the following content-based descriptors: images representing different points of view. - general purpose, visual-based descriptors extracted using the LIRE library6 [10]: auto color correlogram (ACC) [12]; color and edge 5 RUN DESCRIPTION directivity descriptor (CEDD) [5], fuzzy color and texture histogram Participants were allowed to submit up to five runs. The first three (FCTH) [6], Gabor texture, joint composite descriptor (JCD) [4], are required (dedicated) runs: run1 – automated run using visual several MPEG7 features including color layout, edge histogram, and information only; run2 – automated run using text information scalable color [11], pyramid of histograms of orientation gradients only; and run3 – automated run using both visual and text infor- (PHOG) [2], and speeded up robust features (SURF) [1]. mation. For the generation of run1 to run3 only information that - convolutional neural network (CNN)-based descriptors based on the can be extracted from the provided data (e.g. provided descriptors, reference model provided with the Caffe framework7 . The descrip- descriptors of their own, etc.) is allowed to be used. The last two tors are extracted from the last fully connected layer (fc7). runs, run4 and run5, are general ones, i.e. any approach is allowed, - text-based features include term frequency and document fre- e.g. human-based or hybrid human-machine approaches, including quency information and their ratio (TF-IDF). The text-based features using data from external sources, such as Internet or pre-trained are computed per image, per query, and per user basis. models obtained from external datasets related to this task. - user annotation credibility descriptors provide an estimation of the quality of the users’ tag-image content relationships. The follow- 6 EVALUATION ing descriptors are provided: visualScore (measure of user image Performance is assessed for both diversity and relevance using clus- relevance), faceProportion (the percentage of images with faces), ter recall at X (CR@X ), precision at X (P@X ), and their harmonic tagSpecificity (average specificity of a user’s tags, where tag speci- mean F 1@X . CR@X provides the ratio of the number of clusters ficity is the percentage of users having annotated with that tag in a from the ground truth that are represented in the top X results and, large Flickr corpus), locationSimilarity (average similarity between a thus, it reflects the diversification quality of a given image result user’s geotagged photos and a probabilistic model of a surrounding set. We compute CR@X for each one of the available ground truth cell), photoCount (total number of images a user shared), uniqueTags diversity annotations and select the one which maximizes CR@X (proportion of unique tags), uploadFrequency (average time between for each query. Since the clusters in the ground truth consider rel- two consecutive uploads), bulkProportion (the proportion of bulk evant images only, the relevance of the top X results is implicitly taggings in a user’s stream, i.e., of tag sets that appear identical for measured by CR@X . Nevertheless, P@X provides a more precise at least two distinct photos), meanPhotoViews (mean value of the view on the relevance of a particular image set since it directly number of times a user’s image has been seen by other members of measures the relevance among the top X images. We consider var- the community), meanTitleWordCounts (mean value of the number ious cut off points, i.e. X = {5, 10, 20, 30, 40, 50}. Additionally, we of words found in the titles associated with users’ photos), mean- consider two further evaluation metrics, which are well-established TagsPerPhoto (mean value of the number of tags users put for their in the information retrieval community, the intent-aware expected images), meanTagRank (mean rank of a user’s tags in a list in which reciprocal rank (ERR-IA@X ) [3] and the α-normalized discounted the tags are sorted in descending order according the number of cumulative gain (α-nDCG@X ) [7] metrics. The official ranking met- appearances in a large subsample of Flickr images), and meanIm- ric is F 1@20 which gives equal importance to diversity (via CR@20) ageTagClarity (adaptation of the Image Tag Clarity from [18] using and relevance (via P@20). This metric simulates the content of a a TF-IDF language model as individual tag language model). single page of a typical Web image search engine and reflects user behavior, i.e., inspecting the first page of results with priority. 4 GROUND TRUTH Both relevance and diversity annotations were carried out by 17 7 CONCLUSION human annotators. The data were distributed among the annotators The 2017 Retrieving Diverse Social Images task provides partici- such that each query was labeled by three different annotators. For pants with a comparative and collaborative evaluation benchmark relevance, annotators were asked to label each image (one at a time) for social image retrieval approaches focusing on visual-based di- as being relevant to the underlying query (value 1), non-relevant versification. The task explores the diversification in the context of (0), or with "don’t know" (−1). The final relevance ground truth a challenging, ad-hoc image retrieval system, which should be able score was determined using a majority voting scheme. For diver- to tackle complex and general-purpose multi-concept queries. This sity, only the images that were judged as relevant in the previous year, we explicitly accounted for the possibility of having multi- step were considered. For each query, annotators were provided ple different views on a given retrieval result, which might all be with a thumbnail list of all relevant images. After getting familiar subjectively correct. This allows for an investigation of the aspect with their contents, they were asked to re-group the images into of subjectivity in the perception of diversification in a next step. clusters with similar visual appearance (up to 25 clusters in total). Details on the methods and results of the participating teams can be 6 http://www.lire-project.net/ found in the working note papers of the MediaEval 2017 workshop 7 http://caffe.berkeleyvision.org/ proceedings. Retrieving Diverse Social Images Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. 2008. Speeded-Up Robust Features (SURF). Computer Vision and Image Understanding 110, 3 (2008), 346–359. https://doi.org/10.1016/j.cviu.2007.09.014 [2] Anna Bosch, Andrew Zisserman, and Xavier Munoz. 2007. Representing Shape with a Spatial Pyramid Kernel. In ACM International Conference on Image and Video Retrieval (CIVR). ACM, New York, NY, USA, 401–408. https://doi.org/10. 1145/1282280.1282340 [3] Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. In ACM Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, USA, 621–630. https: //doi.org/10.1145/1645953.1646033 [4] Savvas A Chatzichristofis, YS Boutalis, and Mathias Lux. 2009. Selection of the proper compact composite descriptor for improving content based image retrieval. In Signal Processing, Pattern Recognition and Applications (SPPRA). ACTA Press, 134–140. [5] Savvas A. Chatzichristofis and Yiannis S. Boutalis. 2008. CEDD: Color and Edge Directivity Descriptor: A Compact Descriptor for Image Indexing and Retrieval. In International Conference on Computer Vision Systems (ICCV). Springer-Verlag, Berlin, Heidelberg, 312–322. https://doi.org/10.1007/978-3-540-79547-6_30 [6] S. A. Chatzichristofis and Y. S. Boutalis. 2008. FCTH: Fuzzy Color and Texture Histogram - A Low Level Feature for Accurate Image Retrieval. In Int. Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS). IEEE Computer Society, Washington, DC, USA, 191–196. https://doi.org/10.1109/WIAMIS.2008.24 [7] Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and Diversity in Information Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 659–666. https://doi.org/10.1145/1390334.1390446 [8] Bogdan Ionescu, Alexandru Lucian Ginsca, Maia Zaharieva, Bogdan Boteanu, Mihai Lupu, and Henning Müller. 2016. Retrieving Diverse Social Images at MediaEval 2016: Challenge, Dataset and Evaluation. In MediaEval 2016 Multime- dia Benchmark Workshop, Vol. 1739. CEUR-WS.org. [9] Bogdan Ionescu, Adrian Popescu, Anca-Livia Radu, and Henning Müller. 2016. Result diversification in social image retrieval: a benchmarking framework. Multimedia Tools and Applications 75, 2 (2016), 1301–1331. https://doi.org/10. 1007/s11042-014-2369-4 [10] Mathias Lux. 2011. Content Based Image Retrieval with LIRe. In ACM In- ternational Conference on Multimedia. ACM, New York, NY, USA, 735–738. https://doi.org/10.1145/2072298.2072432 [11] B.S. Manjunath, J.-R. Ohm, V.V. Vasudevan, and A. Yamada. 2001. Color and texture descriptors. IEEE Transactions on Circuits and Systems for Video Technology 11, 6 (2001), 703–715. https://doi.org/10.1109/76.927424 [12] Mandar Mitra, Ramin Zabih, Jing Huang, Wei-Jing Zhu, and S. Ravi Kumar. 1997. Image Indexing Using Color Correlograms. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Washington, DC, USA, 762–768. https://doi.org/10.1109/CVPR.1997.609412 [13] Yanwei Pang, Qiang Hao, Yuan Yuan, Tanji Hu, Rui Cai, and Lei Zhang. 2011. Summarizing Tourist Destinations by Mining User-generated Travelogues and Photos. Computer Vision and Image Understanding 115, 3 (2011), 352–363. https: //doi.org/10.1016/j.cviu.2010.10.010 [14] S. Rudinac, A. Hanjalic, and M. Larson. 2013. Generating Visual Summaries of Geographic Areas Using Community-Contributed Images. IEEE Transactions on Multimedia 15, 4 (2013), 921–932. https://doi.org/10.1109/TMM.2013.2237896 [15] Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2015. Search result diversification. Foundations and Trends in Information Retrieval 9, 1 (2015), 1–90. https://doi.org/10.1561/1500000040 [16] Markus Schedl and David Hauger. 2015. Tailoring Music Recommendations to Users by Considering Diversity, Mainstreaminess, and Novelty. In International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 947–950. https://doi.org/10.1145/2766462.2767763 [17] Yue Shi, Xiaoxue Zhao, Jun Wang, Martha Larson, and Alan Hanjalic. 2012. Adap- tive Diversification of Recommendation Results via Latent Factor Portfolio. In International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 175–184. https://doi.org/10.1145/2348283. 2348310 [18] Aixin Sun and Sourav S. Bhowmick. 2009. Image Tag Clarity: In Search of Visual- representative Tags for Social Images. In SIGMM Workshop on Social Media. ACM, New York, NY, USA, 19–26. https://doi.org/10.1145/1631144.1631150 [19] Kaiping Zheng, Hongzhi Wang, Zhixin Qi, Jianzhong Li, and Hong Gao. 2016. A survey of query result diversification. Knowledge and Information Systems 51, 1 (2016), 1–36. https://doi.org/10.1007/s10115-016-0990-4