Imcube @ MediaEval 2015 Retrieving Diverse Social Images Task: Multimodal Filtering and Re-ranking Sebastian Schmiedeke, Pascal Kelm, and Lutz Goldmann imcube labs GmbH Berlin, Germany {schmiedeke, kelm, goldmann}@imcube.de ABSTRACT This paper summarizes the participation of Imcube at the Retriev- ing Diverse Social Images Task of MediaEval 2015. This task ad- dresses the problem of result diversification in the context of social Relevancy photo retrieval where the results of a query should contain relevant Text Visual Credibility but diverse items. Therefore, we propose a multi-modal approach for filtering and re-ranking in order to improve the relevancy and Fusion diversity of the returned list of ranked images. 1. INTRODUCTION Diversification The Retrieving Diverse Social Images Task of MediaEval 2015 [5] Text Visual requires participants to develop a system that automatically refines a list of images returned by a Flickr query in such a way that the Fusion most relevant and diverse images are returned in a ranked list of up to 50 images. A photo is considered relevant if it is a common representation of the overall query concept in good visual quality (sharpness, con- trast, colours) and without people as main subjects except for que- ries dealing with people as part of the topic. The results are consid- Figure 1: Proposed approach. ered diverse if they depict different visual aspects (time, location, view, style, etc) of the target concept with a certain degree of com- plementarity. To improve the ranking of the images the query is expanded with The refinement and diversification process can be based on the the most frequent words from the Wikipedia article and the images social metadata associated with the collected photos in the data set are re-ranked using a bag-of-words representation. The relevancy is and/ or on the visual characteristics of the images. Furthermore, further improved by removing images that do not match the original the task provides information about user annotation credibility as query and the location information from the Wikipedia article. The an automation estimation of the quality of a particular user’s tag. location information is extracted by analysing the original query or Wikipedia title (e.g. , “Great Sphinx of Giza”) taking into account 2. SYSTEM DESCRIPTION typical prepositions for locations (e.g. “in”, “at”, “on”, “of”, “de”). In this section, we present our approach that combines textual, In the case that no location information can be extracted (e.g. “Ni- visual and credibility information to filter and re-rank the initial re- agara Falls”) toponyms are not considered for relevancy filtering. sults. Our approach consists of two steps – relevancy improvement and diversification – as depicted in Figure 1. 2.2 Visual relevancy improvement The goal of the first step is to improve the relevancy of the ranked Visual information is also used to improve the relevancy by re- image list by re-ranking the images based on more reliable textual ranking them according to different criteria. For each visual feature and visual criteria and filtering them in order to remove images a ranked image list is derived based on the computed relevancy which are irrelevant for the given application scenario. The goal of scores. the second step is to improve the diversity of the ranked image list Since images with persons as main subjects are considered irrel- through textual filtering and visual clustering and re-ranking. The evant, we employ a face detector [7] trained for frontal and profile individual modules will be described in the following sections. faces to determine the size of facial regions. The inverse relative size of the detected faces determines the relevancy. Hence, the 2.1 Textual relevancy improvement smaller the area covered by faces the more relevant is the image. This step exploits additional information extracted from the cor- Additionally, photos taken from the target location but not dis- responding Wikipedia article that is provided together with the query. playing it are considered irrelevant. We model that relevancy by computing the visual similarity between the retrieved images and the images which are available from the associated Wikipedia arti- Copyright is held by the author/owner(s). cle. We use histogram of oriented gradients (HOG) features and a MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany clusterless BoW approach [6] based on speeded up robust features Table 1: Evaluation of different runs. Average P@20 Average CR@20 Average F1@20 one-concept multi-concept overall one-concept multi-concept all one-concept multi-concept overall run1 0.7014 0.6743 0.6878 0.3963 0.4209 0.4087 0.4885 0.5027 0.4957 run2 0.7819 0.7143 0.7478 0.4380 0.3986 0.4182 0.5478 0.4917 0.5195 run3 0.7674 0.6993 0.7331 0.4285 0.4064 0.4174 0.5340 0.4926 0.5132 run5 0.5928 0.6671 0.6302 0.3410 0.3508 0.3460 0.4251 0.4448 0.4350 (SURF) features to generate histograms for each of the images. The 3. EXPERIMENTS & RESULTS similarity between the retrieved images and the wikipedia images The following experiments were performed based on the system is computed through histogram intersection. The retrieved images and the individual modules described above following the guide- are re-ranked by considering the maximum score across the set of lines of the task. wikipedia images. Run1 is an approach using visual information only (as described We further incorporate aesthetic aspects to emphasize more vi- in Sec. 2.2 and 2.5). Run2 is an approach based on purely textual sually appealing images, since less blurry and salient images are information (as described in Sec. 2.1 and 2.4). Run3 is an approach usually considered more relevant. The sharpness is calculated as based on textual and visual information (as described in Sec. 2.1, the ratio of magnitude of image gradients between different blurred 2.2 and Sec. 2.5). Run5 is an approach using credibility based rel- versions of the original image. The larger that ratio, the more rel- evancy and visual diversity (as described in Sec. 2.3 and 2.5). evant the image. Saliency is measured using a spectral residual These experiments are performed on the test set provided which approach [4]. Considering the different criteria described above we contains 69 one-concept location queries and 70 multi-concept que- obtain 5 ranked image lists (Face, HOG, BoW, Sharpness, Saliency) ries related to events. which are fused using weighted rank fusion. Table 1 shows the results on the test set for all the runs defined above. Since we want to evaluate our filters for different conditions, 2.3 Credibility based relevance improvement scores for the one-concept and multi-concept queries are also pro- This step is intended to be the baseline approach for improving vided. the relevance. It re-ranks the image list according to the credibil- In general, the textual run (run2) achieves the best results. It ity of the owner of an image. The re-ranking is based on 3 scores achieves the higher precision (P @20 = 0.748) and also a slightly which describe the user credibility [5]: the use of correct tags (visu- better recall (CR@20 = 0.418) compared to the visual run (run1). alScore), specific tags (tagSpecificity) and their preference for pho- The advantage is more significant for one-concept queries than for tographing faces (faceProportion). Following the application sce- multi-concept queries. The textual run fails for queries which main nario the combined credibility score for an image is high if the user topic is not correlated to a location (e.g. “chinese new year in Bei- has a high visualScore, a high tagSpecificity and a low faceProba- jing” (its main topic is firework), “paragliding in the mountains” bility. or “tropical rain”). For these cases, the visual run reaches consid- erably higher F1 scores. Generally, the purely visual run achieves 2.4 Textual diversification a better recall (CR@20 = 0.4209) and thus a slightly better F1 metric (F 1@20 = 0.5027) for multi-concept queries. The final image list should not only contain relevant images but Since, the combination of visual and textual features (run3) con- also diverse ones, i.e. , depicting different aspects of the topic. With stantly achieves lower scores than the individual modalities, we the assumption that images which have an identical textual de- analyse the cases where improvements were made. For example scription often depict very similar content, the images are clustered the previously mentioned query (“chinese new year in Beijing”) based on their textual similarity. The ranked image list is then ob- benefits from visual information with a considerable increase of tained by ranking the clusters in descending order according to their the F1 measure (∆F 1@20 = 0.18). In comparison to run2, run3 relevancy and iteratively selecting the most relevant image from achieves a lower precision (∆P @20 = −0.015) and a similar each cluster. recall (∆CR@20 = −0.001) leading to slightly lower F1 score (∆F 1@20 = −0.006). However, it is interesting to note that the 2.5 Visual diversification results differ for one-concept and multi-concept queries. The recall The visual diversification considers multiple visual characteris- of run3 is actually higher than that of run2 for multi-concept que- tics including colour (ColorMoment), structure (HOG, clusterless ries (∆CR@20 = 0.008) while it is lower for one-concept queries BoW approach [6]) and texture (local binary patterns (LBP)). For (∆CR@20 = 0.009). each feature the normalized distances between the retrieved images are combined using weighted summation and then projected in a lower dimensional space by applying the FastMap [2] algorithm. On the resulting 5-dimensional feature space, kMeans++ cluster- 4. CONCLUSION ing [1] is applied. The results of the different runs show that overall the best results The number of clusters is estimated by Hartigan’s Leader clus- can be achieved with textual information only and that the fusion tering algorithm [3], but the number is restricted to be between 5 of visual and textual information leads to slightly worse results. and 21. Clusters with a low mean relevancy or clusters containing Analysing the results in more detail shows that visual information only a few images are discarded, Since these small clusters are very provides better results for multi-concept queries and queries where likely to contain outliers. The remaining clusters are ordered in de- the main topic is not correlated to a location while textual infor- scending order according to their maximum relevancy and ranked mation achieves better performance for one-concept queries. This image list is obtained by iteratively selecting the best image from shows that a more advanced fusion approach for combining textual them. and visual information may improve the results further. 5. REFERENCES [1] D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pages 1027–1035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics. [2] C. Faloutsos and K. Lin. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pages 163–174. ACM New York, NY, USA, 1995. [3] J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 99th edition, 1975. [4] X. Hou and L. Zhang. Saliency detection: A spectral residual approach. In Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, pages 1–8, June 2007. [5] B. Ionescu, A. L. Gînscă, B. Boteanu, A. Popescu, M. Lupu, and H. Müller. Retrieving Diverse Social Images at MediaEval 2015: Challenge, Dataset and Evaluation. MediaEval 2015 Workshop, Wurzen, Germany, 2015. [6] S. Schmiedeke, P. Kelm, and T. Sikora. DCT-based features for categorisation of social media in compressed domain. In Multimedia Signal Processing (MMSP), 2013 IEEE 15th International Workshop on, pages 295–300, 2013. [7] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–511–I–518 vol.1, 2001.