MUCKE Participation at Retrieving Diverse Social Images Task of MediaEval 2013 Anil Armagan Adrian Popescu Pinar Duygulu Bilkent University, CEA, LIST, Bilkent University, Department of Computer Vision & Content Engineering Department of Computer Engineering, Laboratory, Engineering, 06800, Ankara, Turkey. 91190 Gif-sur-Yvette, France. 06800, Ankara, Turkey. anil.armagan@bilkent.edu.tr adrian.popescu@cea.fr duygulu@cs.bilkent.edu.tr ABSTRACT feature size is 8192. HOG, GIST and BOVW features cap- The Mediaeval 2013 Retrieving Diverse Social Image Task ture different low-level characteristics of images and they can addresses the challenge of improving both relevance and di- be combined to have more comprehensive visual representa- versity of photos in a retrieval task on Flickr. We propose tions. Since these features can be combined in different kind a clustering based technique that exploits both textual and of runs, all features were L1-normalized in order for each of visual information. We introduce a k-Nearest Neighbor (k- feature to have the same contribution, regardless of their NN) inspired re-ranking algorithm that is applied before size. clustering to clean the dataset. After the clustering step, 2.2 Textual Features we exploit social cues to rank clusters by social relevance. From those ranked clusters images are retrieved according We exploit a classical TF-IDF weighting scheme to model to their distance to cluster centroids. textual information associated to points of interests (POIs). Different re-weighting schemes were tested and the best re- sults were obtained when we took the square root of TF-IDF 1. INTRODUCTION scores. The dimension of the model is equal to the number Photo sharing is a frequent activity on social media plat- of unique tags associated to each POI. Given the limited forms, such as Flickr and Facebook. Since there is no quality number of images per POI (up to 150), the dimension is control of the photos and of their annotations, retrieved im- usually in the range of hundreds. Similar to visual features, ages for a given query are usually not sufficiently relevant. L1-normalization is applied to textual features. Even when the relevance criteria is met, it is still difficult to obtain diversified representations [4]. 3. RESULT RERANKING Considering these challenges, the MediaEval 2013 Retriev- The initial result set of Flickr is noisy and we introduce a ing Diverse Social Images Task addresses the problem of k-NN inspired approach that exploits visual and social cues diversification of social image search results (see [2] for the to rerank results to reduce this noise. We considered all the detailed description of the task and the dataset). images of the POI as a positive set and constructed a nega- The dataset used in the task is collected from Flickr. An- tive set of the same size by sampling images of other POIs notations are usually limited to a few number of tags and from the collection. Then we compared the GIST features therefore incomplete [3]. Flickr images also come with social of each image to all other images’ features from positive and cues, such as user ID. negative sets. The resulting top five most visually similar In the following, we summarize our efforts in solving the images are retained to be considered in the next steps. challenges mentioned above through exploiting both textual The top five neighbors from both the positive and the neg- and visual features, as well as social cues. ative sets of each image that depict a POI are considered to produce a reranked results list according to social cues. For 2. FEATURES each image we counted the number of different users that contributed to the top five neighbors and the number of top 2.1 Visual Features five neighbors that belong to the positive set. Finally the average distance to the first five positive neighbors of the We exploit visual features in order to overcome the spar- target image is considered. These criteria were cascaded to sity of textual annotations since there are only few tags for break ties. Images are then ranked by cascading the three each image. We use the Histogram of Gradients (HOG) in scores described and we used the top 70%, 80% and 90% of our experiments. In addition, we extracted GIST and bags the reranked images. These images are given as an input for of visual words (BOVW) based on dense SIFTs that proved the clustering process. The best results are obtained on the to be efficient in large scale image retrieval. Dense SIFT devset with 70% of the initial list retained and this threshold descriptors are extracted using a codebook of size 1024. A is retained for clustering. spatial pyramid model with 2 levels is used and the resulting 4. CLUSTERING Copyright is held by the author/owner(s). k-means++ [1] algorithm is used to cluster the images MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain of a topic using previously mentioned feature types for dif- Table 1: Run performances with three official met- Table 2: Run performances with three official met- rics using expert ground truth rics using crowd sourcing ground truth Run name CR@10 P@10 F1@10 Run name CR@10 P@10 F1@10 Run1 - visual 0.3892 0.7243 0.4905 Run1 - visual 0.7446 0.7449 0.7135 Run2 - textual 0.3869 0.7333 0.489 Run2 - textual 0.7406 0.7204 0.6997 Run3 - textual-visual 0.3848 0.7272 0.4868 Run3 - textual-visual 0.7503 0.7245 0.705 Run5 - everything allowed 0.3742 0.7161 0.4753 Run5 - everything allowed 0.7332 0.7184 0.6951 ferent kind of runs. Different numbers for k value are used were surprised to see how well the textual run (Run2) per- in experiments such as 10, 15 and 20. K value is selected formed compared to visual and multimedia runs since we as 15 since this value gives us the best results on evaluation had expected visual diversification to work better for POIs, metrics precision, recall and f1 measure at 10. Also k = 15 which usually have a limited number of visual aspects. The is close to the official metrics (CR@10, P@10, F1@10) used performance drop from Run3 to Run5, due to the replace- in the evaluation. ment of HOG features with BOVW features also came as a surprise since the latter usually work well for retrieval pro- 5. CLUSTER AND RESULT RANKING cesses over visually diversified datasets. Clusters are not all born equal and we need to be able to rank them by probability of relevance of contained images. 7. CONCLUSION Inspired by [4], we exploit social cues for cluster ranking In this paper we propose a clustering based technique us- and propose a simple scheme that is based on user and date ing textual and visual features. To re-rank the images before information. For each cluster, we count the number of dif- clustering we introduced a kNN inspired technique. This ferent users that contributed to that cluster and the number technique helped us to spot strongly connected images of of different dates when photos of that cluster were taken. a topic. We further apply cluster ranking method based on The first count aims to prioritize clusters that are socially social cues to increase the diversity and the relevancy. Table diverse while the second count aims to surface clusters that 1 and Table 2 show that there are no big differences between are temporally stable. Then we calculate the product of different runs. Performance of Run2 - textual and Run5 - these two counts and consider it as a social ranking score. everything allowed is a surprise for us. Since BOVW rep- To break ties, we also use the number of images present in resentation generally work well for retrieval processes. We each cluster. would expect from the visual representations to improve the For each POI, we retain only the top 10 clusters obtained results more than the textual features. with the cluster ranking procedure and then diversify im- ages by choosing one image from each cluster by descending 8. ACKNOWLEDGMENT similarity to the cluster centroid. This research was supported by the MUCKE project funded within the FP7 CHIST-ERA scheme and also Scientific and 6. RESULTS AND DISCUSSION Technical Research Council of Turkey (TUBITAK) under To address the diversified social image retrieval problem, grant number 112E174. participants are asked to submit different types of runs. We submitted four runs that were produced by using dif- 9. REFERENCES ferent types of features and their combinations on the same [1] D. Arthur and S. Vassilvitskii. k-means++: the dataset. The submitted four runs are described below: Run1 advantages of careful seeding. In Proceedings of the - visual is based on visual features. We concatenate HOG eighteenth annual ACM-SIAM symposium on Discrete and GIST features described in Section 2.1. Run2 - textual algorithms, SODA ’07, pages 1027–1035, Philadelphia, is produced using only textual features described in Section PA, USA, 2007. Society for Industrial and Applied 2.2. Run3 - textual-visual is produced using a combina- Mathematics. tion of textual features described in Section 2.2 and GIST [2] B. Ionescu, M. Menendez, H. Muller, and A. Popescu. features in Section 2.1. Visual and textual features are con- Retrieving diverse social images at mediaeval 2013: catenated and to produce feature vectors. Linear weighting Objectives, dataset and evaluation. In MediaEval 2013 is used with 0.7 and 0.3 weights that are given to visual and Workshop, CEUR-WS.org, ISSN: 1613-0073, textual features respectively. These weights were empiri- Barcelona, Spain, October 18-19 2013. cally chosen by testing different combinations on the devset. [3] L. S. Kennedy, S.-F. Chang, and I. V. Kozintsev. To Run5 - everything allowed is similar to Run 3 - textual- search or to label?: predicting the performance of visual, the only modification being the replacement of HOG search-based automatic image classifiers. In Proceedings features by BOVW features. of the 8th ACM international workshop on Multimedia The results in Table 1 are based on expert evaluation on information retrieval, MIR ’06, pages 249–258, New 346 testset locations. Table 2 shows the average results of York, NY, USA, 2006. ACM. crowd sourcing evaluation carried out only on a subset of 50 locations from the testset via the CrowdFlower platform. [4] L. S. Kennedy and M. Naaman. Generating diverse and Relevance ground truth is based on a majority voting scheme representative image search results for landmarks. In on the annotations and the diversity ground truth is deter- Proceedings of the 17th international conference on mined by the same three annotators. Both tables show that World Wide Web, WWW ’08, pages 297–306, New there are only small differences between the four runs. We York, NY, USA, 2008. ACM.