Bilkent-RETINA at Retrieving Diverse Social Images Task of MediaEval 2014 Mustafa Ilker Sarac Pinar Duygulu Bilkent University, Bilkent University, Department of Computer Engineering, Department of Computer Engineering, 06800, Ankara, Turkey. 06800, Ankara, Turkey. mustafa.sarac@cs.bilkent.edu.tr duygulu@cs.bilkent.edu.tr ABSTRACT This procedure promisingly chop off some of the irrelevant This paper explains the approach proposed by Bilkent - images from the dataset and increase the P and F1 scores. RETINA team for the Retrieving Diverse Social Images task of MediaEval 2014 [1]. We develop a framework which first removes outliers using one-class support vector machines (SVM) to improve relevance. Second it clusters the elimi- nated set and retrieves the centroids to diversify the results. We tried to exploit visual only features during our experi- ments. For the first run we used the provided visual features and for the second run we used well known visual features like SIFT [2] and GIST [4]. 1. INTRODUCTION In today’s world image sharing applications are being used extremely. Users of Facebook upload 350 million photos1 each day and it is said to be equal to the number of photos have been taken during 19th century in total2 . Given that large number of images, search engines become more im- portant than ever in order to produce good quality search Figure 1: Overall framework structure. When the results. In this task the quality factors are determined by images related to a specific location are given as in- means of relevancy and diversity. put, our framework produces diversified results for Participants were provided with a development dataset (de- that location. vset) of 30 locations and a testing dataset (testset) of 123 locations [1]. Each location consists of up to 300 photos Step 3: Clustering which are retrieved from Flickr using text information. In After the outlier removal step, in order to increase the di- the following, we provide a framework which first removes versity score we apply k-means clustering to the remaining the outlier images and then apply k-means clustering to ob- images at each location. tain diversified results. Step 4: Retrieval In the retrieval step we select cluster centroids that we ob- 2. PROPOSED APPROACH tain in the previous step. Each centroid should represent a Our method can be summarized in 4 steps as shown in different aspect of a given location so that it is aimed to get Figure 1, namely: a good diversification results. Step 1: Feature extraction In this step we compute visual features for each image of 3. VISUAL FEATURES each location. Some of the features are provided by the task The task organizers provide us with 6 visual descriptors and 2 of them are extracted by our team. (CM, CN, CSD, GLRLM, HOG, LBP) out of which 4 have Step 2: Outlier removal also a spatial pyramid representation (CM, CN, GLRLM In order to increase number of relevant images for each loca- and LBP). We sought for the best combination of these fea- tion in the dataset, we apply an outlier removal procedure. tures using provided devset images. We found out that best 1 http://www.businessinsider.com/facebook-350-million- results are obtained when all these features are combined. So photos-each-day-2013-9 we concatenate all these 10 visual descriptors and come up 2 with a feature vector of 945 dimensions for each image (i.e., http://blog.1000memories.com/94-number-of-photos-ever- taken-digital-and-analog-in-shoebox descvis). Then we normalize each feature vector to zero mean and unit variance. We also extracted other visual features like GIST and bag Copyright is held by the author/owner(s). of visual words (BOVW) representations using dense SIFT MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain features [2, 4]. We use these extra features while construct- ing the fifth run of the challenge. GIST features are 512 also perform better than the default features and perform dimensional global features and they are useful in capturing similar to SIFT-BOVW features. Results from our 2 submis- the scene information in images. It is important to cap- sions, namely Run#1 and Run#5, can be found in Table 2. ture and differentiate scenery information in order to boost Similar to devset results, using SIFT-BOVW we obtain bet- diversity of the results. ter results from Run#5 than Run#1. In order to compute dense-SIFT descriptors we use vlfeat’s standart feature extactor tool [5]. First we resize each im- age to a fixed size of 200 by 200 pixels and then we obtain Table 1: Results on devset using provided features, 128 by 5776 dimensional SIFT features per image. In or- GIST and SIFT-BOVW. der to create a pool of descriptors we randomly sample 100 Feat. name P@20 CR@20 F1@20 descriptors from each image and then we apply k-means al- descvis 0.7139 0.3813 0.4863 gorithm with ’plusplus’ option. We try 3 different k values GIST 0.7209 0.3798 0.5037 (e.g., 600, 800 and 1000). According to the performance on SIFT-BOVW 0.7167 0.3933 0.5013 devset, we choose k of k-means as 1000 and it becomes the volume of our visual words dictionary. Using this dictionary, we quantize each image to 1000 dimensional feature vectors. Table 2: Official results on testset. 4. OUTLIER REMOVAL Run# P@20 CR@20 F1@20 We use SVM to find out the outliers and construct a sub- 1 0.6809 0.375 0.4758 set of images per location which are more relevant than the 5 0.7228 0.387 0.4966 initial set. Our method is similar to [3] but we use a fixed set of negative examples for each of devset and testset which are selected in the following ways. For devset images we picked 2 random images from each of the 30 locations, for 6. CONCLUSIONS testset images we select 60 random images from each of the We showed that it is possible to obtain competitive results 123 locations considering at most 1 image from each testset using only visual features. Our framework first eliminates location. Then for each location, similar to cross valida- the outliers and then using clustering it tries to leverage the tion, we select 60 random positive images and first train diversity to the retrieval results. However it is obvious that and then classify using one-class SVM, and repeat this pro- one can improve the scores by utilizing more information cedure 10 times consecutively. Finally we select the model into our framework like textual features, credibility scores. which scored the highest accuracy assuming that this model provides the best seperation. We use this process for each 7. ACKNOWLEDGMENTS location, using the same negative examples at each step but This research was supported by the MUCKE project funded with different positive examples. We use a quadratic ker- within the FP7 CHIST-ERA scheme and also Scientific and nel while experimenting with SVM because our features are Technical Research Council of Turkey (TUBITAK) under dense vectors so that they are not easily seperable by lin- grant number 112E174. ear kernel functions. We observed on the devset that as the result of outlier removal process, we get rid of some of the ir- relevant images and obtain a higher relevancy score for each 8. REFERENCES location. [1] B. Ionescu, A. Popescu, M. Lupu, A. L. Gı̂nscă, and H. Müller. Retrieving diverse social images at mediaeval 2014: Challenge, dataset and evaluation. In MediaEval 5. CLUSTERING AND RETRIEVAL 2014 Workshop, October 16-17, Barcelona, Spain, 2014. After outliers are removed we cluster the images of each [2] D. G. Lowe. Object recognition from local location using a k-means algorithm. On the devset we try scale-invariant features. In Computer vision, 1999. The 2 different K values. First we select K as 25, because we proceedings of the seventh IEEE international observed that each location has at most 25 subclasses in conference on, volume 2, pages 1150–1157. Ieee, 1999. their diversity subgroups. Second we select K as 50, because [3] H. Lukashevich, S. Nowak, and P. Dunker. Using that was the maximum number of images required to be one-class svm outliers detection for verification of retrieved. The latter method, over clustering, seemed to collaboratively tagged image training sets. In work better in devset so that we report our test set results Multimedia and Expo, 2009. ICME 2009. IEEE using K as 50. International Conference on, pages 682–685. IEEE, After we compute cluster centroids, we simply retrieve 2009. images which are closest to the centroids. We apply k nearest neighbor method with Euclidean distance and search for the [4] A. Oliva and A. Torralba. Building the gist of a scene: nearest neighbor for each centroid. While computing nearest The role of global image features in recognition. neighbor we pay great attention to retrieve unique neighbors Progress in brain research, 155:23–36, 2006. for each cluster centroid. [5] A. Vedaldi and B. Fulkerson. VLFeat: An open and Results from devset are shown in Table 1. One may ob- portable library of computer vision algorithms. serve that SIFT-BOVW [2] features works better than de- fault features. The reason is that local descriptors are gen- erally works better to capture similarities among images so that each cluster becomes more coherent. GIST [4] features