Bilkent-RETINA at Retrieving Diverse Social Images Task
                       of MediaEval 2014

                      Mustafa Ilker Sarac                                         Pinar Duygulu
                      Bilkent University,                                      Bilkent University,
              Department of Computer Engineering,                      Department of Computer Engineering,
                    06800, Ankara, Turkey.                                   06800, Ankara, Turkey.
             mustafa.sarac@cs.bilkent.edu.tr                              duygulu@cs.bilkent.edu.tr

ABSTRACT                                                          This procedure promisingly chop off some of the irrelevant
This paper explains the approach proposed by Bilkent -            images from the dataset and increase the P and F1 scores.
RETINA team for the Retrieving Diverse Social Images task
of MediaEval 2014 [1]. We develop a framework which first
removes outliers using one-class support vector machines
(SVM) to improve relevance. Second it clusters the elimi-
nated set and retrieves the centroids to diversify the results.
We tried to exploit visual only features during our experi-
ments. For the first run we used the provided visual features
and for the second run we used well known visual features
like SIFT [2] and GIST [4].

1.   INTRODUCTION
   In today’s world image sharing applications are being used
extremely. Users of Facebook upload 350 million photos1
each day and it is said to be equal to the number of photos
have been taken during 19th century in total2 . Given that
large number of images, search engines become more im-
portant than ever in order to produce good quality search
                                                                  Figure 1: Overall framework structure. When the
results. In this task the quality factors are determined by
                                                                  images related to a specific location are given as in-
means of relevancy and diversity.
                                                                  put, our framework produces diversified results for
   Participants were provided with a development dataset (de-
                                                                  that location.
vset) of 30 locations and a testing dataset (testset) of 123
locations [1]. Each location consists of up to 300 photos           Step 3: Clustering
which are retrieved from Flickr using text information. In        After the outlier removal step, in order to increase the di-
the following, we provide a framework which first removes         versity score we apply k-means clustering to the remaining
the outlier images and then apply k-means clustering to ob-       images at each location.
tain diversified results.                                           Step 4: Retrieval
                                                                  In the retrieval step we select cluster centroids that we ob-
2.   PROPOSED APPROACH                                            tain in the previous step. Each centroid should represent a
   Our method can be summarized in 4 steps as shown in            different aspect of a given location so that it is aimed to get
Figure 1, namely:                                                 a good diversification results.
   Step 1: Feature extraction
In this step we compute visual features for each image of         3.   VISUAL FEATURES
each location. Some of the features are provided by the task        The task organizers provide us with 6 visual descriptors
and 2 of them are extracted by our team.                          (CM, CN, CSD, GLRLM, HOG, LBP) out of which 4 have
   Step 2: Outlier removal                                        also a spatial pyramid representation (CM, CN, GLRLM
In order to increase number of relevant images for each loca-     and LBP). We sought for the best combination of these fea-
tion in the dataset, we apply an outlier removal procedure.       tures using provided devset images. We found out that best
1
  http://www.businessinsider.com/facebook-350-million-            results are obtained when all these features are combined. So
photos-each-day-2013-9                                            we concatenate all these 10 visual descriptors and come up
2                                                                 with a feature vector of 945 dimensions for each image (i.e.,
  http://blog.1000memories.com/94-number-of-photos-ever-
taken-digital-and-analog-in-shoebox                               descvis). Then we normalize each feature vector to zero
                                                                  mean and unit variance.
                                                                    We also extracted other visual features like GIST and bag
Copyright is held by the author/owner(s).                         of visual words (BOVW) representations using dense SIFT
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain    features [2, 4]. We use these extra features while construct-
ing the fifth run of the challenge. GIST features are 512          also perform better than the default features and perform
dimensional global features and they are useful in capturing       similar to SIFT-BOVW features. Results from our 2 submis-
the scene information in images. It is important to cap-           sions, namely Run#1 and Run#5, can be found in Table 2.
ture and differentiate scenery information in order to boost       Similar to devset results, using SIFT-BOVW we obtain bet-
diversity of the results.                                          ter results from Run#5 than Run#1.
   In order to compute dense-SIFT descriptors we use vlfeat’s
standart feature extactor tool [5]. First we resize each im-
age to a fixed size of 200 by 200 pixels and then we obtain        Table 1: Results on devset using provided features,
128 by 5776 dimensional SIFT features per image. In or-            GIST and SIFT-BOVW.
der to create a pool of descriptors we randomly sample 100                 Feat. name   P@20 CR@20 F1@20
descriptors from each image and then we apply k-means al-                    descvis    0.7139 0.3813   0.4863
gorithm with ’plusplus’ option. We try 3 different k values                   GIST      0.7209 0.3798   0.5037
(e.g., 600, 800 and 1000). According to the performance on                SIFT-BOVW 0.7167 0.3933       0.5013
devset, we choose k of k-means as 1000 and it becomes the
volume of our visual words dictionary. Using this dictionary,
we quantize each image to 1000 dimensional feature vectors.
                                                                             Table 2: Official results on testset.
4.   OUTLIER REMOVAL                                                          Run# P@20 CR@20 F1@20
   We use SVM to find out the outliers and construct a sub-                     1     0.6809     0.375   0.4758
set of images per location which are more relevant than the                     5     0.7228     0.387   0.4966
initial set. Our method is similar to [3] but we use a fixed
set of negative examples for each of devset and testset which
are selected in the following ways. For devset images we
picked 2 random images from each of the 30 locations, for          6.   CONCLUSIONS
testset images we select 60 random images from each of the           We showed that it is possible to obtain competitive results
123 locations considering at most 1 image from each testset        using only visual features. Our framework first eliminates
location. Then for each location, similar to cross valida-         the outliers and then using clustering it tries to leverage the
tion, we select 60 random positive images and first train          diversity to the retrieval results. However it is obvious that
and then classify using one-class SVM, and repeat this pro-        one can improve the scores by utilizing more information
cedure 10 times consecutively. Finally we select the model         into our framework like textual features, credibility scores.
which scored the highest accuracy assuming that this model
provides the best seperation. We use this process for each         7.   ACKNOWLEDGMENTS
location, using the same negative examples at each step but
                                                                     This research was supported by the MUCKE project funded
with different positive examples. We use a quadratic ker-
                                                                   within the FP7 CHIST-ERA scheme and also Scientific and
nel while experimenting with SVM because our features are
                                                                   Technical Research Council of Turkey (TUBITAK) under
dense vectors so that they are not easily seperable by lin-
                                                                   grant number 112E174.
ear kernel functions. We observed on the devset that as the
result of outlier removal process, we get rid of some of the ir-
relevant images and obtain a higher relevancy score for each       8.   REFERENCES
location.                                                          [1] B. Ionescu, A. Popescu, M. Lupu, A. L. Gı̂nscă, and
                                                                       H. Müller. Retrieving diverse social images at mediaeval
                                                                       2014: Challenge, dataset and evaluation. In MediaEval
5.   CLUSTERING AND RETRIEVAL                                          2014 Workshop, October 16-17, Barcelona, Spain, 2014.
  After outliers are removed we cluster the images of each         [2] D. G. Lowe. Object recognition from local
location using a k-means algorithm. On the devset we try               scale-invariant features. In Computer vision, 1999. The
2 different K values. First we select K as 25, because we              proceedings of the seventh IEEE international
observed that each location has at most 25 subclasses in               conference on, volume 2, pages 1150–1157. Ieee, 1999.
their diversity subgroups. Second we select K as 50, because
                                                                   [3] H. Lukashevich, S. Nowak, and P. Dunker. Using
that was the maximum number of images required to be
                                                                       one-class svm outliers detection for verification of
retrieved. The latter method, over clustering, seemed to
                                                                       collaboratively tagged image training sets. In
work better in devset so that we report our test set results
                                                                       Multimedia and Expo, 2009. ICME 2009. IEEE
using K as 50.
                                                                       International Conference on, pages 682–685. IEEE,
  After we compute cluster centroids, we simply retrieve
                                                                       2009.
images which are closest to the centroids. We apply k nearest
neighbor method with Euclidean distance and search for the         [4] A. Oliva and A. Torralba. Building the gist of a scene:
nearest neighbor for each centroid. While computing nearest            The role of global image features in recognition.
neighbor we pay great attention to retrieve unique neighbors           Progress in brain research, 155:23–36, 2006.
for each cluster centroid.                                         [5] A. Vedaldi and B. Fulkerson. VLFeat: An open and
  Results from devset are shown in Table 1. One may ob-                portable library of computer vision algorithms.
serve that SIFT-BOVW [2] features works better than de-
fault features. The reason is that local descriptors are gen-
erally works better to capture similarities among images so
that each cluster becomes more coherent. GIST [4] features