<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bilkent-RETINA at Retrieving Diverse Social Images Task of MediaEval 2014</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mustafa Ilker Sarac</string-name>
          <email>mustafa.sarac@cs.bilkent.edu.tr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pinar Duygulu</string-name>
          <email>P@20</email>
          <email>duygulu@cs.bilkent.edu.tr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bilkent University, Department of Computer Engineering</institution>
          ,
          <addr-line>06800, Ankara</addr-line>
          ,
          <country country="TR">Turkey.</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This paper explains the approach proposed by Bilkent RETINA team for the Retrieving Diverse Social Images task of MediaEval 2014 [1]. We develop a framework which rst removes outliers using one-class support vector machines (SVM) to improve relevance. Second it clusters the eliminated set and retrieves the centroids to diversify the results. We tried to exploit visual only features during our experiments. For the rst run we used the provided visual features and for the second run we used well known visual features like SIFT [2] and GIST [4].</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>In today's world image sharing applications are being used
extremely. Users of Facebook upload 350 million photos1
each day and it is said to be equal to the number of photos
have been taken during 19th century in total2. Given that
large number of images, search engines become more
important than ever in order to produce good quality search
results. In this task the quality factors are determined by
means of relevancy and diversity.</p>
      <p>
        Participants were provided with a development dataset
(devset) of 30 locations and a testing dataset (testset) of 123
locations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Each location consists of up to 300 photos
which are retrieved from Flickr using text information. In
the following, we provide a framework which rst removes
the outlier images and then apply k-means clustering to
obtain diversi ed results.
      </p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED APPROACH</title>
      <p>Step 3: Clustering
After the outlier removal step, in order to increase the
diversity score we apply k-means clustering to the remaining
images at each location.</p>
      <p>Step 4: Retrieval
In the retrieval step we select cluster centroids that we
obtain in the previous step. Each centroid should represent a
di erent aspect of a given location so that it is aimed to get
a good diversi cation results.
3.</p>
    </sec>
    <sec id="sec-3">
      <title>VISUAL FEATURES</title>
      <p>The task organizers provide us with 6 visual descriptors
(CM, CN, CSD, GLRLM, HOG, LBP) out of which 4 have
also a spatial pyramid representation (CM, CN, GLRLM
and LBP). We sought for the best combination of these
features using provided devset images. We found out that best
results are obtained when all these features are combined. So
we concatenate all these 10 visual descriptors and come up
with a feature vector of 945 dimensions for each image (i.e.,
descvis). Then we normalize each feature vector to zero
mean and unit variance.</p>
      <p>
        We also extracted other visual features like GIST and bag
of visual words (BOVW) representations using dense SIFT
features [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ]. We use these extra features while
constructing the fth run of the challenge. GIST features are 512
dimensional global features and they are useful in capturing
the scene information in images. It is important to
capture and di erentiate scenery information in order to boost
diversity of the results.
      </p>
      <p>
        In order to compute dense-SIFT descriptors we use vlfeat 's
standart feature extactor tool [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. First we resize each
image to a xed size of 200 by 200 pixels and then we obtain
128 by 5776 dimensional SIFT features per image. In
order to create a pool of descriptors we randomly sample 100
descriptors from each image and then we apply k-means
algorithm with 'plusplus' option. We try 3 di erent k values
(e.g., 600, 800 and 1000). According to the performance on
devset, we choose k of k-means as 1000 and it becomes the
volume of our visual words dictionary. Using this dictionary,
we quantize each image to 1000 dimensional feature vectors.
      </p>
    </sec>
    <sec id="sec-4">
      <title>OUTLIER REMOVAL</title>
      <p>
        We use SVM to nd out the outliers and construct a
subset of images per location which are more relevant than the
initial set. Our method is similar to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] but we use a xed
set of negative examples for each of devset and testset which
are selected in the following ways. For devset images we
picked 2 random images from each of the 30 locations, for
testset images we select 60 random images from each of the
123 locations considering at most 1 image from each testset
location. Then for each location, similar to cross
validation, we select 60 random positive images and rst train
and then classify using one-class SVM, and repeat this
procedure 10 times consecutively. Finally we select the model
which scored the highest accuracy assuming that this model
provides the best seperation. We use this process for each
location, using the same negative examples at each step but
with di erent positive examples. We use a quadratic
kernel while experimenting with SVM because our features are
dense vectors so that they are not easily seperable by
linear kernel functions. We observed on the devset that as the
result of outlier removal process, we get rid of some of the
irrelevant images and obtain a higher relevancy score for each
location.
      </p>
    </sec>
    <sec id="sec-5">
      <title>CLUSTERING AND RETRIEVAL</title>
      <p>After outliers are removed we cluster the images of each
location using a k-means algorithm. On the devset we try
2 di erent K values. First we select K as 25, because we
observed that each location has at most 25 subclasses in
their diversity subgroups. Second we select K as 50, because
that was the maximum number of images required to be
retrieved. The latter method, over clustering, seemed to
work better in devset so that we report our test set results
using K as 50.</p>
      <p>After we compute cluster centroids, we simply retrieve
images which are closest to the centroids. We apply k nearest
neighbor method with Euclidean distance and search for the
nearest neighbor for each centroid. While computing nearest
neighbor we pay great attention to retrieve unique neighbors
for each cluster centroid.</p>
      <p>
        Results from devset are shown in Table 1. One may
observe that SIFT-BOVW [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] features works better than
default features. The reason is that local descriptors are
generally works better to capture similarities among images so
that each cluster becomes more coherent. GIST [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] features
also perform better than the default features and perform
similar to SIFT-BOVW features. Results from our 2
submissions, namely Run#1 and Run#5, can be found in Table 2.
Similar to devset results, using SIFT-BOVW we obtain
better results from Run#5 than Run#1.
6.
      </p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS</title>
      <p>We showed that it is possible to obtain competitive results
using only visual features. Our framework rst eliminates
the outliers and then using clustering it tries to leverage the
diversity to the retrieval results. However it is obvious that
one can improve the scores by utilizing more information
into our framework like textual features, credibility scores.
7.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research was supported by the MUCKE project funded
within the FP7 CHIST-ERA scheme and also Scienti c and
Technical Research Council of Turkey (TUBITAK) under
grant number 112E174.
8.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. L.</surname>
          </string-name>
          <article-title>G^nsca, and</article-title>
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller. Retrieving diverse social images at mediaeval 2014: Challenge, dataset and evaluation</article-title>
          . In MediaEval 2014 Workshop, October 16-17, Barcelona, Spain,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Lowe.</surname>
          </string-name>
          <article-title>Object recognition from local scale-invariant features</article-title>
          .
          <source>In Computer vision</source>
          ,
          <year>1999</year>
          .
          <source>The proceedings of the seventh IEEE international conference on</source>
          , volume
          <volume>2</volume>
          , pages
          <fpage>1150</fpage>
          {
          <fpage>1157</fpage>
          .
          <string-name>
            <surname>Ieee</surname>
          </string-name>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lukashevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nowak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Dunker</surname>
          </string-name>
          .
          <article-title>Using one-class svm outliers detection for veri cation of collaboratively tagged image training sets</article-title>
          .
          <source>In Multimedia and Expo</source>
          ,
          <year>2009</year>
          .
          <article-title>ICME 2009</article-title>
          . IEEE International Conference on, pages
          <volume>682</volume>
          {
          <fpage>685</fpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          .
          <article-title>Building the gist of a scene: The role of global image features in recognition</article-title>
          .
          <source>Progress in brain research</source>
          ,
          <volume>155</volume>
          :
          <fpage>23</fpage>
          {
          <fpage>36</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          and
          <string-name>
            <surname>B. Fulkerson.</surname>
          </string-name>
          <article-title>VLFeat: An open and portable library of computer vision algorithms</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>