<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PeRCeiVe Lab@UNICT at MediaEval 2014 Diverse Images: Random Forests for Diversity-based Clustering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Concetto Spampinato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Palazzo</string-name>
          <email>simone.palazzo@dieei.unict.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical, Electronics and Computer Engineering University of Catania Viale Andrea Doria</institution>
          ,
          <addr-line>6 - 95125, Catania</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>In this paper we describe the work done by the Pattern Recognition and Computer Vision Laboratory (PeRCeiVe Lab) of the University of Catania (Italy) for the MediaEval 2014 Retrieving Diverse Social Images Task. The main challenge consists of retrieving, for a given topic, a set of photos which are relevant to the topic but also showing diverse views of it. We submitted four runs exploiting a common feature-independent clustering strategy based on randomforests for diversifying Flickr result images while preserving relevance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The goal of the MediaEval 2014 Retrieving Diverse Social
Images Task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is to re ne a list of location photos, retrieved
from Flickr through textual queries. Although photos
retrieved in Flickr are often relevant, e.g., depict partially or
entirely the target location, a signi cant number is either
noisy or redundant. The objective is, therefore, to lter out
such photos in order to obtain a exhaustive, and compact
summary for the considered location.
      </p>
      <p>Conversely to most of the existing methods, our approach
rst looks for diversity of the Flickr photos by performing
a diversity-based clustering and then removes the irrelevant
clusters by computing the similarities between the clustered
photos and information available on Wikipedia. Final
reranking is carried out by re-clustering the relevant images
and computing a diversity score for each location photo.</p>
    </sec>
    <sec id="sec-2">
      <title>METHOD</title>
      <p>The strategy employed for dealing with the Retrieving
Diverse Social Images Task of MediaEval 2014 relies on a
common, feature-independent framework consisting of the
following steps:</p>
      <p>
        Diversity-based clustering. The goal of this
clustering step is to assess dissimilarity between samples
(i.e., location photos described with either visual or
text descriptors) of a given location and to group them
according to such dissimilarity. To do that, we use
Random forest predictors which allow us to de ne a
dissimilarity measure between observations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For
each tree of the forest, if samples/observations i and
j land in the same terminal node, their similarity
score is increased by one. The similarities are nally
normalized by the number of trees.
      </p>
      <p>
        The similarities between samples build up a matrix,
SIM , which is symmetric, positive de nite and with
values in the unit interval [0; 1]. The dissimilarity
matrix is de ned as DISSIMij = p1 SIMij , which
is employed as input for partitioning around medoids
clustering [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Each obtained cluster is likely to show
a speci c view of the considered location;
Cluster ltering by relevance. Starting from the
diversity clusters we then perform a cluster ltering
by relevance. In particular, let SW be the average
of maximum similarities between all the topic samples
and the content available on Wikipedia computed as
follows:
      </p>
      <p>N
SW = 1 X</p>
      <p>max SIM (i; j)
N i=1 j2W iki
(1)
with N being the number of topic samples, and W iki
the number of samples describing Wikipedia location
content (e.g., in case of visual features, W iki is the
number of images on Wikipedia for that location, while
when employing text features W iki = 1 since only
sample describing the entire Wikipedia page text is
considered).</p>
      <p>For each cluster C we carry out an unsupervised
hierarchical tree clustering on samples' features, thus
obtaining C0 clusters. After that, we scale the similarities
between the samples of cluster C and the Wikipedia
content by C0, i.e., SIM C0 = SIM (i; j)i2C;j2W iki C0.
This gives more weight to the samples which resemble
mostly the Wikipedia content and that, at the same
time, are diverse from other samples. The relevance
score RSC of cluster C is, eventually, computed as mean
of SIM C0 and the cluster is removed if RSC k SW ;
Final ranking according to a diversity score.
The samples obtained at the previous step are again
clustered using unsupervised hierarchical clustering and
re-ranked according to a diversity score, computed for
a sample j, by integrating: 1) number of samples in
near clusters: if sample j is in cluster Cj we count how
main samples are in Cj , Cj 1 and Cj+1 and its score
is s1j = NCj +NCj1 1 +NCj+1 . This favors again
diversity as samples with many items in near clusters are
86.67%
78.17%
85.33%
84.14%
43.10%
44.02%
43.61%
42.97%
56.87%
55.59%
56.93%
56.14%
74.80%
75.53%
72.40%
72.93%
38.74%
39.02%
37.88%
37.31%
50.34%
50.63%
49.03%
48.49%
strongly penalized; 2) photo id distance s2j from
sample j to all other samples in the list: photos with close
IDs are likely to refer to a similar view of a location;
and 3) a random factor s3j to enable exploration of
new solutions. The nal ranking of sample j is given
by multiplying the three above scores.
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-4">
      <title>Setup and features</title>
      <p>The algorithm described in Sect. 2 depends on the number
of trees used for the diversity based clustering and on the k
threshold for cluster ltering by relevance, which were set,
respectively, to 50 and 3.5 for visual features and 1.0 for text
features.</p>
      <p>For runs using only visual features, we used the visual
descriptors provided by the task's organizers for each photo
(including the Wikipedia ones) normalized between 0 and 1
and concatenated into a single 945-dimensional vector.</p>
      <p>Textual descriptors were computed as TF-IDF vectors
from a vocabulary made up of all words from titles,
descriptions and tags for photos in the development and test sets,
plus words extracted from Wikipedia page (available as part
of the data provided by the task's organizers). In order to
reduce the original vocabulary size, being too large (more
than 90,000 words), we removed: 1) words shorter than four
characters; 2) words starting with digits; 3) words with low
maximum TF-IDF values, thus resulting in a vocabulary size
of 51,136 terms.</p>
      <p>For both the visual-based and the text-based classi ers,
the random forest was trained on 10 locations (the remaining
ones were used for testing), randomly selected from the ones
available in the training set. Increasing the number locations
led to higher training times without an actual improvement
in accuracy on the training set.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Results and discussion</title>
      <p>
        We submitted four runs whose results are given in Table 1:
Run 1 (visual information only): we employed the
algorithm described in Sect. 2 on the visual features,
ltering out images: 1) having people, detected by
the face detector in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], as subjects, and 2) whose
Euclidean distance between the location's GPS
coordinates (provided as part of each topic's description) and
their GPS coordinates (when provided) was over 10;
Run 2 (text information only): the same algorithm was
employed on the reduced-vocabulary TF-IDF
descriptors;
Run 3 (visual-text fusion): the results presented in
this run were obtained by combining those computed
for run 1 and run 2, i.e., by multiplying the ranking
scores of the two ranked lists for images appearing in
both lists and completing the nal list with images of
the list of Run 1;
Run 5 (any resources): same algorithm as in Run 1,
without applying the face and GPS-based ltering.
It is clear that our attempt of making the training phase as
independent as possible from the development set succeeded
only partially, since the results on the test set are sensibly
lower than those on the training set. Also, textual features
performed surprisingly better than visual ones, which had
obtained the highest accuracy by far on the development
set. The signi cant performance di erence, in terms of
precision, of the visual runs (about 12%) on the test set and
the development set can be due to a di erent relevance
distribution of visual features, while it seems that text features
keep the same distribution on the two sets.
4.
      </p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS</title>
      <p>In this paper we describe our random-forest based
approach for tackling the MediaEval 2014 Retrieving Diverse
Social Images Task. Our method, applied to text and visual
features indi erently, leverages on a diversity-based
clustering using Random Forests and on noisy cluster ltering to
increase relevance. Final ranking is made in order to favor
diversity with respect to relevance.</p>
      <p>As future improvement, we mean to better exploit the
amount of information provided for development: in
particular, the diversity ground truth will be used to improve
intra-cluster split in the diversity-based clustering, and user
credibility information will be integrated into the relevance
estimation model.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. L.</surname>
          </string-name>
          <article-title>G^nsca, and</article-title>
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller</article-title>
          .
          <source>Retrieving Diverse Social Images at MediaEval</source>
          <year>2014</year>
          :
          <article-title>Challenge, Dataset and Evaluation</article-title>
          . In MediaEval 2014 Workshop, October 16-
          <issue>17</issue>
          ,
          <year>2014</year>
          , Barcelona, Spain,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.-S.</given-names>
            <surname>Park</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.-H.</given-names>
            <surname>Jun</surname>
          </string-name>
          .
          <article-title>A simple and fast algorithm for k-medoids clustering</article-title>
          .
          <source>Expert Syst. Appl.</source>
          ,
          <volume>36</volume>
          (
          <issue>2</issue>
          ):
          <volume>3336</volume>
          {
          <fpage>3341</fpage>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Shi</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Horvath</surname>
          </string-name>
          .
          <article-title>Unsupervised Learning With Random Forest Predictors</article-title>
          .
          <source>Journal of Computational and Graphical Statistics</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ):
          <volume>118</volume>
          {
          <fpage>138</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Viola</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Robust real-time object detection</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>57</volume>
          :
          <fpage>137</fpage>
          {
          <fpage>154</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>