<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ghent University-iMinds at MediaEval 2014 Diverse Images: Adaptive Clustering with Deep Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wesley De Neve</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>wesley.deneve@ugent.be</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Abhineshwar Tomar</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Baptist Vandersmissen</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Image and Video Systems Lab, KAIST</institution>
          ,
          <addr-line>Daejeon</addr-line>
          ,
          <country country="KR">South Korea</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Multimedia Lab, ELIS, Ghent University - iMinds</institution>
          ,
          <addr-line>Ghent</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Rik Van de Walle</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>In this paper, we attempt to tackle the MediaEval 2014 Retrieving Diverse Social Images challenge, a lter and re nement problem de ned for a Flickr-based ranked set of social images. We build upon solutions proposed in [5] and mainly focus on exploiting the joint use of all modalities. The use of image features extracted from a deep convolutional neural network, combined with the use of distributed word representations, forms the basis of our approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        In this paper, we describe our approach for tackling the
MediaEval 2014 Retrieving Diverse Social Images Task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
This task focuses on result diversi cation in the context of
image retrieval. We refer to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for a complete task overview.
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>This section describes four di erent approaches created to
solve the aforementioned challenge. The approach used in
the last run uses external data sources; all other approaches
exclusively use data provided by the task organizers. We
focused on two parts: relevance estimation of an image with
respect to a speci c location and similarity estimation
between a pair of images. Particularly, run 2, 3 and 5 build
upon these parts.</p>
    </sec>
    <sec id="sec-3">
      <title>Run 1: Visual-only</title>
      <p>
        We propose a hierarchical clustering-based approach for
the ranking of images in accordance with their relevance
and diversity for a speci c location. We used the approach
proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (cf. \Visual run").
      </p>
    </sec>
    <sec id="sec-4">
      <title>Run 2: Textual-only</title>
      <p>
        The textual run makes use of information derived from the
provided tags and other textual metadata. This approach
aims at diversifying the results by optimizing an adapted
performance metric. We modi ed both the relevance and
diversity estimation of the algorithm proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (cf.
\Textual run") as presented in the following sections.
2.2.1
Rel(x) =
      </p>
      <p>tags(x) +
with
and</p>
      <p>representing scalars,
tags(x) = jft j t 2 Tx; tf idft &gt;
jTxj</p>
      <p>1
f lickr(x)</p>
      <p>;
t2Tx
gj</p>
      <p>X tf idft; (2)
and f lickr(x) denoting the original Flickr ranking of image
x. The TF-IDF score of tag t is denoted by tf idft. The
tag score (cf. Equation 2) is the sum of each tag's
normalized TF-IDF score multiplied by the relative number of high
scoring tags. In our approach, is set to the average
TFIDF score. This bene ts images with a higher number of
more relevant tags.
2.2.2</p>
      <p>Estimating the semantic di erence between two images is
based on the amount of shared tags. Let x and y denote two
images with Tx and Ty denoting their set of tags,
respectively. The diversity is then calculated as follows:
Div(x; y) = 1</p>
      <p>jTx \ Tyj
max(jTxj; jTyj)
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Run 3: Visual and Textual</title>
      <p>
        The fusion of both visual and textual information results
in a relevance-based clustering approach (cf. \Combined
run" in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). We modi ed the clustering technique to
adaptive hierarchical clustering. The optimal distance to form
clusters is determined by nding the "knee" point in the plot
of number of clusters versus the inter-cluster distance
(similar to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). To estimate the relevance of an image, we use
our textual-only method (cf. Section 2.2.1). The diversity
between two images is estimated based on the Euclidean
distance between their visual descriptor, which is represented
by a CN3x3 and LBP3x3 vector [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2.4
      </p>
    </sec>
    <sec id="sec-6">
      <title>Run 5: External Sources</title>
      <p>The algorithm used to produce the fth run is based on
the one used in Section 2.3. Both the relevance and diversity
(1)
(3)</p>
      <p>Run 5
0.8567
estimation components are adapted and described below.
2.4.1</p>
      <p>In order to accurately estimate the relevance of an image,
a well-de ned target location is necessary. Thus, each
location is rst described in both a textual and visual manner.</p>
      <p>To create this textual identity, related information of each
location is extracted from DBpedia1. From this information
textual keywords are extracted and combined with the top
k most frequently occurring tags in the set of images of a
location. The visual identity is formed on the basis of a
set of representative photos, retrieved via Wikipedia. The
relevance of an image is calculated based on a linear
combination of the following three factors: textual relevance,
visual relevance, and Flickr relevance.</p>
      <p>The textual relevance of an image is entirely based on its
tags. Again, assume that Tx denotes the set of tags of
image x and that Ta denotes the set of tags depicting location
a (i.e., textual identity):</p>
      <p>Pt2Tx emaxk2Ta fsim(t;k)g
jTxj
;
(4)</p>
      <p>
        We propose a new method to compute the similarity
between tags and omit the use of the ubiquitous TF-IDF.
Therefore, we make use of distributed word representations,
namely word2vec2. A pretrained model (the Google News
Dataset-based dictionary de ned as Tw) is used to convert
words to vectors. Such vectors preserve the semantic and
linguistic regularities among words [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The following
formula describes this approach:
sim(ta; tb) =
8 cos( ) if ta 2 Tw ^ tb 2 Tw
&lt; 1 if ta 2= Tw _ tb 2= Tw; ta = tb ;
: 0 else
(5)
with ta and tb depicting a tag, and cos( ) the cosine
similarity between their representative vectors. With this
technique, semantically similar and spelling-wise di erent tags
can still have an in uence on the eventual relevance score.
      </p>
      <p>Visual relevance is calculated based on the maximum
similarity between the image and the representative Wikipedia
images. Finally, Flickr relevance is the inverse of the original
Flickr ranking of the image.
2.4.2</p>
      <p>Diversity Estimation</p>
      <p>
        To improve the similarity estimation and thus
dissimilarity estimation between two images, we attempt to nd more
e ective visual descriptors. Therefore, we make use of a deep
convolutional neural network, trained on 1.2 million images
from ImageNet, named OverFeat3, to extract high-level
features [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Each image is resized and cropped to a size of 231
pixels by 231 pixels, then for each image a representative
vector is extracted from a convolutional network. This is
done by feed-forward propagation through the network and
omitting the fully connected layers, which results in a vector
of size 4096 for each image. Thus, we assume that the
numerous lters in the convolutional layers extract high-level
and representative features. The diversity between two
images is then again estimated based on the Euclidean distance
between their descriptors.
      </p>
    </sec>
    <sec id="sec-7">
      <title>EXPERIMENTS</title>
      <p>In Table 1, we can see the results of the original Flickr
ranking together with the results of all algorithms on the
development set. Table 2 shows the results on the test
set. Clearly, run 5 outperforms the other approaches when
observing the F1-measure. Run 5 reaches an F1-score of
57:16% on the development set and 54:55% on the test set.
4.</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSIONS</title>
      <p>We observe that run 5, using distributed word
representations for the relevance estimation and OverFeat features
for the diversity assessment, outperforms all others.
Particularly, the use of advanced image features positively in uences
the F1-score. For future work, the in uence of more focused
distributed word representations will be investigated.
5.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Ginsca</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller. Retrieving diverse social images at mediaeval 2014: Challenge, dataset and evaluation</article-title>
          .
          <source>In MultimediaEval working Notes</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in Neural Information Processing Systems. NIPS</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Salvador</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Chan</surname>
          </string-name>
          .
          <article-title>Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms</article-title>
          .
          <source>In Tools with Arti cial Intelligence</source>
          , pages
          <fpage>576</fpage>
          {
          <fpage>584</fpage>
          ,
          <string-name>
            <surname>Nov</surname>
          </string-name>
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eigen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mathieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun. Overfeat</surname>
          </string-name>
          :
          <article-title>Integrated recognition, localization and detection using convolutional networks</article-title>
          .
          <source>CoRR</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Vandersmissen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Godin</surname>
          </string-name>
          , W. De Neve, and R. Van de Walle. Ghent University-iMinds at
          <article-title>MediaEval 2013 Diverse Images: Relevance-Based Hierarchical Clustering</article-title>
          .
          <source>Working Notes Proceedings of the MediaEval 2013 Workshop</source>
          , Barcelona, Spain, October
          <volume>18</volume>
          -19, CEUR-WS,
          <volume>1043</volume>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>