<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CEA LIST's Participation at the MediaEval 2014 Retrieving Diverse Social Images Task</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Alexandru Lucian Ginsca</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CEA, LIST, Vision &amp; Content Engineering Laboratory</institution>
          ,
          <addr-line>91190 Gif-sur-Yvette</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Faculty of Informatics, Vienna University of Technology</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>TELECOM Bretagne</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>The Mediaeval 2014 Retrieving Diverse Social Image Task aims to tackle the challenge of improving result diversity while keeping a high precision in a social image retrieval task. We base our approach on the retrieval performance of recently introduced visual descriptors coupled with a mixt diversi cation method that explores the use of social cues together with a classic clustering setting. As a novelty, this year's task introduced user credibility features. We also describe how to use credibility in the diversi cation process and how to improve individual features by the means of a regression model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Social image retrieval presents an appropriate setting for
the use of multimodal approaches to improve both results
relevance and diversity. Recently, emerging works propose
the use of social cues alongside visual and textual data.</p>
      <p>Our e orts are channeled towards exploiting visual
information and the use of credibility in the diversi cation
process. We rst describe a couple of pre- ltering techniques
followed by an image retrieval method that boosts precision.
Next, we describe how to predict a user's credibility score
and we propose a user based image ltering approach. After
we show how we improve diversity by clustering and cluster
ranking, we nally describe the submitted runs and discuss
the results we obtained on the testset.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>AIMING FOR PRECISION</title>
    </sec>
    <sec id="sec-3">
      <title>Initial pre-filtering</title>
      <p>
        We use two ltering steps with the goal to eliminate noise
form the image lists. Similar to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we eliminate geotagged
images that have a distance from the POI higher than 1
km. The second lter is a restriction on the presence of
faces in images. We use the standard OpenCV1 algorithm
to perform face detection and we eliminate images having a
face coverage ratio higher than 0.4. The distance threshold
and the one for the percentage of faces are determined on
the devset. We keep the same pre- ltering steps for all the
runs.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Image retrieval</title>
      <p>3.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>LISTENING TO SOCIAL CUES</title>
    </sec>
    <sec id="sec-6">
      <title>Predicting user credibility</title>
      <p>We exploit the credibility set to train a regression model
that predicts a user's credibility score from the provided
features. We perform model selection and parameter tunning
by 5-fold cross-validation (cv) on the credibility set and we
evaluate the performance of the predictions by Spearman's
rank correlation coe cient with the ground truth credibility
values. The highest cv correlation (0.47) is obtained using
gradient boosting regression trees with a Huber loss and 100
estimators. By comparison, the highest correlation of an
individual feature (visual score) is 0.36. The gain in regards
to the Spearman score is also re ected on the competition
metrics. When xing the rest of the parameters and using
the predicted credibility scores instead of the provided
visual credibility feature, F1@20 increases from 0.61 to 0.632
on the devset.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>User selection</title>
      <p>For each topic, we rst keep a subset of users that have
contributions in the top n images found in the ranking
produces by the image retrieval process described in Section 2.2.
Then, as an extra lter, in our nal ranking we retain only
images coming from the selected user set. Given the good
precision of image retrieval, we have a high con dence that
images found in the top of the ranking are relevant. This
gives us an ad-hoc topical expertise insight about the users
responsible for those images. We tune n on the devset and
x it at 20. For comparison, when not using a user based
lter, the F1@20 score drops from 0.632 to 0.597. We also
tried a similar approach by retaining contributions from top
users ranked according to the credibility score but this did
not improve the results. This result hints at the need for a
topic speci c credibility score.</p>
    </sec>
    <sec id="sec-8">
      <title>4. IMPROVING DIVERSITY</title>
      <p>
        Building on previous works, we combine a more traditional
clustering approach for diversi cation with the use of social
cues [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
4.1
      </p>
    </sec>
    <sec id="sec-9">
      <title>Clustering</title>
      <p>We rst perform k-Means clustering on the complete set of
images. To ensure a stable cluster distribution, we initialize
the centroids by uniformly selecting images from the ranking
produced after image retrieval. For example, the i -th cluster
will have as initial centroid the image found on the position
(i 1) n=k, where k is the desired number of clusters and
n is the number of images in the ranking. After validation
on the devset, k is set to 30.
4.2</p>
    </sec>
    <sec id="sec-10">
      <title>Cluster ranking</title>
      <p>We leverage the social component of this task by ordering
the clusters based on the average credibility score of the
users that contribute with images in the cluster. For the
runs that do not permit the use of credibility, we rank the
clusters according to the number of unique users represented
in each cluster. In the case of a tie, we prefer the cluster that
has the best ranked image after visual retrieval. Our nal
ranked list is obtained by selecting from each cluster at a
time the image that is best placed in the visual retrieval
ranking.</p>
    </sec>
    <sec id="sec-11">
      <title>RESULTS AND DISCUSSION</title>
      <p>
        We submitted ve di erent runs at this year's Retrieving
Diverse Social Images Task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Our submissions are brie y
described below:
      </p>
      <p>RUN1 uses the provided LBP3x3 visual descriptor for
image retrieval and clustering. The clusters are then
ranked based on the number of users represented in
each cluster.</p>
      <p>
        RUN2 is a purely textual one. We concatenated the
title, tags and description of the photos to calculate the
text similarity. As text pre-processing phase, we
decompounded the terms by applying a greedy approach
using the dictionary which is created by all the words
in the text. In the next step, in order to disambiguate
the places, we expand the queries using the rst
sentence of Wikipedia. After testing several language
models, using a semantic similarity approach based on
Word2Vec [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] gave the best result. We trained a model
on Wikipedia and then used the vector representation
of words to calculate the text similarity of the query
to each photo. In additional to the text similarity, we
extracted three binary attributes: (1) if the photo had
any views, (2) if the distance between a photo and the
POI is greater than 8 kilometers, and (3) if the
description length has more than 2000 characters. All
features were then used in a Linear Regression model
in order to re-rank the list. Finally, following [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], in
order to diversify the ranking, we iterate over the
initial re-ranked list and keep one image from each user
at each iteration.
      </p>
      <p>RUN3 is a fusion between RUN1 and RUN2. Since the
scores for visual and textual rankings are not in the
same range, fusion is performed based on the ranks of
the images in the two initial rankings. More speci
cally, we perform a linear weighting in which the
individual ranks are given a weight of 0.5. Other weighting
have been tested but the results remain quite stable in
the range 0.3 - 0.7, a result which accounts for the
robustness of the proposed fusion.</p>
      <p>RUN4 is similar to RUN1 with the single di erence
laying in the use of credibility for cluster ranking.
RUN5 is obtained using the Ca e visual descriptor for
image retrieval and clustering and predicted credibility
scores for cluster ranking.</p>
      <p>Our textual run (RUN2) is the single one in which we do
not use clustering to improve diversity. This re ects across
metrics, as it can be seen in Table 1. Although it performs
well in terms of F1@20, this run is placed at oposite poles
when looking at the other metrics. It has the highest P@20
and the lowest CR@20.</p>
      <p>The usefulness of credibility can be best observed when
comparing RUN1 and RUN4. They share the same con
guration with the sole exception being the use of the predicted
credibility scores for cluster ranking in RUN4. Although the
di erence is not as signi cant as on the devset, we can see
a slight improvement of F1@20.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGMENT</title>
      <p>This research was supported by the MUCKE project, partly
funded within the FP7 CHIST-ERA scheme.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          and al.
          <source>Retrieving diverse social images at mediaeval</source>
          <year>2014</year>
          :
          <article-title>Challenge, dataset and evaluation</article-title>
          . In MediaEval 2014 Workshop, Barcelona, Spain, October
          <volume>16</volume>
          -17
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jain</surname>
          </string-name>
          and al.
          <article-title>Experiments in diversifying ickr result sets</article-title>
          .
          <source>In MediaEval 2013 Workshop</source>
          , Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          . Ca e:
          <article-title>An open source convolutional architecture for fast feature embedding</article-title>
          . http://ca e.berkeleyvision.org,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          and al. E
          <article-title>cient estimation of word representations in vector space</article-title>
          .
          <source>CoRR</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          .
          <article-title>Cea list's participation at the mediaeval 2013 retrieving diverse social images task</article-title>
          .
          <source>In MediaEval 2013 Workshop</source>
          , Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>