<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UPMC at MediaEval 2016 Retrieving Diverse Social Images Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sabrina Tollari</string-name>
          <email>Sabrina.Tollari@lip6.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sorbonne Universités, UPMC Univ Paris 06, UMR CNRS 7606 LIP6</institution>
          ,
          <addr-line>75252 PARIS cedex 05</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>In the MediaEval 2016 Retrieving Diverse Social Images Task, we proposed a general framework based on agglomerative hierarchical clustering (AHC). We tested the provided credibility descriptors as a vector input for our AHC. The results on devset showed that this vector based on the credibility descriptors is the best feature, but unfortunately that is not con rmed on testset. To merge several features, we chose to merge feature similarities. Tests on devset showed that to merge similarities using linear or weighted-max operators gave, most of the time, better results than using only one feature. This results is partially con rmed on testset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Contrary to previous years, in 2016, the task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] addresses
the use case of a general ad-hoc image retrieval system.
General cases are more di cult to tackle, because the system
can't be adapted to a particular application. Another
difference is the use of the F1@20 metrics, which means that we
are not only interested in diversity, but also to nd a balance
between relevance and diversity, that is more di cult to
handle. In the task of 2013, we proposed a framework [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which,
rst, tries to improve relevance and, then, makes a
clustering to improve diversity. This strategy has obtained good
results and can handle general cases. So this year, we use
the same strategy, but we adapt the parameters to the use
of F1@20 metrics, i.e., not only to improve diversity, but to
nd a balance between relevance and diversity.
      </p>
    </sec>
    <sec id="sec-2">
      <title>FRAMEWORK</title>
      <p>For each query, we apply the following framework. Step 1
(optional): Re-rank Flickr baseline to improve relevance
according to text features. Step 2: Cluster the N rst results
using Agglomerative Hierarchical Clustering (AHC). Step 3:
Sort the images in each cluster using their rank in Step 1,
sort the clusters according to the rank of the image on the
top of each cluster. Step 4: Re-rank the results alternating
images from di erent clusters.</p>
      <p>
        The AHC [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a robust method that can handle di
erent kind of features. Applying the AHC to query results
provides a hierarchy of image clusters. In order to obtain
groups of similar images, we cut the hierarchy to obtain a
xed number k of unordered clusters (see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for details).
      </p>
      <p>The AHC needs a measure to compare two documents. A
document can be describe by several features (text, visual,
etc.). To take advantage of several features, we need a way
to merge them. We choose to merge similarities. Some of
the features are associated with a distance, others with a
similarity. In order to have only similarities, all distances
are transformed using the classical formula: let (x; y) be a
distance between x and y, then the similarity is de ned as:
sim(x; y) = 1=(1 + (x; y)).</p>
      <p>Let f1 and f2 be two features and 2 [0; 1], we compute
a linear fusion of feature similarities by:
simLinear(f1;f2; )(x; y) =
simf1 (x; y) + (1
) simf2 (x; y):</p>
      <p>Let n be the number of features. Let's choose wisely a
weight wi for each feature fi, such as Pn
i=1 wi = 1. We
compute a weighted-max fusion similarities by:
simWMax(f1;w1;f2;w2; ;fn;wn)(x; y) =
max
i2f1; ;ng
wi simfi (x; y):
3.</p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS AND RESULTS</title>
      <sec id="sec-3-1">
        <title>Text re-ranking (Step 1).</title>
        <p>Using vector space model (VSM) with tf-idf weights and
cosinus similarity, we tested the choice of textual information
elds (Title (t), Description (d), Tags (t), Username(u)).
We also tested several stemmer. We notice no signi cant
di erence with or without stemmers, the reason may be
because there are only a few words in the query title. So we
choose not to use stemmer in all the experiments. Finally,
its seems that globally ttu gives slightly better P@20.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Features for clustering in Step 2.</title>
        <p>We tested several combinations of textual information
elds. Finally, for text clustering, the best solution on
devset is to use all the elds (tdtu) and a similarity based on
the Euclidean distance. It seems that to use in addition the
Description eld tends to produce more diversity than using
ttu, because documents are more dissimilar between them.</p>
        <p>
          We tested the provided visual features cnn_gen and
cnn_ad. On most of our experiments, it seems that cnn_ad
gives slightly better or better results than cnn_gen. We also
tested several features from the Lire library [
          <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
          ]: the
ScalableColor feature (ScalCol) | a histogram in HSV color
space encoded by a Haar transform | gives the best results.
        </p>
        <p>Using the provided credibility descriptors, we built, for
each image, normalized real vectors of 13 dimensions (noted
cred) (NaN, null and missing values | ' 3:5% of the
credibility descriptor values | are replaced by random values).</p>
      </sec>
      <sec id="sec-3-3">
        <title>Clustering parameters in Step 2.</title>
        <p>When varying features and parameters, we noticed that,
on devset, globally, complete linkage (AHCCompl) gave better
results than single or average linkages.</p>
        <p>For each query, 300 results were provided. Usually, there
are more relevant documents in the rst results than in the
end of the result list. Is it worth for the system to take
time to cluster online 300 results in order to improve the
F1@20 of the rst 20 documents ? We made several
experiments varying the diversity methods, parameters, features
and number of input documents. Globally, we didn't get a
lot of di erences in term of better F1@20 between 150, 200,
250 or 300 documents. The only real di erence depends on
the number of clusters. Usually, the more documents are
in the input set, the more the number of clusters should be
high to obtain goods results: around 20 clusters for 150
documents, and around 50 clusters for 300 documents. Finally,
we choose to take 300 documents because the peak of the
curve is wider than with 150 documents.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Reference runs and run results.</title>
        <p>The baseline run is the FlickR ranking. VSM(ttu) run is
obtained using VSM on ttu elds and without clustering. To
have some comparison elements, we also tested: a clustering
of random features (documents are represented by vectors of
5 random values) and a clustering using only the username
(two documents with same username are similars).</p>
        <p>As the queries are only composed of text, we cannot apply
a Step 1 to improve relevance in the case of run 1 (visual
only). Figure 1 shows that AHCCompl(ScalCol) (clustering
on ScalCol features without Step 1) gives lower results than
VSM(ttu)+AHCCompl(ScalCol) (with Step 1), but in both cases,
visual features give lower results than tdtu or cred features.</p>
        <p>The best number of clusters to use is always an open
question. If we want the best CR@20, most of the time is better
to take 20 clusters, unfortunately with 20 clusters the P@20
is often the worst. So, in order to optimise F1@20 and
according to the curves on devset (see Figure 1), we choose
for run 2 to run 5 to take 50 clusters. This choice seems to
give a good compromise between relevance and diversity.</p>
        <p>On devset, best results using only one feature is obtained
using cred. The reason may be because, in the case of cred,
images with the same userid have the same vectors, so these
images will be in the same cluster and these images are often
0.58
0.56
0.48
0.46
0
baseline</p>
        <p>VSM(ttu)</p>
        <p>VSM(ttu)+AHCCompl(random)
VSM(ttu)+AHCCompl(username)</p>
        <p>VSM(ttu)+AHCCompl(cnn-ad)
VSM(ttu)+AHCCompl(ScalCol)</p>
        <p>AHCCompl(ScalCol)</p>
        <p>VSM(ttu)+AHCCompl(tdtu)
VSM(ttu)+AHCCompl(Linear(tdtu,ScalCol,0.02))</p>
        <p>VSM(ttu)+AHCCompl(cred)
VSM(ttu)+AHCCompl(WMax(tdtu,0.014,ScalCol,0.97,cred,0.016))
50 100 150</p>
        <p>number of clusters
about the same subtopic. In Figure 1, we can notice that
the clustering on username gives better results than on text
only (tdtu) or visual only (ScalCol), but lower results than
on cred. So there must be also another reason. If some
images have similar credibility descriptors, that means that
their users have the same characteristics. But it is not clear
why these characteristics are interesting for diversity. To try
to show that cred is a good feature for diversity whatever
the diversity method, we tried this feature with a greedy
algorithm and we obtained the same conclusions (on devset).
Unfortunately, on testset, the text only run (run 2) gives
better results than cred one (run 4) (see Table 1). So this
result cannot be generalized and may depend of devset.</p>
        <p>As the visual similarity are not normalized, we needed
to carefully optimize the weights of the linear and of the
weighted-max fusion operators. On devset, the
weightedmax fusion using tdtu, ScalCol and cred gave the best
results for all our experiments. But as cred is not so good on
testset, run 5 does not give very good results. Finally, the
linear fusion between text (tdtu) and visual (ScalCol) gives
the best results on testset (run 3).</p>
        <p>Despite the fact that we use di erent kind of features, the
F1D@20 for run 2 to run 5 are very close (from 0.543 to 0.553)
that means it's di cult to make reliable conclusion on the
best feature or on the interest of similarity fusion.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>[1] http://www.semanticmetadata.net/lire.</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. O.</given-names>
            <surname>Duda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. E.</given-names>
            <surname>Hart</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Stork</surname>
          </string-name>
          .
          <article-title>Pattern Classi cation</article-title>
          . John Wiley and Sons, Inc., p.
          <fpage>552</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. L. G</surname>
          </string-name>
          ^nsca,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boteanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller. Retrieving diverse social images at MediaEval 2016: Challenge, dataset and evaluation</article-title>
          . In MediaEval 2016 Workshop, Hilversum, Netherlands, October
          <volume>20</volume>
          -21
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kuoman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tollari</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Detyniecki</surname>
          </string-name>
          . UPMC at MediaEval2013:
          <article-title>Relevance by text and diversity by visual clustering</article-title>
          .
          <source>In MediaEval 2013 Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kuoman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tollari</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Detyniecki</surname>
          </string-name>
          .
          <article-title>Using tree of concepts and hierarchical reordering for diversity in image retrieval</article-title>
          .
          <source>In CBMI</source>
          , pages
          <volume>251</volume>
          {
          <fpage>256</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lux</surname>
          </string-name>
          and
          <string-name>
            <surname>S. A.</surname>
          </string-name>
          <article-title>Chatzichristo s. Lire: Lucene image retrieval: An extensible java CBIR library</article-title>
          .
          <source>In ACM International Conference on Multimedia</source>
          , pages
          <volume>1085</volume>
          {
          <fpage>1088</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>