UPMC at MediaEval 2016
                       Retrieving Diverse Social Images Task

                                                         Sabrina Tollari
       Sorbonne Universités, UPMC Univ Paris 06, UMR CNRS 7606 LIP6, 75252 PARIS cedex 05, France
                                                   Sabrina.Tollari@lip6.fr


ABSTRACT                                                              The AHC needs a measure to compare two documents. A
In the MediaEval 2016 Retrieving Diverse Social Images             document can be describe by several features (text, visual,
Task, we proposed a general framework based on agglomera-          etc.). To take advantage of several features, we need a way
tive hierarchical clustering (AHC). We tested the provided         to merge them. We choose to merge similarities. Some of
credibility descriptors as a vector input for our AHC. The         the features are associated with a distance, others with a
results on devset showed that this vector based on the credi-      similarity. In order to have only similarities, all distances
bility descriptors is the best feature, but unfortunately that     are transformed using the classical formula: let δ(x, y) be a
is not confirmed on testset. To merge several features, we         distance between x and y, then the similarity is defined as:
chose to merge feature similarities. Tests on devset showed        sim(x, y) = 1/(1 + δ(x, y)).
that to merge similarities using linear or weighted-max op-           Let f1 and f2 be two features and τ ∈ [0, 1], we compute
erators gave, most of the time, better results than using only     a linear fusion of feature similarities by:
one feature. This results is partially confirmed on testset.       simLinear(f1 ,f2 ,τ ) (x, y) = τ · simf1 (x, y) + (1 − τ ) · simf2 (x, y).
                                                                     Let n be the number of features. Let’sPnchoose wisely a
1.   INTRODUCTION                                                  weight wi for each feature fi , such as  i=1 wi = 1. We
   Contrary to previous years, in 2016, the task [3] addresses     compute a weighted-max fusion similarities by:
the use case of a general ad-hoc image retrieval system.
General cases are more difficult to tackle, because the system       simWMax(f1 ,w1 ,f2 ,w2 ,··· ,fn ,wn ) (x, y) =      max         wi · simfi (x, y).
                                                                                                                      i∈{1,··· ,n}
can’t be adapted to a particular application. Another dif-
ference is the use of the F1@20 metrics, which means that we       3.     EXPERIMENTS AND RESULTS
are not only interested in diversity, but also to find a balance
between relevance and diversity, that is more difficult to han-
dle. In the task of 2013, we proposed a framework [4] which,       Text re-ranking (Step 1).
first, tries to improve relevance and, then, makes a cluster-         Using vector space model (VSM) with tf-idf weights and
ing to improve diversity. This strategy has obtained good          cosinus similarity, we tested the choice of textual information
results and can handle general cases. So this year, we use         fields (Title (t), Description (d), Tags (t), Username(u)).
the same strategy, but we adapt the parameters to the use          We also tested several stemmer. We notice no significant
of F1@20 metrics, i.e., not only to improve diversity, but to      difference with or without stemmers, the reason may be be-
find a balance between relevance and diversity.                    cause there are only a few words in the query title. So we
                                                                   choose not to use stemmer in all the experiments. Finally,
                                                                   its seems that globally ttu gives slightly better P@20.
2.   FRAMEWORK
   For each query, we apply the following framework. Step 1        Features for clustering in Step 2.
(optional): Re-rank Flickr baseline to improve relevance ac-          We tested several combinations of textual information
cording to text features. Step 2: Cluster the N first results      fields. Finally, for text clustering, the best solution on dev-
using Agglomerative Hierarchical Clustering (AHC). Step 3:         set is to use all the fields (tdtu) and a similarity based on
Sort the images in each cluster using their rank in Step 1,        the Euclidean distance. It seems that to use in addition the
sort the clusters according to the rank of the image on the        Description field tends to produce more diversity than using
top of each cluster. Step 4: Re-rank the results alternating       ttu, because documents are more dissimilar between them.
images from different clusters.                                       We tested the provided visual features cnn_gen and
   The AHC [2] is a robust method that can handle differ-          cnn_ad. On most of our experiments, it seems that cnn_ad
ent kind of features. Applying the AHC to query results            gives slightly better or better results than cnn_gen. We also
provides a hierarchy of image clusters. In order to obtain         tested several features from the Lire library [1, 6]: the Sca-
groups of similar images, we cut the hierarchy to obtain a         lableColor feature (ScalCol) — a histogram in HSV color
fixed number k of unordered clusters (see [5] for details).        space encoded by a Haar transform — gives the best results.
                                                                      Using the provided credibility descriptors, we built, for
                                                                   each image, normalized real vectors of 13 dimensions (noted
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, October 20-21, 2016, Hilversum,           cred) (NaN, null and missing values — ' 3.5% of the credi-
Netherlands                                                        bility descriptor values — are replaced by random values).
Table 1: Run results. Between brackets, gain in percentage compared to the devset baseline or to the testset
worst run. The number of documents for clustering per query is 300. k is the selected number of clusters
        Step 1 Steps 2-4: AHCCompl                       devset                            testset
Run              Features             k P@20          CR@20       F1@20       P@20      CR@20       F1@20
baseline -        -                   -   0.698(ref.) 0.371(ref.) 0.467(ref.)  -         -           -
VSM VSM(ttu) -                        -   0.772(+11%) 0.397(+7%) 0.507(+9%) -            -           -
rand    VSM(ttu) random features     50 0.771(+10%) 0.410(+11%) 0.522(+12%) -            -           -
user    VSM(ttu) username            50 0.761(+9%) 0.485(+31%) 0.578(+24%) -             -           -
run 1    -       ScalCol             20 0.631(-10%) 0.432(+16%) 0.498(+7%) 0.520(ref.) 0.400(ref.) 0.430(ref.)
run 2 VSM(ttu) tdtu                  50 0.768(+10%) 0.471(+27%) 0.569(+22%) 0.697(+34%) 0.486(+22%) 0.552(+28%)
run 3 VSM(ttu) Linear(tdtu,          50 0.767(+10%) 0.487(+31%) 0.582(+25%) 0.696(+34%) 0.494(+24%) 0.553(+29%)
                    ScalCol,0.02)
run 4 VSM(ttu) cred                  50 0.767(+10%) 0.491(+32%) 0.585(+25%) 0.681(+31%) 0.487(+22%) 0.543(+26%)
run 5 VSM(ttu) WMax(tdtu,0.014,      50 0.771(+10%) 0.493(+33%) 0.588(+26%) 0.686(+32%) 0.487(+22%) 0.544(+26%)
                 ScalCol,0.97,cred,0.016)
Number of queries                                           70                                64


Clustering parameters in Step 2.                                                  0.58


   When varying features and parameters, we noticed that,
on devset, globally, complete linkage (AHCCompl) gave better                      0.56


results than single or average linkages.


                                                                 F1@20 (devset)
   For each query, 300 results were provided. Usually, there                      0.54                                                         baseline
                                                                                                                                               VSM(ttu)
are more relevant documents in the first results than in the                                                              VSM(ttu)+AHCCompl(random)
end of the result list. Is it worth for the system to take                        0.52
                                                                                                                        VSM(ttu)+AHCCompl(username)
time to cluster online 300 results in order to improve the                                                                 VSM(ttu)+AHCCompl(cnn-ad)
                                                                                   0.5                                    VSM(ttu)+AHCCompl(ScalCol)
F1@20 of the first 20 documents ? We made several exper-                                                                           AHCCompl(ScalCol)
iments varying the diversity methods, parameters, features                                                                    VSM(ttu)+AHCCompl(tdtu)
                                                                                  0.48
                                                                                                           VSM(ttu)+AHCCompl(Linear(tdtu,ScalCol,0.02))
and number of input documents. Globally, we didn’t get a
                                                                                                                             VSM(ttu)+AHCCompl(cred)
lot of differences in term of better F1@20 between 150, 200,                      0.46       VSM(ttu)+AHCCompl(WMax(tdtu,0.014,ScalCol,0.97,cred,0.016))
250 or 300 documents. The only real difference depends on                                0        50                   100                   150

                                                                                                         number of clusters
the number of clusters. Usually, the more documents are
in the input set, the more the number of clusters should be
high to obtain goods results: around 20 clusters for 150 do-     Figure 1: Some of the results on devset varying the
cuments, and around 50 clusters for 300 documents. Finally,      number of clusters (300 documents per query)
we choose to take 300 documents because the peak of the
curve is wider than with 150 documents.

Reference runs and run results.                                  about the same subtopic. In Figure 1, we can notice that
   The baseline run is the FlickR ranking. VSM(ttu) run is       the clustering on username gives better results than on text
obtained using VSM on ttu fields and without clustering. To      only (tdtu) or visual only (ScalCol), but lower results than
have some comparison elements, we also tested: a clustering      on cred. So there must be also another reason. If some
of random features (documents are represented by vectors of      images have similar credibility descriptors, that means that
5 random values) and a clustering using only the username        their users have the same characteristics. But it is not clear
(two documents with same username are similars).                 why these characteristics are interesting for diversity. To try
   As the queries are only composed of text, we cannot apply     to show that cred is a good feature for diversity whatever
a Step 1 to improve relevance in the case of run 1 (visual       the diversity method, we tried this feature with a greedy al-
only). Figure 1 shows that AHCCompl(ScalCol) (clustering         gorithm and we obtained the same conclusions (on devset).
on ScalCol features without Step 1) gives lower results than     Unfortunately, on testset, the text only run (run 2) gives
VSM(ttu)+AHCCompl(ScalCol) (with Step 1), but in both cases,
                                                                 better results than cred one (run 4) (see Table 1). So this
visual features give lower results than tdtu or cred features.   result cannot be generalized and may depend of devset.
   The best number of clusters to use is always an open ques-       As the visual similarity are not normalized, we needed
tion. If we want the best CR@20, most of the time is better      to carefully optimize the weights of the linear and of the
to take 20 clusters, unfortunately with 20 clusters the P@20     weighted-max fusion operators. On devset, the weighted-
is often the worst. So, in order to optimise F1@20 and ac-       max fusion using tdtu, ScalCol and cred gave the best re-
cording to the curves on devset (see Figure 1), we choose        sults for all our experiments. But as cred is not so good on
for run 2 to run 5 to take 50 clusters. This choice seems to     testset, run 5 does not give very good results. Finally, the
give a good compromise between relevance and diversity.          linear fusion between text (tdtu) and visual (ScalCol) gives
   On devset, best results using only one feature is obtained    the best results on testset (run 3).
using cred. The reason may be because, in the case of cred,         Despite the fact that we use different kind of features, the
images with the same userid have the same vectors, so these      F1D@20 for run 2 to run 5 are very close (from 0.543 to 0.553)
images will be in the same cluster and these images are often    that means it’s difficult to make reliable conclusion on the
                                                                 best feature or on the interest of similarity fusion.
4.   REFERENCES
[1] http://www.semanticmetadata.net/lire.
[2] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
    Classification. John Wiley and Sons, Inc., p. 552, 2000.
[3] B. Ionescu, A. L. Gı̂nscă, M. Zaharieva, B. Boteanu,
    M. Lupu, and H. Müller. Retrieving diverse social
    images at MediaEval 2016: Challenge, dataset and
    evaluation. In MediaEval 2016 Workshop, Hilversum,
    Netherlands, October 20-21 2016.
[4] C. Kuoman, S. Tollari, and M. Detyniecki. UPMC at
    MediaEval2013: Relevance by text and diversity by
    visual clustering. In MediaEval 2013 Workshop, 2013.
[5] C. Kuoman, S. Tollari, and M. Detyniecki. Using tree
    of concepts and hierarchical reordering for diversity in
    image retrieval. In CBMI, pages 251–256, 2013.
[6] M. Lux and S. A. Chatzichristofis. Lire: Lucene image
    retrieval: An extensible java CBIR library. In ACM
    International Conference on Multimedia, pages
    1085–1088, 2008.