=Paper=
{{Paper
|id=Vol-1263/paper22
|storemode=property
|title=CEA LIST's Participation at the MediaEval 2014 Retrieving Diverse Social Images Task
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_22.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/GinscaPR14
}}
==CEA LIST's Participation at the MediaEval 2014 Retrieving Diverse Social Images Task==
<pdf width="1500px">https://ceur-ws.org/Vol-1263/mediaeval2014_submission_22.pdf</pdf>
<pre>
    CEA LIST’s Participation at the MediaEval 2014 Retrieving
                  Diverse Social Images Task

                         Alexandru Lucian Ginsca1,2 , Adrian Popescu1 , Navid Rekabsaz3
                  1
                      CEA, LIST, Vision & Content Engineering Laboratory, 91190 Gif-sur-Yvette, France
                                                 2
                                                   TELECOM Bretagne, France
                                  3
                                    Faculty of Informatics, Vienna University of Technology
                 {alexandru.ginsca, adrian.popescu}@cea.fr, rekabsaz@ifs.tuwien.ac.at

ABSTRACT                                                         2.2    Image retrieval
The Mediaeval 2014 Retrieving Diverse Social Image Task            Following the latest advances in computer vision, we use
aims to tackle the challenge of improving result diversity       Caffe [3], a powerful CNN-based feature, to extract repre-
while keeping a high precision in a social image retrieval       sentations for the images in the collection, as well as the
task. We base our approach on the retrieval performance of       Wikipedia image examples. Following a standard content
recently introduced visual descriptors coupled with a mixt       based image retrieval approach, we rank the images for each
diversification method that explores the use of social cues      topic by the average cosine similarity between the retrieved
together with a classic clustering setting. As a novelty, this   image and all of the example images. On the devset, we
year’s task introduced user credibility features. We also de-    obtain a P@20 of 0.966 when doing retrieval with the Caffe
scribe how to use credibility in the diversification process     features. This represents a significant improvement over the
and how to improve individual features by the means of a         Flickr ranking (P@20 = 0.831) and LBP3x3 (P@20 = 0.816),
regression model.                                                the descriptor provided by the organizers which gives the
                                                                 best performances in visual retrieval. One drawback of this
                                                                 method is the strong trade-off between precision and clus-
1.     INTRODUCTION                                              ter recall. Although P@20 on the devset is high, we get a
   Social image retrieval presents an appropriate setting for    CR@20 of 0.293, leading to a F1@20 of 0.438. This prob-
the use of multimodal approaches to improve both results         lem is directly approached by first selecting images found in
relevance and diversity. Recently, emerging works propose        different clusters, as described in Section 4.
the use of social cues alongside visual and textual data.
   Our efforts are channeled towards exploiting visual infor-
mation and the use of credibility in the diversification pro-
                                                                 3.    LISTENING TO SOCIAL CUES
cess. We first describe a couple of pre-filtering techniques
followed by an image retrieval method that boosts precision.
                                                                 3.1    Predicting user credibility
Next, we describe how to predict a user’s credibility score        We exploit the credibility set to train a regression model
and we propose a user based image filtering approach. After      that predicts a user’s credibility score from the provided fea-
we show how we improve diversity by clustering and cluster       tures. We perform model selection and parameter tunning
ranking, we finally describe the submitted runs and discuss      by 5-fold cross-validation (cv) on the credibility set and we
the results we obtained on the testset.                          evaluate the performance of the predictions by Spearman’s
                                                                 rank correlation coefficient with the ground truth credibility
                                                                 values. The highest cv correlation (0.47) is obtained using
2.     AIMING FOR PRECISION                                      gradient boosting regression trees with a Huber loss and 100
                                                                 estimators. By comparison, the highest correlation of an in-
2.1      Initial pre-filtering                                   dividual feature (visual score) is 0.36. The gain in regards
  We use two filtering steps with the goal to eliminate noise    to the Spearman score is also reflected on the competition
form the image lists. Similar to [2], we eliminate geotagged     metrics. When fixing the rest of the parameters and using
images that have a distance from the POI higher than 1           the predicted credibility scores instead of the provided vi-
km. The second filter is a restriction on the presence of        sual credibility feature, F1@20 increases from 0.61 to 0.632
faces in images. We use the standard OpenCV1 algorithm           on the devset.
to perform face detection and we eliminate images having a
face coverage ratio higher than 0.4. The distance threshold      3.2    User selection
and the one for the percentage of faces are determined on          For each topic, we first keep a subset of users that have
the devset. We keep the same pre-filtering steps for all the     contributions in the top n images found in the ranking pro-
runs.                                                            duces by the image retrieval process described in Section 2.2.
                                                                 Then, as an extra filter, in our final ranking we retain only
1
    http://opencv.org/                                           images coming from the selected user set. Given the good
                                                                 precision of image retrieval, we have a high confidence that
                                                                 images found in the top of the ranking are relevant. This
Copyright is held by the author/owner(s).                        gives us an ad-hoc topical expertise insight about the users
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain   responsible for those images. We tune n on the devset and
fix it at 20. For comparison, when not using a user based
filter, the F1@20 score drops from 0.632 to 0.597. We also          Table 1: Run performances with three official met-
tried a similar approach by retaining contributions from top        rics
                                                                            Run name F1@20    P@20   CR@20
users ranked according to the credibility score but this did
                                                                            RUN1      0.5182 0.7313   0.4103
not improve the results. This result hints at the need for a
                                                                            RUN2      0.5346 0.8089 0.4084
topic specific credibility score.
                                                                            RUN3      0.5525   0.798  0.4335
                                                                            RUN4      0.5243 0.7378   0.4157
4.    IMPROVING DIVERSITY                                                   RUN5      0.571   0.7931 0.4563
  Building on previous works, we combine a more traditional
clustering approach for diversification with the use of social
cues [5].                                                                  order to diversify the ranking, we iterate over the ini-
                                                                           tial re-ranked list and keep one image from each user
4.1     Clustering                                                         at each iteration.
   We first perform k-Means clustering on the complete set of            • RUN3 is a fusion between RUN1 and RUN2. Since the
images. To ensure a stable cluster distribution, we initialize             scores for visual and textual rankings are not in the
the centroids by uniformly selecting images from the ranking               same range, fusion is performed based on the ranks of
produced after image retrieval. For example, the i-th cluster              the images in the two initial rankings. More specifi-
will have as initial centroid the image found on the position              cally, we perform a linear weighting in which the indi-
(i − 1) ∗ n/k, where k is the desired number of clusters and               vidual ranks are given a weight of 0.5. Other weighting
n is the number of images in the ranking. After validation                 have been tested but the results remain quite stable in
on the devset, k is set to 30.                                             the range 0.3 - 0.7, a result which accounts for the
4.2     Cluster ranking                                                    robustness of the proposed fusion.
   We leverage the social component of this task by ordering             • RUN4 is similar to RUN1 with the single difference
the clusters based on the average credibility score of the                 laying in the use of credibility for cluster ranking.
users that contribute with images in the cluster. For the
                                                                         • RUN5 is obtained using the Caffe visual descriptor for
runs that do not permit the use of credibility, we rank the
                                                                           image retrieval and clustering and predicted credibility
clusters according to the number of unique users represented
                                                                           scores for cluster ranking.
in each cluster. In the case of a tie, we prefer the cluster that
has the best ranked image after visual retrieval. Our final           Our textual run (RUN2) is the single one in which we do
ranked list is obtained by selecting from each cluster at a         not use clustering to improve diversity. This reflects across
time the image that is best placed in the visual retrieval          metrics, as it can be seen in Table 1. Although it performs
ranking.                                                            well in terms of F1@20, this run is placed at oposite poles
                                                                    when looking at the other metrics. It has the highest P@20
5.    RESULTS AND DISCUSSION                                        and the lowest CR@20.
                                                                      The usefulness of credibility can be best observed when
  We submitted five different runs at this year’s Retrieving        comparing RUN1 and RUN4. They share the same configu-
Diverse Social Images Task [1]. Our submissions are briefly         ration with the sole exception being the use of the predicted
described below:                                                    credibility scores for cluster ranking in RUN4. Although the
     • RUN1 uses the provided LBP3x3 visual descriptor for          difference is not as significant as on the devset, we can see
       image retrieval and clustering. The clusters are then        a slight improvement of F1@20.
       ranked based on the number of users represented in
       each cluster.                                                6.    ACKNOWLEDGMENT
                                                                      This research was supported by the MUCKE project, partly
     • RUN2 is a purely textual one. We concatenated the ti-        funded within the FP7 CHIST-ERA scheme.
       tle, tags and description of the photos to calculate the
       text similarity. As text pre-processing phase, we de-
       compounded the terms by applying a greedy approach
                                                                    7.    REFERENCES
                                                                    [1] B. Ionescu and al. Retrieving diverse social images at
       using the dictionary which is created by all the words
                                                                        mediaeval 2014: Challenge, dataset and evaluation. In
       in the text. In the next step, in order to disambiguate
                                                                        MediaEval 2014 Workshop, Barcelona, Spain, October
       the places, we expand the queries using the first sen-
                                                                        16-17 2014.
       tence of Wikipedia. After testing several language
                                                                    [2] N. Jain and al. Experiments in diversifying flickr result
       models, using a semantic similarity approach based on
                                                                        sets. In MediaEval 2013 Workshop, Barcelona, Spain,
       Word2Vec [4] gave the best result. We trained a model
                                                                        October 18-19 2013.
       on Wikipedia and then used the vector representation
       of words to calculate the text similarity of the query       [3] Y. Jia. Caffe: An open source convolutional
       to each photo. In additional to the text similarity, we          architecture for fast feature embedding.
       extracted three binary attributes: (1) if the photo had          http://caffe.berkeleyvision.org, 2013.
       any views, (2) if the distance between a photo and the       [4] T. Mikolov and al. Efficient estimation of word
       POI is greater than 8 kilometers, and (3) if the de-             representations in vector space. CoRR, 2013.
       scription length has more than 2000 characters. All          [5] A. Popescu. Cea list’s participation at the mediaeval
       features were then used in a Linear Regression model             2013 retrieving diverse social images task. In MediaEval
       in order to re-rank the list. Finally, following [5], in         2013 Workshop, Barcelona, Spain, October 18-19 2013.

</pre>