=Paper=
{{Paper
|id=Vol-1263/paper22
|storemode=property
|title=CEA LIST's Participation at the MediaEval 2014 Retrieving Diverse Social Images Task
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_22.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/GinscaPR14
}}
==CEA LIST's Participation at the MediaEval 2014 Retrieving Diverse Social Images Task==
CEA LIST’s Participation at the MediaEval 2014 Retrieving
Diverse Social Images Task
Alexandru Lucian Ginsca1,2 , Adrian Popescu1 , Navid Rekabsaz3
1
CEA, LIST, Vision & Content Engineering Laboratory, 91190 Gif-sur-Yvette, France
2
TELECOM Bretagne, France
3
Faculty of Informatics, Vienna University of Technology
{alexandru.ginsca, adrian.popescu}@cea.fr, rekabsaz@ifs.tuwien.ac.at
ABSTRACT 2.2 Image retrieval
The Mediaeval 2014 Retrieving Diverse Social Image Task Following the latest advances in computer vision, we use
aims to tackle the challenge of improving result diversity Caffe [3], a powerful CNN-based feature, to extract repre-
while keeping a high precision in a social image retrieval sentations for the images in the collection, as well as the
task. We base our approach on the retrieval performance of Wikipedia image examples. Following a standard content
recently introduced visual descriptors coupled with a mixt based image retrieval approach, we rank the images for each
diversification method that explores the use of social cues topic by the average cosine similarity between the retrieved
together with a classic clustering setting. As a novelty, this image and all of the example images. On the devset, we
year’s task introduced user credibility features. We also de- obtain a P@20 of 0.966 when doing retrieval with the Caffe
scribe how to use credibility in the diversification process features. This represents a significant improvement over the
and how to improve individual features by the means of a Flickr ranking (P@20 = 0.831) and LBP3x3 (P@20 = 0.816),
regression model. the descriptor provided by the organizers which gives the
best performances in visual retrieval. One drawback of this
method is the strong trade-off between precision and clus-
1. INTRODUCTION ter recall. Although P@20 on the devset is high, we get a
Social image retrieval presents an appropriate setting for CR@20 of 0.293, leading to a F1@20 of 0.438. This prob-
the use of multimodal approaches to improve both results lem is directly approached by first selecting images found in
relevance and diversity. Recently, emerging works propose different clusters, as described in Section 4.
the use of social cues alongside visual and textual data.
Our efforts are channeled towards exploiting visual infor-
mation and the use of credibility in the diversification pro-
3. LISTENING TO SOCIAL CUES
cess. We first describe a couple of pre-filtering techniques
followed by an image retrieval method that boosts precision.
3.1 Predicting user credibility
Next, we describe how to predict a user’s credibility score We exploit the credibility set to train a regression model
and we propose a user based image filtering approach. After that predicts a user’s credibility score from the provided fea-
we show how we improve diversity by clustering and cluster tures. We perform model selection and parameter tunning
ranking, we finally describe the submitted runs and discuss by 5-fold cross-validation (cv) on the credibility set and we
the results we obtained on the testset. evaluate the performance of the predictions by Spearman’s
rank correlation coefficient with the ground truth credibility
values. The highest cv correlation (0.47) is obtained using
2. AIMING FOR PRECISION gradient boosting regression trees with a Huber loss and 100
estimators. By comparison, the highest correlation of an in-
2.1 Initial pre-filtering dividual feature (visual score) is 0.36. The gain in regards
We use two filtering steps with the goal to eliminate noise to the Spearman score is also reflected on the competition
form the image lists. Similar to [2], we eliminate geotagged metrics. When fixing the rest of the parameters and using
images that have a distance from the POI higher than 1 the predicted credibility scores instead of the provided vi-
km. The second filter is a restriction on the presence of sual credibility feature, F1@20 increases from 0.61 to 0.632
faces in images. We use the standard OpenCV1 algorithm on the devset.
to perform face detection and we eliminate images having a
face coverage ratio higher than 0.4. The distance threshold 3.2 User selection
and the one for the percentage of faces are determined on For each topic, we first keep a subset of users that have
the devset. We keep the same pre-filtering steps for all the contributions in the top n images found in the ranking pro-
runs. duces by the image retrieval process described in Section 2.2.
Then, as an extra filter, in our final ranking we retain only
1
http://opencv.org/ images coming from the selected user set. Given the good
precision of image retrieval, we have a high confidence that
images found in the top of the ranking are relevant. This
Copyright is held by the author/owner(s). gives us an ad-hoc topical expertise insight about the users
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain responsible for those images. We tune n on the devset and
fix it at 20. For comparison, when not using a user based
filter, the F1@20 score drops from 0.632 to 0.597. We also Table 1: Run performances with three official met-
tried a similar approach by retaining contributions from top rics
Run name F1@20 P@20 CR@20
users ranked according to the credibility score but this did
RUN1 0.5182 0.7313 0.4103
not improve the results. This result hints at the need for a
RUN2 0.5346 0.8089 0.4084
topic specific credibility score.
RUN3 0.5525 0.798 0.4335
RUN4 0.5243 0.7378 0.4157
4. IMPROVING DIVERSITY RUN5 0.571 0.7931 0.4563
Building on previous works, we combine a more traditional
clustering approach for diversification with the use of social
cues [5]. order to diversify the ranking, we iterate over the ini-
tial re-ranked list and keep one image from each user
4.1 Clustering at each iteration.
We first perform k-Means clustering on the complete set of • RUN3 is a fusion between RUN1 and RUN2. Since the
images. To ensure a stable cluster distribution, we initialize scores for visual and textual rankings are not in the
the centroids by uniformly selecting images from the ranking same range, fusion is performed based on the ranks of
produced after image retrieval. For example, the i-th cluster the images in the two initial rankings. More specifi-
will have as initial centroid the image found on the position cally, we perform a linear weighting in which the indi-
(i − 1) ∗ n/k, where k is the desired number of clusters and vidual ranks are given a weight of 0.5. Other weighting
n is the number of images in the ranking. After validation have been tested but the results remain quite stable in
on the devset, k is set to 30. the range 0.3 - 0.7, a result which accounts for the
4.2 Cluster ranking robustness of the proposed fusion.
We leverage the social component of this task by ordering • RUN4 is similar to RUN1 with the single difference
the clusters based on the average credibility score of the laying in the use of credibility for cluster ranking.
users that contribute with images in the cluster. For the
• RUN5 is obtained using the Caffe visual descriptor for
runs that do not permit the use of credibility, we rank the
image retrieval and clustering and predicted credibility
clusters according to the number of unique users represented
scores for cluster ranking.
in each cluster. In the case of a tie, we prefer the cluster that
has the best ranked image after visual retrieval. Our final Our textual run (RUN2) is the single one in which we do
ranked list is obtained by selecting from each cluster at a not use clustering to improve diversity. This reflects across
time the image that is best placed in the visual retrieval metrics, as it can be seen in Table 1. Although it performs
ranking. well in terms of F1@20, this run is placed at oposite poles
when looking at the other metrics. It has the highest P@20
5. RESULTS AND DISCUSSION and the lowest CR@20.
The usefulness of credibility can be best observed when
We submitted five different runs at this year’s Retrieving comparing RUN1 and RUN4. They share the same configu-
Diverse Social Images Task [1]. Our submissions are briefly ration with the sole exception being the use of the predicted
described below: credibility scores for cluster ranking in RUN4. Although the
• RUN1 uses the provided LBP3x3 visual descriptor for difference is not as significant as on the devset, we can see
image retrieval and clustering. The clusters are then a slight improvement of F1@20.
ranked based on the number of users represented in
each cluster. 6. ACKNOWLEDGMENT
This research was supported by the MUCKE project, partly
• RUN2 is a purely textual one. We concatenated the ti- funded within the FP7 CHIST-ERA scheme.
tle, tags and description of the photos to calculate the
text similarity. As text pre-processing phase, we de-
compounded the terms by applying a greedy approach
7. REFERENCES
[1] B. Ionescu and al. Retrieving diverse social images at
using the dictionary which is created by all the words
mediaeval 2014: Challenge, dataset and evaluation. In
in the text. In the next step, in order to disambiguate
MediaEval 2014 Workshop, Barcelona, Spain, October
the places, we expand the queries using the first sen-
16-17 2014.
tence of Wikipedia. After testing several language
[2] N. Jain and al. Experiments in diversifying flickr result
models, using a semantic similarity approach based on
sets. In MediaEval 2013 Workshop, Barcelona, Spain,
Word2Vec [4] gave the best result. We trained a model
October 18-19 2013.
on Wikipedia and then used the vector representation
of words to calculate the text similarity of the query [3] Y. Jia. Caffe: An open source convolutional
to each photo. In additional to the text similarity, we architecture for fast feature embedding.
extracted three binary attributes: (1) if the photo had http://caffe.berkeleyvision.org, 2013.
any views, (2) if the distance between a photo and the [4] T. Mikolov and al. Efficient estimation of word
POI is greater than 8 kilometers, and (3) if the de- representations in vector space. CoRR, 2013.
scription length has more than 2000 characters. All [5] A. Popescu. Cea list’s participation at the mediaeval
features were then used in a Linear Regression model 2013 retrieving diverse social images task. In MediaEval
in order to re-rank the list. Finally, following [5], in 2013 Workshop, Barcelona, Spain, October 18-19 2013.