=Paper=
{{Paper
|id=None
|storemode=property
|title=UPMC at MediaEval 2013: Relevance by Text and Diversity by Visual Clustering
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_19.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/KuomanTD13
}}
==UPMC at MediaEval 2013: Relevance by Text and Diversity by Visual Clustering==
UPMC at MediaEval 2013:
Relevance by Text and Diversity by Visual Clustering
Christian Kuoman Sabrina Tollari Marcin Detyniecki
XILOPIX UPMC Univ Paris 06 / LIP6 UPMC Univ Paris 06 / LIP6
88000 Epinal, France 75005 Paris, France 75005 Paris, France
christian@xilopix.net Sabrina.Tollari@lip6.fr Marcin.Detyniecki@lip6.fr
ABSTRACT trieved images have GPS coordinates; but for the keywords
In the diversity task, our strategy was to, first, try to im- subset, approximately 60% of the images do not have any
prove relevance, and then to cluster similar images to im- coordinates. For these images, we choose to attribute them
prove diversity. We propose a four step framework, based on the GPS coordinates of the nearest image, among the im-
AHC clustering and different reranking strategies. A large ages of the same query, according to the visual distance.
number of tests on devset showed that most of the best Moreover, if the smallest visual distance is greater than a
strategies include text based reranking for pertinence, and threshold, the system associates the (0,0) GPS coordinates
visual clustering for diversity - even compared to location to the image in order to avoid some noisy results.
based descriptors. Results on expert and crowd-sourcing To better exploit geographical granularity between im-
testset grounds truths seem to confirm these observations. ages, we use the “thesaurus” developed by the commercial
search engine Xilopix (see [2] for details). The “travel” do-
main of this thesaurus is organized into a tree of concepts:
1. INTRODUCTION continents, countries, regions, departments and locations.
In the Retrieving Diverse Social Images Task of Media- For each concept, the thesaurus provides its name and its
Eval 2013 challenge [1], participants were provided with a GPS coordinates. For images with GPS, the system calcu-
ranked list (we called “baseline”) of at most 150 photos of a lates the great-circle distance between the GPS coordinates
location (the query) from Flickr.com. For each query, our of the image and the GPS coordinates of each concept in the
strategy to induce diversity while keeping the relevance is thesaurus, and finally selects the closest concept and its pa-
based on four steps. Step 1: Rerank the baseline to improve rent nodes (method called tree). For images without GPS,
relevance. Step 2: Cluster the results using an Agglomera- the system matches the terms of the image and the terms of
tive Hierarchical Clustering (AHC). Step 3: Sort the clusters each thesaurus node using TF-IDF weighting (method called
based on a cluster priority criteria; and then sort the images tree-tfidf ) or probabilistic models (method called tree-proba)
in each cluster. Step 4: Finally rerank the results alternating and finally selects the closest concept and its parent nodes.
images from different clusters. To estimate the similarity between two concepts in the the-
It is important to notice that the AHC does not take the saurus, we use the Wu-Palmer’s similarity [5] that quantifies
image rank into account, but when we sort the clusters and the similarity between two concepts of a same tree.
the images in the cluster (Step 3) the rank obtained in Step 1
is crucial information that we exploit. In fact, it is the only 3. CLUSTERING BY AHC
way to guarantee global relevance with respect to the query. The Agglomerative Hierarchical Clustering (AHC) is a
clustering method that provides a hierarchy of clusters of
2. SIMILARITIES AND DISTANCES images. Applying the AHC to the query results provides a
dendrogram. In order to obtain groups of similar images,
To rerank the baseline list according to the similarity to
we choose to cut the dendrogram to obtain a fixed number
the query (Step 1) and to cluster images (Step 2), we need
of unordered clusters (method called FixedN where N is the
to compare the images. We tested on the devset several
number of clusters to obtain) (see [2] for more details).
similarities and distances, for different types of descriptors:
The way most diversity methods work implies what we call
visual, textual, GPS and a geographic tree thesaurus. For
“rank priority” (rank ): we first choose the cluster containing
all visual descriptors provided by the organizers [1] (CN3x3,
the image of rank 1 in Step 1, then we choose a different
LBP3x3, CSD, HOG... ), we use the Euclidean distance. For
cluster containing the next possible lowest rank. Other ways
textual descriptors, we use Dirichlet Prior Smoothing [3] for
to prioritize the clusters may be interesting, we propose to
the probabilistic model; the cosinus for TF-IDF weighting;
consider the number of images contained in each cluster. We
the formula mentioned in [4] for Social TF-IDF weighting.
sort the clusters in decreasing order from the cluster with
To estimate the distance between two GPS coordinates,
the largest numbers of images to the cluster with the less
we compute the classical great-circle distance using the Ha-
images (dec priority). After sorting the clusters, we sort the
versine formula. For the keywordsGPS subset, all the re-
images in each cluster according to their rank in Step 1.
4. EXPERIMENTS AND RESULTS
Copyright is held by the author/owner(s).
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain On devset, we tested our model for all descriptors and for
Table 1: Submitted runs: parameters (top), results on devset (middle) and results on testset (bottom).
Between brackets, gain in percentage compared with run1 visual
keywordsGPS keywords
SUBMITTED RUNS PARAMETERS
Rerank AHC Priority Rerank AHC Priority
run1 visual baseline CN3x3 dec, fixed35 LBP3x3 CSD dec, fixed20
run2 text tfidf(tt) proba(tt) rank,fixed35 tfidf(ttd) social-tfidf(ttd) rank,fixed25
run3 textvisual tfidf(tt) CSD dec, fixed20 tfidf(ttd) CSD rank,fixed25
run5 allallowed tree CSD dec, fixed30 tfidf(ttd) tree-proba(ttd) rank,fixed15
RESULTS ON DEVSET
number of queries 25 25
P@10 CR@10 F1@10 P@10 CR@10 F1@10
baseline 0.860 (-) 0.412 (-) 0.544 (-) 0.688 (-) 0.464 (-) 0.529 (-)
run1 visual 0.868(ref) 0.498(ref) 0.623(ref) 0.696(ref) 0.543(ref) 0.575(ref)
run2 text 0.928(+7) 0.493(-1) 0.636(+2) 0.788(+13) 0.586(+8) 0.629(+9)
run3 textvisual 0.844(-3) 0.509(+2) 0.627(+1) 0.812(+17) 0.584(+8) 0.635(+10)
run5 allallowed 0.808(-7) 0.483(-3) 0.592(-5) 0.760(+9) 0.560(+3) 0.594(+3)
RESULTS ON TESTSET
number of queries 210 132
P@10 CR@10 F1@10 P@10 CR@10 F1@10
run1 visual 0.774(ref) 0.370(ref) 0.489(ref) 0.630(ref) 0.400(ref) 0.468(ref)
run2 text 0.844(+9) 0.404(+9) 0.531(+9) 0.746(+18) 0.412(+3) 0.507(+8)
run3 textvisual 0.823(+6) 0.426(+15) 0.547(+12) 0.718(+14) 0.417(+4) 0.503(+8)
run5 allallowed 0.766(-1) 0.378(+2) 0.496(+1) 0.705(+12) 0.388(-3) 0.475(+2)
Table 2: Scores obtained for 3 crowd-sourcing grounds truths (GT1, GT2, GT3) and for the expert ground
truth (GT0). All the results are in average for a subset of 49 queries of testset. nb is the number of queries
among the 49 queries which have a CR@10=1. Between brackets, gain in % compared with run1 visual
GT1,2,3 GT1 GT2 GT3 GT0
P@10 CR@10 nb CR@10 nb CR@10 nb P@10 CR@10 nb
run1 visual 0.694(ref) 0.786(ref) 27 0.754(ref) 10 0.645(ref) 14 0.806(ref) 0.367(ref) 0
run2 text 0.757(+9) 0.836(+6) 31 0.756(+0) 16 0.645(-0) 14 0.851(+6) 0.408(+11) 1
run3 textvisual 0.749(+8) 0.886(+13) 33 0.792(+5) 21 0.687(+6) 16 0.841(+4) 0.415(+13) 0
run5 allallowed 0.708(+2) 0.828(+5) 29 0.768(+2) 20 0.643(-0) 15 0.794(-2) 0.377(+3) 0
most of the parameters. For textual models, on keywords 5. CONCLUSION AND DISCUSSION
subset, we choose to use the title, tags and descriptions (ttd) Results on expert and crowd-sourcing grounds truths sug-
fields, while on keywordsGPS, we choose to use only the gest that an interesting and robust strategy to improve di-
title and tags (tt) fields. Among our large number of tests, versity - in the sense of this challenge - is to increase the re-
Table 3 shows an example of comparison of AHC results levance using the text, and then to exploit visual clustering
on devset keywordsGPS for tree, GPS and visual (CSD) to diversify the results. Preliminary tests on devset showed
descriptors using the same parameters and the same Step 1 that the exploitation of these descriptors outperforms, in
reranking approach (i.e. tfidf(tt)). Best diversity results are terms of diversity, the use of location descriptors (GPS or
obtained with visual descriptors compared to tree and GPS. tree). This is an unexpected results taking into account that
According to the results on devset, we choose the methods queries were formulated around the notion of location.
and the parameters for each subset. Table 1 summarizes the
parameters and the scores obtained on devset and testset ac-
cording of the expert ground truth, while Table 2 compares
6. REFERENCES
the results on the crowd-sourcing grounds truths. [1] B. Ionescu, M. Menéndez, H. Müller, and A. Popescu.
Retrieving diverse social images at MediaEval 2013:
Objectives, dataset and evaluation. In MediaEval 2013
Table 3: Comparison on devset keywordsGPS Workshop, Barcelona, Spain, October 18-19 2013.
Rerank AHC Priority P@10 CR@10 [2] C. Kuoman, S. Tollari, and M. Detyniecki. Using tree
of concepts and hierarchical reordering for diversity in
baseline - - 0.860(ref) 0.412(ref)
image retrieval. In CBMI, pages 251–256, 2013.
tfidf(tt) - - 0.896(+4) 0.429(+4)
[3] T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.
tfidf(tt) tree dec,fixed30 0.840(-2) 0.438(+6) [4] A. Popescu and G. Grefenstette. Social media driven
tfidf(tt) GPS dec,fixed30 0.864(+0) 0.443(+8) image retrieval. In ACM ICMR, 2011.
tfidf(tt) CSD dec,fixed30 0.864(+0) 0.485(+18) [5] Z. Wu and M. Palmer. Verbs semantics and lexical
selection. In Ass. for Computational Linguistics, 1994.