1. INTRODUCTION

UPMC at MediaEval 2016 Retrieving Diverse Social Images Task

Sabrina Tollari

Sabrina.Tollari@lip6.fr 0 0 Sorbonne Universités, UPMC Univ Paris 06, UMR CNRS 7606 LIP6 , 75252 PARIS cedex 05 , France

2016

20 21

In the MediaEval 2016 Retrieving Diverse Social Images Task, we proposed a general framework based on agglomerative hierarchical clustering (AHC). We tested the provided credibility descriptors as a vector input for our AHC. The results on devset showed that this vector based on the credibility descriptors is the best feature, but unfortunately that is not con rmed on testset. To merge several features, we chose to merge feature similarities. Tests on devset showed that to merge similarities using linear or weighted-max operators gave, most of the time, better results than using only one feature. This results is partially con rmed on testset.

1. INTRODUCTION

Contrary to previous years, in 2016, the task [ 3 ] addresses the use case of a general ad-hoc image retrieval system. General cases are more di cult to tackle, because the system can't be adapted to a particular application. Another difference is the use of the F1@20 metrics, which means that we are not only interested in diversity, but also to nd a balance between relevance and diversity, that is more di cult to handle. In the task of 2013, we proposed a framework [ 4 ] which, rst, tries to improve relevance and, then, makes a clustering to improve diversity. This strategy has obtained good results and can handle general cases. So this year, we use the same strategy, but we adapt the parameters to the use of F1@20 metrics, i.e., not only to improve diversity, but to nd a balance between relevance and diversity.

FRAMEWORK

For each query, we apply the following framework. Step 1 (optional): Re-rank Flickr baseline to improve relevance according to text features. Step 2: Cluster the N rst results using Agglomerative Hierarchical Clustering (AHC). Step 3: Sort the images in each cluster using their rank in Step 1, sort the clusters according to the rank of the image on the top of each cluster. Step 4: Re-rank the results alternating images from di erent clusters.

The AHC [ 2 ] is a robust method that can handle di erent kind of features. Applying the AHC to query results provides a hierarchy of image clusters. In order to obtain groups of similar images, we cut the hierarchy to obtain a xed number k of unordered clusters (see [ 5 ] for details).

The AHC needs a measure to compare two documents. A document can be describe by several features (text, visual, etc.). To take advantage of several features, we need a way to merge them. We choose to merge similarities. Some of the features are associated with a distance, others with a similarity. In order to have only similarities, all distances are transformed using the classical formula: let (x; y) be a distance between x and y, then the similarity is de ned as: sim(x; y) = 1=(1 + (x; y)).

Let f1 and f2 be two features and 2 [0; 1], we compute a linear fusion of feature similarities by: simLinear(f1;f2; )(x; y) = simf1 (x; y) + (1 ) simf2 (x; y):

Let n be the number of features. Let's choose wisely a weight wi for each feature fi, such as Pn i=1 wi = 1. We compute a weighted-max fusion similarities by: simWMax(f1;w1;f2;w2; ;fn;wn)(x; y) = max i2f1; ;ng wi simfi (x; y): 3.

EXPERIMENTS AND RESULTS Text re-ranking (Step 1).

Using vector space model (VSM) with tf-idf weights and cosinus similarity, we tested the choice of textual information elds (Title (t), Description (d), Tags (t), Username(u)). We also tested several stemmer. We notice no signi cant di erence with or without stemmers, the reason may be because there are only a few words in the query title. So we choose not to use stemmer in all the experiments. Finally, its seems that globally ttu gives slightly better P@20.

Features for clustering in Step 2.

We tested several combinations of textual information elds. Finally, for text clustering, the best solution on devset is to use all the elds (tdtu) and a similarity based on the Euclidean distance. It seems that to use in addition the Description eld tends to produce more diversity than using ttu, because documents are more dissimilar between them.

We tested the provided visual features cnn_gen and cnn_ad. On most of our experiments, it seems that cnn_ad gives slightly better or better results than cnn_gen. We also tested several features from the Lire library [ 1, 6 ]: the ScalableColor feature (ScalCol) | a histogram in HSV color space encoded by a Haar transform | gives the best results.

Using the provided credibility descriptors, we built, for each image, normalized real vectors of 13 dimensions (noted cred) (NaN, null and missing values | ' 3:5% of the credibility descriptor values | are replaced by random values).

Clustering parameters in Step 2.

When varying features and parameters, we noticed that, on devset, globally, complete linkage (AHCCompl) gave better results than single or average linkages.

For each query, 300 results were provided. Usually, there are more relevant documents in the rst results than in the end of the result list. Is it worth for the system to take time to cluster online 300 results in order to improve the F1@20 of the rst 20 documents ? We made several experiments varying the diversity methods, parameters, features and number of input documents. Globally, we didn't get a lot of di erences in term of better F1@20 between 150, 200, 250 or 300 documents. The only real di erence depends on the number of clusters. Usually, the more documents are in the input set, the more the number of clusters should be high to obtain goods results: around 20 clusters for 150 documents, and around 50 clusters for 300 documents. Finally, we choose to take 300 documents because the peak of the curve is wider than with 150 documents.

Reference runs and run results.

The baseline run is the FlickR ranking. VSM(ttu) run is obtained using VSM on ttu elds and without clustering. To have some comparison elements, we also tested: a clustering of random features (documents are represented by vectors of 5 random values) and a clustering using only the username (two documents with same username are similars).

As the queries are only composed of text, we cannot apply a Step 1 to improve relevance in the case of run 1 (visual only). Figure 1 shows that AHCCompl(ScalCol) (clustering on ScalCol features without Step 1) gives lower results than VSM(ttu)+AHCCompl(ScalCol) (with Step 1), but in both cases, visual features give lower results than tdtu or cred features.

The best number of clusters to use is always an open question. If we want the best CR@20, most of the time is better to take 20 clusters, unfortunately with 20 clusters the P@20 is often the worst. So, in order to optimise F1@20 and according to the curves on devset (see Figure 1), we choose for run 2 to run 5 to take 50 clusters. This choice seems to give a good compromise between relevance and diversity.

On devset, best results using only one feature is obtained using cred. The reason may be because, in the case of cred, images with the same userid have the same vectors, so these images will be in the same cluster and these images are often 0.58 0.56 0.48 0.46 0 baseline

VSM(ttu)

VSM(ttu)+AHCCompl(random) VSM(ttu)+AHCCompl(username)

VSM(ttu)+AHCCompl(cnn-ad) VSM(ttu)+AHCCompl(ScalCol)

AHCCompl(ScalCol)

VSM(ttu)+AHCCompl(tdtu) VSM(ttu)+AHCCompl(Linear(tdtu,ScalCol,0.02))

VSM(ttu)+AHCCompl(cred) VSM(ttu)+AHCCompl(WMax(tdtu,0.014,ScalCol,0.97,cred,0.016)) 50 100 150

number of clusters about the same subtopic. In Figure 1, we can notice that the clustering on username gives better results than on text only (tdtu) or visual only (ScalCol), but lower results than on cred. So there must be also another reason. If some images have similar credibility descriptors, that means that their users have the same characteristics. But it is not clear why these characteristics are interesting for diversity. To try to show that cred is a good feature for diversity whatever the diversity method, we tried this feature with a greedy algorithm and we obtained the same conclusions (on devset). Unfortunately, on testset, the text only run (run 2) gives better results than cred one (run 4) (see Table 1). So this result cannot be generalized and may depend of devset.

As the visual similarity are not normalized, we needed to carefully optimize the weights of the linear and of the weighted-max fusion operators. On devset, the weightedmax fusion using tdtu, ScalCol and cred gave the best results for all our experiments. But as cred is not so good on testset, run 5 does not give very good results. Finally, the linear fusion between text (tdtu) and visual (ScalCol) gives the best results on testset (run 3).

Despite the fact that we use di erent kind of features, the F1D@20 for run 2 to run 5 are very close (from 0.543 to 0.553) that means it's di cult to make reliable conclusion on the best feature or on the interest of similarity fusion.

[1] http://www.semanticmetadata.net/lire.

[2]

R. O.

Duda ,

P. E.

Hart , and

D. G.

Stork . Pattern Classi cation . John Wiley and Sons, Inc., p. 552 , 2000 .

[3]

Ionescu , A. L. G ^nsca,

Zaharieva ,

Boteanu ,

Lupu , and

Mu ller. Retrieving diverse social images at MediaEval 2016: Challenge, dataset and evaluation . In MediaEval 2016 Workshop, Hilversum, Netherlands, October 20 -21 2016 .

[4]

Kuoman ,

Tollari , and

Detyniecki . UPMC at MediaEval2013: Relevance by text and diversity by visual clustering . In MediaEval 2013 Workshop , 2013 .

[5]

Kuoman ,

Tollari , and

Detyniecki . Using tree of concepts and hierarchical reordering for diversity in image retrieval . In CBMI , pages 251 { 256 , 2013 .

[6]

Lux and S. A. Chatzichristo s. Lire: Lucene image retrieval: An extensible java CBIR library . In ACM International Conference on Multimedia , pages 1085 { 1088 , 2008 .