UPMC at MediaEval 2016 Retrieving Diverse Social Images Task Sabrina Tollari Sorbonne Universités, UPMC Univ Paris 06, UMR CNRS 7606 LIP6, 75252 PARIS cedex 05, France Sabrina.Tollari@lip6.fr ABSTRACT The AHC needs a measure to compare two documents. A In the MediaEval 2016 Retrieving Diverse Social Images document can be describe by several features (text, visual, Task, we proposed a general framework based on agglomera- etc.). To take advantage of several features, we need a way tive hierarchical clustering (AHC). We tested the provided to merge them. We choose to merge similarities. Some of credibility descriptors as a vector input for our AHC. The the features are associated with a distance, others with a results on devset showed that this vector based on the credi- similarity. In order to have only similarities, all distances bility descriptors is the best feature, but unfortunately that are transformed using the classical formula: let δ(x, y) be a is not confirmed on testset. To merge several features, we distance between x and y, then the similarity is defined as: chose to merge feature similarities. Tests on devset showed sim(x, y) = 1/(1 + δ(x, y)). that to merge similarities using linear or weighted-max op- Let f1 and f2 be two features and τ ∈ [0, 1], we compute erators gave, most of the time, better results than using only a linear fusion of feature similarities by: one feature. This results is partially confirmed on testset. simLinear(f1 ,f2 ,τ ) (x, y) = τ · simf1 (x, y) + (1 − τ ) · simf2 (x, y). Let n be the number of features. Let’sPnchoose wisely a 1. INTRODUCTION weight wi for each feature fi , such as i=1 wi = 1. We Contrary to previous years, in 2016, the task [3] addresses compute a weighted-max fusion similarities by: the use case of a general ad-hoc image retrieval system. General cases are more difficult to tackle, because the system simWMax(f1 ,w1 ,f2 ,w2 ,··· ,fn ,wn ) (x, y) = max wi · simfi (x, y). i∈{1,··· ,n} can’t be adapted to a particular application. Another dif- ference is the use of the F1@20 metrics, which means that we 3. EXPERIMENTS AND RESULTS are not only interested in diversity, but also to find a balance between relevance and diversity, that is more difficult to han- dle. In the task of 2013, we proposed a framework [4] which, Text re-ranking (Step 1). first, tries to improve relevance and, then, makes a cluster- Using vector space model (VSM) with tf-idf weights and ing to improve diversity. This strategy has obtained good cosinus similarity, we tested the choice of textual information results and can handle general cases. So this year, we use fields (Title (t), Description (d), Tags (t), Username(u)). the same strategy, but we adapt the parameters to the use We also tested several stemmer. We notice no significant of F1@20 metrics, i.e., not only to improve diversity, but to difference with or without stemmers, the reason may be be- find a balance between relevance and diversity. cause there are only a few words in the query title. So we choose not to use stemmer in all the experiments. Finally, its seems that globally ttu gives slightly better P@20. 2. FRAMEWORK For each query, we apply the following framework. Step 1 Features for clustering in Step 2. (optional): Re-rank Flickr baseline to improve relevance ac- We tested several combinations of textual information cording to text features. Step 2: Cluster the N first results fields. Finally, for text clustering, the best solution on dev- using Agglomerative Hierarchical Clustering (AHC). Step 3: set is to use all the fields (tdtu) and a similarity based on Sort the images in each cluster using their rank in Step 1, the Euclidean distance. It seems that to use in addition the sort the clusters according to the rank of the image on the Description field tends to produce more diversity than using top of each cluster. Step 4: Re-rank the results alternating ttu, because documents are more dissimilar between them. images from different clusters. We tested the provided visual features cnn_gen and The AHC [2] is a robust method that can handle differ- cnn_ad. On most of our experiments, it seems that cnn_ad ent kind of features. Applying the AHC to query results gives slightly better or better results than cnn_gen. We also provides a hierarchy of image clusters. In order to obtain tested several features from the Lire library [1, 6]: the Sca- groups of similar images, we cut the hierarchy to obtain a lableColor feature (ScalCol) — a histogram in HSV color fixed number k of unordered clusters (see [5] for details). space encoded by a Haar transform — gives the best results. Using the provided credibility descriptors, we built, for each image, normalized real vectors of 13 dimensions (noted Copyright is held by the author/owner(s). MediaEval 2016 Workshop, October 20-21, 2016, Hilversum, cred) (NaN, null and missing values — ' 3.5% of the credi- Netherlands bility descriptor values — are replaced by random values). Table 1: Run results. Between brackets, gain in percentage compared to the devset baseline or to the testset worst run. The number of documents for clustering per query is 300. k is the selected number of clusters Step 1 Steps 2-4: AHCCompl devset testset Run Features k P@20 CR@20 F1@20 P@20 CR@20 F1@20 baseline - - - 0.698(ref.) 0.371(ref.) 0.467(ref.) - - - VSM VSM(ttu) - - 0.772(+11%) 0.397(+7%) 0.507(+9%) - - - rand VSM(ttu) random features 50 0.771(+10%) 0.410(+11%) 0.522(+12%) - - - user VSM(ttu) username 50 0.761(+9%) 0.485(+31%) 0.578(+24%) - - - run 1 - ScalCol 20 0.631(-10%) 0.432(+16%) 0.498(+7%) 0.520(ref.) 0.400(ref.) 0.430(ref.) run 2 VSM(ttu) tdtu 50 0.768(+10%) 0.471(+27%) 0.569(+22%) 0.697(+34%) 0.486(+22%) 0.552(+28%) run 3 VSM(ttu) Linear(tdtu, 50 0.767(+10%) 0.487(+31%) 0.582(+25%) 0.696(+34%) 0.494(+24%) 0.553(+29%) ScalCol,0.02) run 4 VSM(ttu) cred 50 0.767(+10%) 0.491(+32%) 0.585(+25%) 0.681(+31%) 0.487(+22%) 0.543(+26%) run 5 VSM(ttu) WMax(tdtu,0.014, 50 0.771(+10%) 0.493(+33%) 0.588(+26%) 0.686(+32%) 0.487(+22%) 0.544(+26%) ScalCol,0.97,cred,0.016) Number of queries 70 64 Clustering parameters in Step 2. 0.58 When varying features and parameters, we noticed that, on devset, globally, complete linkage (AHCCompl) gave better 0.56 results than single or average linkages. F1@20 (devset) For each query, 300 results were provided. Usually, there 0.54 baseline VSM(ttu) are more relevant documents in the first results than in the VSM(ttu)+AHCCompl(random) end of the result list. Is it worth for the system to take 0.52 VSM(ttu)+AHCCompl(username) time to cluster online 300 results in order to improve the VSM(ttu)+AHCCompl(cnn-ad) 0.5 VSM(ttu)+AHCCompl(ScalCol) F1@20 of the first 20 documents ? We made several exper- AHCCompl(ScalCol) iments varying the diversity methods, parameters, features VSM(ttu)+AHCCompl(tdtu) 0.48 VSM(ttu)+AHCCompl(Linear(tdtu,ScalCol,0.02)) and number of input documents. Globally, we didn’t get a VSM(ttu)+AHCCompl(cred) lot of differences in term of better F1@20 between 150, 200, 0.46 VSM(ttu)+AHCCompl(WMax(tdtu,0.014,ScalCol,0.97,cred,0.016)) 250 or 300 documents. The only real difference depends on 0 50 100 150 number of clusters the number of clusters. Usually, the more documents are in the input set, the more the number of clusters should be high to obtain goods results: around 20 clusters for 150 do- Figure 1: Some of the results on devset varying the cuments, and around 50 clusters for 300 documents. Finally, number of clusters (300 documents per query) we choose to take 300 documents because the peak of the curve is wider than with 150 documents. Reference runs and run results. about the same subtopic. In Figure 1, we can notice that The baseline run is the FlickR ranking. VSM(ttu) run is the clustering on username gives better results than on text obtained using VSM on ttu fields and without clustering. To only (tdtu) or visual only (ScalCol), but lower results than have some comparison elements, we also tested: a clustering on cred. So there must be also another reason. If some of random features (documents are represented by vectors of images have similar credibility descriptors, that means that 5 random values) and a clustering using only the username their users have the same characteristics. But it is not clear (two documents with same username are similars). why these characteristics are interesting for diversity. To try As the queries are only composed of text, we cannot apply to show that cred is a good feature for diversity whatever a Step 1 to improve relevance in the case of run 1 (visual the diversity method, we tried this feature with a greedy al- only). Figure 1 shows that AHCCompl(ScalCol) (clustering gorithm and we obtained the same conclusions (on devset). on ScalCol features without Step 1) gives lower results than Unfortunately, on testset, the text only run (run 2) gives VSM(ttu)+AHCCompl(ScalCol) (with Step 1), but in both cases, better results than cred one (run 4) (see Table 1). So this visual features give lower results than tdtu or cred features. result cannot be generalized and may depend of devset. The best number of clusters to use is always an open ques- As the visual similarity are not normalized, we needed tion. If we want the best CR@20, most of the time is better to carefully optimize the weights of the linear and of the to take 20 clusters, unfortunately with 20 clusters the P@20 weighted-max fusion operators. On devset, the weighted- is often the worst. So, in order to optimise F1@20 and ac- max fusion using tdtu, ScalCol and cred gave the best re- cording to the curves on devset (see Figure 1), we choose sults for all our experiments. But as cred is not so good on for run 2 to run 5 to take 50 clusters. This choice seems to testset, run 5 does not give very good results. Finally, the give a good compromise between relevance and diversity. linear fusion between text (tdtu) and visual (ScalCol) gives On devset, best results using only one feature is obtained the best results on testset (run 3). using cred. The reason may be because, in the case of cred, Despite the fact that we use different kind of features, the images with the same userid have the same vectors, so these F1D@20 for run 2 to run 5 are very close (from 0.543 to 0.553) images will be in the same cluster and these images are often that means it’s difficult to make reliable conclusion on the best feature or on the interest of similarity fusion. 4. REFERENCES [1] http://www.semanticmetadata.net/lire. [2] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, Inc., p. 552, 2000. [3] B. Ionescu, A. L. Gı̂nscă, M. Zaharieva, B. Boteanu, M. Lupu, and H. Müller. Retrieving diverse social images at MediaEval 2016: Challenge, dataset and evaluation. In MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21 2016. [4] C. Kuoman, S. Tollari, and M. Detyniecki. UPMC at MediaEval2013: Relevance by text and diversity by visual clustering. In MediaEval 2013 Workshop, 2013. [5] C. Kuoman, S. Tollari, and M. Detyniecki. Using tree of concepts and hierarchical reordering for diversity in image retrieval. In CBMI, pages 251–256, 2013. [6] M. Lux and S. A. Chatzichristofis. Lire: Lucene image retrieval: An extensible java CBIR library. In ACM International Conference on Multimedia, pages 1085–1088, 2008.