=Paper=
{{Paper
|id=Vol-1263/paper18
|storemode=property
|title=PeRCeiVe Lab@UNICT at MediaEval 2014 Diverse Images: Random Forests for Diversity-based Clustering
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_18.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SpampinatoP14
}}
==PeRCeiVe Lab@UNICT at MediaEval 2014 Diverse Images: Random Forests for Diversity-based Clustering==
PeRCeiVe Lab@UNICT at MediaEval 2014 Diverse Images: Random Forests for Diversity-based Clustering Concetto Spampinato, Simone Palazzo Department of Electrical, Electronics and Computer Engineering University of Catania Viale Andrea Doria, 6 - 95125, Catania, Italy {cspampin, simone.palazzo}@dieei.unict.it ABSTRACT j land in the same terminal node, their similarity s- In this paper we describe the work done by the Pattern core is increased by one. The similarities are finally Recognition and Computer Vision Laboratory (PeRCeiVe normalized by the number of trees. Lab) of the University of Catania (Italy) for the MediaEval The similarities between samples build up a matrix, 2014 Retrieving Diverse Social Images Task. The main chal- SIM , which is symmetric, positive definite and with lenge consists of retrieving, for a given topic, a set of pho- values in the unit interval [0, 1]. The p dissimilarity ma- tos which are relevant to the topic but also showing diverse trix is defined as DISSIMij = 1 − SIMij , which views of it. We submitted four runs exploiting a common is employed as input for partitioning around medoids feature-independent clustering strategy based on random- clustering [2]. Each obtained cluster is likely to show forests for diversifying Flickr result images while preserving a specific view of the considered location; relevance. • Cluster filtering by relevance. Starting from the diversity clusters we then perform a cluster filtering 1. INTRODUCTION by relevance. In particular, let SW be the average The goal of the MediaEval 2014 Retrieving Diverse Social of maximum similarities between all the topic samples Images Task [1] is to refine a list of location photos, retrieved and the content available on Wikipedia computed as from Flickr through textual queries. Although photos re- follows: trieved in Flickr are often relevant, e.g., depict partially or N entirely the target location, a significant number is either 1 X SW = max SIM (i, j) (1) noisy or redundant. The objective is, therefore, to filter out N i=1 j∈W iki such photos in order to obtain a exhaustive, and compact summary for the considered location. with N being the number of topic samples, and W iki Conversely to most of the existing methods, our approach the number of samples describing Wikipedia location first looks for diversity of the Flickr photos by performing content (e.g., in case of visual features, W iki is the a diversity-based clustering and then removes the irrelevant number of images on Wikipedia for that location, while clusters by computing the similarities between the clustered when employing text features W iki = 1 since only photos and information available on Wikipedia. Final re- sample describing the entire Wikipedia page text is ranking is carried out by re-clustering the relevant images considered). and computing a diversity score for each location photo. For each cluster C we carry out an unsupervised hier- archical tree clustering on samples’ features, thus ob- 2. METHOD taining C 0 clusters. After that, we scale the similarities The strategy employed for dealing with the Retrieving between the samples of cluster C and the Wikipedia Diverse Social Images Task of MediaEval 2014 relies on a content by C 0 , i.e., SIMC0 = SIM (i, j)i∈C,j∈W iki · C 0 . common, feature-independent framework consisting of the This gives more weight to the samples which resemble following steps: mostly the Wikipedia content and that, at the same time, are diverse from other samples. The relevance s- • Diversity-based clustering. The goal of this clus- core RSC of cluster C is, eventually, computed as mean tering step is to assess dissimilarity between samples of SIMC0 and the cluster is removed if RSC ≥ k · SW ; (i.e., location photos described with either visual or text descriptors) of a given location and to group them • Final ranking according to a diversity score. according to such dissimilarity. To do that, we use The samples obtained at the previous step are again Random forest predictors which allow us to define a clustered using unsupervised hierarchical clustering and dissimilarity measure between observations [3]. For re-ranked according to a diversity score, computed for each tree of the forest, if samples/observations i and a sample j, by integrating: 1) number of samples in n- ear clusters: if sample j is in cluster Cj we count how main samples are in Cj , Cj−1 and Cj+1 and its score is s1j = NC +NC 1 +NC . This favors again diver- Copyright is held by the author/owner(s). j j−1 j+1 MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain sity as samples with many items in near clusters are Table 1: Training and official test results obtained by our method for each of the submitted runs. Development set Test set (official) Run P@20 CR@20 F1@20 P@20 CR@20 F1@20 run1 (visual) 86.67% 43.10% 56.87% 74.80% 38.74% 50.34% run2 (text) 78.17% 44.02% 55.59% 75.53% 39.02% 50.63% run3 (visual-text) 85.33% 43.61% 56.93% 72.40% 37.88% 49.03% run5 (any resources) 84.14% 42.97% 56.14% 72.93% 37.31% 48.49% strongly penalized; 2) photo id distance s2j from sam- for run 1 and run 2, i.e., by multiplying the ranking ple j to all other samples in the list: photos with close scores of the two ranked lists for images appearing in IDs are likely to refer to a similar view of a location; both lists and completing the final list with images of and 3) a random factor s3j to enable exploration of the list of Run 1; new solutions. The final ranking of sample j is given by multiplying the three above scores. • Run 5 (any resources): same algorithm as in Run 1, without applying the face and GPS-based filtering. 3. EXPERIMENTS It is clear that our attempt of making the training phase as independent as possible from the development set succeeded 3.1 Setup and features only partially, since the results on the test set are sensibly The algorithm described in Sect. 2 depends on the number lower than those on the training set. Also, textual features of trees used for the diversity based clustering and on the k performed surprisingly better than visual ones, which had threshold for cluster filtering by relevance, which were set, obtained the highest accuracy by far on the development respectively, to 50 and 3.5 for visual features and 1.0 for text set. The significant performance difference, in terms of pre- features. cision, of the visual runs (about 12%) on the test set and For runs using only visual features, we used the visual the development set can be due to a different relevance dis- descriptors provided by the task’s organizers for each photo tribution of visual features, while it seems that text features (including the Wikipedia ones) normalized between 0 and 1 keep the same distribution on the two sets. and concatenated into a single 945-dimensional vector. Textual descriptors were computed as TF-IDF vectors 4. CONCLUSIONS from a vocabulary made up of all words from titles, descrip- In this paper we describe our random-forest based ap- tions and tags for photos in the development and test sets, proach for tackling the MediaEval 2014 Retrieving Diverse plus words extracted from Wikipedia page (available as part Social Images Task. Our method, applied to text and visual of the data provided by the task’s organizers). In order to features indifferently, leverages on a diversity-based cluster- reduce the original vocabulary size, being too large (more ing using Random Forests and on noisy cluster filtering to than 90,000 words), we removed: 1) words shorter than four increase relevance. Final ranking is made in order to favor characters; 2) words starting with digits; 3) words with low diversity with respect to relevance. maximum TF-IDF values, thus resulting in a vocabulary size As future improvement, we mean to better exploit the of 51,136 terms. amount of information provided for development: in par- For both the visual-based and the text-based classifiers, ticular, the diversity ground truth will be used to improve the random forest was trained on 10 locations (the remaining intra-cluster split in the diversity-based clustering, and user ones were used for testing), randomly selected from the ones credibility information will be integrated into the relevance available in the training set. Increasing the number locations estimation model. led to higher training times without an actual improvement in accuracy on the training set. 5. REFERENCES 3.2 Results and discussion [1] B. Ionescu, A. Popescu, M. Lupu, A. L. Gı̂nsca, and We submitted four runs whose results are given in Table 1: H. Müller. Retrieving Diverse Social Images at MediaEval 2014: Challenge, Dataset and Evaluation. In • Run 1 (visual information only): we employed the al- MediaEval 2014 Workshop, October 16-17, 2014, gorithm described in Sect. 2 on the visual features, Barcelona, Spain, 2014. filtering out images: 1) having people, detected by [2] H.-S. Park and C.-H. Jun. A simple and fast algorithm the face detector in [4], as subjects, and 2) whose Eu- for k-medoids clustering. Expert Syst. Appl., clidean distance between the location’s GPS coordi- 36(2):3336–3341, Mar. 2009. nates (provided as part of each topic’s description) and [3] T. Shi and S. Horvath. Unsupervised Learning With their GPS coordinates (when provided) was over 10; Random Forest Predictors. Journal of Computational • Run 2 (text information only): the same algorithm was and Graphical Statistics, 15(1):118 – 138, 2006. employed on the reduced-vocabulary TF-IDF descrip- [4] P. Viola and M. Jones. Robust real-time object tors; detection. International Journal of Computer Vision, 57:137–154, 2001. • Run 3 (visual-text fusion): the results presented in this run were obtained by combining those computed