Ghent University-iMinds at MediaEval 2013 Diverse Images: Relevance-Based Hierarchical Clustering Baptist Vandersmissen1 Abhineshwar Tomar1 Fréderic Godin1 baptist.vandersmissen@ugent.be abhineshwar.tomar@ugent.be frederic.godin@ugent.be Wesley De Neve 1,2 Rik Van de Walle1 wesley.deneve@ugent.be rik.vandewalle@ugent.be 1 Multimedia Lab, ELIS, Ghent University – iMinds, Ghent, Belgium 2 Image and Video Systems Lab, KAIST, Daejeon, South Korea ABSTRACT cluster. The SRI for a set of images is calculated by tak- In this paper, we attempt to tackle the MediaEval 2013 Re- ing the mean of all the corresponding feature vectors of trieving Diverse Social Images challenge, which is a filter and these images. Intra-cluster ranking of all of the m images refinement problem on a Flickr-based ranked set of social im- is produced by calculating their Euclidean distance to the ages. We developed three different approaches, using visual SRI. The image with rank 1 (topmost rank) is closest to data, textual data and a combination thereof, respectively. the SRI of the cluster. Hierarchical clustering on highly relevant images, combined • Subsequently, we rank the k different clusters, again by with a greedy approach to complement the ranking, forms calculating the distance of the cluster SRIs to a general the basis of our approach. SRI value, which is the mean over all cluster SRIs. Finally, n images are selected by iterating over the ranked clusters 1. INTRODUCTION and taking the topmost ranked image within each cluster. In this paper, we describe our approach for tackling the MediaEval 2013 Retrieving Diverse Social Images Task [2]. 3. TEXTUAL RUN This task focuses on result diversification in the context of The textual run makes use of information derived from social image retrieval. We refer to [2] for a complete task tags and other textual metadata. This approach aims at di- overview. versifying the results by reranking the images retrieved from We suggest a cluster-based approach for the visual run and Flickr using textual relevance and semantic similarity. Our a semantic similarity-based approach for the textual run. solution is based on [3] and makes use of an adapted perfor- The third run focuses on hierarchical clustering of relevant mance metric to improve the ranking characteristics. Images images and represents a combination of the purely visual for a query can then be ordered by directly optimizing the and textual techniques. performance metric. This metric is named Average Diverse Precision (ADP) and is derived from the conventional Aver- 2. VISUAL RUN age Precision metric by adding a diversity component. We We propose a hierarchical clustering-based approach for refer to [3] for a comprehensive overview. the ranking of images in accordance with their relevance We implemented a greedy approach that optimizes an es- and diversity for a specific location. This method builds on timation of the ADP measurement. Let τ denote an ordering the approach provided in [1]. It introduces an inter-cluster of the images, and let τ(i) be the image at the position of ranking machanism and differs on use of feature vectors, rank i (a lower number indicates an image with a higher distance measure and Synthetic Representative Image (SRI) rank). With the top i − 1 documents established, we can calculation method. We want to refine a set of m images derive that the ith image should be decided as follows: retrieved from Flickr to a ranking of size n.  Rel(x)  τ(i) = arg max Div(x)(C + Div(x)) , (2) • The set of m images is hierarchically clustered to pro- x∈D−S i duce k clusters. Similarity between two images xi and xj where (represented by a CN3x3 and LBP3x3 feature vector) is measured using a Gaussian kernel: S = {τ(1), τ(2), . . . , τ(i − 1)} , (3) ||xi − xj ||22   i−1 s(xi , xj ) = exp − . (1) X σ2 C= Rel(τ(k))Div(τ(k)), (4) k=1 • For each of the k clusters produced, we calculate a Syn- thetic Representative Image (SRI). The SRI acts as a rep- with Rel(x) and Div(x) denoting the estimated relevance resentative image for all the images within a particular and diversity of the image, respectively. 3.1 Relevance Estimation Copyright is held by the author/owner(s). We estimate the relevance of an image by making use of MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain textual metadata (number of views, number of comments) Table 1: Results (in %). Numbers between brackets denote the relative difference to the Flickr ranking and CR denotes the cluster recall of the ranking. Development Set (comparison with Flickr) Test Set + GPS - GPS + GPS - GPS CR@10 P@10 CR@10 P@10 CR@10 P@10 CR@10 P@10 Visual 43.6 (2.4) 79.2 (-6.8) 48.4 (2.0) 71.6 (2.8) 37.5 76.1 34.7 56.8 Textual 44.2 (2.9) 81.6 (-4.4) 51.6 (5.2) 67.2 (-1.6) 39.7 74.9 37.5 58.6 Combined 49.8 (8.5) 85.6 (-0.4) 51.7 (5.3) 74.8 (6.0) 41.3 80.5 42.8 66.7 and social tags. This data was provided together with three based on a gain score. This score is higher for images textual features: TF–IDF, Social TF–IDF and a probabilis- that maximize the diversity and relevance related to the tic feature type. We suggest a linear combination of this current ranking. normalized data. • When the first l spots in the ranking are filled, the algo- Rel(x) = α × tagsx + β × viewsx + γ × commentsx , (5) rithm runs from the start until all n spots are taken. where viewsx and commentsx denote normalized number of views and number of comments, respectively and P Table 2: Results on subset of 50 locations from test t∈Tx a probt,x + b tfidft,x + c stfidft,x set (in %) with crowd-sourcing ground truth. tagsx = , (6) |Tx | GT1 GT2 GT3 with Tx denoting the set of tags linked to image x. We CR@10 P@10 CR@10 P@10 CR@10 P@10 found that the precision of image orderings, which are purely based on relevance information, can be maximized by setting Visual 83.0 72.5 78.7 72.5 69.9 72.5 parameters a and γ to zero and increasing the impact of the Textual 77.7 65.7 72.1 65.7 63.5 65.7 parameters linked to ’tag’ and ’views’ information. Combined 83.6 69.4 77.4 69.4 66.1 69.4 3.2 Diversity Estimation Diversity of an image to a ranking is defined as the mini- mal difference with the other images in that ranking: 5. EXPERIMENTS In Table 1, we can see the results of our algorithms (three Div(x) = min {1 − s(x, τ(i))} (7) runs), compared to the regular Flickr ranking and the re- i∈{1,...,n} sults on the test set. We notice that the combined method We use a semantic similarity measure based on Google creates the most precise and diverse rankings based on both distance to assess the difference between two images. The precision and cluster recall measures (averages 42.1% cluster average of the summation of the similarities between the recall on test set). Analysis of the results on the develop- different tags of both images gives us a value that describes ment set indicate the importance of finding the right bal- the semantic similarity between both images. ance between relevance and diversity. We also observe the similarity of the visual run and the textual run in terms of 4. COMBINED RUN effectiveness. Table 2 lists the results on 50 locations of the To estimate the relevance of an image, we use the textual- test set evaluated with crowd-sourcing ground truths. based method (cf. Section 3.1). Similarity between images is based on visual image features and a Gaussian kernel 6. CONCLUSIONS method to measure the difference between two image vec- We observe that a method clustering images focused on tors (cf. Equation 1). In order to provide both a relevant high relevance outperforms all others. This method uses and diverse ordering, we make use of hierarchical cluster- textual data for the relevance estimation and visual features ing techniques. We, again, try to refine a set of m images for the diversity assessment. retrieved from Flickr to a ranking of size n. • First, l images are selected with the highest estimated 7. REFERENCES relevance. l is an arbitrary number that depicts a subset [1] R. Anca-Livia, J. Stöttinger, B. Ionescu, M. Menéndez, of the final ranking. The larger l becomes, the more the and F. Giunchiglia. Representativeness and diversity in focus will shift from relevance to diversity. photos via crowd-sourced media analysis. 2012. [2] B. Ionescu, M. Menéndez, H. Müller, and A. Popescu. • Next, these l images are hierarchically clustered based on a Retrieving diverse social images at mediaeval 2013: distance matrix calculated with the above described visual Objectives, dataset and evaluation. In MediaEval 2013 similarity. Per cluster, the most relevant item is selected Workshop, CEUR-WS.org, Barcelona, Spain, 2013. and added to the final ranking. [3] M. Wang, K. Yang, X.-S. Hua, and H.-J. Zhang. • Depending on the number of remaining places of the first Towards a relevant and diverse search of social images. l spots in the final ranking, images are greedily added Multimedia, IEEE Transactions on, 2010.