A Unified Framework for Retrieving Diverse Social Images Maia Zaharieva1,2 and Patrick Schwab1 1 Multimedia Information Systems, Faculty of Computer Science, University of Vienna, Austria 2 Interactive Media Systems, Institute of Software Technology and Interactive Systems, Vienna University of Technology, Austria maia.zaharieva@[univie|tuwien].ac.at, a0927193@unet.univie.ac.at ABSTRACT 1) relevance ranking of input images, 2) image clustering for In this paper we explore the performance of a generic, uni- diversification, and 3) final image selection. The initial set fied framework for the retrieval of relevant and diverse im- of input images may be optionally pre-processed in order to ages from social photo collections. The approach allows filter potentially irrelevant images, such as images with a for the easy evaluation of different visual and textual im- human as main subject. age descriptions, clustering algorithms, and similarity met- In the first stage, relevance ranking, each image of the rics. Preliminary results show strong dependance between input set is first represented by a feature vector ~v , where ~v the choice of underlying technology and similarity metric, is a concatenation of the standardized z-scores of the feature and the achieved performance. descriptors d1 , ..., dn : ~v = zscore( d1 , ..., dn ) (1) 1. INTRODUCTION Since provided Wikipedia photos are per definition represen- The immense daily growth of publicly shared media bears tative [2], we additionally compute a representative feature both potentials and challenges for automated media analysis vector ~vr for each referenced Wikipedia image. Following, and understanding. Currently, image access and retrieval is the relevance score, s, of an image is defined as the smallest usually based on user-provided and camera-generated meta- distance between its feature vector ~v and all ~vr from the set data. Although, more and more available, such metadata of representative feature vectors W : often suffers limitations such as imprecise capture time and GPS information or misleading and ambiguous textual de- s = min distance(~v , ~vr ) (2) vr ∈W ~ scriptions. Additionally, the large number of shared items often results in a high-number of visually highly similar data. The aim of the second stage, image clustering, is to find This challenge is the focus of the MediaEval 2014 Retriev- groups of similar images that can be used to diversify the ing Diverse Social Images Task [2]. The aim of the task is final retrieval results. Note that, distance measures and im- the refinement of location images retrieved from Flickr while age features at this step are not necessarily the same ones taking into consideration both their relevance and diversity. employed for relevance ranking. Previous work in the context of this task shows a broad The third and last stage of the approach, final image se- field of possible approaches ranging from re-ranking and lection, combines the results of the previous steps to retrieve clustering to greedy optimization and graph representations images that are both relevant and diverse according to the [3]. Several authors propose different systems for different initial image set. For this stage we use a Round-Robin al- feature types (e.g., [1][5]) that impedes the reasoning about gorithm. We start by selecting the image with the best the selection of an approach or particular features. Fur- relevance score from each cluster. These images, sorted in thermore, some methods build upon assumptions that hold ascending order, constitute the m highest ranked results, true in a limited setting only (e.g., relevance of an image is where m is the number of detected clusters. The selected related to the number of views or the length of the descrip- images are removed from their corresponding clusters and tions [4]). While most of the presented approaches employ a the selection process is repeated until the required number combination of a re-ranking (for relevance improvement) and of retrieved results is achieved. a clustering (for ensuring diversification) method, we build In general, the clustering algorithm, the metric used to a unified framework that allows for a thorough evaluation compare the feature vectors, and the underlying image fea- of various textual and visual features, clustering algorithms, tures (for both image ranking and image clustering) are and similarity metrics. up to choice. In our experiments we tested different clus- tering algorithms: k-means, Adaptive Hierarchical Cluster- ing (AHC), MeanShift, and Lingo, several comparison met- 2. APPROACH rics: Euclidean, city-block, χ2 , cosine, correlation, Maha- We employ a multi-stage approach for the retrieval of di- lanobis, Spearman, Hamming, and Jaccard, and all visual verse social images. The workflow passes three main stages: and textual features provided by the organizers [2]: term frequency - inverse document frequency (TF-IDF), Color Naming Histogram (CN), Histogram of Oriented Gradients Copyright is held by the author/owner(s). (HOG), Color Moments (CM), Locally Binary Pattern (LBP), MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain Statistics of Gray Level Run Length Matrix (GLRLM), and Table 1: Best feature-metric combinations for AHC. Table 2: Official runs configurations (V: visual, T: textual descriptors employed). Relevance ranking Image clustering 1 CM3x3 Euclidean 1 CM χ2 Relevance ranking Image Clustering 2 TF-IDF Spearman CM3x3 Euclidean run1 (V) CM3x3 SIFT SIFT Euclidean 2 HOG cosine run2 (T) TF-IDF TF-IDF 3 CM χ2 3 GLRLM3x3 χ2 run3 (VT) TF-IDF CSD LBP χ2 LBP3x3 χ2 run5 (V) CM3x3 CSD LBP3x3 χ2 4 GLRLM Euclidean 4 HOG cosine LBP χ2 GLRLM χ2 SIFT Euclidean GLRLM3x3 χ2 5 CSD cosine Table 3: Evaluation results. 5 CSD cosine CN Euclidean CN correlation CN3x3 Euclidean Development dataset Test dataset 6 CN3x3 Euclidean 6 TF-IDF Euclidean CR@20 P@20 F1@20 CR@20 P@20 F1@20 run1 0.4426 0.7600 0.5552 0.3901 0.6646 0.4863 run2 0.4132 0.7250 0.5188 0.3909 0.6809 0.4888 run3 0.4484 0.7567 0.5559 0.3982 0.6732 0.4949 run5 0.4369 0.7617 0.5499 0.3915 0.6752 0.4897 the corresponding spatial pyramid representations (3x3) in addition to Bag-of-Visual Words (BoVW) of dense SIFT de- scriptors. 4. CONCLUSION In this paper we presented a generic, unsupervised frame- work for the evaluation of various visual and textual fea- 3. EXPERIMENTS AND RESULTS tures, similarity metrics, and clustering approaches for the retrieval of diverse social images. Performed experiments In our first experiments we compared the performances of aim at the evaluation of the potentials and limitations of the different clustering algorithms. Results on the develop- the provided visual and textual descriptions and, thus, we ment data set showed that AHC significantly outperforms refrain from employing any assumptions or external sources k-means, MeanShift and Lingo for all explored features (sig- of information. Although, there are significant differences in nificance t-test, p < 0.001). Thus, we employed AHC in all the performances of single features, the top performing fea- follow up experiments. tures prove to be highly interchangeable. Achieved results We conducted a thorough evaluation of the performance indicate that - for the given datasets - the crucial part of of the employed features at the two main stages of our ap- the process is not so much the diversification but more the proach: relevance ranking and image clustering. Table 1 assessment of image relevance. summarizes the results by means of ranked feature lists. The reported feature rankings and the selection of correspond- ing best performing distance measures are the product of Acknowledgment significance t-tests with overall p < 0.003. While the Color This work has been partly funded by the Vienna Science and Naming Histograms (CN and CN3x3) are usually outper- Technology Fund (WWTF) through project ICT12-010. formed by any other feature, the Color Moments (CM and CM3x3) show robust performance in both the ranking and the clustering tasks. In contrast to the ranking, which is 5. REFERENCES clearly dominated by the performance of CM3x3, TF-IDF [1] D. Corney, C. Martin, A. Göker, E. S. Xioufis, and SIFT, image clustering using AHC is more robust and S. Papadopoulos, Y. Kompatsiaris, L. M. Aiello, and the difference in the performance of global and local features B. Thomee. Socialsensor: Finding diverse images at decreases to a large extent. mediaeval 2013. In MediaEval 2013 Workshop, 2013. Eventually, we submitted four runs for the final evalua- [2] B. Ionescu, A. Popescu, M. Lupu, A. L. Gı̂nscâ, and tion (see Table 2 for the configurations). Table 3 shows the H. Müller. Retrieving diverse social images at results for the submitted runs for both development and test mediaeval 2014: Challenge, dataset, and evaluation. In datasets. Best performances are achieved by the combina- MediaEval 2014 Workshop, 2014. tion of textual and visual information (run3 ). However, in [3] B. Ionescu, A. Popescu, H. Müller, M. Menéndez, and the context of the test dataset, the differences between the A.-L. Radu. Benchmarking Result Diversification in performances of the different runs vanish. Overall, clustering Social Image Retrieval. IEEE International Conference recall (CR) remains relatively low due to the large number on Image Processing, 2014. of irrelevant images building noisy clusters. In general, the [4] N. Jain, J. Hare, S. Samangooei, J. Preston, J. Davies, achieved results outline the limitations of the available tex- D. Dupplaw, and P. H. Lewis. Experiments in tual (and visual) information in assessing image relevance. diversifying flickr result sets. In MediaEval 2013 This is mainly due to the fact, that user-provided textual de- Multimedia Benchmark Workshop, 2013. scriptions on social media sites often contain ambiguous or [5] B. Vandersmissen, A. Tomar, F. Godin, W. D. Neve, irrelevant information. A possible approach to improve the and R. V. de Walle. Ghent University-iMinds at results may consider occasionally available GPS data and MediaEval 2013 Diverse Images: Relevance-Based employ external resources as additional source for informa- Hierarchical Clustering. In MediaEval 2013 Workshop, tion. 2013.