MIS @ Retrieving Diverse Social Images Task 2015 Maia Zaharieva1,2 and Lukas Diem2 1 Interactive Media Systems Group, Vienna University of Technology, Austria 2 Multimedia Information Systems Group, University of Vienna, Austria maia.zaharieva@[tuwien|univie].ac.at, l.diem@univie.ac.at ABSTRACT 2. APPROACH In this paper, we describe our approach for the MediaEval We employ a multi-stage workflow for the retrieval of di- 2015 Retrieving Diverse Social Images Task. The proposed verse social images, which passes the following steps: 1) data approach exploits available user-generated textual descrip- preprocessing, 2) relevance reranking, and 3) image cluster- tions and the visual content of the images in a combination ing and final image selection. with common, unsupervised clustering techniques in order to In the first step, data preprocessing, we filter potentially increase the diversification of retrieval results. Preliminary irrelevant images, i.e., images with humans as the main sub- experiments indicate that the approach generalizes well for jects and images that are captured far away from the topic of different datasets and achieves comparable results for single- interest. We employ the OpenCV1 face detector and remove and multi-topic queries. images with faces of area exceeding 5% of the total image area. Additionally, if GPS data is available, we measure the distance between the topic of interest and the correspond- 1. INTRODUCTION ing images and remove those with a Harvesine distance [7] Manual assessment of the relevance of publicly available greater than 100km. The reason for this strict threshold is images to a particular query is not feasible due to the im- the underlying tourist application scenario where the preci- mense amount of data captured and shared daily on social sion of location’s specification ranges strongly from a partic- media platforms. As a result, the automated optimization ular spot (e.g., the Tower Bridge in London) to large-scale of image retrieval results gains constantly in importance. locations such as national parks or entire cities. Next to relevance, the aspect of diversification of retrieval The aim of the second stage, relevance reranking, is to results plays a crucial role in order to reduce the redundancy improve the original Flickr rating. Since the provided Wiki- in the retrieved images and, thus, to increase the efficiency pedia images are per definition representative [4], we mea- in overviewing the underlying data. The MediaEval 2015 sure the visual similarity between the images of a set and Retrieving Diverse Social Images Task [4] addresses these the associated Wikipedia images by means of the Euclidean challenges in form of a tourist-oriented retrieval task, where distance between the corresponding adapted convolutional the topics of interest represent sightseeing spots around the neural network (CNN) based descriptors. In case that there world. The aim of the task is to refine the set of images are no Wikipedia images provided for a given query, we con- retrieved from Flickr while taking into account both the rel- sider the top 10 images from the original Flickr ranking as evance and the diversity of the selected images. reference images. Following, all images are reranked accord- Previous work in this context shows a broad range of pos- ing to the achieved similarity score. sible approaches. The original Flickr ranking is commonly In the third step, image clustering, we aim at finding improved by a direct comparison with the corresponding groups of similar images which can be used to diversify the Wikipedia images [5][8]. Other methods employ training by final image results. For the visual-based runs, preliminary support vector machines (SVM) [6] or regression models [3]. experiments with the provided visual descriptors [4] and dif- The diversification of retrieval results is usually approached ferent clustering algorithms (k-means, k-medoids, XMeans, by means of conventional clustering algorithms, such as k- and agglomerative hierarchival clusteirng (AHC)) showed means [3][6], hierarchical clustering [1][2], and random for- that the best performing method for the development data est [8] or by an ensemble of clustering approaches [5]. considers CNN as a visual feature and the AHC cluster- In this paper, we address relevance re-ranking by means ing method. The final selection of images from the clusters of a similarity score to a reference set of images. This ref- follows a Round-Robin approach. We start by selecting the erence set is given by Wikipedia images (if available) or by image with the best relevance score from each cluster. These the top ranked images provided by Flickr. To increase di- images, sorted in ascending order, constitute the m highest versification, we employ a hierarchical clustering algorithm ranked results, where m is the number of detected clusters. and compare the performance of recently-introduced pow- The selected images are removed from their corresponding erful visual features with text-based approaches, which are clusters and the selection process is repeated until the re- well-established in the context of web mining and retrieval. quired number of retrieved results is achieved. We employ the Ward’s aggregation method and limit the number of final clusters to 50 based on preliminary experiments. Copyright is held by the author/owner(s). 1 MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany http://opencv.org Table 1: Experimental results on the development Table 2: Official runs configurations. dataset in terms of precision (P@20), cluster recall Run Data preprocessing Relevance reranking Image clustering (CR@20), and F1-score (F1@20). Employed runs 1 (V) Face filter CNN CNN consider visual (V) and/or textual (T) information. 2 (T) GPS filter – TF-IDF 3 (V,T) Face+GPS filter CNN CNN Data Relevance Image P@20 CR@20 F1@20 5 (V,T) Face+GPS filter – TF-IDF preprocessing reranking clustering – Flickr baseline 0.812 0.343 0.471 T GPS filter – – 0.820 0.350 0.478 Table 3: MediaEval 2015 Benchmark results. Bold V Face filter – – 0.816 0.349 0.478 values indicate best values in terms of F1-score for V,T Face+GPS filter – – 0.825 0.355 0.485 the different types of test data. T – – TF-IDF 0.784 0.455 0.569 single-topic multi-topic overall T GPS filter – TF-IDF 0.799 0.462 0.577 P@20 CR@20 F1@20 P@20 CR@20 F1@20 P@20 CR@20 F1@20 T – – LDA 0.798 0.420 0.542 1 0.779 0.450 0.552 0.763 0.434 0.534 0.771 0.442 0.543 T GPS filter – LDA 0.815 0.429 0.553 2 0.690 0.407 0.501 0.697 0.434 0.519 0.694 0.421 0.511 V – CNN – 0.936 0.282 0.420 3 0.794 0.455 0.560 0.764 0.435 0.535 0.778 0.445 0.547 V – – CNN 0.783 0.437 0.553 5 0.688 0.409 0.501 0.699 0.429 0.517 0.694 0.419 0.509 V – CNN CNN 0.831 0.454 0.578 V Face filter CNN CNN 0.835 0.461 0.584 V,T Face+GPS filter – TF-IDF 0.819 0.464 0.584 based runs, the consideration of the relevance reranking step V,T Face+GPS filter CNN TF-IDF 0.925 0.318 0.460 using the CNN features demonstrates a significant increase V,T Face+GPS filter – LDA 0.830 0.437 0.564 in the relevance (P@20-score of 0.94). However, the drop V,T Face+GPS filter CNN LDA 0.933 0.318 0.459 V,T Face+GPS filter CNN CNN 0.849 0.468 0.593 in the clustering recall indicates an increase of redundancy in the retrieved images as a side-effect. Overall, the best- For the text-based runs we consider two approaches. First, performing text-based and visual-based runs are comparable we perform topic modeling on the textual descriptions of in terms of F1@20 with the computational costs for the text- each image (title and tags) using Latent Dirichlet Allocation based runs being significantly lower. The multimodal runs (LDA) and the MALLET Toolbox2 and extract T topics additionally slightly improve both the clustering recall and for the employed dataset. For each image, we estimate the the F1-scores by approximately 1%. Surprisingly, the con- likelihoods l1 and l2 of the first- and second-best matching sideration of the reranking step in a combination with the topics. If the difference of the likelihoods is larger than a text-based image clustering and selection cannot compen- threshold τ (l2 /l1 < τ ) the most likely topic (l1 ) is assigned sate for the drop in the clustering recall. to the photo otherwise no topic is assigned. We set T = 50 Following our preliminary experiments we submitted four and τ = 0.8 for all experiments. runs corresponding to the best configuration for the respec- The second text-based approach considers the well-estab- tive modality (see Table 2). Table 3 summarizes the re- lished term frequency-inverse document frequency (TF-IDF). sults of the official runs on the test dataset. In opposite to We compute the TF-IDF vector for each image using the the development data, which contains the retrieval results complete textual description (title, tags, and descriptions). of single-topic queries only, the test data differentiates be- The textual descriptions are first preprocessed to increase tween single- (e.g., Niagara Falls) and multi-topics queries their expressiveness, i.e., we remove potential occurrences (e.g., Academy awards in Hollywood ). Overall, there is no of the corresponding user name, web links, and stopwords significant difference in the performance for the two subsets. and we additionally stem all remaining terms. Furthermore, While the (predominantly) visual-driven runs (runs 1 and 3) we account for images with missing textual descriptions. In show a slight decrease in the clustering recall for the multi- such a case, we search for timely closest image with a de- topic queries, the text-driven runs (runs 2 and 4) indicate scription which is either captured within a predefined radius the opposing trend. Furthermore, in contrast to the results (10 meter in our experiments) or by the same user within a on the development data, the test runs show notable differ- predefined short time span (e.g., 5 minutes). In the follow- ence between the performance of the text- and the visual- ing, we cluster the resulting TF-IDF vectors using again the based runs. This reveals the better generalization ability of AHC method, whereas the similarity between the TF-IDF the visual-based runs to different datasets. Overall, the best vectors is measured using the cosine similarity. The selection performance in terms of F1-score of 0.55 is achieved by the of the final image set follows the Round-Robin approach as visual-based run which additionally considers the face and described for the visual-based approach. GPS filters to reject irrelevant images (run 3). 3. EXPERIMENTAL RESULTS 4. CONCLUSION Table 1 presents a selection of our preliminary experi- In this paper we investigated both text- and visual-driven ments on the development dataset. The results show that approaches for the diversification of Flickr image retrieval the prerpocessing step (face and GPS filter) only marginally results. The achieved performances indicate that the visual- improves the performance for the top 20 retrieved images based approach copes well with different data and varying in comparison to the Flickr baseline results. Nevertheless, query types. Overall, the relevance ranking shows promising 95% of the rejected images are irrelevant with respect to the results in terms of precision. However, the diversification in- underlying search query. Experiments with the text-based creases only slowly by means of clustering recall. Our future runs show only minor differences in the performance of the work will exploit the potential of combining features of dif- TF-IDF and the LDA-based methods. While the achieved ferent modalities in the clustering process, e.g. by means of precision (P@20) is comparable to those of the Flickr base- a late fusion approach. line, the cluster recall (CR@20) improves notably, e.g. from 0.34 to 0.46 using the TF-IDF approach. For the visual- Acknowledgment 2 This work has been partly funded by the Vienna Science and http://mallet.cs.umass.edu. Technology Fund (WWTF) through project ICT12-010. 5. REFERENCES [1] A. Castellanos, A. Garcia-Serrano, and J. Cigarran. UNED @ retrieving diverse social images task. In MediaEval Benchmark Workshop, 2014. [2] D.-T. Dang-Nguyen, L. Piras, G. Giacinto, G. Boato, and F. D. Natale. Retrieval of diverse images by pre-filtering and hierarchical clustering. In MediaEval Benchmark Workshop, 2014. [3] A. L. Ginsca, A. Popescu, and N. Rekabsaz. CEA LIST’s participation at the MediaEval 2014 retrieving diverse social images task. In MediaEval Benchmark Workshop, 2014. [4] B. Ionescu, A. L. Gı̂nscâ, B. Boteanu, A. Popescu, M. Lupu, and H. Müller. Retrieving diverse social images at MediaEval 2015: Challenge, dataset and evaluation. In MediaEval Benchmark Workshop, 2015. [5] J. R. M. Palotti, N. Rekabsaz, M. Lupu, and A. Hanbury. TUW @ retrieving diverse social images task 2014. In MediaEval Benchmark Workshop, 2014. [6] M. I. Sarac and P. Duygulu. Bilkent-RETINA at retrieving diverse social images task of MediaEval 2014. In MediaEval Benchmark Workshop, 2014. [7] R. W. Sinnott. Virtues of the haversine. Sky and Telescope, 68(2):159, 1984. [8] C. Spampinato and S. Palazzo. PeRCeiVe@UNICT at MediaEval 2014 diverse images: Random forests for diversity-based clustering. In MediaEval Benchmark Workshop, 2014.