1. INTRODUCTION

Maia Zaharieva

0 1

Lukas Diem

l.diem@univie.ac.at 1 0 Interactive Media Systems Group, Vienna University of Technology , Austria 1 Multimedia Information Systems Group, University of Vienna , Austria

2015

14 15

In this paper, we describe our approach for the MediaEval 2015 Retrieving Diverse Social Images Task. The proposed approach exploits available user-generated textual descriptions and the visual content of the images in a combination with common, unsupervised clustering techniques in order to increase the diversi cation of retrieval results. Preliminary experiments indicate that the approach generalizes well for di erent datasets and achieves comparable results for singleand multi-topic queries.

1. INTRODUCTION

Manual assessment of the relevance of publicly available images to a particular query is not feasible due to the immense amount of data captured and shared daily on social media platforms. As a result, the automated optimization of image retrieval results gains constantly in importance. Next to relevance, the aspect of diversi cation of retrieval results plays a crucial role in order to reduce the redundancy in the retrieved images and, thus, to increase the e ciency in overviewing the underlying data. The MediaEval 2015 Retrieving Diverse Social Images Task [ 4 ] addresses these challenges in form of a tourist-oriented retrieval task, where the topics of interest represent sightseeing spots around the world. The aim of the task is to re ne the set of images retrieved from Flickr while taking into account both the relevance and the diversity of the selected images.

Previous work in this context shows a broad range of possible approaches. The original Flickr ranking is commonly improved by a direct comparison with the corresponding Wikipedia images [ 5 ][ 8 ]. Other methods employ training by support vector machines (SVM) [ 6 ] or regression models [ 3 ]. The diversi cation of retrieval results is usually approached by means of conventional clustering algorithms, such as kmeans [ 3 ][ 6 ], hierarchical clustering [ 1 ][ 2 ], and random forest [ 8 ] or by an ensemble of clustering approaches [ 5 ].

In this paper, we address relevance re-ranking by means of a similarity score to a reference set of images. This reference set is given by Wikipedia images (if available) or by the top ranked images provided by Flickr. To increase diversi cation, we employ a hierarchical clustering algorithm and compare the performance of recently-introduced powerful visual features with text-based approaches, which are well-established in the context of web mining and retrieval. 2.

APPROACH

We employ a multi-stage work ow for the retrieval of diverse social images, which passes the following steps: 1) data preprocessing, 2) relevance reranking, and 3) image clustering and nal image selection.

In the rst step, data preprocessing, we lter potentially irrelevant images, i.e., images with humans as the main subjects and images that are captured far away from the topic of interest. We employ the OpenCV1 face detector and remove images with faces of area exceeding 5% of the total image area. Additionally, if GPS data is available, we measure the distance between the topic of interest and the corresponding images and remove those with a Harvesine distance [ 7 ] greater than 100km. The reason for this strict threshold is the underlying tourist application scenario where the precision of location's speci cation ranges strongly from a particular spot (e.g., the Tower Bridge in London) to large-scale locations such as national parks or entire cities.

The aim of the second stage, relevance reranking, is to improve the original Flickr rating. Since the provided Wikipedia images are per de nition representative [ 4 ], we measure the visual similarity between the images of a set and the associated Wikipedia images by means of the Euclidean distance between the corresponding adapted convolutional neural network (CNN) based descriptors. In case that there are no Wikipedia images provided for a given query, we consider the top 10 images from the original Flickr ranking as reference images. Following, all images are reranked according to the achieved similarity score.

In the third step, image clustering, we aim at nding groups of similar images which can be used to diversify the nal image results. For the visual-based runs, preliminary experiments with the provided visual descriptors [ 4 ] and different clustering algorithms (k-means, k-medoids, XMeans, and agglomerative hierarchival clusteirng (AHC)) showed that the best performing method for the development data considers CNN as a visual feature and the AHC clustering method. The nal selection of images from the clusters follows a Round-Robin approach. We start by selecting the image with the best relevance score from each cluster. These images, sorted in ascending order, constitute the m highest ranked results, where m is the number of detected clusters. The selected images are removed from their corresponding clusters and the selection process is repeated until the required number of retrieved results is achieved. We employ the Ward's aggregation method and limit the number of nal clusters to 50 based on preliminary experiments. Data preprocessing { Flickr baseline T GPS lter V Face lter V,T Face+GPS lter T { T GPS lter T { T GPS lter V { V { V { V Face lter V,T Face+GPS lter V,T Face+GPS lter V,T Face+GPS lter V,T Face+GPS lter V,T Face+GPS lter { { { { { { { CNN

{ CNN CNN

{ CNN

{ CNN CNN { { { TF-IDF TF-IDF LDA LDA

{ CNN CNN CNN TF-IDF TF-IDF LDA LDA CNN

For the text-based runs we consider two approaches. First, we perform topic modeling on the textual descriptions of each image (title and tags) using Latent Dirichlet Allocation (LDA) and the MALLET Toolbox2 and extract T topics for the employed dataset. For each image, we estimate the likelihoods l1 and l2 of the rst- and second-best matching topics. If the di erence of the likelihoods is larger than a threshold (l2=l1 < ) the most likely topic (l1) is assigned to the photo otherwise no topic is assigned. We set T = 50 and = 0:8 for all experiments.

The second text-based approach considers the well-established term frequency-inverse document frequency (TF-IDF). We compute the TF-IDF vector for each image using the complete textual description (title, tags, and descriptions). The textual descriptions are rst preprocessed to increase their expressiveness, i.e., we remove potential occurrences of the corresponding user name, web links, and stopwords and we additionally stem all remaining terms. Furthermore, we account for images with missing textual descriptions. In such a case, we search for timely closest image with a description which is either captured within a prede ned radius (10 meter in our experiments) or by the same user within a prede ned short time span (e.g., 5 minutes). In the following, we cluster the resulting TF-IDF vectors using again the AHC method, whereas the similarity between the TF-IDF vectors is measured using the cosine similarity. The selection of the nal image set follows the Round-Robin approach as described for the visual-based approach.

3. EXPERIMENTAL RESULTS

Table 1 presents a selection of our preliminary experiments on the development dataset. The results show that the prerpocessing step (face and GPS lter) only marginally improves the performance for the top 20 retrieved images in comparison to the Flickr baseline results. Nevertheless, 95% of the rejected images are irrelevant with respect to the underlying search query. Experiments with the text-based runs show only minor di erences in the performance of the TF-IDF and the LDA-based methods. While the achieved precision (P@20) is comparable to those of the Flickr baseline, the cluster recall (CR@20) improves notably, e.g. from 0:34 to 0:46 using the TF-IDF approach. For the visual2http://mallet.cs.umass.edu.

Run 1 (V) 2 (T) 3 (V,T) 5 (V,T) based runs, the consideration of the relevance reranking step using the CNN features demonstrates a signi cant increase in the relevance (P@20-score of 0:94). However, the drop in the clustering recall indicates an increase of redundancy in the retrieved images as a side-e ect. Overall, the bestperforming text-based and visual-based runs are comparable in terms of F1@20 with the computational costs for the textbased runs being signi cantly lower. The multimodal runs additionally slightly improve both the clustering recall and the F1-scores by approximately 1%. Surprisingly, the consideration of the reranking step in a combination with the text-based image clustering and selection cannot compensate for the drop in the clustering recall.

Following our preliminary experiments we submitted four runs corresponding to the best con guration for the respective modality (see Table 2). Table 3 summarizes the results of the o cial runs on the test dataset. In opposite to the development data, which contains the retrieval results of single-topic queries only, the test data di erentiates between single- (e.g., Niagara Falls) and multi-topics queries (e.g., Academy awards in Hollywood ). Overall, there is no signi cant di erence in the performance for the two subsets. While the (predominantly) visual-driven runs (runs 1 and 3) show a slight decrease in the clustering recall for the multitopic queries, the text-driven runs (runs 2 and 4) indicate the opposing trend. Furthermore, in contrast to the results on the development data, the test runs show notable di erence between the performance of the text- and the visualbased runs. This reveals the better generalization ability of the visual-based runs to di erent datasets. Overall, the best performance in terms of F1-score of 0:55 is achieved by the visual-based run which additionally considers the face and GPS lters to reject irrelevant images (run 3). 4.

CONCLUSION

In this paper we investigated both text- and visual-driven approaches for the diversi cation of Flickr image retrieval results. The achieved performances indicate that the visualbased approach copes well with di erent data and varying query types. Overall, the relevance ranking shows promising results in terms of precision. However, the diversi cation increases only slowly by means of clustering recall. Our future work will exploit the potential of combining features of different modalities in the clustering process, e.g. by means of a late fusion approach.

Acknowledgment

This work has been partly funded by the Vienna Science and Technology Fund (WWTF) through project ICT12-010.

[1]

Castellanos ,

Garcia-Serrano , and

J. Cigarran.

UNED @ retrieving diverse social images task . In MediaEval Benchmark Workshop , 2014 .

[2]

.-T. Dang-Nguyen,

Piras , G. Giacinto, G. Boato, and

F. D.

Natale . Retrieval of diverse images by pre- ltering and hierarchical clustering . In MediaEval Benchmark Workshop , 2014 .

[3]

A. L.

Ginsca ,

Popescu , and N. Rekabsaz. CEA LIST's participation at the MediaEval 2014 retrieving diverse social images task . In MediaEval Benchmark Workshop , 2014 .

[4]

Ionescu , A. L. G ^nsca^,

Boteanu ,

Popescu ,

Lupu , and

Mu ller. Retrieving diverse social images at MediaEval 2015: Challenge, dataset and evaluation . In MediaEval Benchmark Workshop , 2015 .

[5]

J. R. M.

Palotti ,

Rekabsaz ,

Lupu , and

Hanbury . TUW @ retrieving diverse social images task 2014 . In MediaEval Benchmark Workshop, 2014 .

[6]

M. I.

Sarac and

Duygulu . Bilkent-RETINA at retrieving diverse social images task of MediaEval 2014 . In MediaEval Benchmark Workshop, 2014 .

[7]

R. W.

Sinnott . Virtues of the haversine . Sky and Telescope , 68 ( 2 ): 159 , 1984 .

[8]

Spampinato and S. Palazzo. PeRCeiVe@UNICT at MediaEval 2014 diverse images: Random forests for diversity-based clustering . In MediaEval Benchmark Workshop , 2014 .