1. INTRODUCTION

A Unified Framework for Retrieving Diverse Social Images

Maia Zaharieva

0 1

Patrick Schwab

1 0 Interactive Media Systems, Institute of Software Technology and Interactive Systems, Vienna University of Technology , Austria 1 Multimedia Information Systems, Faculty of Computer Science, University of Vienna , Austria

2014

16 17

In this paper we explore the performance of a generic, unied framework for the retrieval of relevant and diverse images from social photo collections. The approach allows for the easy evaluation of di erent visual and textual image descriptions, clustering algorithms, and similarity metrics. Preliminary results show strong dependance between the choice of underlying technology and similarity metric, and the achieved performance.

1. INTRODUCTION

The immense daily growth of publicly shared media bears both potentials and challenges for automated media analysis and understanding. Currently, image access and retrieval is usually based on user-provided and camera-generated metadata. Although, more and more available, such metadata often su ers limitations such as imprecise capture time and GPS information or misleading and ambiguous textual descriptions. Additionally, the large number of shared items often results in a high-number of visually highly similar data. This challenge is the focus of the MediaEval 2014 Retrieving Diverse Social Images Task [ 2 ]. The aim of the task is the re nement of location images retrieved from Flickr while taking into consideration both their relevance and diversity.

Previous work in the context of this task shows a broad eld of possible approaches ranging from re-ranking and clustering to greedy optimization and graph representations [ 3 ]. Several authors propose di erent systems for di erent feature types (e.g., [ 1 ][ 5 ]) that impedes the reasoning about the selection of an approach or particular features. Furthermore, some methods build upon assumptions that hold true in a limited setting only (e.g., relevance of an image is related to the number of views or the length of the descriptions [ 4 ]). While most of the presented approaches employ a combination of a re-ranking (for relevance improvement) and a clustering (for ensuring diversi cation) method, we build a uni ed framework that allows for a thorough evaluation of various textual and visual features, clustering algorithms, and similarity metrics.

APPROACH

We employ a multi-stage approach for the retrieval of diverse social images. The work ow passes three main stages: 1) relevance ranking of input images, 2) image clustering for diversi cation, and 3) nal image selection. The initial set of input images may be optionally pre-processed in order to lter potentially irrelevant images, such as images with a human as main subject.

In the rst stage, relevance ranking, each image of the input set is rst represented by a feature vector ~v, where ~v is a concatenation of the standardized z-scores of the feature descriptors d1; :::; dn: (1) Since provided Wikipedia photos are per de nition representative [ 2 ], we additionally compute a representative feature vector ~vr for each referenced Wikipedia image. Following, the relevance score, s, of an image is de ned as the smallest distance between its feature vector ~v and all ~vr from the set of representative feature vectors W : s = min distance(~v; ~vr) ~vr2W (2)

The aim of the second stage, image clustering, is to nd groups of similar images that can be used to diversify the nal retrieval results. Note that, distance measures and image features at this step are not necessarily the same ones employed for relevance ranking.

The third and last stage of the approach, nal image selection, combines the results of the previous steps to retrieve images that are both relevant and diverse according to the initial image set. For this stage we use a Round-Robin algorithm. We start by selecting the image with the best relevance score from each cluster. These images, sorted in ascending order, constitute the m highest ranked results, where m is the number of detected clusters. The selected images are removed from their corresponding clusters and the selection process is repeated until the required number of retrieved results is achieved.

In general, the clustering algorithm, the metric used to compare the feature vectors, and the underlying image features (for both image ranking and image clustering) are up to choice. In our experiments we tested di erent clustering algorithms: k-means, Adaptive Hierarchical Clustering (AHC), MeanShift, and Lingo, several comparison metrics: Euclidean, city-block, 2, cosine, correlation, Mahalanobis, Spearman, Hamming, and Jaccard, and all visual and textual features provided by the organizers [ 2 ]: term frequency - inverse document frequency (TF-IDF), Color Naming Histogram (CN), Histogram of Oriented Gradients (HOG), Color Moments (CM), Locally Binary Pattern (LBP), Statistics of Gray Level Run Length Matrix (GLRLM), and the corresponding spatial pyramid representations (3x3) in addition to Bag-of-Visual Words (BoVW) of dense SIFT descriptors.

EXPERIMENTS AND RESULTS

In our rst experiments we compared the performances of the di erent clustering algorithms. Results on the development data set showed that AHC signi cantly outperforms k-means, MeanShift and Lingo for all explored features (signi cance t-test, p < 0:001). Thus, we employed AHC in all follow up experiments.

We conducted a thorough evaluation of the performance of the employed features at the two main stages of our approach: relevance ranking and image clustering. Table 1 summarizes the results by means of ranked feature lists. The reported feature rankings and the selection of corresponding best performing distance measures are the product of signi cance t-tests with overall p < 0:003. While the Color Naming Histograms (CN and CN3x3) are usually outperformed by any other feature, the Color Moments (CM and CM3x3) show robust performance in both the ranking and the clustering tasks. In contrast to the ranking, which is clearly dominated by the performance of CM3x3, TF-IDF and SIFT, image clustering using AHC is more robust and the di erence in the performance of global and local features decreases to a large extent.

Eventually, we submitted four runs for the nal evaluation (see Table 2 for the con gurations). Table 3 shows the results for the submitted runs for both development and test datasets. Best performances are achieved by the combination of textual and visual information (run3 ). However, in the context of the test dataset, the di erences between the performances of the di erent runs vanish. Overall, clustering recall (CR) remains relatively low due to the large number of irrelevant images building noisy clusters. In general, the achieved results outline the limitations of the available textual (and visual) information in assessing image relevance. This is mainly due to the fact, that user-provided textual descriptions on social media sites often contain ambiguous or irrelevant information. A possible approach to improve the results may consider occasionally available GPS data and employ external resources as additional source for information.

Development dataset CR@20 P@20 F1@20 0.4426 0.7600 0.5552 0.4132 0.7250 0.5188 0.4484 0.7567 0.5559 0.4369 0.7617 0.5499

CR@20 0.3901 0.3909 0.3982 0.3915

Test dataset

P@20 0.6646 0.6809 0.6732 0.6752 4.

CONCLUSION

In this paper we presented a generic, unsupervised framework for the evaluation of various visual and textual features, similarity metrics, and clustering approaches for the retrieval of diverse social images. Performed experiments aim at the evaluation of the potentials and limitations of the provided visual and textual descriptions and, thus, we refrain from employing any assumptions or external sources of information. Although, there are signi cant di erences in the performances of single features, the top performing features prove to be highly interchangeable. Achieved results indicate that - for the given datasets - the crucial part of the process is not so much the diversi cation but more the assessment of image relevance.

Acknowledgment

This work has been partly funded by the Vienna Science and Technology Fund (WWTF) through project ICT12-010.

[1]

Corney ,

Martin , A . Goker, E. S. Xiou s, S. Papadopoulos,

Kompatsiaris ,

L. M.

Aiello , and

Thomee . Socialsensor: Finding diverse images at mediaeval 2013 . In MediaEval 2013 Workshop, 2013 .

[2]

Ionescu ,

Popescu ,

Lupu , A. L. G^

nsca^, and

ller. Retrieving diverse social images at mediaeval 2014: Challenge, dataset, and evaluation . In MediaEval 2014 Workshop , 2014 .

[3]

Ionescu ,

Popescu , H.

Muller, M. Menendez, and

A.-L.

Radu . Benchmarking Result Diversi cation in Social Image Retrieval . IEEE International Conference on Image Processing , 2014 .

[4]

Jain ,

Hare ,

Samangooei ,

Preston ,

Davies ,

Dupplaw , and

P. H.

Lewis . Experiments in diversifying ickr result sets . In MediaEval 2013 Multimedia Benchmark Workshop , 2013 .

[5]

Vandersmissen ,

Tomar ,

Godin ,

W. D.

Neve , and R. V. de Walle. Ghent University-iMinds at MediaEval 2013 Diverse Images: Relevance-Based Hierarchical Clustering . In MediaEval 2013 Workshop , 2013 .