1. INTRODUCTION

ETH-CVL @ MediaEval 2015: Learning Objective Functions for Improved Image Retrieval

Sai Srivatsa R

saisrivatsan12@gmail.com 1

Michael Gygli

gygli@vision.ee.ethz.ch 0

Luc Van Gool

vangool@vision.ee.ethz.ch 0 0 Computer Vision Laboratory, ETH Zurich 1 Indian Institute of Technology , Kharagpur

2015

14 15

In this paper, we present a method to select a re ned subset of images, given an initial list of retrieved images. The goal of any image retrieval system is to present results that are maximally relevant as well as diverse. We formulate this as a subset selection problem and we address it using submodularity. In order to select the best subset, we learn an objective function as a linear combination of submodular functions. This objective quanti es how relevant and representative a selected subset it. Using this method we obtain promising results at MediaEval 2015.

1. INTRODUCTION

Image retrieval using text queries is a central topic in Multimedia retrieval. While early approaches relied solely on text associated with images, more recent approaches combine textual and visual cues to return more relevant results [ 12, 6 ]. Nonetheless, search engines of photo sharing sites such as Flickr still retrieve results that are often irrelevant and redundant. The MediaEval 2015 Retrieving Diverse Social Images Task fosters research to improve results retrieved by Flickr. It asks the participants to develop algorithms to re ne a ranked list of photos retrieved from Flickr using the photo's visual, textual and meta information. An overview of the task is presented in [ 4 ].

METHODOLOGY

We formulate the task of diversifying Image retrieval results as a subset selection problem. Given a set of retrieved images, I = (I1; I2; : : : ; In) and a budget B, the task is to nd a subset S I; jSj = B such that S is maximally relevant as well as diverse. Such problems are usually solved by using a scoring function F : 2n ! R that assigns a higher score for diverse and relevant subsets. Let V be the power set of I, we obtain the best subset S by computing: S = argmax F (S):

S V;jSj=B (1) Evaluating the scores for all possible subsets (2n) is intractable. We address this issue with submodularity.

A set function f(.) is said to be submodular if f (A [ v) f (A) f (B [ v) f (B); (2) where A B V nv, V being the ground set of elements [ 9 ]. Submodular functions naturally model properties such as representativeness and relevance as they exhibit a diminishing returns property.

If the scoring function is monotone submodular, we can nd a near optimal solution for equation 1 using greedy submodular maximization methods [ 10, 5 ]. A linear combination of submodular functions with non-negative weights is still submodular. Thus we de ne our scoring function as F (S) = wT f (S); (3) where f (S) = [f1(S); f2(S) : : : fk(S)]T are normalized submodular monotone functions and w 2 Rk+ is a weight vector. We learn these weights with sub-gradient descent1 [ 7 ]. 2.1

Submodular Scoring Functions

We use several submodular functions, aimed at quantifying how relevant or diverse the selected subset is.

Visual Representativeness We de ne the representativeness score as 1 - k-Medoid Loss. The k-Medoid loss for a subset is obtained by computing the sum of euclidean distance between images in the query and the nearest selected medoid (images in the selected subset) in the feature space [ 3 ] (using CNN features [ 1 ]). Thus k-Medoid loss is minimum when the selected subset is representative thereby resulting in a higher representativeness score.

Visual Relevance We use the relevance ground truth provided for the devset topics to train a generic SVM on CNN features with relevance ground truth as labels. The relevance score of a subset is the number of images in the subset that are predicted as relevant.

Text Relevance In order to obtain a text-based score for an image, given a query, we use a Bag-of-Words model. We represent the wikipage associated with the query as a vector. Similarly, each image is represented as vector obtained encoding its title, tags and description (with the same relative weighting as [ 13 ]). The text relevance of an image is computed as its cosine similarity to the wikipedia page, using tf-idf weighting2. Finally, the text relevance score of a set of image is simply the sum over the relevance of its individual elements.

Flickr Ranks For an image having Flickr rank i belonging to a topic having n images, its Flickr score is given by n i n . The sum of ickr scores of images in the subset is the ickr score of the subset. 1We use the implementation of [ 3 ] for submodular maximization and learning weights. 2Using the implementation provided in scikit-learn [ 11 ]. Vis. Rel Vis. Rep 0.0 0.1 0.2 0.3 0.4 0.5 0.6

Weight

Time Representativeness This function quanti es how diverse the images are with respect to time taken. Photos taken during di erent times of the day, or taken during different seasons can also lead to increase in diversity. This score is computed using the same k-medoid loss as in Visual representativeness, but using the timestamp as the feature representation. 2.2

Learning

Using the relevance and cluster ground truth, for a given query and a budget B, we construct a ground truth subset (Stgt) for each query t in the devset. To learn the weights, we optimize the following large-margin formulation [ 7 ]

T min 1 X L^t(w) + 2 jjwjj2 w 0 T t=1 where T is the total number of queries in the devset and L^t(w), the hinge loss of for training examples t is given by L^t(w) = Smt2aVxt(F (St) + `(St)) F (Stgt) where `(:) is the loss function. We use F1-loss (`(St) = jStj F 1(St)) as the loss function. As F1-loss is not submodular, we use its (pointwise) modular approximation [ 9 ]. We perform the optimization using sub-gradient descent [ 7 ] with an adaptive learning rate [ 2 ].

RESULTS AND DISCUSSION

We evaluated our method on the MediaEval 2015 diverse social images task [ 4 ]. The test data consists of 139 queries with more than 40; 000 images. It includes single-topic (location) as well as multi-topic queries (events associated with locations). In Fig. 2 we show performance for di erent congurations and varying budgets. The con gurations are: (i) Run 1 - Visual only, i.e. relevance prediction and representativeness. (ii) Run 2 - Meta only: In this run we only use (4) (5)

Run Type Run 1 Run 2 Run 3 Run Description

all single-topic multi-topic

all Single-topic Multi-topic

All Single-topic Multi-topic information associated with the image, but not the image itself, i.e text relevance, Flickr rank and time representativeness. (iii) Run 3 - we use a combination of the above mentioned objectives. In Tab. 1 we provide the results using the o cial performance metrics computed by [ 4 ]. The distribution of weights learnt for each shell is as shown in Fig. 1.

The visual run yields higher cluster recall while the textual run yields a better value of precision. This suggests that using visual information is e ective for diversifying the retrieval results while textual information is more e ective for retrieving relevant images. The lower precision of the visual run is not surprising, as it only uses a generic relevance prediction. While this allows to lter out images of people and several non-landmarks, it does not score relevance in a query-speci c way. In order to improve our visual approach it is thus necessary to compute similarities between text queries and images. This could be done by learning a joint embedding of text and images, similar to e.g. [ 8 ]. We also note that the method that we use performs better on the single-topic sets than the multi-topic sets.

[1]

Donahue ,

Jia ,

Vinyals , J. Ho man, N. Zhang , E. Tzeng, and

Darrell . Decaf: A deep convolutional activation feature for generic visual recognition . In International Conference on Machine Learning (ICML) , 2014 .

[2]

Duchi , E. Hazan, and

Singer . Adaptive subgradient methods for online learning and stochastic optimization . The Journal of Machine Learning Research , 2011 .

[3]

Gygli ,

Grabner , and L. Van Gool. Video Summarization by Learning Submodular Mixtures of Objectives . In Conference on Computer Vision and Pattern Recognition (CVPR) , 2015 .

[4]

Ionescu ,

A. L.

Ginsca ,

Boteanu ,

Popescu ,

Lupu , and

Mu ller. Retrieving diverse social images at mediaeval 2015: Challenge, dataset and evaluation . In Proceedings of MediaEval Benchmarking Initiative for Multimedia Evaluation , 2015 .

[5]

Krause and

Golovin . Submodular function maximization . Tractability: Practical Approaches to Hard Problems, 2012 .

[6]

M. S.

Lew ,

Sebe ,

Djeraba , and

Jain . Content-based multimedia information retrieval: State of the art and challenges . ACM Transactions on Multimedia Computing , Communications, and Applications (TOMM), 2006 .

[7]

Lin and

Bilmes . Learning mixtures of submodular shells with application to document summarization . In Uncertainty in Arti cial Intelligence (UAI) , 2012 .

[8]

Liu ,

Mei ,

Zhang ,

Che , and

Luo . Multi-task deep visual-semantic embedding for video thumbnail selection . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2015 .

[9]

Narasimhan and

Bilmes . A submodular-supermodular procedure with applications to discriminative structure learning . In Uncertainty in Arti cial Intelligence (UAI) , 2005 .

[10]

G. L.

Nemhauser ,

L. A.

Wolsey , and M. L. Fisher.

An analysis of approximations for maximizing submodular set functions-

I. Mathematical Programming , 1978 .

[11]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , and

Duchesnay . Scikit-learn: Machine learning in Python . Journal of Machine Learning Research , 2011 .

[12]

Rui ,

T. S.

Huang , and

S.-F.

Chang . Image retrieval: Current techniques, promising directions, and open issues . Journal of Visual Communication and Image Representation , 1999 .

[13]

Spyromitros-Xiou s , S. Papadopoulos,

A. L.

Ginsca ,

Popescu ,

Kompatsiaris ,

and I.

Vlahavas . Improving diversity in image search via supervised relevance scoring . In ACM on International Conference on Multimedia Retrieval. ACM , 2015 .