1. INTRODUCTION

Ghent University-iMinds at MediaEval 2014 Diverse Images: Adaptive Clustering with Deep Features

Wesley De Neve

0 1 2 3 4

wesley.deneve@ugent.be

0 1 4 0 Abhineshwar Tomar 1 Baptist Vandersmissen 2 Image and Video Systems Lab, KAIST , Daejeon , South Korea 3 Multimedia Lab, ELIS, Ghent University - iMinds , Ghent , Belgium 4 Rik Van de Walle

2014

16 17

In this paper, we attempt to tackle the MediaEval 2014 Retrieving Diverse Social Images challenge, a lter and re nement problem de ned for a Flickr-based ranked set of social images. We build upon solutions proposed in [5] and mainly focus on exploiting the joint use of all modalities. The use of image features extracted from a deep convolutional neural network, combined with the use of distributed word representations, forms the basis of our approach.

1. INTRODUCTION

In this paper, we describe our approach for tackling the MediaEval 2014 Retrieving Diverse Social Images Task [ 1 ]. This task focuses on result diversi cation in the context of image retrieval. We refer to [ 1 ] for a complete task overview.

METHODOLOGY

This section describes four di erent approaches created to solve the aforementioned challenge. The approach used in the last run uses external data sources; all other approaches exclusively use data provided by the task organizers. We focused on two parts: relevance estimation of an image with respect to a speci c location and similarity estimation between a pair of images. Particularly, run 2, 3 and 5 build upon these parts.

Run 1: Visual-only

We propose a hierarchical clustering-based approach for the ranking of images in accordance with their relevance and diversity for a speci c location. We used the approach proposed in [ 5 ] (cf. \Visual run").

Run 2: Textual-only

The textual run makes use of information derived from the provided tags and other textual metadata. This approach aims at diversifying the results by optimizing an adapted performance metric. We modi ed both the relevance and diversity estimation of the algorithm proposed in [ 5 ] (cf. \Textual run") as presented in the following sections. 2.2.1 Rel(x) =

tags(x) + with and

representing scalars, tags(x) = jft j t 2 Tx; tf idft > jTxj

1 f lickr(x)

; t2Tx gj

X tf idft; (2) and f lickr(x) denoting the original Flickr ranking of image x. The TF-IDF score of tag t is denoted by tf idft. The tag score (cf. Equation 2) is the sum of each tag's normalized TF-IDF score multiplied by the relative number of high scoring tags. In our approach, is set to the average TFIDF score. This bene ts images with a higher number of more relevant tags. 2.2.2

Estimating the semantic di erence between two images is based on the amount of shared tags. Let x and y denote two images with Tx and Ty denoting their set of tags, respectively. The diversity is then calculated as follows: Div(x; y) = 1

jTx \ Tyj max(jTxj; jTyj) 2.3

Run 3: Visual and Textual

The fusion of both visual and textual information results in a relevance-based clustering approach (cf. \Combined run" in [ 5 ]). We modi ed the clustering technique to adaptive hierarchical clustering. The optimal distance to form clusters is determined by nding the "knee" point in the plot of number of clusters versus the inter-cluster distance (similar to [ 3 ]). To estimate the relevance of an image, we use our textual-only method (cf. Section 2.2.1). The diversity between two images is estimated based on the Euclidean distance between their visual descriptor, which is represented by a CN3x3 and LBP3x3 vector [ 1 ]. 2.4

Run 5: External Sources

The algorithm used to produce the fth run is based on the one used in Section 2.3. Both the relevance and diversity (1) (3)

Run 5 0.8567 estimation components are adapted and described below. 2.4.1

In order to accurately estimate the relevance of an image, a well-de ned target location is necessary. Thus, each location is rst described in both a textual and visual manner.

To create this textual identity, related information of each location is extracted from DBpedia1. From this information textual keywords are extracted and combined with the top k most frequently occurring tags in the set of images of a location. The visual identity is formed on the basis of a set of representative photos, retrieved via Wikipedia. The relevance of an image is calculated based on a linear combination of the following three factors: textual relevance, visual relevance, and Flickr relevance.

The textual relevance of an image is entirely based on its tags. Again, assume that Tx denotes the set of tags of image x and that Ta denotes the set of tags depicting location a (i.e., textual identity):

Pt2Tx emaxk2Ta fsim(t;k)g jTxj ; (4)

We propose a new method to compute the similarity between tags and omit the use of the ubiquitous TF-IDF. Therefore, we make use of distributed word representations, namely word2vec2. A pretrained model (the Google News Dataset-based dictionary de ned as Tw) is used to convert words to vectors. Such vectors preserve the semantic and linguistic regularities among words [ 2 ]. The following formula describes this approach: sim(ta; tb) = 8 cos( ) if ta 2 Tw ^ tb 2 Tw < 1 if ta 2= Tw _ tb 2= Tw; ta = tb ; : 0 else (5) with ta and tb depicting a tag, and cos( ) the cosine similarity between their representative vectors. With this technique, semantically similar and spelling-wise di erent tags can still have an in uence on the eventual relevance score.

Visual relevance is calculated based on the maximum similarity between the image and the representative Wikipedia images. Finally, Flickr relevance is the inverse of the original Flickr ranking of the image. 2.4.2

Diversity Estimation

To improve the similarity estimation and thus dissimilarity estimation between two images, we attempt to nd more e ective visual descriptors. Therefore, we make use of a deep convolutional neural network, trained on 1.2 million images from ImageNet, named OverFeat3, to extract high-level features [ 4 ]. Each image is resized and cropped to a size of 231 pixels by 231 pixels, then for each image a representative vector is extracted from a convolutional network. This is done by feed-forward propagation through the network and omitting the fully connected layers, which results in a vector of size 4096 for each image. Thus, we assume that the numerous lters in the convolutional layers extract high-level and representative features. The diversity between two images is then again estimated based on the Euclidean distance between their descriptors.

EXPERIMENTS

In Table 1, we can see the results of the original Flickr ranking together with the results of all algorithms on the development set. Table 2 shows the results on the test set. Clearly, run 5 outperforms the other approaches when observing the F1-measure. Run 5 reaches an F1-score of 57:16% on the development set and 54:55% on the test set. 4.

CONCLUSIONS

We observe that run 5, using distributed word representations for the relevance estimation and OverFeat features for the diversity assessment, outperforms all others. Particularly, the use of advanced image features positively in uences the F1-score. For future work, the in uence of more focused distributed word representations will be investigated. 5.

[1]

Ionescu ,

Popescu ,

Lupu ,

A. L.

Ginsca , and

Mu ller. Retrieving diverse social images at mediaeval 2014: Challenge, dataset and evaluation . In MultimediaEval working Notes , 2014 .

[2]

Mikolov , I. Sutskever,

Chen , G. Corrado, and

Dean . Distributed representations of words and phrases and their compositionality . In Advances in Neural Information Processing Systems. NIPS , 2013 .

[3]

Salvador and

Chan . Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms . In Tools with Arti cial Intelligence , pages 576 { 584 , Nov 2004 .

[4]

Sermanet ,

Eigen ,

Zhang ,

Mathieu ,

Fergus , and

LeCun. Overfeat : Integrated recognition, localization and detection using convolutional networks . CoRR , 2013 .

[5]

Vandersmissen ,

Tomar ,

Godin , W. De Neve, and R. Van de Walle. Ghent University-iMinds at MediaEval 2013 Diverse Images: Relevance-Based Hierarchical Clustering . Working Notes Proceedings of the MediaEval 2013 Workshop , Barcelona, Spain, October 18 -19, CEUR-WS, 1043 , 2013 .