Ghent University-iMinds at MediaEval 2014 Diverse Images: Adaptive Clustering with Deep Features

Ghent University-iMinds at MediaEval 2014 Diverse Images: Adaptive Clustering with Deep Features BaptistVandersmissen baptist.vandersmissen@ugent.be ELIS Multimedia Lab Ghent University -iMinds

Ghent Belgium

AbhineshwarTomar abhineshwar.tomar@ugent.be FrédericGodin frederic.godin@ugent.be WesleyDe Neve wesley.deneve@ugent.be ELIS Multimedia Lab Ghent University -iMinds

Ghent Belgium

Image and Video Systems Lab KAIST

Daejeon South Korea

RikVan De Walle rik.vandewalle@ugent.be ELIS Multimedia Lab Ghent University -iMinds

Ghent Belgium

Ghent University-iMinds at MediaEval 2014 Diverse Images: Adaptive Clustering with Deep Features 8E15BE0AB85AA00F414C01E622B47D0F GROBID - A machine learning software for extracting information from scholarly documents

In this paper, we attempt to tackle the MediaEval 2014 Retrieving Diverse Social Images challenge, a filter and refinement problem defined for a Flickr-based ranked set of social images. We build upon solutions proposed in [5] and mainly focus on exploiting the joint use of all modalities. The use of image features extracted from a deep convolutional neural network, combined with the use of distributed word representations, forms the basis of our approach.

INTRODUCTION

In this paper, we describe our approach for tackling the MediaEval 2014 Retrieving Diverse Social Images Task [1]. This task focuses on result diversification in the context of image retrieval. We refer to [1] for a complete task overview.

METHODOLOGY

This section describes four different approaches created to solve the aforementioned challenge. The approach used in the last run uses external data sources; all other approaches exclusively use data provided by the task organizers. We focused on two parts: relevance estimation of an image with respect to a specific location and similarity estimation between a pair of images. Particularly, run 2, 3 and 5 build upon these parts.

Run 1: Visual-only

We propose a hierarchical clustering-based approach for the ranking of images in accordance with their relevance and diversity for a specific location. We used the approach proposed in [5] (cf. "Visual run").

Run 2: Textual-only

The textual run makes use of information derived from the provided tags and other textual metadata. This approach aims at diversifying the results by optimizing an adapted performance metric. We modified both the relevance and diversity estimation of the algorithm proposed in [5] (cf. "Textual run") as presented in the following sections.

Relevance Estimation

The relevance of an image is estimated by making use of textual metadata. Let Tx denote the set of tags assigned to image x. Next formula predicts the relevance of image x:

Rel(x) = α × tags(x) + β × 1 f lickr(x) ,(1)

with α and β representing scalars,

tags(x) = |{t | t ∈ Tx, tf idft > λ}| |Tx| × t∈Tx tf idft,(2)

and f lickr(x) denoting the original Flickr ranking of image x. The TF-IDF score of tag t is denoted by tf idft. The tag score (cf. Equation 2) is the sum of each tag's normalized TF-IDF score multiplied by the relative number of high scoring tags. In our approach, λ is set to the average TF-IDF score. This benefits images with a higher number of more relevant tags.

Diversity Estimation

Estimating the semantic difference between two images is based on the amount of shared tags. Let x and y denote two images with Tx and Ty denoting their set of tags, respectively. The diversity is then calculated as follows:

Div(x, y) = 1 − |Tx ∩ Ty| max(|Tx|, |Ty|) .(3)

Run 3: Visual and Textual

The fusion of both visual and textual information results in a relevance-based clustering approach (cf. "Combined run" in [5]). We modified the clustering technique to adaptive hierarchical clustering. The optimal distance to form clusters is determined by finding the "knee" point in the plot of number of clusters versus the inter-cluster distance (similar to [3]). To estimate the relevance of an image, we use our textual-only method (cf. Section 2.2.1). The diversity between two images is estimated based on the Euclidean distance between their visual descriptor, which is represented by a CN3x3 and LBP3x3 vector [1].

Run 5: External Sources

The algorithm used to produce the fifth run is based on the one used in Section 2.3. Both the relevance and diversity

Relevance Estimation

In order to accurately estimate the relevance of an image, a well-defined target location is necessary. Thus, each location is first described in both a textual and visual manner.

To create this textual identity, related information of each location is extracted from DBpedia1 . From this information textual keywords are extracted and combined with the top k most frequently occurring tags in the set of images of a location. The visual identity is formed on the basis of a set of representative photos, retrieved via Wikipedia. The relevance of an image is calculated based on a linear combination of the following three factors: textual relevance, visual relevance, and Flickr relevance.

The textual relevance of an image is entirely based on its tags. Again, assume that Tx denotes the set of tags of image x and that Ta denotes the set of tags depicting location a (i.e., textual identity):

Rel(x) = t∈Tx e max k∈Ta {sim(t,k)} |Tx| ,(4)

We propose a new method to compute the similarity between tags and omit the use of the ubiquitous TF-IDF. Therefore, we make use of distributed word representations, namely word2vec2 . A pretrained model (the Google News Dataset-based dictionary defined as Tw) is used to convert words to vectors. Such vectors preserve the semantic and linguistic regularities among words [2]. The following formula describes this approach:

sim(ta, t b ) =    cos(Θ) if ta ∈ Tw ∧ t b ∈ Tw 1 if ta / ∈ Tw ∨ t b / ∈ Tw, ta = t b 0 else ,(5)

with ta and t b depicting a tag, and cos(Θ) the cosine similarity between their representative vectors. With this technique, semantically similar and spelling-wise different tags can still have an influence on the eventual relevance score.

Visual relevance is calculated based on the maximum similarity between the image and the representative Wikipedia images. Finally, Flickr relevance is the inverse of the original Flickr ranking of the image.

Diversity Estimation

To improve the similarity estimation and thus dissimilarity estimation between two images, we attempt to find more effective visual descriptors. Therefore, we make use of a deep convolutional neural network, trained on 1.2 million images from ImageNet, named OverFeat3 , to extract high-level features [4]. Each image is resized and cropped to a size of 231 pixels by 231 pixels, then for each image a representative vector is extracted from a convolutional network. This is done by feed-forward propagation through the network and omitting the fully connected layers, which results in a vector of size 4096 for each image. Thus, we assume that the numerous filters in the convolutional layers extract high-level and representative features. The diversity between two images is then again estimated based on the Euclidean distance between their descriptors.

EXPERIMENTS

In Table 1, we can see the results of the original Flickr ranking together with the results of all algorithms on the development set. Table 2 shows the results on the test set. Clearly, run 5 outperforms the other approaches when observing the F1-measure. Run 5 reaches an F1-score of 57.16% on the development set and 54.55% on the test set.

CONCLUSIONS

We observe that run 5, using distributed word representations for the relevance estimation and OverFeat features for the diversity assessment, outperforms all others. Particularly, the use of advanced image features positively influences the F1-score. For future work, the influence of more focused distributed word representations will be investigated.

Table 1 :1Results on development set. Flickr Run 1 Run 2 Run 3 Run 5P@20 0.8333 0.70830.75000.7700 0.8567CR@20 0.3455 0.3967 0.4441 0.40430.4289F1@20 0.4885 0.50860.55790.5302 0.5716estimation components are adapted and described below.

Table 2 :2Results on test set.Run 1 Run 2 Run 3 Run 5P@200.62320.74800.7557 0.8008CR@200.3600 0.4279 0.40350.4252F1@200.45030.53690.5181 0.5455

http://dbpedia.org/ https://code.google.com/p/word2vec/ http://cilvr.nyu.edu/doku.php?id=code:start

<author> <persName><surname>References</surname></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b1"> <analytic> <title level="a" type="main">Retrieving diverse social images at mediaeval 2014: Challenge, dataset and evaluation BIonescu APopescu MLupu ALGinsca HMüller MultimediaEval working Notes 2014 Distributed representations of words and phrases and their compositionality TMikolov ISutskever KChen GCorrado JDean Advances in Neural Information Processing Systems NIPS 2013 Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms SSalvador PChan Tools with Artificial Intelligence Nov 2004 Overfeat: Integrated recognition, localization and detection using convolutional networks PSermanet DEigen XZhang MMathieu RFergus YLecun 2013 CoRR Ghent University-iMinds at MediaEval 2013 Diverse Images: Relevance-Based Hierarchical Clustering BVandersmissen ATomar FGodin WDeNeve RVan De Walle Working Notes Proceedings of the MediaEval 2013 Workshop

Barcelona, Spain

CEUR-WS October 18-19. 1043. 2013