=Paper=
{{Paper
|id=Vol-1263/paper14
|storemode=property
|title=Ghent University-iMinds at MediaEval 2014 Diverse Images: Adaptive Clustering with Deep Features
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_14.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/VandersmissenTGNW14
}}
==Ghent University-iMinds at MediaEval 2014 Diverse Images: Adaptive Clustering with Deep Features==
Ghent University-iMinds at MediaEval 2014 Diverse
Images: Adaptive Clustering with Deep Features
Baptist Vandersmissen1 Abhineshwar Tomar1 Fréderic Godin1
baptist.vandersmissen@ugent.be abhineshwar.tomar@ugent.be frederic.godin@ugent.be
Wesley De Neve1,2 Rik Van de Walle1
wesley.deneve@ugent.be rik.vandewalle@ugent.be
1
Multimedia Lab, ELIS, Ghent University – iMinds, Ghent, Belgium
2
Image and Video Systems Lab, KAIST, Daejeon, South Korea
ABSTRACT 2.2.1 Relevance Estimation
In this paper, we attempt to tackle the MediaEval 2014 Re- The relevance of an image is estimated by making use of
trieving Diverse Social Images challenge, a filter and refine- textual metadata. Let Tx denote the set of tags assigned to
ment problem defined for a Flickr-based ranked set of social image x. Next formula predicts the relevance of image x:
images. We build upon solutions proposed in [5] and mainly 1
focus on exploiting the joint use of all modalities. The use of Rel(x) = α × tags(x) + β × , (1)
f lickr(x)
image features extracted from a deep convolutional neural
network, combined with the use of distributed word repre- with α and β representing scalars,
sentations, forms the basis of our approach. |{t | t ∈ Tx , tf idft > λ}| X
tags(x) = × tf idft , (2)
|Tx |
t∈Tx
1. INTRODUCTION
In this paper, we describe our approach for tackling the and f lickr(x) denoting the original Flickr ranking of image
MediaEval 2014 Retrieving Diverse Social Images Task [1]. x. The TF-IDF score of tag t is denoted by tf idft . The
This task focuses on result diversification in the context of tag score (cf. Equation 2) is the sum of each tag’s normal-
image retrieval. We refer to [1] for a complete task overview. ized TF-IDF score multiplied by the relative number of high
scoring tags. In our approach, λ is set to the average TF-
IDF score. This benefits images with a higher number of
2. METHODOLOGY more relevant tags.
This section describes four different approaches created to
solve the aforementioned challenge. The approach used in 2.2.2 Diversity Estimation
the last run uses external data sources; all other approaches Estimating the semantic difference between two images is
exclusively use data provided by the task organizers. We based on the amount of shared tags. Let x and y denote two
focused on two parts: relevance estimation of an image with images with Tx and Ty denoting their set of tags, respec-
respect to a specific location and similarity estimation be- tively. The diversity is then calculated as follows:
tween a pair of images. Particularly, run 2, 3 and 5 build
upon these parts. |Tx ∩ Ty |
Div(x, y) = 1 − . (3)
max(|Tx |, |Ty |)
2.1 Run 1: Visual-only
We propose a hierarchical clustering-based approach for 2.3 Run 3: Visual and Textual
the ranking of images in accordance with their relevance The fusion of both visual and textual information results
and diversity for a specific location. We used the approach in a relevance-based clustering approach (cf. “Combined
proposed in [5] (cf. “Visual run”). run” in [5]). We modified the clustering technique to adap-
tive hierarchical clustering. The optimal distance to form
2.2 Run 2: Textual-only clusters is determined by finding the ”knee” point in the plot
The textual run makes use of information derived from the of number of clusters versus the inter-cluster distance (sim-
provided tags and other textual metadata. This approach ilar to [3]). To estimate the relevance of an image, we use
aims at diversifying the results by optimizing an adapted our textual-only method (cf. Section 2.2.1). The diversity
performance metric. We modified both the relevance and between two images is estimated based on the Euclidean dis-
diversity estimation of the algorithm proposed in [5] (cf. tance between their visual descriptor, which is represented
“Textual run”) as presented in the following sections. by a CN3x3 and LBP3x3 vector [1].
2.4 Run 5: External Sources
Copyright is held by the author/owner(s). The algorithm used to produce the fifth run is based on
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain the one used in Section 2.3. Both the relevance and diversity
Table 1: Results on development set. Table 2: Results on test set.
Flickr Run 1 Run 2 Run 3 Run 5 Run 1 Run 2 Run 3 Run 5
P@20 0.8333 0.7083 0.7500 0.7700 0.8567 P@20 0.6232 0.7480 0.7557 0.8008
CR@20 0.3455 0.3967 0.4441 0.4043 0.4289 CR@20 0.3600 0.4279 0.4035 0.4252
F1@20 0.4885 0.5086 0.5579 0.5302 0.5716 F1@20 0.4503 0.5369 0.5181 0.5455
from ImageNet, named OverFeat3 , to extract high-level fea-
estimation components are adapted and described below.
tures [4]. Each image is resized and cropped to a size of 231
2.4.1 Relevance Estimation pixels by 231 pixels, then for each image a representative
vector is extracted from a convolutional network. This is
In order to accurately estimate the relevance of an image, done by feed-forward propagation through the network and
a well-defined target location is necessary. Thus, each loca- omitting the fully connected layers, which results in a vector
tion is first described in both a textual and visual manner. of size 4096 for each image. Thus, we assume that the nu-
To create this textual identity, related information of each merous filters in the convolutional layers extract high-level
location is extracted from DBpedia1 . From this information and representative features. The diversity between two im-
textual keywords are extracted and combined with the top ages is then again estimated based on the Euclidean distance
k most frequently occurring tags in the set of images of a between their descriptors.
location. The visual identity is formed on the basis of a
set of representative photos, retrieved via Wikipedia. The
relevance of an image is calculated based on a linear com- 3. EXPERIMENTS
bination of the following three factors: textual relevance, In Table 1, we can see the results of the original Flickr
visual relevance, and Flickr relevance. ranking together with the results of all algorithms on the
development set. Table 2 shows the results on the test
The textual relevance of an image is entirely based on its set. Clearly, run 5 outperforms the other approaches when
tags. Again, assume that Tx denotes the set of tags of im- observing the F1-measure. Run 5 reaches an F1-score of
age x and that Ta denotes the set of tags depicting location 57.16% on the development set and 54.55% on the test set.
a (i.e., textual identity):
4. CONCLUSIONS
P maxk∈Ta {sim(t,k)}
t∈Tx e
We observe that run 5, using distributed word represen-
Rel(x) = , (4) tations for the relevance estimation and OverFeat features
|Tx |
for the diversity assessment, outperforms all others. Particu-
We propose a new method to compute the similarity be- larly, the use of advanced image features positively influences
tween tags and omit the use of the ubiquitous TF-IDF. the F1-score. For future work, the influence of more focused
Therefore, we make use of distributed word representations, distributed word representations will be investigated.
namely word2vec2 . A pretrained model (the Google News
Dataset-based dictionary defined as Tw ) is used to convert
words to vectors. Such vectors preserve the semantic and
5. REFERENCES
linguistic regularities among words [2]. The following for- [1] B. Ionescu, A. Popescu, M. Lupu, A. L. Ginsca, and
mula describes this approach: H. Müller. Retrieving diverse social images at
mediaeval 2014: Challenge, dataset and evaluation. In
cos(Θ) if ta ∈ Tw ∧ tb ∈ Tw
MultimediaEval working Notes, 2014.
sim(ta , tb ) = 1 / Tw ∨ t b ∈
if ta ∈ / Tw , t a = t b , [2] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and
0 else J. Dean. Distributed representations of words and
(5) phrases and their compositionality. In Advances in
with ta and tb depicting a tag, and cos(Θ) the cosine simi- Neural Information Processing Systems. NIPS, 2013.
larity between their representative vectors. With this tech- [3] S. Salvador and P. Chan. Determining the number of
nique, semantically similar and spelling-wise different tags clusters/segments in hierarchical
can still have an influence on the eventual relevance score. clustering/segmentation algorithms. In Tools with
Visual relevance is calculated based on the maximum sim- Artificial Intelligence, pages 576–584, Nov 2004.
ilarity between the image and the representative Wikipedia [4] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu,
images. Finally, Flickr relevance is the inverse of the original R. Fergus, and Y. LeCun. Overfeat: Integrated
Flickr ranking of the image. recognition, localization and detection using
convolutional networks. CoRR, 2013.
2.4.2 Diversity Estimation [5] B. Vandersmissen, A. Tomar, F. Godin, W. De Neve,
To improve the similarity estimation and thus dissimilar- and R. Van de Walle. Ghent University-iMinds at
ity estimation between two images, we attempt to find more MediaEval 2013 Diverse Images: Relevance-Based
effective visual descriptors. Therefore, we make use of a deep Hierarchical Clustering. Working Notes Proceedings of
convolutional neural network, trained on 1.2 million images the MediaEval 2013 Workshop, Barcelona, Spain,
1
October 18-19, CEUR-WS, 1043, 2013.
http://dbpedia.org/
2 3
https://code.google.com/p/word2vec/ http://cilvr.nyu.edu/doku.php?id=code:start