=Paper=
{{Paper
|id=Vol-1263/paper14
|storemode=property
|title=Ghent University-iMinds at MediaEval 2014 Diverse Images: Adaptive Clustering with Deep Features
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_14.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/VandersmissenTGNW14
}}
==Ghent University-iMinds at MediaEval 2014 Diverse Images: Adaptive Clustering with Deep Features==
<pdf width="1500px">https://ceur-ws.org/Vol-1263/mediaeval2014_submission_14.pdf</pdf>
<pre>
         Ghent University-iMinds at MediaEval 2014 Diverse
          Images: Adaptive Clustering with Deep Features

           Baptist Vandersmissen1                    Abhineshwar Tomar1                    Fréderic Godin1
        baptist.vandersmissen@ugent.be            abhineshwar.tomar@ugent.be            frederic.godin@ugent.be

                                    Wesley De Neve1,2                Rik Van de Walle1
                                   wesley.deneve@ugent.be          rik.vandewalle@ugent.be

                           1
                               Multimedia Lab, ELIS, Ghent University – iMinds, Ghent, Belgium
                               2
                                Image and Video Systems Lab, KAIST, Daejeon, South Korea

ABSTRACT                                                          2.2.1    Relevance Estimation
In this paper, we attempt to tackle the MediaEval 2014 Re-          The relevance of an image is estimated by making use of
trieving Diverse Social Images challenge, a filter and refine-    textual metadata. Let Tx denote the set of tags assigned to
ment problem defined for a Flickr-based ranked set of social      image x. Next formula predicts the relevance of image x:
images. We build upon solutions proposed in [5] and mainly                                                        1
focus on exploiting the joint use of all modalities. The use of             Rel(x) = α × tags(x) + β ×                  ,    (1)
                                                                                                             f lickr(x)
image features extracted from a deep convolutional neural
network, combined with the use of distributed word repre-         with α and β representing scalars,
sentations, forms the basis of our approach.                                    |{t | t ∈ Tx , tf idft > λ}|   X
                                                                    tags(x) =                                ×   tf idft ,   (2)
                                                                                            |Tx |
                                                                                                                t∈Tx
1.    INTRODUCTION
  In this paper, we describe our approach for tackling the        and f lickr(x) denoting the original Flickr ranking of image
MediaEval 2014 Retrieving Diverse Social Images Task [1].         x. The TF-IDF score of tag t is denoted by tf idft . The
This task focuses on result diversification in the context of     tag score (cf. Equation 2) is the sum of each tag’s normal-
image retrieval. We refer to [1] for a complete task overview.    ized TF-IDF score multiplied by the relative number of high
                                                                  scoring tags. In our approach, λ is set to the average TF-
                                                                  IDF score. This benefits images with a higher number of
2.    METHODOLOGY                                                 more relevant tags.
  This section describes four different approaches created to
solve the aforementioned challenge. The approach used in          2.2.2    Diversity Estimation
the last run uses external data sources; all other approaches        Estimating the semantic difference between two images is
exclusively use data provided by the task organizers. We          based on the amount of shared tags. Let x and y denote two
focused on two parts: relevance estimation of an image with       images with Tx and Ty denoting their set of tags, respec-
respect to a specific location and similarity estimation be-      tively. The diversity is then calculated as follows:
tween a pair of images. Particularly, run 2, 3 and 5 build
upon these parts.                                                                                   |Tx ∩ Ty |
                                                                                Div(x, y) = 1 −                     .        (3)
                                                                                                  max(|Tx |, |Ty |)
2.1    Run 1: Visual-only
  We propose a hierarchical clustering-based approach for         2.3     Run 3: Visual and Textual
the ranking of images in accordance with their relevance             The fusion of both visual and textual information results
and diversity for a specific location. We used the approach       in a relevance-based clustering approach (cf. “Combined
proposed in [5] (cf. “Visual run”).                               run” in [5]). We modified the clustering technique to adap-
                                                                  tive hierarchical clustering. The optimal distance to form
2.2    Run 2: Textual-only                                        clusters is determined by finding the ”knee” point in the plot
  The textual run makes use of information derived from the       of number of clusters versus the inter-cluster distance (sim-
provided tags and other textual metadata. This approach           ilar to [3]). To estimate the relevance of an image, we use
aims at diversifying the results by optimizing an adapted         our textual-only method (cf. Section 2.2.1). The diversity
performance metric. We modified both the relevance and            between two images is estimated based on the Euclidean dis-
diversity estimation of the algorithm proposed in [5] (cf.        tance between their visual descriptor, which is represented
“Textual run”) as presented in the following sections.            by a CN3x3 and LBP3x3 vector [1].

                                                                  2.4     Run 5: External Sources
Copyright is held by the author/owner(s).                           The algorithm used to produce the fifth run is based on
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain    the one used in Section 2.3. Both the relevance and diversity
          Table 1: Results on development set.                                        Table 2: Results on test set.
            Flickr Run 1 Run 2 Run 3 Run 5                                              Run 1 Run 2 Run 3 Run 5
     P@20     0.8333     0.7083    0.7500    0.7700     0.8567                 P@20     0.6232   0.7480    0.7557    0.8008
    CR@20     0.3455     0.3967   0.4441     0.4043     0.4289               CR@20      0.3600   0.4279    0.4035    0.4252
    F1@20     0.4885     0.5086    0.5579    0.5302     0.5716               F1@20      0.4503   0.5369    0.5181    0.5455


                                                                     from ImageNet, named OverFeat3 , to extract high-level fea-
estimation components are adapted and described below.
                                                                     tures [4]. Each image is resized and cropped to a size of 231
2.4.1       Relevance Estimation                                     pixels by 231 pixels, then for each image a representative
                                                                     vector is extracted from a convolutional network. This is
   In order to accurately estimate the relevance of an image,        done by feed-forward propagation through the network and
a well-defined target location is necessary. Thus, each loca-        omitting the fully connected layers, which results in a vector
tion is first described in both a textual and visual manner.         of size 4096 for each image. Thus, we assume that the nu-
   To create this textual identity, related information of each      merous filters in the convolutional layers extract high-level
location is extracted from DBpedia1 . From this information          and representative features. The diversity between two im-
textual keywords are extracted and combined with the top             ages is then again estimated based on the Euclidean distance
k most frequently occurring tags in the set of images of a           between their descriptors.
location. The visual identity is formed on the basis of a
set of representative photos, retrieved via Wikipedia. The
relevance of an image is calculated based on a linear com-           3.     EXPERIMENTS
bination of the following three factors: textual relevance,            In Table 1, we can see the results of the original Flickr
visual relevance, and Flickr relevance.                              ranking together with the results of all algorithms on the
                                                                     development set. Table 2 shows the results on the test
The textual relevance of an image is entirely based on its           set. Clearly, run 5 outperforms the other approaches when
tags. Again, assume that Tx denotes the set of tags of im-           observing the F1-measure. Run 5 reaches an F1-score of
age x and that Ta denotes the set of tags depicting location         57.16% on the development set and 54.55% on the test set.
a (i.e., textual identity):
                                                                     4.     CONCLUSIONS
                         P       maxk∈Ta {sim(t,k)}
                           t∈Tx e
                                                                        We observe that run 5, using distributed word represen-
              Rel(x) =                              ,         (4)    tations for the relevance estimation and OverFeat features
                                   |Tx |
                                                                     for the diversity assessment, outperforms all others. Particu-
   We propose a new method to compute the similarity be-             larly, the use of advanced image features positively influences
tween tags and omit the use of the ubiquitous TF-IDF.                the F1-score. For future work, the influence of more focused
Therefore, we make use of distributed word representations,          distributed word representations will be investigated.
namely word2vec2 . A pretrained model (the Google News
Dataset-based dictionary defined as Tw ) is used to convert
words to vectors. Such vectors preserve the semantic and
                                                                     5.     REFERENCES
linguistic regularities among words [2]. The following for-          [1] B. Ionescu, A. Popescu, M. Lupu, A. L. Ginsca, and
mula describes this approach:                                            H. Müller. Retrieving diverse social images at
                                                                         mediaeval 2014: Challenge, dataset and evaluation. In
                   cos(Θ) if ta ∈ Tw ∧ tb ∈ Tw
                  
                                                                         MultimediaEval working Notes, 2014.
  sim(ta , tb ) =   1              / Tw ∨ t b ∈
                             if ta ∈          / Tw , t a = t b ,     [2] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and
                   0        else                                        J. Dean. Distributed representations of words and
                                                               (5)       phrases and their compositionality. In Advances in
with ta and tb depicting a tag, and cos(Θ) the cosine simi-              Neural Information Processing Systems. NIPS, 2013.
larity between their representative vectors. With this tech-         [3] S. Salvador and P. Chan. Determining the number of
nique, semantically similar and spelling-wise different tags             clusters/segments in hierarchical
can still have an influence on the eventual relevance score.             clustering/segmentation algorithms. In Tools with
   Visual relevance is calculated based on the maximum sim-              Artificial Intelligence, pages 576–584, Nov 2004.
ilarity between the image and the representative Wikipedia           [4] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu,
images. Finally, Flickr relevance is the inverse of the original         R. Fergus, and Y. LeCun. Overfeat: Integrated
Flickr ranking of the image.                                             recognition, localization and detection using
                                                                         convolutional networks. CoRR, 2013.
2.4.2       Diversity Estimation                                     [5] B. Vandersmissen, A. Tomar, F. Godin, W. De Neve,
   To improve the similarity estimation and thus dissimilar-             and R. Van de Walle. Ghent University-iMinds at
ity estimation between two images, we attempt to find more               MediaEval 2013 Diverse Images: Relevance-Based
effective visual descriptors. Therefore, we make use of a deep           Hierarchical Clustering. Working Notes Proceedings of
convolutional neural network, trained on 1.2 million images              the MediaEval 2013 Workshop, Barcelona, Spain,
1
                                                                         October 18-19, CEUR-WS, 1043, 2013.
    http://dbpedia.org/
2                                                                    3
    https://code.google.com/p/word2vec/                                  http://cilvr.nyu.edu/doku.php?id=code:start

</pre>