Ghent University-iMinds at MediaEval 2013 Diverse
         Images: Relevance-Based Hierarchical Clustering

           Baptist Vandersmissen1                    Abhineshwar Tomar1                        Fréderic Godin1
        baptist.vandersmissen@ugent.be            abhineshwar.tomar@ugent.be               frederic.godin@ugent.be
                                  Wesley De Neve         1,2
                                                                        Rik Van de Walle1
                                 wesley.deneve@ugent.be            rik.vandewalle@ugent.be
                           1
                               Multimedia Lab, ELIS, Ghent University – iMinds, Ghent, Belgium
                               2
                                 Image and Video Systems Lab, KAIST, Daejeon, South Korea

ABSTRACT                                                               cluster. The SRI for a set of images is calculated by tak-
In this paper, we attempt to tackle the MediaEval 2013 Re-             ing the mean of all the corresponding feature vectors of
trieving Diverse Social Images challenge, which is a filter and        these images. Intra-cluster ranking of all of the m images
refinement problem on a Flickr-based ranked set of social im-          is produced by calculating their Euclidean distance to the
ages. We developed three different approaches, using visual            SRI. The image with rank 1 (topmost rank) is closest to
data, textual data and a combination thereof, respectively.            the SRI of the cluster.
Hierarchical clustering on highly relevant images, combined       • Subsequently, we rank the k different clusters, again by
with a greedy approach to complement the ranking, forms             calculating the distance of the cluster SRIs to a general
the basis of our approach.                                          SRI value, which is the mean over all cluster SRIs. Finally,
                                                                    n images are selected by iterating over the ranked clusters
1.   INTRODUCTION                                                   and taking the topmost ranked image within each cluster.
  In this paper, we describe our approach for tackling the
MediaEval 2013 Retrieving Diverse Social Images Task [2].         3.     TEXTUAL RUN
This task focuses on result diversification in the context of       The textual run makes use of information derived from
social image retrieval. We refer to [2] for a complete task       tags and other textual metadata. This approach aims at di-
overview.                                                         versifying the results by reranking the images retrieved from
  We suggest a cluster-based approach for the visual run and      Flickr using textual relevance and semantic similarity. Our
a semantic similarity-based approach for the textual run.         solution is based on [3] and makes use of an adapted perfor-
The third run focuses on hierarchical clustering of relevant      mance metric to improve the ranking characteristics. Images
images and represents a combination of the purely visual          for a query can then be ordered by directly optimizing the
and textual techniques.                                           performance metric. This metric is named Average Diverse
                                                                  Precision (ADP) and is derived from the conventional Aver-
2.   VISUAL RUN                                                   age Precision metric by adding a diversity component. We
  We propose a hierarchical clustering-based approach for         refer to [3] for a comprehensive overview.
the ranking of images in accordance with their relevance            We implemented a greedy approach that optimizes an es-
and diversity for a specific location. This method builds on      timation of the ADP measurement. Let τ denote an ordering
the approach provided in [1]. It introduces an inter-cluster      of the images, and let τ(i) be the image at the position of
ranking machanism and differs on use of feature vectors,          rank i (a lower number indicates an image with a higher
distance measure and Synthetic Representative Image (SRI)         rank). With the top i − 1 documents established, we can
calculation method. We want to refine a set of m images           derive that the ith image should be decided as follows:
retrieved from Flickr to a ranking of size n.                                            
                                                                                           Rel(x)
                                                                                                                       
                                                                      τ(i) = arg max              Div(x)(C + Div(x)) , (2)
• The set of m images is hierarchically clustered to pro-                         x∈D−S       i
  duce k clusters. Similarity between two images xi and xj
                                                                  where
  (represented by a CN3x3 and LBP3x3 feature vector) is
  measured using a Gaussian kernel:                                                S = {τ(1), τ(2), . . . , τ(i − 1)} ,       (3)
                                  ||xi − xj ||22
                                                
                                                                                         i−1
              s(xi , xj ) = exp −                  .   (1)                               X
                                       σ2                                           C=         Rel(τ(k))Div(τ(k)),            (4)
                                                                                         k=1
• For each of the k clusters produced, we calculate a Syn-
  thetic Representative Image (SRI). The SRI acts as a rep-       with Rel(x) and Div(x) denoting the estimated relevance
  resentative image for all the images within a particular        and diversity of the image, respectively.

                                                                  3.1      Relevance Estimation
Copyright is held by the author/owner(s).                           We estimate the relevance of an image by making use of
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain    textual metadata (number of views, number of comments)
Table 1: Results (in %). Numbers between brackets denote the relative difference to the Flickr ranking and
CR denotes the cluster recall of the ranking.
                    Development Set (comparison with Flickr)                    Test Set
                               + GPS                                - GPS                       + GPS                    - GPS
                     CR@10                P@10            CR@10             P@10         CR@10       P@10            CR@10    P@10
         Visual      43.6 (2.4)         79.2 (-6.8)      48.4 (2.0)     71.6 (2.8)         37.5       76.1            34.7       56.8
        Textual      44.2 (2.9)         81.6 (-4.4)      51.6 (5.2)    67.2 (-1.6)         39.7       74.9            37.5       58.6
      Combined      49.8 (8.5)         85.6 (-0.4)       51.7 (5.3)    74.8 (6.0)          41.3       80.5            42.8    66.7


and social tags. This data was provided together with three                  based on a gain score. This score is higher for images
textual features: TF–IDF, Social TF–IDF and a probabilis-                    that maximize the diversity and relevance related to the
tic feature type. We suggest a linear combination of this                    current ranking.
normalized data.
                                                                        • When the first l spots in the ranking are filled, the algo-
  Rel(x) = α × tagsx + β × viewsx + γ × commentsx , (5)                   rithm runs from the start until all n spots are taken.
where viewsx and commentsx denote normalized number of
views and number of comments, respectively and
           P                                                            Table 2: Results on subset of 50 locations from test
              t∈Tx a probt,x + b tfidft,x + c stfidft,x                 set (in %) with crowd-sourcing ground truth.
   tagsx =                                              , (6)
                             |Tx |                                                           GT1                 GT2                    GT3
with Tx denoting the set of tags linked to image x. We                                  CR@10 P@10           CR@10 P@10       CR@10 P@10
found that the precision of image orderings, which are purely
based on relevance information, can be maximized by setting                    Visual    83.0      72.5       78.7     72.5       69.9        72.5
parameters a and γ to zero and increasing the impact of the                   Textual    77.7      65.7       72.1     65.7       63.5        65.7
parameters linked to ’tag’ and ’views’ information.
                                                                            Combined     83.6      69.4       77.4     69.4       66.1        69.4
3.2    Diversity Estimation
 Diversity of an image to a ranking is defined as the mini-
mal difference with the other images in that ranking:                   5.     EXPERIMENTS
                                                                           In Table 1, we can see the results of our algorithms (three
             Div(x) =      min        {1 − s(x, τ(i))}        (7)       runs), compared to the regular Flickr ranking and the re-
                        i∈{1,...,n}
                                                                        sults on the test set. We notice that the combined method
  We use a semantic similarity measure based on Google                  creates the most precise and diverse rankings based on both
distance to assess the difference between two images. The               precision and cluster recall measures (averages 42.1% cluster
average of the summation of the similarities between the                recall on test set). Analysis of the results on the develop-
different tags of both images gives us a value that describes           ment set indicate the importance of finding the right bal-
the semantic similarity between both images.                            ance between relevance and diversity. We also observe the
                                                                        similarity of the visual run and the textual run in terms of
4.    COMBINED RUN                                                      effectiveness. Table 2 lists the results on 50 locations of the
   To estimate the relevance of an image, we use the textual-           test set evaluated with crowd-sourcing ground truths.
based method (cf. Section 3.1). Similarity between images
is based on visual image features and a Gaussian kernel                 6.     CONCLUSIONS
method to measure the difference between two image vec-                   We observe that a method clustering images focused on
tors (cf. Equation 1). In order to provide both a relevant              high relevance outperforms all others. This method uses
and diverse ordering, we make use of hierarchical cluster-              textual data for the relevance estimation and visual features
ing techniques. We, again, try to refine a set of m images              for the diversity assessment.
retrieved from Flickr to a ranking of size n.
• First, l images are selected with the highest estimated               7.     REFERENCES
  relevance. l is an arbitrary number that depicts a subset             [1] R. Anca-Livia, J. Stöttinger, B. Ionescu, M. Menéndez,
  of the final ranking. The larger l becomes, the more the                  and F. Giunchiglia. Representativeness and diversity in
  focus will shift from relevance to diversity.                             photos via crowd-sourced media analysis. 2012.
                                                                        [2] B. Ionescu, M. Menéndez, H. Müller, and A. Popescu.
• Next, these l images are hierarchically clustered based on a              Retrieving diverse social images at mediaeval 2013:
  distance matrix calculated with the above described visual                Objectives, dataset and evaluation. In MediaEval 2013
  similarity. Per cluster, the most relevant item is selected               Workshop, CEUR-WS.org, Barcelona, Spain, 2013.
  and added to the final ranking.
                                                                        [3] M. Wang, K. Yang, X.-S. Hua, and H.-J. Zhang.
• Depending on the number of remaining places of the first                  Towards a relevant and diverse search of social images.
  l spots in the final ranking, images are greedily added                   Multimedia, IEEE Transactions on, 2010.