A Unified Framework for Retrieving Diverse Social Images

                                       Maia Zaharieva1,2 and Patrick Schwab1
            1
                Multimedia Information Systems, Faculty of Computer Science, University of Vienna, Austria
                 2
                   Interactive Media Systems, Institute of Software Technology and Interactive Systems,
                                         Vienna University of Technology, Austria
                            maia.zaharieva@[univie|tuwien].ac.at, a0927193@unet.univie.ac.at


ABSTRACT                                                          1) relevance ranking of input images, 2) image clustering for
In this paper we explore the performance of a generic, uni-       diversification, and 3) final image selection. The initial set
fied framework for the retrieval of relevant and diverse im-      of input images may be optionally pre-processed in order to
ages from social photo collections. The approach allows           filter potentially irrelevant images, such as images with a
for the easy evaluation of different visual and textual im-       human as main subject.
age descriptions, clustering algorithms, and similarity met-         In the first stage, relevance ranking, each image of the
rics. Preliminary results show strong dependance between          input set is first represented by a feature vector ~v , where ~v
the choice of underlying technology and similarity metric,        is a concatenation of the standardized z-scores of the feature
and the achieved performance.                                     descriptors d1 , ..., dn :
                                                                                   ~v = zscore( d1 ,   ...,   dn )            (1)
1.   INTRODUCTION                                                 Since provided Wikipedia photos are per definition represen-
   The immense daily growth of publicly shared media bears        tative [2], we additionally compute a representative feature
both potentials and challenges for automated media analysis       vector ~vr for each referenced Wikipedia image. Following,
and understanding. Currently, image access and retrieval is       the relevance score, s, of an image is defined as the smallest
usually based on user-provided and camera-generated meta-         distance between its feature vector ~v and all ~vr from the set
data. Although, more and more available, such metadata            of representative feature vectors W :
often suffers limitations such as imprecise capture time and
GPS information or misleading and ambiguous textual de-                             s = min distance(~v , ~vr )               (2)
                                                                                         vr ∈W
                                                                                         ~
scriptions. Additionally, the large number of shared items
often results in a high-number of visually highly similar data.      The aim of the second stage, image clustering, is to find
This challenge is the focus of the MediaEval 2014 Retriev-        groups of similar images that can be used to diversify the
ing Diverse Social Images Task [2]. The aim of the task is        final retrieval results. Note that, distance measures and im-
the refinement of location images retrieved from Flickr while     age features at this step are not necessarily the same ones
taking into consideration both their relevance and diversity.     employed for relevance ranking.
   Previous work in the context of this task shows a broad           The third and last stage of the approach, final image se-
field of possible approaches ranging from re-ranking and          lection, combines the results of the previous steps to retrieve
clustering to greedy optimization and graph representations       images that are both relevant and diverse according to the
[3]. Several authors propose different systems for different      initial image set. For this stage we use a Round-Robin al-
feature types (e.g., [1][5]) that impedes the reasoning about     gorithm. We start by selecting the image with the best
the selection of an approach or particular features. Fur-         relevance score from each cluster. These images, sorted in
thermore, some methods build upon assumptions that hold           ascending order, constitute the m highest ranked results,
true in a limited setting only (e.g., relevance of an image is    where m is the number of detected clusters. The selected
related to the number of views or the length of the descrip-      images are removed from their corresponding clusters and
tions [4]). While most of the presented approaches employ a       the selection process is repeated until the required number
combination of a re-ranking (for relevance improvement) and       of retrieved results is achieved.
a clustering (for ensuring diversification) method, we build         In general, the clustering algorithm, the metric used to
a unified framework that allows for a thorough evaluation         compare the feature vectors, and the underlying image fea-
of various textual and visual features, clustering algorithms,    tures (for both image ranking and image clustering) are
and similarity metrics.                                           up to choice. In our experiments we tested different clus-
                                                                  tering algorithms: k-means, Adaptive Hierarchical Cluster-
                                                                  ing (AHC), MeanShift, and Lingo, several comparison met-
2.   APPROACH                                                     rics: Euclidean, city-block, χ2 , cosine, correlation, Maha-
  We employ a multi-stage approach for the retrieval of di-       lanobis, Spearman, Hamming, and Jaccard, and all visual
verse social images. The workflow passes three main stages:       and textual features provided by the organizers [2]: term
                                                                  frequency - inverse document frequency (TF-IDF), Color
                                                                  Naming Histogram (CN), Histogram of Oriented Gradients
Copyright is held by the author/owner(s).                         (HOG), Color Moments (CM), Locally Binary Pattern (LBP),
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain    Statistics of Gray Level Run Length Matrix (GLRLM), and
Table 1: Best feature-metric combinations for AHC.               Table 2: Official runs configurations (V: visual, T:
                                                                 textual descriptors employed).
       Relevance ranking               Image clustering
  1   CM3x3         Euclidean    1   CM                χ2                            Relevance ranking   Image Clustering
  2   TF-IDF        Spearman         CM3x3         Euclidean             run1 (V)         CM3x3               SIFT
      SIFT          Euclidean    2   HOG             cosine              run2 (T)         TF-IDF             TF-IDF
  3   CM                χ2       3   GLRLM3x3          χ2                run3 (VT)        TF-IDF              CSD
      LBP               χ2           LBP3x3            χ2                run5 (V)         CM3x3               CSD
      LBP3x3            χ2       4   GLRLM         Euclidean
  4   HOG             cosine         LBP               χ2
      GLRLM             χ2           SIFT          Euclidean
      GLRLM3x3          χ2       5   CSD             cosine                    Table 3: Evaluation results.
  5   CSD             cosine         CN            Euclidean
      CN           correlation       CN3x3         Euclidean               Development dataset             Test dataset
  6   CN3x3         Euclidean    6   TF-IDF        Euclidean             CR@20 P@20       F1@20      CR@20 P@20 F1@20
                                                                  run1   0.4426 0.7600 0.5552        0.3901 0.6646 0.4863
                                                                  run2   0.4132 0.7250 0.5188        0.3909 0.6809 0.4888
                                                                  run3   0.4484 0.7567 0.5559        0.3982 0.6732 0.4949
                                                                  run5   0.4369 0.7617 0.5499        0.3915 0.6752 0.4897
the corresponding spatial pyramid representations (3x3) in
addition to Bag-of-Visual Words (BoVW) of dense SIFT de-
scriptors.
                                                                 4.   CONCLUSION
                                                                    In this paper we presented a generic, unsupervised frame-
                                                                 work for the evaluation of various visual and textual fea-
3.    EXPERIMENTS AND RESULTS                                    tures, similarity metrics, and clustering approaches for the
                                                                 retrieval of diverse social images. Performed experiments
   In our first experiments we compared the performances of
                                                                 aim at the evaluation of the potentials and limitations of
the different clustering algorithms. Results on the develop-
                                                                 the provided visual and textual descriptions and, thus, we
ment data set showed that AHC significantly outperforms
                                                                 refrain from employing any assumptions or external sources
k-means, MeanShift and Lingo for all explored features (sig-
                                                                 of information. Although, there are significant differences in
nificance t-test, p < 0.001). Thus, we employed AHC in all
                                                                 the performances of single features, the top performing fea-
follow up experiments.
                                                                 tures prove to be highly interchangeable. Achieved results
   We conducted a thorough evaluation of the performance
                                                                 indicate that - for the given datasets - the crucial part of
of the employed features at the two main stages of our ap-
                                                                 the process is not so much the diversification but more the
proach: relevance ranking and image clustering. Table 1
                                                                 assessment of image relevance.
summarizes the results by means of ranked feature lists. The
reported feature rankings and the selection of correspond-
ing best performing distance measures are the product of         Acknowledgment
significance t-tests with overall p < 0.003. While the Color     This work has been partly funded by the Vienna Science and
Naming Histograms (CN and CN3x3) are usually outper-             Technology Fund (WWTF) through project ICT12-010.
formed by any other feature, the Color Moments (CM and
CM3x3) show robust performance in both the ranking and
the clustering tasks. In contrast to the ranking, which is
                                                                 5.   REFERENCES
clearly dominated by the performance of CM3x3, TF-IDF            [1] D. Corney, C. Martin, A. Göker, E. S. Xioufis,
and SIFT, image clustering using AHC is more robust and              S. Papadopoulos, Y. Kompatsiaris, L. M. Aiello, and
the difference in the performance of global and local features       B. Thomee. Socialsensor: Finding diverse images at
decreases to a large extent.                                         mediaeval 2013. In MediaEval 2013 Workshop, 2013.
   Eventually, we submitted four runs for the final evalua-      [2] B. Ionescu, A. Popescu, M. Lupu, A. L. Gı̂nscâ, and
tion (see Table 2 for the configurations). Table 3 shows the         H. Müller. Retrieving diverse social images at
results for the submitted runs for both development and test         mediaeval 2014: Challenge, dataset, and evaluation. In
datasets. Best performances are achieved by the combina-             MediaEval 2014 Workshop, 2014.
tion of textual and visual information (run3 ). However, in      [3] B. Ionescu, A. Popescu, H. Müller, M. Menéndez, and
the context of the test dataset, the differences between the         A.-L. Radu. Benchmarking Result Diversification in
performances of the different runs vanish. Overall, clustering       Social Image Retrieval. IEEE International Conference
recall (CR) remains relatively low due to the large number           on Image Processing, 2014.
of irrelevant images building noisy clusters. In general, the    [4] N. Jain, J. Hare, S. Samangooei, J. Preston, J. Davies,
achieved results outline the limitations of the available tex-       D. Dupplaw, and P. H. Lewis. Experiments in
tual (and visual) information in assessing image relevance.          diversifying flickr result sets. In MediaEval 2013
This is mainly due to the fact, that user-provided textual de-       Multimedia Benchmark Workshop, 2013.
scriptions on social media sites often contain ambiguous or      [5] B. Vandersmissen, A. Tomar, F. Godin, W. D. Neve,
irrelevant information. A possible approach to improve the           and R. V. de Walle. Ghent University-iMinds at
results may consider occasionally available GPS data and             MediaEval 2013 Diverse Images: Relevance-Based
employ external resources as additional source for informa-          Hierarchical Clustering. In MediaEval 2013 Workshop,
tion.                                                                2013.