MUCKE Participation at Retrieving Diverse Social Images
                 Task of MediaEval 2013

                  Anil Armagan                          Adrian Popescu                       Pinar Duygulu
                 Bilkent University,                       CEA, LIST,                      Bilkent University,
            Department of Computer                Vision & Content Engineering          Department of Computer
                    Engineering,                           Laboratory,                        Engineering,
             06800, Ankara, Turkey.               91190 Gif-sur-Yvette, France.          06800, Ankara, Turkey.
           anil.armagan@bilkent.edu.tr               adrian.popescu@cea.fr              duygulu@cs.bilkent.edu.tr

ABSTRACT                                                            feature size is 8192. HOG, GIST and BOVW features cap-
The Mediaeval 2013 Retrieving Diverse Social Image Task             ture different low-level characteristics of images and they can
addresses the challenge of improving both relevance and di-         be combined to have more comprehensive visual representa-
versity of photos in a retrieval task on Flickr. We propose         tions. Since these features can be combined in different kind
a clustering based technique that exploits both textual and         of runs, all features were L1-normalized in order for each of
visual information. We introduce a k-Nearest Neighbor (k-           feature to have the same contribution, regardless of their
NN) inspired re-ranking algorithm that is applied before            size.
clustering to clean the dataset. After the clustering step,         2.2    Textual Features
we exploit social cues to rank clusters by social relevance.
From those ranked clusters images are retrieved according             We exploit a classical TF-IDF weighting scheme to model
to their distance to cluster centroids.                             textual information associated to points of interests (POIs).
                                                                    Different re-weighting schemes were tested and the best re-
                                                                    sults were obtained when we took the square root of TF-IDF
1.    INTRODUCTION                                                  scores. The dimension of the model is equal to the number
  Photo sharing is a frequent activity on social media plat-        of unique tags associated to each POI. Given the limited
forms, such as Flickr and Facebook. Since there is no quality       number of images per POI (up to 150), the dimension is
control of the photos and of their annotations, retrieved im-       usually in the range of hundreds. Similar to visual features,
ages for a given query are usually not sufficiently relevant.       L1-normalization is applied to textual features.
Even when the relevance criteria is met, it is still difficult to
obtain diversified representations [4].                             3.    RESULT RERANKING
Considering these challenges, the MediaEval 2013 Retriev-              The initial result set of Flickr is noisy and we introduce a
ing Diverse Social Images Task addresses the problem of             k-NN inspired approach that exploits visual and social cues
diversification of social image search results (see [2] for the     to rerank results to reduce this noise. We considered all the
detailed description of the task and the dataset).                  images of the POI as a positive set and constructed a nega-
The dataset used in the task is collected from Flickr. An-          tive set of the same size by sampling images of other POIs
notations are usually limited to a few number of tags and           from the collection. Then we compared the GIST features
therefore incomplete [3]. Flickr images also come with social       of each image to all other images’ features from positive and
cues, such as user ID.                                              negative sets. The resulting top five most visually similar
In the following, we summarize our efforts in solving the           images are retained to be considered in the next steps.
challenges mentioned above through exploiting both textual          The top five neighbors from both the positive and the neg-
and visual features, as well as social cues.                        ative sets of each image that depict a POI are considered to
                                                                    produce a reranked results list according to social cues. For
2.    FEATURES                                                      each image we counted the number of different users that
                                                                    contributed to the top five neighbors and the number of top
2.1    Visual Features                                              five neighbors that belong to the positive set. Finally the
                                                                    average distance to the first five positive neighbors of the
   We exploit visual features in order to overcome the spar-
                                                                    target image is considered. These criteria were cascaded to
sity of textual annotations since there are only few tags for
                                                                    break ties. Images are then ranked by cascading the three
each image. We use the Histogram of Gradients (HOG) in
                                                                    scores described and we used the top 70%, 80% and 90% of
our experiments. In addition, we extracted GIST and bags
                                                                    the reranked images. These images are given as an input for
of visual words (BOVW) based on dense SIFTs that proved
                                                                    the clustering process. The best results are obtained on the
to be efficient in large scale image retrieval. Dense SIFT
                                                                    devset with 70% of the initial list retained and this threshold
descriptors are extracted using a codebook of size 1024. A
                                                                    is retained for clustering.
spatial pyramid model with 2 levels is used and the resulting

                                                                    4.    CLUSTERING
Copyright is held by the author/owner(s).                             k-means++ [1] algorithm is used to cluster the images
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain      of a topic using previously mentioned feature types for dif-
Table 1: Run performances with three official met-               Table 2: Run performances with three official met-
rics using expert ground truth                                   rics using crowd sourcing ground truth
  Run name                  CR@10 P@10 F1@10                       Run name                  CR@10 P@10 F1@10
  Run1 - visual             0.3892 0.7243 0.4905                   Run1 - visual             0.7446 0.7449 0.7135
  Run2 - textual            0.3869 0.7333  0.489                   Run2 - textual            0.7406 0.7204 0.6997
  Run3 - textual-visual     0.3848 0.7272 0.4868                   Run3 - textual-visual     0.7503 0.7245 0.705
  Run5 - everything allowed 0.3742 0.7161 0.4753                   Run5 - everything allowed 0.7332 0.7184 0.6951


ferent kind of runs. Different numbers for k value are used      were surprised to see how well the textual run (Run2) per-
in experiments such as 10, 15 and 20. K value is selected        formed compared to visual and multimedia runs since we
as 15 since this value gives us the best results on evaluation   had expected visual diversification to work better for POIs,
metrics precision, recall and f1 measure at 10. Also k = 15      which usually have a limited number of visual aspects. The
is close to the official metrics (CR@10, P@10, F1@10) used       performance drop from Run3 to Run5, due to the replace-
in the evaluation.                                               ment of HOG features with BOVW features also came as a
                                                                 surprise since the latter usually work well for retrieval pro-
5.   CLUSTER AND RESULT RANKING                                  cesses over visually diversified datasets.
   Clusters are not all born equal and we need to be able to
rank them by probability of relevance of contained images.       7.   CONCLUSION
Inspired by [4], we exploit social cues for cluster ranking        In this paper we propose a clustering based technique us-
and propose a simple scheme that is based on user and date       ing textual and visual features. To re-rank the images before
information. For each cluster, we count the number of dif-       clustering we introduced a kNN inspired technique. This
ferent users that contributed to that cluster and the number     technique helped us to spot strongly connected images of
of different dates when photos of that cluster were taken.       a topic. We further apply cluster ranking method based on
The first count aims to prioritize clusters that are socially    social cues to increase the diversity and the relevancy. Table
diverse while the second count aims to surface clusters that     1 and Table 2 show that there are no big differences between
are temporally stable. Then we calculate the product of          different runs. Performance of Run2 - textual and Run5 -
these two counts and consider it as a social ranking score.      everything allowed is a surprise for us. Since BOVW rep-
To break ties, we also use the number of images present in       resentation generally work well for retrieval processes. We
each cluster.                                                    would expect from the visual representations to improve the
For each POI, we retain only the top 10 clusters obtained        results more than the textual features.
with the cluster ranking procedure and then diversify im-
ages by choosing one image from each cluster by descending       8.   ACKNOWLEDGMENT
similarity to the cluster centroid.                                This research was supported by the MUCKE project funded
                                                                 within the FP7 CHIST-ERA scheme and also Scientific and
6.   RESULTS AND DISCUSSION                                      Technical Research Council of Turkey (TUBITAK) under
   To address the diversified social image retrieval problem,    grant number 112E174.
participants are asked to submit different types of runs.
We submitted four runs that were produced by using dif-          9.   REFERENCES
ferent types of features and their combinations on the same      [1] D. Arthur and S. Vassilvitskii. k-means++: the
dataset. The submitted four runs are described below: Run1           advantages of careful seeding. In Proceedings of the
- visual is based on visual features. We concatenate HOG             eighteenth annual ACM-SIAM symposium on Discrete
and GIST features described in Section 2.1. Run2 - textual           algorithms, SODA ’07, pages 1027–1035, Philadelphia,
is produced using only textual features described in Section         PA, USA, 2007. Society for Industrial and Applied
2.2. Run3 - textual-visual is produced using a combina-              Mathematics.
tion of textual features described in Section 2.2 and GIST       [2] B. Ionescu, M. Menendez, H. Muller, and A. Popescu.
features in Section 2.1. Visual and textual features are con-        Retrieving diverse social images at mediaeval 2013:
catenated and to produce feature vectors. Linear weighting           Objectives, dataset and evaluation. In MediaEval 2013
is used with 0.7 and 0.3 weights that are given to visual and        Workshop, CEUR-WS.org, ISSN: 1613-0073,
textual features respectively. These weights were empiri-            Barcelona, Spain, October 18-19 2013.
cally chosen by testing different combinations on the devset.
                                                                 [3] L. S. Kennedy, S.-F. Chang, and I. V. Kozintsev. To
Run5 - everything allowed is similar to Run 3 - textual-
                                                                     search or to label?: predicting the performance of
visual, the only modification being the replacement of HOG
                                                                     search-based automatic image classifiers. In Proceedings
features by BOVW features.
                                                                     of the 8th ACM international workshop on Multimedia
   The results in Table 1 are based on expert evaluation on
                                                                     information retrieval, MIR ’06, pages 249–258, New
346 testset locations. Table 2 shows the average results of
                                                                     York, NY, USA, 2006. ACM.
crowd sourcing evaluation carried out only on a subset of
50 locations from the testset via the CrowdFlower platform.      [4] L. S. Kennedy and M. Naaman. Generating diverse and
Relevance ground truth is based on a majority voting scheme          representative image search results for landmarks. In
on the annotations and the diversity ground truth is deter-          Proceedings of the 17th international conference on
mined by the same three annotators. Both tables show that            World Wide Web, WWW ’08, pages 297–306, New
there are only small differences between the four runs. We           York, NY, USA, 2008. ACM.