USEMP: Finding Diverse Images at MediaEval 2015

               E. Spyromitros-Xioufis1 , A. Popescu2 , S. Papadopoulos1 , I. Kompatsiaris1
                      1
                          CERTH-ITI, Thermi-Thessaloniki, Greece, {espyromi,papadop,ikom}@iti.gr
                            2
                              CEA, LIST, 91190 Gif-sur-Yvette, France, adrian.popescu@cea.fr


ABSTRACT                                                          evance annotations (for the images of the development set)
We describe the participation of the USEMP team in the            to train a classifier which was then used to assign relevance
Retrieving Diverse Social Images Task of MediaEval 2015.          scores to images of unseen queries. Although that classifier
Our runs are produced based on a supervised diversification       was not adapted to any particular query, it could accurately
method that jointly optimizes relevance and diversity. All        estimate relevance by capturing its task-specific notion.
runs are automated and use only resources given by the task          In 2014, the SocialSensor team refined their approach [12],
organizers. Our best results in terms of the official ranking     topping the scoreboard in several categories: best visual-
metric on the one-topic part of the test set came by the          only run, third textual run and best visual+textual run that
runs that combine visual and textual information while the        was also ranked second overall, slightly surpassed by a run
textual-only run performed better on the multi-topic part.        that used specialized filters (face/blur detectors) and user
                                                                  credibility information [4]. The main addition compared to
                                                                  [3] was that a different relevance classifier was trained for
1.   INTRODUCTION                                                 each query, using the query’s Wikipedia images as additional
  The Retrieving Diverse Social Images task of MediaEval          positive examples. These examples were assigned a larger
2015 [7] deals with the problem of result diversification in      weight to have increased influence on the learned model.
social image retrieval. This year there are two notable differ-   Thus the query-specific notion of relevance was captured in
ences with previous editions: (1) a larger development set is     addition to the task-specific notion captured in [3].
available and (2) “multi-concept” queries which are mainly           The approach was evaluated in [10] in more detail iden-
related to events rather than specific places or landmarks        tifying a link between relevance detection accuracy and di-
were introduced.                                                  versification performance. Furthermore, a multimodal en-
  We deal with the task using supervised Maximal Marginal         semble classifier, called Multi-Modal Stacking (MMS), was
Relevance (sMMR) [10], a refined version of the supervised        proposed for combining different types of features for rele-
diversification method that we developed in [12]. sMMR            vance detection in a principled manner. Due to the addition
and earlier versions of the method are discussed in Section 2.    of this multimodal scheme and to the use of state-of-the-art
Section 3 gives further details about our methodology and         convolutional neural network features for relevance detec-
describes modifications compared to [10]. Section 4 provides      tion, [10] managed to achieve a 5.7% relative increase over
descriptions of the employed features and Section 5 describes     the best result obtained in the 2014 [4] edition of the task.
the submitted runs. Finally, Section 6 presents and discusses
the experimental results.
                                                                  3.   METHOD
                                                                     Given the effectiveness of sMMR in previous editions of
2.   PREVIOUS WORK                                                the task, we opted for applying it in this year’s task as
   In the 2013 edition of the task, the SocialSensor team de-     well. In particular, we applied the sMMRaq variant: the
veloped a supervised diversification method [3] and applied       relevance detection model for each query was trained using
it for producing the visual-only run that achieved the best       relevant and irrelevant examples from other queries, com-
performance among runs of this type. Similarly to previous        bined with representative examples of the query itself (in
diversification methods [2, 5], that method greedily opti-        the form of either the corresponding Wikipedia images or
mized a utility function that jointly accounts for relevance      the Wikipedia page). For multi-topic queries, which were
and diversity. The main difference compared to earlier ap-        not accompanied by representative Wikipedia images, the
proaches was the replacement of the unsupervised definition       visual relevance models were trained using only examples
of relevance with a task-specific definition that is learned      from other queries (sMMRa variant).
directly from the ground truth. More specifically, instead           To further improve the relevance detection models com-
of computing an image’s relevance score by measuring its          pared to [10], we performed careful tuning of two param-
similarity to a reference image (e.g., the Wikipedia image        eters: a) the number no of examples from other queries
                                                                                                                  ne
of a query topic), the approach exploited the available rel-      employed by each model and b) the ratio r = n    o
                                                                                                                     defined
                                                                                                            1
                                                                  as the number of examples of this query divided by the
                                                                  1                                                            ne
Copyright is held by the author/owner(s).                           ne is modified by repeating each representative example n∗
                                                                                                                                 e
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany       times, where n∗e is the actual number of representative examples.
           Table 1: Estimated (one-topic/multi-topic) and final performance of the submitted runs.
                     Development Set                  Test Set (One-Topic)       Test Set (Multi-Topic)      Test Set (overall)
 Run     auc        p@20        cr@20       f1@20    p@20 cr@20 f1@20            p@20 cr@20 f1@20         p@20 cr@20 f1@20
  1  0.821/0.773 0.860/0.763 0.489/0.468 0.616/0.573 0.805 0.478     0.587       0.598 0.453     0.499    0.701 0.465 0. 542
  2     0.688       0.836        0.459       0.586   0.824 0.455     0.569       0.734 0.442 0.530        0.779 0.448 0.549
  3  0.857/0.816 0.893/0.840 0.515/0.489 0.646/0.609 0.833 0.504 0.618           0.617 0.408     0.471    0.724 0.456      0.544
  5  0.857/0.816 0.877/0.823 0.526/0.499 0.650/0.613 0.802 0.509     0.611       0.608 0.417     0.474    0.704 0.462      0.542


number of examples from other queries, by performing a            the one-topic and the multi-topic instantiation respectively.
grid search over the values no = {1K, 5K, 10K, 20K} and             Run 2: BOW features were used for both relevance and
r = {0.0, 0.1, . . . , 1.0}. Model selection was based on area    diversity. The following parameters were used: no = 20K,
under ROC (AUC), computed using a modified leave-one(-            r = 0.4, N = 80 and w = 0.8.
query)-out cross-validation procedure. For each query of the        Run 3: A different instantiation was used for each part
development set, no examples were randomly selected from          of the collection. MMS was used to combine the outputs
other queries and combined (if possible) with ne represen-        of relevance detection models built using CNN and BOW
tative (Wikipedia) examples of that query to build a model        features with one-dimensional META features, and VLAD
that was evaluated on the remaining (Flickr) examples of          features were used for diversity in both instantiations. {w =
that query. The per-query AUC scores were then averaged           0.50, N = 220} and {w = 0.55, N = 170} were used for the
to obtain a single estimate. As in [10], an L2-regularized        one-topic and the multi-topic instantiation respectively.
Logistic Regression classifier was used [6] with appropriate        Run 5: This is a variation of run 3 where we use the same
tuning of the c parameter. Besides the parameters of the          relevance detection models but also tune the M parameter
relevance model, we also tuned the w and N parameters             of the approach in addition to N and w, as done in [12].
of the sMMR approach, so as to maximize F1@20 on the              This resulted into setting {w = 0.4, N = 210, M = 2} for
development set, as done in [12].                                 the one-topic instantiation and {w = 0.4, N = 300, M = 5}
                                                                  for the mutli-topic instantiation.
4.   FEATURES
   VLAD: VLAD+CSURF [11] vectors were computed from               6.   RESULTS AND DISCUSSION
a 128-dimensional visual vocabulary and projected to 128             Table 1 shows the performance of the submitted runs on
dimensions with PCA and whitening. Both the visual vo-            each part of the test collection and estimates of their perfor-
cabulary and the PCA projection matrix are learned using          mance obtained from the development set. The best overall
the images of the development set.                                performance on the test set is obtained with run 2. We ob-
   CNN : Convolutional neural network features were adapted       serve that the performance is much better on the one-topic
for the tourism use case using ≈1,000 Points Of Interest          part. This was expected given the fact that model and pa-
(POIs) instead of ImageNet classes. These features were           rameter tuning was performed on the development set which
computed by fine-tuning the VGG model proposed by [9].            did not contain examples of multi-topic queries. The best
Approximately 1,200 images were collected for each POI and        performance in terms of F1@20 on the one-topic part is ob-
fed directly to Caffe [8] for training. This change of training   tained by runs 3 and 5, followed by run 1 and then run 2.
classes was inspired by recent domain adaptation work pre-        We see that although being slightly over-optimistic (≈5%
sented in [1] which shows that the feature transfer is more       on average) our F1@20 estimates for the one-topic part are
efficient when the training classes are conceptually close to     strongly correlated with the final results and are indicative
the target dataset. The features are constituted by the out-      of the relative run strength.
puts of the fc7 layer and include 4, 096 dimensions.                 On the multi-topic part, the best performance is obtained
   BOW : To generate textual features, we transformed each        by run 2, followed by run 1 and then runs 3 and 5. The
query and each Flickr image into a text document. For             superiority of run 2 over 1 on this part of the collection is
queries, we used a parsed version of the corresponding Wiki-      attributed to the fact that representative examples of multi-
pedia page and for Flickr images we used a concatenation of       topic queries were available only in textual form. Compar-
the words in their titles, descriptions and tags. Bag-of-words    ing the final results on this part with our estimates we see
features (BOW) were then computed for each document us-           very poor correlation2 . This suggests that this part of the
ing all terms that appear at least twice in the collection to     collection has significantly different characteristics from the
form the dictionary, and word frequencies as term weights.        development set and that performing model selection and
This led to an 80K-dimensional representation.                    parameter tuning on the development set was not helpful.
   META: The following one-dimensional features were also         We expect that better results could have been achieved on
computed from textual metadata and used as additional fea-        the multi-topic part, provided that the development set con-
tures in the meta input space of the MMS algorithm: dis-          tained queries of this type.
tance from POI (only for one-topic queries) and Flickr rank.
                                                                  7.   ACKNOWLEDGEMENTS
5.   RUNS                                                            This work is supported by the USEMP FP7 project, par-
  Run 1: CNN features were used for relevance and VLAD            tially funded by the EC under contract number 611596.
features for diversity. no was set to 20K in both instanti-       2
                                                                   Nevertheless, these estimates could serve as an indication of the
ations and r was set to 0.5 for the one-topic instantiation.      performance of a system that has no access to the Wikipedia
{w = 0.55, N = 170} and {w = 0.00, N = 120} were used for         images of the one-topic queries.
8.   REFERENCES                                                      M. Lupu, and H. Müller. Retrieving diverse social
                                                                     images at MediaEval 2015: Challenge, dataset and
 [1] A. Babenko, A. Slesarev, A. Chigorin, and                       evaluation. In MediaEval, 2015.
     V. Lempitsky. Neural codes for image retrieval. In          [8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,
     ECCV, 2014.                                                     J. Long, R. Girshick, S. Guadarrama, and T. Darrell.
 [2] J. Carbonell and J. Goldstein. The use of mmr,                  Caffe: Convolutional architecture for fast feature
     diversity-based reranking for reordering documents              embedding. arXiv preprint arXiv:1408.5093, 2014.
     and producing summaries. In ACM SIGIR, 1998.                [9] K. Simonyan and A. Zisserman. Very deep
 [3] D. Corney, C. Martin, A. Göker,                                convolutional networks for large-scale image
     E. Spyromitros-Xioufis, S. Papadopoulos,                        recognition. In International Conference on Learning
     Y. Kompatsiaris, L. Aiello, and B. Thomee.                      Representations, 2015.
     Socialsensor: Finding diverse images at mediaeval          [10] E. Spyromitros-Xioufis, A. Ginsca, A. Popescu,
     2013. In MediaEval, 2013.                                       S. Papadopoulos, Y. Kompatsiaris, and I. Vlahavas.
 [4] D.-T. Dang-Nguyen, L. Piras, G. Giacinto, G. Boato,             Improving diversity in image search via supervised
     and F. De Natale. Retrieval of diverse images by                relevance scoring. In International Conference on
     pre-filtering and hierarchical clustering. In MediaEval,        Multimedia Retrieval (ICMR), 2015.
     2014.                                                      [11] E. Spyromitros-Xioufis, S. Papadopoulos,
 [5] T. Deselaers, T. Gass, P. Dreuw, and H. Ney. Jointly            I. Kompatsiaris, G. Tsoumakas, and I. Vlahavas. A
     optimising relevance and diversity in image retrieval.          comprehensive study over vlad and product
     In ACM CIVR, 2009.                                              quantization in large-scale image retrieval. IEEE
 [6] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,                Transactions on Multimedia, 2014.
     and C.-J. Lin. LIBLINEAR: A library for large linear       [12] E. Spyromitros-Xioufis, S. Papadopoulos,
     classification. Journal of Machine Learning Research,           Y. Kompatsiaris, and I. Vlahavas. Socialsensor:
     9:1871–1874, 2008.                                              Finding diverse images at mediaeval 2014. In
 [7] B. Ionescu, A. Gı̂nsca, B. Boteanu, A. Popescu,                 MediaEval, 2014.