NLE @ MediaEval’17: Combining Cross-Media Similarity and
         Embeddings for Retrieving Diverse Social Images
                                                      Jean-Michel Renders and Gabriela Csurka
                                                            Naver Labs Europe, Meylan, France
                                                            firstname.lastname@naverlabs.com
ABSTRACT                                                                            2 INCREASING THE TOP RELEVANCE
In this working note we briefly describe the methods we used in                     2.1 Cross-media and Mono-media Relevance
the MediaEval17, Retrieving Diverse Social Images Task and give
                                                                                    First, we describe our cross-media similarity measure, which we
details on the submitted runs.
                                                                                    already proposed in [3, 4]. This cross-media similarity measure is
                                                                                    a relatively simple extension of pseudo-relevance feedback and,
1    INTRODUCTION                                                                   can be applied to a single media as well (text or image). It can
One of the main motivations for participating in the MediaEval’17                   be considered as a two-step similarity measure, where the final
Retrieving Diverse Social Images Task [22] was to evaluate the                      similarity between a query and a document3 is nothing else but
cross-media similarity measure we proposed in [3, 4] , which has                    the average visual similarity between the document and the top-K
proven to give top-ranked retrieval results on several ImageCLEF                    documents most “textually”-similar (i.e. relevant) to the query.
multimedia search tasks between 2007 and 20111 [12].                                   More formally, if we denote by SV (d,d ′ ) the normalized visual
   The cross-media similarity we adopted this year differs from the                 similarity measure between documents d and d ′ , and by ST (d,q)
one we used previously in the features used for both visual and                     the textual relevance score of document d with respect to query q,
textual modalities. Indeed, previously we used Fisher Vectors [15]                  the new relevance score of a document d is defined as a weighted
for image representation and standard Dirichlet language model                      average of its similarity with the top retrieved documents based on
(LM) or Lexical Entailment [2] for text representation. However,                    the textual relevance scores:
                                                                                                                              S (d ,q)SV (d,di )
                                                                                                              PK
recent progresses have shown that using activation layers of deep
                                                                                                               d i ∈N NTK (q) T i
Convolutional Networks pre-trained on ImageNet as image repre-                                 ST ,V (d,q) =                                         (1)
                                                                                                                     d i ∈N N K (q) ST (di ,q)
                                                                                                                   P
sentation performs better than Fisher Vectors [15] on visual task                                                                   T
due to the large amount of knowledge learned from ImageNet. Sim-
                                                                                    where N NTK (q) denotes the top-K documents most similar to the
ilarly, word embedding-based representations such as word2vec
                                                                                    query q using only the textual modality. We called it cross-media
relying on the information learned from large textual collections
                                                                                    similarity, because it represents in some sense the similarity be-
outperform standard tfidf -based and previous LM-based represen-
                                                                                    tween a textual query and the visual part of a document.
tations. Therefore, in our cross-media similarity model we used
                                                                                        From our experiments, we observe better performances if we
features extracted from deep models pre-trained on ImageNet and
                                                                                    recombine this score with the initial relevance scores as a convex
word embeddings learned from a large corpus of queries2 .
                                                                                    linear combination: S̃T,V (d, q) = (1 − α )ST,V (d, q) + αST (di , q). This
   A second motivation was to compare this simple approach with
                                                                                    algorithm is the core of NLE-RUN3.
more recent image and text combination strategies, such as joint
                                                                                        Note that we can apply a similar two-step similarity measure,
image and text embedding [5, 6, 8, 20, 21]. These methods, in con-
                                                                                    using only visual (or textual, resp.) features in both steps. Concretely,
trast to our fully unsupervised cross-media similarity, exploit labels
                                                                                    we obtain a pure text-based retrieval model (more or less equivalent
or relevance scores to learn the embedding.
                                                                                    to classical pseudo relevance feedback) by replacing in (1) SV (d,d ′ )
   Finally, our third motivation was to evaluate several methods to
                                                                                    with ST (d,d ′ ), the normalized textual similarity measure between
make the top ranked images more diverse. In particular, we explored
                                                                                    documents d and d ′ ; this results in a purely textual relevance score
a clustering-based method, with several visual, textual and joint
                                                                                    S̃T,T (d, q). This method corresponds to NLE-RUN2.
similarity measures: images were re-ranked based on the number of
                                                                                        By analogy, assuming (abusively) that the Flickr ranking is based
times a document shared clusters with documents already present
                                                                                    on the image only, we can replace the term ST (di ,q) in (1) with
in the upper ranked images (the lower, the better). While this family
                                                                                    SF (di ,q), the normalized “Flickr” relevance score, defined as (n −
of methods allowed us to significantly increase the cluster recall,
                                                                                    r )/r , where n is the number of images returned by Flickr and r is the
these methods turned out to perform below the classical Maximum
                                                                                    provided Flickr rank of document di ; this results in a purely visual
Margin Relevance method (MMR) proposed in [1], at least for the
                                                                                    relevance score S̃F,V (d, q). This method corresponds to NLE-RUN1.
development set.
1 For more details, please visit www.imageclef.org                                  2.2     Joint visual and textual embedding
2 The models used to get these representations were built prior and independently
                                                                                    We considered the joint textual and visual embedding model pro-
from the challenge.
                                                                                    posed in [20], where the idea is to use a two-view neural network
Copyright held by the owner/author(s).                                              with two layers of non-linearities on top of any representation of
MediaEval’17, 13-15 September 2017, Dublin, Ireland
                                                                                    3 Here a document refers to a Flickr image with its textual and visual representations.
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                                J.-M. Renders and G. Csurka


the image and text views. To train this network, in a way which is                        Table 1: The retrieval results for our main runs
reminiscent of some “learning to rank” strategies, we use 4 differ-
ent triplet losses (visual-visual, textual-textual, visual-textual and                  Results (@20)          P       CR        F1      ERR-IA        α-nDCG
textual-visual). The aim is to enforce that two documents relevant                        Run1 (V)            73.2     59.4     63.3      66.0           62.3
to the same query should have both textual and visual embeddings                          Run2 (T)            72.7     61.7     64.3      66.3           62.8
close in the new common (i.e. joint) latent space, while a document                      Run3 (VT)            78.2     67.9     70.5      73.3           68.9
relevant to a query q should be far from documents non-relevant                          Run4 (VT)            79.3     66.3     69.8      72.3           67.9
to the same query or from documents relevant to other queries.                           Run5 (VT)            78.1     66.4     69.4      73.0           68.6
More formally, given a set of triplets (di ,d j ,dk ) built from the set
of queries and their associated documents, the method amounts to                     For the textual facet, after trying word2vec and Glove [14] embed-
minimizing the following loss function:                                           dings, we finally decided to adopt the Dual Embedding Space Model
                                                                                  for Document Ranking [13], pre-trained on the Bing query corpus5 .
       ℒ(di ,d j ,dk )   =   max[0,m + d (piV ,pVj ) − d (piV ,pkV )]             This choice was motivated by the fact that this embedding specifi-
                         +   max[0,m + d (pTi ,pTj ) − d (pTi ,pTk )]             cally designed for IR applications experimentally turned out to give
                                                                                  better performance on the development set. Document and query
                         +   max[0,m + d (piV ,pTj ) − d (piV ,pTk )]             embeddings are simply computed as the average of the embeddings
                         +   max[0,m + d (pTi ,pVj ) − d (pTi ,pkV )]             of their constitutive words; we then use a simple mixture of the
                                                                                  Dirichlet-smoothed LM relevance score with the cosine similarity
where piV and pTi are the projections of the visual respectively tex-             of the textual embeddings as the ST (d,q) textual relevance score.
tual representation of document di into the common embedded                          As visual representation, we considered several deep CNN mod-
space. To select such triplets for training, we experimented with                 els pretrained on ImageNet. We experimented with AlexNet [10],
using the ground-truth relevance scores provided with the devel-                  GoogleNet Inception V3 [18], Inception-ResNet [17] and RMAC6 [7,
opment set but we have observed that they do not generalize for                   19] deep models. The pretrained models were used as such, without
unseen topics. Therefore, instead, we used the pseudo-relevance                   any fine tuning on the task collection. We used as visual represen-
scores (using our cross-media similarity scores) by considering the               tation the activations of the last fully connected layer preceding
top-ranked documents4 as relevant to the query; the bottom-ranked                 the class prediction one. The features were L2-normalized and the
documents as well as all documents associated to the other queries                dot product used as similarity.
were assumed to be non-relevant.                                                     We used the provided ground truth on the development set
   After the model was trained, we computed embeddings both for                   and considered the P@50 to select the best visual similarity and
the textual queries and documents. For the documents – which have                 to set the parameters. Best results were found with the features
two embeddings –, we considered the centroid of their visual and                  extracted from the Inception-ResNet [17] model. As best choice for
textual embeddings and ranked them according to their distance                    the parameters in (1), we found K = 25 and α = 0.15.
to the query in the embedding space. This approach was used to                       To promote diversity, we used for all runs the classical MMR
build our NLE-RUN4 and NLE-RUN5 runs.                                             applied to the initial relevance scored computed by the methods
                                                                                  described above. The metrics used in MMR to penalize documents
3    PROMOTING DIVERSITY                                                          similar to higher rank documents was the RMAC visual similarity
Note that in general promoting diversity comes with a risk of de-                 between images, except for NLE-RUN2, where we used the cosine
creased precision as we discard in general relevant elements from                 similarity between text embeddings to keep the run purely textual.
the top that are similar to other elements on the top. Our aim there-             The weight factor β that penalizes a too high similarity with higher
fore was to find a good trade-off between keeping the relevance as                rank documents was tuned using the development set.
high as possible while introducing diversity. The best performance                   Our runs are summarized in Table 1. We can see that our visual
on the development set was obtained with the Maximum Margin                       only and textual only runs have similar performances, the visual
Relevance method (MMR) proposed in [1]. The main idea of the                      one having slightly higher precision and the text higher diversity.
method is that we re-rank documents by considering new scores                     Using the cross-media similarity allowed us to obtain a much better
which corresponds to their initial relevance scores diminished with               ranking both in terms of precision and also diversity. Learning joint
the maximum similarity score compared to the documents already                    visual and textual embedding using the relevance scores did not
selected weighted by a penalty factor β.                                          help, or even slightly degraded the results. The main reason is that
                                                                                  the embedding only learned from information already captured by
                                                                                  the cross-media similarity7 .
4    RESULTS AND ANALYSIS
                                                                                     Acknowledgement: We would like to thank Jon Almazan, who
The methods presented here above, based on pseudo-relevance feed-                 provided us with RMAC representations for the images.
back, heavily depends on the choice of the mono-modal similarity
measures and, consequently, on a good textual/visual representa-                  5 See http://research.microsoft.com/projects/DESM
                                                                                  6 The RMAC model [7, 19] is trained with a triplet loss instead of a classification loss,
tion of the query and the documents.
                                                                                  to make the distance between images from the same class smaller than the distance to
                                                                                  images from other classes plus a margin.
4 We considered as relevant documents with scores > mean + std and non-relevant   7 Originally, we intended to use external data such as Visual Genome [9] or Flickr30K
scores < mean + std , where mean and std are the mean and standard deviation      Entities [16] to learn embeddings such as relationships between objects and persons,
of the the scores within the topic.                                               etc. Due to time constraint we will investigate this in the future.
Retrieving Diverse Social Images Task                                                       MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES                                                                         Recognition (CVPR).
 [1] Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-     [19] Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2016. Particular object
     based reranking for reordering documents and producing summaries.             retrieval with integral max-pooling of CNN activations.. In Interna-
     In ACM SIGIR Conference on Research and Development in Information            tional Conference on Machine Learning (ICML).
     Retrieval (SIGIR).                                                       [20] Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning Deep
 [2] Stéphane Clinchant, Cyril Goutte, and Éric Gaussier. 2006. Lexical            Structure-Preserving Image-Text Embeddings. In IEEE Conference on
     Entailment for Information Retrieval. In European Conference on Infor-        Computer Vision and Pattern Recognition (CVPR).
     mation Retrieval Research.                                               [21] jason Weston, bengio Bengio, and Nicolas Usunier. 2011. WSABIE:
 [3] Stéphane Clinchant, Jean-Michel Renders, and Gabriela Csurka. 2007.           Scaling Up To Large Vocabulary Image Annotation. In AAAI Interna-
     XRCE’s participation to ImageCLEF. In CLEF online Working Notes.              tional Joint Conference on Artificial Intelligence (IJCAI).
 [4] Stéphane Clinchant, Jean-Michel Renders, and Gabriela Csurka. 2008.      [22] Maia Zaharieva, Bogdan Ionescu, Alexandru Lucian Gînscă,
     Trans–Media Pseudo–Relevance Feedback Methods in Multimedia                   Rodrygo L.T. Santos, and Henning Müller. 2017. Retrieving Diverse
     Retrieval. In Advances in Multilingual and Multimodal Information             Social Images at MediaEval 2017: Challenges, Dataset and Evaluation.
     Retrieval. Vol. LNCS 5152. Springer, 569–576.                                 In Medieval 2017, Multimedia Benchmark Workshop.
 [5] Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean,
     Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep
     Visual-Semantic Embedding Model. In Annual Conference on Neural
     Information Processing Systems (NIPS).
 [6] Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and
     Svetlana Lazebnik. 2014. Improving Image-Sentence Embeddings
     Using Large Weakly Annotated Photo Collections. In European Con-
     ference on Computer Vision (ECCV).
 [7] Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016.
     Deep Image Retrieval: Learning global representations for image
     search. In European Conference on Computer Vision (ECCV).
 [8] Albert Gordo and Diane Larlus. 2017. Beyond instance-level image
     retrieval: Leveraging captions to learn a global visual representation
     for semantic retrieval. In IEEE Conference on Computer Vision and
     Pattern Recognition (CVPR).
 [9] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata,
     Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A.
     Shamma, Michael S. Bernstein, and Fei-Fei Li. 2017. Visual Genome:
     Connecting language and vision using crowdsourced. 123 (2017),
     32–73.
[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Ima-
     geNet Classification with Deep Convolutional Neural Networks. In
     Annual Conference on Neural Information Processing Systems (NIPS).
[11] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per-
     ona, Deva Ramanan, Piotr Dollár, and C.Lawrence Zitnick. 2014. Mi-
     crosoft COCO: Common Objects in Context. In European Conference
     on Computer Vision (ECCV).
[12] Henning Müller, Paul Clough, Theo Deselaers, and Barbara Caputo
     (Eds.). 2010. ImageCLEF- Experimental Evaluation in Visual Information
     Retrieval. Vol. INRE. Springer.
[13] Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. 2016.
     Improving Document Ranking with Dual Word Embeddings.
[14] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.
     GloVe: Global Vectors for Word Representation. In Empirical Methods
     in Natural Language Processing (EMNLP). 1532–1543.
[15] Florent Perronnin and Chris Dance. 2007. Fisher Kernels on Visual
     Vocabularies for Image Categorization. In IEEE Conference on Computer
     Vision and Pattern Recognition (CVPR).
[16] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo,
     Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k Entities:
     Collecting Region-to-Phrase Correspondences for Richer Image-to-
     Sentence Models.
[17] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi.
     2016. Inception-v4, Inception-ResNet and the Impact of Residual
     Connections on Learning. CoRR arXiv:1602.07261 (2016).
[18] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens,
     and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for
     Computer Vision. In IEEE Conference on Computer Vision and Pattern