=Paper= {{Paper |id=Vol-1984/Mediaeval_2017_paper_32 |storemode=property |title=NLE@MediaEval’17: Combining Cross-Media Similarity and Embeddings for Retrieving Diverse Social Images |pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_32.pdf |volume=Vol-1984 |authors=Jean-Michel Renders,Gabriela Csurka |dblpUrl=https://dblp.org/rec/conf/mediaeval/RendersC17 }} ==NLE@MediaEval’17: Combining Cross-Media Similarity and Embeddings for Retrieving Diverse Social Images== https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_32.pdf

NLE @ MediaEval’17: Combining Cross-Media Similarity and
Embeddings for Retrieving Diverse Social Images
Jean-Michel Renders and Gabriela Csurka
Naver Labs Europe, Meylan, France
firstname.lastname@naverlabs.com
ABSTRACT 2 INCREASING THE TOP RELEVANCE
In this working note we briefly describe the methods we used in 2.1 Cross-media and Mono-media Relevance
the MediaEval17, Retrieving Diverse Social Images Task and give
First, we describe our cross-media similarity measure, which we
details on the submitted runs.
already proposed in [3, 4]. This cross-media similarity measure is
a relatively simple extension of pseudo-relevance feedback and,
1 INTRODUCTION can be applied to a single media as well (text or image). It can
One of the main motivations for participating in the MediaEval’17 be considered as a two-step similarity measure, where the final
Retrieving Diverse Social Images Task [22] was to evaluate the similarity between a query and a document3 is nothing else but
cross-media similarity measure we proposed in [3, 4] , which has the average visual similarity between the document and the top-K
proven to give top-ranked retrieval results on several ImageCLEF documents most “textually”-similar (i.e. relevant) to the query.
multimedia search tasks between 2007 and 20111 [12]. More formally, if we denote by SV (d,d ′ ) the normalized visual
The cross-media similarity we adopted this year differs from the similarity measure between documents d and d ′ , and by ST (d,q)
one we used previously in the features used for both visual and the textual relevance score of document d with respect to query q,
textual modalities. Indeed, previously we used Fisher Vectors [15] the new relevance score of a document d is defined as a weighted
for image representation and standard Dirichlet language model average of its similarity with the top retrieved documents based on
(LM) or Lexical Entailment [2] for text representation. However, the textual relevance scores:
S (d ,q)SV (d,di )
PK
recent progresses have shown that using activation layers of deep
d i ∈N NTK (q) T i
Convolutional Networks pre-trained on ImageNet as image repre- ST ,V (d,q) = (1)
d i ∈N N K (q) ST (di ,q)
P
sentation performs better than Fisher Vectors [15] on visual task T
due to the large amount of knowledge learned from ImageNet. Sim-
where N NTK (q) denotes the top-K documents most similar to the
ilarly, word embedding-based representations such as word2vec
query q using only the textual modality. We called it cross-media
relying on the information learned from large textual collections
similarity, because it represents in some sense the similarity be-
outperform standard tfidf -based and previous LM-based represen-
tween a textual query and the visual part of a document.
tations. Therefore, in our cross-media similarity model we used
From our experiments, we observe better performances if we
features extracted from deep models pre-trained on ImageNet and
recombine this score with the initial relevance scores as a convex
word embeddings learned from a large corpus of queries2 .
linear combination: S̃T,V (d, q) = (1 − α )ST,V (d, q) + αST (di , q). This
A second motivation was to compare this simple approach with
algorithm is the core of NLE-RUN3.
more recent image and text combination strategies, such as joint
Note that we can apply a similar two-step similarity measure,
image and text embedding [5, 6, 8, 20, 21]. These methods, in con-
using only visual (or textual, resp.) features in both steps. Concretely,
trast to our fully unsupervised cross-media similarity, exploit labels
we obtain a pure text-based retrieval model (more or less equivalent
or relevance scores to learn the embedding.
to classical pseudo relevance feedback) by replacing in (1) SV (d,d ′ )
Finally, our third motivation was to evaluate several methods to
with ST (d,d ′ ), the normalized textual similarity measure between
make the top ranked images more diverse. In particular, we explored
documents d and d ′ ; this results in a purely textual relevance score
a clustering-based method, with several visual, textual and joint
S̃T,T (d, q). This method corresponds to NLE-RUN2.
similarity measures: images were re-ranked based on the number of
By analogy, assuming (abusively) that the Flickr ranking is based
times a document shared clusters with documents already present
on the image only, we can replace the term ST (di ,q) in (1) with
in the upper ranked images (the lower, the better). While this family
SF (di ,q), the normalized “Flickr” relevance score, defined as (n −
of methods allowed us to significantly increase the cluster recall,
r )/r , where n is the number of images returned by Flickr and r is the
these methods turned out to perform below the classical Maximum
provided Flickr rank of document di ; this results in a purely visual
Margin Relevance method (MMR) proposed in [1], at least for the
relevance score S̃F,V (d, q). This method corresponds to NLE-RUN1.
development set.
1 For more details, please visit www.imageclef.org 2.2 Joint visual and textual embedding
2 The models used to get these representations were built prior and independently
We considered the joint textual and visual embedding model pro-
from the challenge.
posed in [20], where the idea is to use a two-view neural network
Copyright held by the owner/author(s). with two layers of non-linearities on top of any representation of
MediaEval’17, 13-15 September 2017, Dublin, Ireland
3 Here a document refers to a Flickr image with its textual and visual representations.
MediaEval’17, 13-15 September 2017, Dublin, Ireland J.-M. Renders and G. Csurka

the image and text views. To train this network, in a way which is Table 1: The retrieval results for our main runs
reminiscent of some “learning to rank” strategies, we use 4 differ-
ent triplet losses (visual-visual, textual-textual, visual-textual and Results (@20) P CR F1 ERR-IA α-nDCG
textual-visual). The aim is to enforce that two documents relevant Run1 (V) 73.2 59.4 63.3 66.0 62.3
to the same query should have both textual and visual embeddings Run2 (T) 72.7 61.7 64.3 66.3 62.8
close in the new common (i.e. joint) latent space, while a document Run3 (VT) 78.2 67.9 70.5 73.3 68.9
relevant to a query q should be far from documents non-relevant Run4 (VT) 79.3 66.3 69.8 72.3 67.9
to the same query or from documents relevant to other queries. Run5 (VT) 78.1 66.4 69.4 73.0 68.6
More formally, given a set of triplets (di ,d j ,dk ) built from the set
of queries and their associated documents, the method amounts to For the textual facet, after trying word2vec and Glove [14] embed-
minimizing the following loss function: dings, we finally decided to adopt the Dual Embedding Space Model
for Document Ranking [13], pre-trained on the Bing query corpus5 .
ℒ(di ,d j ,dk ) = max[0,m + d (piV ,pVj ) − d (piV ,pkV )] This choice was motivated by the fact that this embedding specifi-
+ max[0,m + d (pTi ,pTj ) − d (pTi ,pTk )] cally designed for IR applications experimentally turned out to give
better performance on the development set. Document and query
+ max[0,m + d (piV ,pTj ) − d (piV ,pTk )] embeddings are simply computed as the average of the embeddings
+ max[0,m + d (pTi ,pVj ) − d (pTi ,pkV )] of their constitutive words; we then use a simple mixture of the
Dirichlet-smoothed LM relevance score with the cosine similarity
where piV and pTi are the projections of the visual respectively tex- of the textual embeddings as the ST (d,q) textual relevance score.
tual representation of document di into the common embedded As visual representation, we considered several deep CNN mod-
space. To select such triplets for training, we experimented with els pretrained on ImageNet. We experimented with AlexNet [10],
using the ground-truth relevance scores provided with the devel- GoogleNet Inception V3 [18], Inception-ResNet [17] and RMAC6 [7,
opment set but we have observed that they do not generalize for 19] deep models. The pretrained models were used as such, without
unseen topics. Therefore, instead, we used the pseudo-relevance any fine tuning on the task collection. We used as visual represen-
scores (using our cross-media similarity scores) by considering the tation the activations of the last fully connected layer preceding
top-ranked documents4 as relevant to the query; the bottom-ranked the class prediction one. The features were L2-normalized and the
documents as well as all documents associated to the other queries dot product used as similarity.
were assumed to be non-relevant. We used the provided ground truth on the development set
After the model was trained, we computed embeddings both for and considered the P@50 to select the best visual similarity and
the textual queries and documents. For the documents – which have to set the parameters. Best results were found with the features
two embeddings –, we considered the centroid of their visual and extracted from the Inception-ResNet [17] model. As best choice for
textual embeddings and ranked them according to their distance the parameters in (1), we found K = 25 and α = 0.15.
to the query in the embedding space. This approach was used to To promote diversity, we used for all runs the classical MMR
build our NLE-RUN4 and NLE-RUN5 runs. applied to the initial relevance scored computed by the methods
described above. The metrics used in MMR to penalize documents
3 PROMOTING DIVERSITY similar to higher rank documents was the RMAC visual similarity
Note that in general promoting diversity comes with a risk of de- between images, except for NLE-RUN2, where we used the cosine
creased precision as we discard in general relevant elements from similarity between text embeddings to keep the run purely textual.
the top that are similar to other elements on the top. Our aim there- The weight factor β that penalizes a too high similarity with higher
fore was to find a good trade-off between keeping the relevance as rank documents was tuned using the development set.
high as possible while introducing diversity. The best performance Our runs are summarized in Table 1. We can see that our visual
on the development set was obtained with the Maximum Margin only and textual only runs have similar performances, the visual
Relevance method (MMR) proposed in [1]. The main idea of the one having slightly higher precision and the text higher diversity.
method is that we re-rank documents by considering new scores Using the cross-media similarity allowed us to obtain a much better
which corresponds to their initial relevance scores diminished with ranking both in terms of precision and also diversity. Learning joint
the maximum similarity score compared to the documents already visual and textual embedding using the relevance scores did not
selected weighted by a penalty factor β. help, or even slightly degraded the results. The main reason is that
the embedding only learned from information already captured by
the cross-media similarity7 .
4 RESULTS AND ANALYSIS
Acknowledgement: We would like to thank Jon Almazan, who
The methods presented here above, based on pseudo-relevance feed- provided us with RMAC representations for the images.
back, heavily depends on the choice of the mono-modal similarity
measures and, consequently, on a good textual/visual representa- 5 See http://research.microsoft.com/projects/DESM
6 The RMAC model [7, 19] is trained with a triplet loss instead of a classification loss,
tion of the query and the documents.
to make the distance between images from the same class smaller than the distance to
images from other classes plus a margin.
4 We considered as relevant documents with scores > mean + std and non-relevant 7 Originally, we intended to use external data such as Visual Genome [9] or Flickr30K
scores < mean + std , where mean and std are the mean and standard deviation Entities [16] to learn embeddings such as relationships between objects and persons,
of the the scores within the topic. etc. Due to time constraint we will investigate this in the future.
Retrieving Diverse Social Images Task MediaEval’17, 13-15 September 2017, Dublin, Ireland

REFERENCES Recognition (CVPR).
[1] Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity- [19] Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2016. Particular object
based reranking for reordering documents and producing summaries. retrieval with integral max-pooling of CNN activations.. In Interna-
In ACM SIGIR Conference on Research and Development in Information tional Conference on Machine Learning (ICML).
Retrieval (SIGIR). [20] Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning Deep
[2] Stéphane Clinchant, Cyril Goutte, and Éric Gaussier. 2006. Lexical Structure-Preserving Image-Text Embeddings. In IEEE Conference on
Entailment for Information Retrieval. In European Conference on Infor- Computer Vision and Pattern Recognition (CVPR).
mation Retrieval Research. [21] jason Weston, bengio Bengio, and Nicolas Usunier. 2011. WSABIE:
[3] Stéphane Clinchant, Jean-Michel Renders, and Gabriela Csurka. 2007. Scaling Up To Large Vocabulary Image Annotation. In AAAI Interna-
XRCE’s participation to ImageCLEF. In CLEF online Working Notes. tional Joint Conference on Artificial Intelligence (IJCAI).
[4] Stéphane Clinchant, Jean-Michel Renders, and Gabriela Csurka. 2008. [22] Maia Zaharieva, Bogdan Ionescu, Alexandru Lucian Gînscă,
Trans–Media Pseudo–Relevance Feedback Methods in Multimedia Rodrygo L.T. Santos, and Henning Müller. 2017. Retrieving Diverse
Retrieval. In Advances in Multilingual and Multimodal Information Social Images at MediaEval 2017: Challenges, Dataset and Evaluation.
Retrieval. Vol. LNCS 5152. Springer, 569–576. In Medieval 2017, Multimedia Benchmark Workshop.
[5] Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean,
Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep
Visual-Semantic Embedding Model. In Annual Conference on Neural
Information Processing Systems (NIPS).
[6] Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and
Svetlana Lazebnik. 2014. Improving Image-Sentence Embeddings
Using Large Weakly Annotated Photo Collections. In European Con-
ference on Computer Vision (ECCV).
[7] Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016.
Deep Image Retrieval: Learning global representations for image
search. In European Conference on Computer Vision (ECCV).
[8] Albert Gordo and Diane Larlus. 2017. Beyond instance-level image
retrieval: Leveraging captions to learn a global visual representation
for semantic retrieval. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR).
[9] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata,
Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A.
Shamma, Michael S. Bernstein, and Fei-Fei Li. 2017. Visual Genome:
Connecting language and vision using crowdsourced. 123 (2017),
32–73.
[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Ima-
geNet Classification with Deep Convolutional Neural Networks. In
Annual Conference on Neural Information Processing Systems (NIPS).
[11] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per-
ona, Deva Ramanan, Piotr Dollár, and C.Lawrence Zitnick. 2014. Mi-
crosoft COCO: Common Objects in Context. In European Conference
on Computer Vision (ECCV).
[12] Henning Müller, Paul Clough, Theo Deselaers, and Barbara Caputo
(Eds.). 2010. ImageCLEF- Experimental Evaluation in Visual Information
Retrieval. Vol. INRE. Springer.
[13] Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. 2016.
Improving Document Ranking with Dual Word Embeddings.
[14] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.
GloVe: Global Vectors for Word Representation. In Empirical Methods
in Natural Language Processing (EMNLP). 1532–1543.
[15] Florent Perronnin and Chris Dance. 2007. Fisher Kernels on Visual
Vocabularies for Image Categorization. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
[16] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo,
Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k Entities:
Collecting Region-to-Phrase Correspondences for Richer Image-to-
Sentence Models.
[17] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi.
2016. Inception-v4, Inception-ResNet and the Impact of Residual
Connections on Learning. CoRR arXiv:1602.07261 (2016).
[18] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens,
and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for
Computer Vision. In IEEE Conference on Computer Vision and Pattern