NLE @ MediaEval’17: Combining Cross-Media Similarity and Embeddings for Retrieving Diverse Social Images Jean-Michel Renders and Gabriela Csurka Naver Labs Europe, Meylan, France firstname.lastname@naverlabs.com ABSTRACT 2 INCREASING THE TOP RELEVANCE In this working note we briefly describe the methods we used in 2.1 Cross-media and Mono-media Relevance the MediaEval17, Retrieving Diverse Social Images Task and give First, we describe our cross-media similarity measure, which we details on the submitted runs. already proposed in [3, 4]. This cross-media similarity measure is a relatively simple extension of pseudo-relevance feedback and, 1 INTRODUCTION can be applied to a single media as well (text or image). It can One of the main motivations for participating in the MediaEval’17 be considered as a two-step similarity measure, where the final Retrieving Diverse Social Images Task [22] was to evaluate the similarity between a query and a document3 is nothing else but cross-media similarity measure we proposed in [3, 4] , which has the average visual similarity between the document and the top-K proven to give top-ranked retrieval results on several ImageCLEF documents most “textually”-similar (i.e. relevant) to the query. multimedia search tasks between 2007 and 20111 [12]. More formally, if we denote by SV (d,d ′ ) the normalized visual The cross-media similarity we adopted this year differs from the similarity measure between documents d and d ′ , and by ST (d,q) one we used previously in the features used for both visual and the textual relevance score of document d with respect to query q, textual modalities. Indeed, previously we used Fisher Vectors [15] the new relevance score of a document d is defined as a weighted for image representation and standard Dirichlet language model average of its similarity with the top retrieved documents based on (LM) or Lexical Entailment [2] for text representation. However, the textual relevance scores: S (d ,q)SV (d,di ) PK recent progresses have shown that using activation layers of deep d i ∈N NTK (q) T i Convolutional Networks pre-trained on ImageNet as image repre- ST ,V (d,q) = (1) d i ∈N N K (q) ST (di ,q) P sentation performs better than Fisher Vectors [15] on visual task T due to the large amount of knowledge learned from ImageNet. Sim- where N NTK (q) denotes the top-K documents most similar to the ilarly, word embedding-based representations such as word2vec query q using only the textual modality. We called it cross-media relying on the information learned from large textual collections similarity, because it represents in some sense the similarity be- outperform standard tfidf -based and previous LM-based represen- tween a textual query and the visual part of a document. tations. Therefore, in our cross-media similarity model we used From our experiments, we observe better performances if we features extracted from deep models pre-trained on ImageNet and recombine this score with the initial relevance scores as a convex word embeddings learned from a large corpus of queries2 . linear combination: S̃T,V (d, q) = (1 − α )ST,V (d, q) + αST (di , q). This A second motivation was to compare this simple approach with algorithm is the core of NLE-RUN3. more recent image and text combination strategies, such as joint Note that we can apply a similar two-step similarity measure, image and text embedding [5, 6, 8, 20, 21]. These methods, in con- using only visual (or textual, resp.) features in both steps. Concretely, trast to our fully unsupervised cross-media similarity, exploit labels we obtain a pure text-based retrieval model (more or less equivalent or relevance scores to learn the embedding. to classical pseudo relevance feedback) by replacing in (1) SV (d,d ′ ) Finally, our third motivation was to evaluate several methods to with ST (d,d ′ ), the normalized textual similarity measure between make the top ranked images more diverse. In particular, we explored documents d and d ′ ; this results in a purely textual relevance score a clustering-based method, with several visual, textual and joint S̃T,T (d, q). This method corresponds to NLE-RUN2. similarity measures: images were re-ranked based on the number of By analogy, assuming (abusively) that the Flickr ranking is based times a document shared clusters with documents already present on the image only, we can replace the term ST (di ,q) in (1) with in the upper ranked images (the lower, the better). While this family SF (di ,q), the normalized “Flickr” relevance score, defined as (n − of methods allowed us to significantly increase the cluster recall, r )/r , where n is the number of images returned by Flickr and r is the these methods turned out to perform below the classical Maximum provided Flickr rank of document di ; this results in a purely visual Margin Relevance method (MMR) proposed in [1], at least for the relevance score S̃F,V (d, q). This method corresponds to NLE-RUN1. development set. 1 For more details, please visit www.imageclef.org 2.2 Joint visual and textual embedding 2 The models used to get these representations were built prior and independently We considered the joint textual and visual embedding model pro- from the challenge. posed in [20], where the idea is to use a two-view neural network Copyright held by the owner/author(s). with two layers of non-linearities on top of any representation of MediaEval’17, 13-15 September 2017, Dublin, Ireland 3 Here a document refers to a Flickr image with its textual and visual representations. MediaEval’17, 13-15 September 2017, Dublin, Ireland J.-M. Renders and G. Csurka the image and text views. To train this network, in a way which is Table 1: The retrieval results for our main runs reminiscent of some “learning to rank” strategies, we use 4 differ- ent triplet losses (visual-visual, textual-textual, visual-textual and Results (@20) P CR F1 ERR-IA α-nDCG textual-visual). The aim is to enforce that two documents relevant Run1 (V) 73.2 59.4 63.3 66.0 62.3 to the same query should have both textual and visual embeddings Run2 (T) 72.7 61.7 64.3 66.3 62.8 close in the new common (i.e. joint) latent space, while a document Run3 (VT) 78.2 67.9 70.5 73.3 68.9 relevant to a query q should be far from documents non-relevant Run4 (VT) 79.3 66.3 69.8 72.3 67.9 to the same query or from documents relevant to other queries. Run5 (VT) 78.1 66.4 69.4 73.0 68.6 More formally, given a set of triplets (di ,d j ,dk ) built from the set of queries and their associated documents, the method amounts to For the textual facet, after trying word2vec and Glove [14] embed- minimizing the following loss function: dings, we finally decided to adopt the Dual Embedding Space Model for Document Ranking [13], pre-trained on the Bing query corpus5 . ℒ(di ,d j ,dk ) = max[0,m + d (piV ,pVj ) − d (piV ,pkV )] This choice was motivated by the fact that this embedding specifi- + max[0,m + d (pTi ,pTj ) − d (pTi ,pTk )] cally designed for IR applications experimentally turned out to give better performance on the development set. Document and query + max[0,m + d (piV ,pTj ) − d (piV ,pTk )] embeddings are simply computed as the average of the embeddings + max[0,m + d (pTi ,pVj ) − d (pTi ,pkV )] of their constitutive words; we then use a simple mixture of the Dirichlet-smoothed LM relevance score with the cosine similarity where piV and pTi are the projections of the visual respectively tex- of the textual embeddings as the ST (d,q) textual relevance score. tual representation of document di into the common embedded As visual representation, we considered several deep CNN mod- space. To select such triplets for training, we experimented with els pretrained on ImageNet. We experimented with AlexNet [10], using the ground-truth relevance scores provided with the devel- GoogleNet Inception V3 [18], Inception-ResNet [17] and RMAC6 [7, opment set but we have observed that they do not generalize for 19] deep models. The pretrained models were used as such, without unseen topics. Therefore, instead, we used the pseudo-relevance any fine tuning on the task collection. We used as visual represen- scores (using our cross-media similarity scores) by considering the tation the activations of the last fully connected layer preceding top-ranked documents4 as relevant to the query; the bottom-ranked the class prediction one. The features were L2-normalized and the documents as well as all documents associated to the other queries dot product used as similarity. were assumed to be non-relevant. We used the provided ground truth on the development set After the model was trained, we computed embeddings both for and considered the P@50 to select the best visual similarity and the textual queries and documents. For the documents – which have to set the parameters. Best results were found with the features two embeddings –, we considered the centroid of their visual and extracted from the Inception-ResNet [17] model. As best choice for textual embeddings and ranked them according to their distance the parameters in (1), we found K = 25 and α = 0.15. to the query in the embedding space. This approach was used to To promote diversity, we used for all runs the classical MMR build our NLE-RUN4 and NLE-RUN5 runs. applied to the initial relevance scored computed by the methods described above. The metrics used in MMR to penalize documents 3 PROMOTING DIVERSITY similar to higher rank documents was the RMAC visual similarity Note that in general promoting diversity comes with a risk of de- between images, except for NLE-RUN2, where we used the cosine creased precision as we discard in general relevant elements from similarity between text embeddings to keep the run purely textual. the top that are similar to other elements on the top. Our aim there- The weight factor β that penalizes a too high similarity with higher fore was to find a good trade-off between keeping the relevance as rank documents was tuned using the development set. high as possible while introducing diversity. The best performance Our runs are summarized in Table 1. We can see that our visual on the development set was obtained with the Maximum Margin only and textual only runs have similar performances, the visual Relevance method (MMR) proposed in [1]. The main idea of the one having slightly higher precision and the text higher diversity. method is that we re-rank documents by considering new scores Using the cross-media similarity allowed us to obtain a much better which corresponds to their initial relevance scores diminished with ranking both in terms of precision and also diversity. Learning joint the maximum similarity score compared to the documents already visual and textual embedding using the relevance scores did not selected weighted by a penalty factor β. help, or even slightly degraded the results. The main reason is that the embedding only learned from information already captured by the cross-media similarity7 . 4 RESULTS AND ANALYSIS Acknowledgement: We would like to thank Jon Almazan, who The methods presented here above, based on pseudo-relevance feed- provided us with RMAC representations for the images. back, heavily depends on the choice of the mono-modal similarity measures and, consequently, on a good textual/visual representa- 5 See http://research.microsoft.com/projects/DESM 6 The RMAC model [7, 19] is trained with a triplet loss instead of a classification loss, tion of the query and the documents. to make the distance between images from the same class smaller than the distance to images from other classes plus a margin. 4 We considered as relevant documents with scores > mean + std and non-relevant 7 Originally, we intended to use external data such as Visual Genome [9] or Flickr30K scores < mean + std , where mean and std are the mean and standard deviation Entities [16] to learn embeddings such as relationships between objects and persons, of the the scores within the topic. etc. Due to time constraint we will investigate this in the future. Retrieving Diverse Social Images Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES Recognition (CVPR). [1] Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity- [19] Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2016. Particular object based reranking for reordering documents and producing summaries. retrieval with integral max-pooling of CNN activations.. In Interna- In ACM SIGIR Conference on Research and Development in Information tional Conference on Machine Learning (ICML). Retrieval (SIGIR). [20] Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning Deep [2] Stéphane Clinchant, Cyril Goutte, and Éric Gaussier. 2006. Lexical Structure-Preserving Image-Text Embeddings. In IEEE Conference on Entailment for Information Retrieval. In European Conference on Infor- Computer Vision and Pattern Recognition (CVPR). mation Retrieval Research. [21] jason Weston, bengio Bengio, and Nicolas Usunier. 2011. WSABIE: [3] Stéphane Clinchant, Jean-Michel Renders, and Gabriela Csurka. 2007. Scaling Up To Large Vocabulary Image Annotation. In AAAI Interna- XRCE’s participation to ImageCLEF. In CLEF online Working Notes. tional Joint Conference on Artificial Intelligence (IJCAI). [4] Stéphane Clinchant, Jean-Michel Renders, and Gabriela Csurka. 2008. [22] Maia Zaharieva, Bogdan Ionescu, Alexandru Lucian Gînscă, Trans–Media Pseudo–Relevance Feedback Methods in Multimedia Rodrygo L.T. Santos, and Henning Müller. 2017. Retrieving Diverse Retrieval. In Advances in Multilingual and Multimodal Information Social Images at MediaEval 2017: Challenges, Dataset and Evaluation. Retrieval. Vol. LNCS 5152. Springer, 569–576. In Medieval 2017, Multimedia Benchmark Workshop. [5] Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Annual Conference on Neural Information Processing Systems (NIPS). [6] Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections. In European Con- ference on Computer Vision (ECCV). [7] Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep Image Retrieval: Learning global representations for image search. In European Conference on Computer Vision (ECCV). [8] Albert Gordo and Diane Larlus. 2017. Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [9] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. 2017. Visual Genome: Connecting language and vision using crowdsourced. 123 (2017), 32–73. [10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Ima- geNet Classification with Deep Convolutional Neural Networks. In Annual Conference on Neural Information Processing Systems (NIPS). [11] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per- ona, Deva Ramanan, Piotr Dollár, and C.Lawrence Zitnick. 2014. Mi- crosoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV). [12] Henning Müller, Paul Clough, Theo Deselaers, and Barbara Caputo (Eds.). 2010. ImageCLEF- Experimental Evaluation in Visual Information Retrieval. Vol. INRE. Springer. [13] Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. 2016. Improving Document Ranking with Dual Word Embeddings. [14] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. [15] Florent Perronnin and Chris Dance. 2007. Fisher Kernels on Visual Vocabularies for Image Categorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [16] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to- Sentence Models. [17] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. 2016. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. CoRR arXiv:1602.07261 (2016). [18] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In IEEE Conference on Computer Vision and Pattern