Image-Text Rematching for News Items using Optimized Embeddings and CNNs in MediaEval NewsImages 2021 Tom Sühr, Ajay Madhavanr, Nasim Jamshidi Avanaki, René Berk, Andreas Lommatzsch Technische Universität Berlin Berlin, Germany {tom.suehr,jamshidiavanaki,ajay.m.ravichandran,rene.m.berk,andreas.lommatzsch}@campus.tu-berlin.de ABSTRACT the dataset and its’ features, the specific vocabulary of the domain Finding a matching image for a news article is a core problem in the as well as the models for transforming the textual and visual data. creation of traditional and online newspapers. The task of image- In this work we research the degree to which the textual and text matching has thus become a vibrant research area in computer visual contents of a news article are related. Our developed model science. The performance of state-of-the-art image retrieval sys- should be able to recommend a ranked list of related images, for a tems on various benchmarks is excellent. However, they all rely given text input. We analyze, whether state of the art image-text on datasets with a detailed textual description of the images or on matching architectures like VSE work for a small and homoge- very large training collections. In this work, we optimize image- neous dataset from just one newspaper. Furthermore, we research text matching algorithms for a small dataset based on the data of a which adaptations are needed to improve the performance in the single newspaper. Our optimized processing pipeline and the com- MediaEval NewsImages scenario. puted configurations reach precise results. The evaluation results The rest of this paper is organized as follows: Sec. 2 explains obtained in the MediaEval NewsImages benchmark significantly our approach and the implementation. In Sec. 3 we present the outperforming the algorithms from previous years. performance results and discuss the specific strengths of the models. Finally, we summarize our work and discuss extensions in Sec. 5. 1 INTRODUCTION 2 APPROACH The process of selecting images for news articles in the multimedia Our approach follows the general architecture of Visual Semantic industry is crucial. Images play a significant role of the storytelling Embeddings [5]. The core idea of this architecture is to embed both, process. They are used to attract the user’s attention, thus achieving text input and image input, into a joint embedding space. In this a high number of clicks or high average dwell time per user. How- joint embedding, matching text-image pairs can then be done based ever, finding a good image that matches the news article in a single on distance or similarity measures such as cosine similarity. Thus, picture is a hard task. Automating this task can provide beneficial the challenge of this approach is to learn such a joint embedding effects in different areas, e.g. leveraging the efficiency of publishing and to extract those features which characterize image and text articles, saving costs and human resources. Finding a relationship pairs best. Fig. 1 shows our architecture and the components. between a text and an image is a problem that is researched in the Image Encoding. The image encoding consists of three steps: (i) field of recommender systems. Several papers exist that achieved preprocessing, (ii) feature extraction and (iii) linear mapping into good results, but most works rely on huge generic data collections. the joint embedding size. In the preprocessing, we normalize the In this paper we develop models for a specific newspaper that has RGB values of the pixels and resize the images to 250 pixels. In the its own image database, a different journalistic style and a signifi- second step, the preprocessed image are fed into a pretrained CNN cantly smaller amount of data. We evaluate our models using the (VGG19 [16]). data provided in the MediaEval 2021 NewsImages Challenge. A detailed description of the dataset and the evaluation metrics are discussed in the Task Overview paper [11]. Title Text Category Image Our approach is inspired by recent works in the domain of text Category Title Pre- Text Pre- Pre- Image Pre- and image encoding as well as advanced Image-Text Matching processor processor processor processor methods. We analyzed commonly used CNNs (pretrained on Ima- geNet [6]) for the image encoding, such as ResNet [7], VGG [8], and Semantic Embedding DenseNet [9]. For the efficient encoding of texts and their contexts, Title Text Category Embeddings Embeddings Embeddings the use of text embeddings has shown promising results [2, 13, 18]. Recent image-text matching algorithms are usually based on two Title Text Category Encoder Encoder ImageNet pretrained VGG-19 Image Fuser Fuser Fuser Text branches for extraction of image and text representations, for which then the computed representations are aligned for both modalities Feature Fuser Linear Transformation in a joint semantic space [1, 3, 15, 20]. Critical aspects are the size of Article Embedding Image Embedding Copyright 2021 for this paper by its authors. Use permitted under Creative Commons Constrastive Cosine Similarity License Attribution 4.0 International (CC BY 4.0). MediaEval’21, 13-15 December 2021, Online Figure 1: Our system architecture. MediaEval’21, 13-15 December 2021, Online Sühr et al. Text Encoding. The text encoding stands in the center of this our word embedding trained on German news article data, performs work. One special feature of news image retrieval is that more than better at recall at 50 and 100. The wiki-based embedding has a more one textual input might exist. In the NewsImages task the article fine-grained differentiation between words. Thus, given a word and title, the snippet and the article category are provided. We employ a slight modification, the wiki embedding is able to produce two three preprocessing steps for each textual input. We apply stop significantly different representations. Furthermore, the vocabulary words removal and stemming (using nltk). In order to get the same of the wiki embedding is much larger than the vocabulary of our number of word vectors for each input, we picked a constant length custom embedding. The custom embedding performs better if we and cropped or extended the input to that length. Subsequently, we look at a large interval of the ranking (r@50, r@100) because it is vectorized the text and compute a semantic embedding [4, 13, 14]. better suited to embed news article words. In summary, the custom Due to the limited amount of data, we test pretrained embeddings. embedding provides a better representation of the articles compared First Fusion Layer: The task of the first fusion layer is to reduce to the embeddings computed based on the wiki corpus. However, the three matrices to three vector representations. Embedding each when fine grained differentiations between words are relevant, the textual input on a word level, yields three matrices of the sizes: wiki-based embedding performs better. (𝑎 = 5, 𝑤) for the title input, (𝑏 = 25, 𝑤) for the text input and (𝑐 = 1, 𝑤) for the category input; the word embedding size 𝑤 is Model r@5 r@10 r@50 r@100 300. A: Word Embeddings MaxPool + wiki 1.93% 3.76% 12.59% 19.37% Stacking and Second Fusion Layer: Receiving three inputs of B: Word Embeddings Linear + wiki 4.49% 7.26% 20.99% 31.91% C: Word Embeddings MaxPool + custom 2.92% 4.60% 14.36% 24.86% size (1, 𝑤) for title, text and category, the next step is to fuse all D: Word Embeddings Linear + custom 3.97% 7.10% 21.57% 33.26% three representations and transform them in the size of the joint E: Word/Subw. Emb. Linear + wiki 2.56% 4.70 % 16.19 % 26.68% embedding space of the size (1, 𝑑). In order to achieve that, we stack Table 1: The evaluation results obtained for the evaluation all three input representations of (1, 𝑤) which yields one vector set for the analyzed models. of size (1, 3𝑤). Another fully connected layer of size (3𝑤, 𝑑) then maps the stacked representations to the size of the joint embedding space (1, 𝑑). 4 CONTRIBUTIONS Contrastive Loss. A multitude of loss functions exist to train the In this work we made the following contributions: First, we showed joint embedding space of article and image embedding. The loss that state of the art architectures perform significantly worse on function should ensure with the learned model that the similarity a small, non-descriptive and homogeneous dataset. Secondly, we between an article and the true matching image is higher than showed that the performance of embeddings trained on large cor- the similarity to other images and vice versa; the use of a margin- pora such as Wikipedia, improve the performance in the top 10 based contrastive loss fulfills these requirements [3, 10, 12, 12, 13]. retrieved images while tailored embeddings (to a specific style of a For the image embedding 𝑥𝑖 and the article embedding 𝑥𝑡 we first newspaper) improve the top 100 performance. Thirdly, we provide define the similarity measure as the inner product of both vectors: our code1 . For future Mediaeval participants and other researchers 𝑠 (𝑖, 𝑡) = ⟨𝑥𝑖 , 𝑥𝑡 ⟩ : R𝑑 × R𝑑 −→ R. In our implementation we use for benchmarking purposes and to build upon. the L2-normalized vectors 𝑥𝑖 , 𝑥𝑡 for computing the similarity. 3 EXPERIMENTS AND RESULTS 5 CONCLUSION We have investigated how to adapt state of the art image-text match- We tested different configurations focusing on finding optimal em- ing systems to a small, homogeneous and specific dataset. We ana- beddings and hyperparameters. lyzed existing and well performing image-text matching systems The experimental results on the Mediaeval test set of size 3022 like VSE, identified components which do not work well with our are shown in Table 1. The experiments reveal that the linear layer dataset, and systematically tested possible substitutions for them. for the dimension reduction of the textual inputs outperforms adap- Our experiment show that the non-viability of components like the tive max pooling in all compositions with a margin of almost 10% trainable word embeddings have impacts on the viability of other in the settings D and C. However, the adaptive max pooling com- components, e.g. the adaptive max pooling. We further showed that ponent performed extremely well in most recent works. The reason we can successfully substitute these components in an easy way for that seems to be the difference between pretrained word embed- and achieve reasonable performance on our data. Future work could dings and fine-tuned word embeddings. The adaptive max pooling investigate other substitutions for the identified components, e.g. can consider positions in the textual input. The fully connected optimizing the word embeddings with respect to the loss. Further- layer on the other hand is better suited for the pretrained embed- more, future projects could research other configurations or even dings because it will learn an average importance of the different inputs for the image encoding layer as well as investigating fairness positions. aspects. It might be that our strategy works well for political arti- This suggests that the linear layer instead of adaptive max pool- cles but not for sports articles. Thus, analyzing and incorporating ing is more adaptive to the word embedding. fairness aspects of matching and ranking [17, 19] could normalize In addition, we find the models B and D differ in the performance the performance of our model across various article subjects. whereas the models only differ in the used data for learning the embeddings. While model B with the word embedding trained on 1 https://github.com/tsuehr/News-text-image-matching wiki achieves higher recall at position 5 and 10, the same model with NewsImages MediaEval’21, 13-15 December 2021, Online REFERENCES vision and pattern recognition. 3128–3137. [1] Yanbei Chen and Loris Bazzani. 2020. Learning joint visual semantic matching [11] Bennamin Kille, Andreas Lommatzsch, Özlem Özgöbek, Mehdi Elahi, and Duc- embeddings for language-guided retrieval. In Computer Vision–ECCV 2020: 16th Tien Dang-Nguyen. 2021. News Images in MediaEval 2021. In Proceedings of European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII the MediaEval Benchmarking Initiative for Multimedia Evaluation 2021. CEUR 16. Springer, 136–152. Workshop Proceedings. http://ceur-ws.org/Vol-2882/ [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: [12] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual- Pre-training of deep bidirectional transformers for language understanding. semantic embeddings with multimodal neural language models. arXiv preprint arXiv preprint arXiv:1810.04805 (2018). arXiv:1411.2539 (2014). [3] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: [13] Fangyu Liu, Rémi Lebret, Didier Orel, Philippe Sordet, and Karl Aberer. 2020. Improving visual-semantic embeddings with hard negatives. arXiv preprint Upgrading the Newsroom: An Automated Image Selection System for News arXiv:1707.05612 (2017). Articles. ACM Transactions on Multimedia Computing, Communications, and [4] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Applications (TOMM) 16, 3 (2020), 1–28. Improving Visual-Semantic Embeddings with Hard Negatives. (2018). https: [14] Fangyu Liu, Rongtian Ye, Xun Wang, and Shuaipeng Li. 2020. HAL: Improved //github.com/fartashf/vsepp text-image matching by mitigating visual semantic hubs. In Proceedings of the [5] Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, AAAI Conference on Artificial Intelligence, Vol. 34. 11563–11571. Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic [15] Lin Ma, Wenhao Jiang, Zequn Jie, Yu-Gang Jiang, and Wei Liu. 2019. Matching embedding model. (2013). image and sentence with multi-faceted representations. IEEE Transactions on [6] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Circuits and Systems for Video Technology 30, 7 (2019), 2250–2261. 2017. Densely connected convolutional networks. In Proceedings of the IEEE [16] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net- conference on computer vision and pattern recognition. 4700–4708. works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [7] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry [17] Tom Sühr, Asia J Biega, Meike Zehlike, Krishna P Gummadi, and Abhijnan Heck. 2013. Learning deep structured semantic models for web search using Chakraborty. 2019. Two-sided fairness for repeated matchings in two-sided mar- clickthrough data. In Proceedings of the 22nd ACM international conference on kets: A case study of a ride-hailing platform. In Proceedings of the 25th ACM Information & Knowledge Management. 2333–2338. SIGKDD International Conference on Knowledge Discovery & Data Mining. [8] Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. 2017. 3082–3092. Learning robust visual-semantic embeddings. In Proceedings of the IEEE [18] Keyu Wen, Xiaodong Gu, and Qingrong Cheng. 2020. Learning Dual Semantic International Conference on Computer Vision. 3571–3580. Relations with Graph Attention for Image-Text Matching. IEEE Transactions on [9] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Circuits and Systems for Video Technology (2020). Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, and [19] Meike Zehlike, Tom Sühr, Carlos Castillo, and Ivan Kitanovski. 2020. Fairsearch: others. 2017. Google’s multilingual neural machine translation system: En- A tool for fairness in ranked search results. In Companion Proceedings of the abling zero-shot translation. Transactions of the Association for Computational Web Conference 2020. 172–175. Linguistics 5 (2017), 339–351. [20] Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for [10] Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for gen- image-text matching. In Proceedings of the European Conference on Computer erating image descriptions. In Proceedings of the IEEE conference on computer Vision (ECCV). 686–701.