Deep Embedding-based Multimodal Matching for News Articles: Exploring the Effects of Transfer Learning & Data Augmentation Martin Ludwig Zehetner Mohamed Amine Dhiab Technische Universität Berlin, Berlin, Germany Technische Universität Berlin, Berlin, Germany m.zehetner@tu-berlin.de m.dhiab@campus.tu-berlin.de ABSTRACT Wu et al. [18] present such an embedding-based SAEM model Choosing the right combination of news article image and title is framework to solve a cross-media retrieval task posed by the image critical for attracting new users and converting latent interest into captioning data sets, i.e. images with associated descriptive sen- clicks. In the context of the MediaEval 2021 NewsImages Challenge, tences, Flickr30k [19] or MS-COCO [13]. The SAEM models can we therefore investigated the underlying multimodal matching be understood as neural networks divided into two branches. The problem between images and titles of news articles by employing first of these branches takes feature vectors representing salient deep embedding-based models to map and match images and text regions in images as inputs. These vectors and their intra-modal re- in the same semantic space. Additionally we explored the impact lations are then encoded using self-attention layers and final image of transfer learning and paraphrasing-based data augmentation embeddings are generated using an average pooling. In the second schemes on the task performance. We observed a clear improve- branch, text inputs are converted to continuous context-sensitive ment in performance using transfer learning approaches, but no word embeddings using the encoder of a transformer initialized consistent improvement using the data augmentation technique with a pre-trained BERT [3] architecture. The continuous represen- we selected. Our best model achieved a mean recall@100 of 0.3488. tations are then fed into three distinct 1D Convolutional Neural Network (CNN) layers [9] and a fully connected layer subsequently generates the global text embeddings. Furthermore, to guarantee 1 INTRODUCTION the mapping of similar images to texts, a weighted combination of Online news articles commonly try to convey content in a distinc- a triplet loss [17] and angular loss [16] is used while training the tive and direct manner through combinations of expressive titles models with image-text pairs. This allows computed embeddings and images. Understanding the relationships between these news to be directly compared using the cosine similarity measure. images and texts, e.g. news headlines, can therefore help to provide insight into a wide variety of different tasks in the news domain. 3 APPROACH In this sense, the MediaEval 2021 NewsImages Re-Matching Task aims to advance this investigation by setting a task to re-match 3.1 Deep Embedding-based Multimodal decoupled real-world news articles with the images used in said Matching using SAEM articles [8]. In particular, given a news article the corresponding A central component of our approach is the generation of directly article image is to be selected from the set of all images. comparable text and image embeddings using variations of SAEM We attempted to further the understanding of these relation- models. We chose to use the SAEM model framework as it allows ships and solve the task given by using multimodal embeddings, easy customization of the input image feature sources and achieved specifically embeddings that map images and texts into the same good results in other cross-media retrieval tasks on the Flickr30k semantic “news” vector space. These embeddings are generated and MS-COCO image captioning data sets [18]. using variations of the Self-Attention Embeddings for Image-Text We decided to only use the images and article titles from the Matching (SAEM) model framework [18] and are intended to enable MediaEval data set [8] in our SAEM variations, due to superior per- the computation of semantic similarity, with regard to the “news” formance in initial tests and restricted translation and paraphrasing domain, between texts and images using basic similarity metrics. resources. For the generation of image feature vectors representing In this context, we placed a special emphasis on investigating the the salient image regions, we used a bottom-up attention mecha- effects of a transfer learning scheme, using information exploited nism [1] consisting of a Faster R-CNN with ResNet-101 trained on from image caption data sets, and augmenting the given text data Visual Genomes [11]. We set the dimensions of these feature vec- by applying a paraphrase-based approach. tors to (36, 2048). In addition we used the WordPiece tokenizer [3] to process the article titles to generate the initial text input. 2 RELATED WORK It is important to point out that unlike in the introduction of Matching media objects of different modalities, e.g. text and images, the SAEM model framework [18], in which the focus was only on is essential for various multimedia tasks. Learning a common space matching semantically identical images and texts, the focus in our into which text and image feature vectors can be embedded and approach lays rather on the somewhat more abstract task of map- then compared in is a typical approach for such tasks [2, 7]. ping the similarity of image and text elements in the semantics of the online news article context, i.e. which image would be selected Copyright 2021 for this paper by its authors. Use permitted under Creative Commons to match a given article title. This is to be achieved by means of License Attribution 4.0 International (CC BY 4.0). training and fine-tuning on the provided MediaEval image-title MediaEval’21, December 13-15 2021, Online pairs [8]. Concurrently, the training process is used to optimize MediaEval’21, December 13-15 2021, Online M.L. Zehetner, M.A. Dhiab the hyperparameters, e.g. learning parameters, the use of transfer Table 1: Results: Mean Recall@k of the submitted SAEM learning and the use of data augmentation. The trained models are variants (pre: pre-trained, para: paraphrased titles used) then used to compute the images which best match the article titles based on the cosine similarity of the corresponding embeddings. SAEM Variations MR@5 MR@10 MR@50 MR@100 not pre & not para 0.04073 0.07050 0.20836 0.30966 3.2 Transfer Learning not pre & para 0.04856 0.08512 0.21253 0.30287 Transfer learning broadly describes the use of knowledge learned pre & not para 0.0700 0.1159 0.2585 0.3488 in one scenario to improve the training process or results in an- pre & para 0.0653 0.1003 0.2381 0.3248 other [6]. Among the most commonly used strategies in the neural network field are various approaches using pre-trained models [14]. In our approach, due to the relatively small size of the MediaEval data set, we decided to investigate and try to exploit the effect of on the validation data. Afterwards, the model with the highest mean a direct pre-trained model strategy in our task scenario. To this between these 3 metrics is selected as the best pre-trained model. end, we first train our SAEM variants on either the Flick30K or the In our experiment this best model was trained on MS-COCO. MS-COCO data set. We then fine-tune the SAEM models using the Training & Evaluation of MediaEval Models. The second phase best versions of the pre-trained models as the initial model states of the experiment focuses on the comparison of the performance and then train the models using the news images and titles. of the SAEM models with different configuration combinations. The “Initial Model State” 𝑀 and the “Textual Input Type” 𝑇 can 3.3 Data Preparation & Augmentation be seen as variables of the configurations. 𝑀 can represent the In general, performance of neural networks can be influenced heav- initialization of the SAEM model randomly or based on the best ily by the size and quality of the training data sets [15]. In many pre-trained model. 𝑇 can represent the use of only translated article real world tasks, relatively few data points are available for the titles or the use of translated titles together with paraphrased titles. training process, such as in our MediaEval task. A potential solu- The data is then split, with the first two batches of the MediaEval tion is data augmentation, i.e. the generation of new data points data set representing the training data and the third batch used to increase training data diversity [4]. Therefore, we investigated as validation data [8]. Afterwards, for each of the possible four the effects of an augmentation approach consisting of generating a (𝑀, 𝑇 ) combinations, the following steps are performed. First the new paraphrased title for each article title in the training set, where input data is augmented and processed according to the current 𝑇 . the image is mapped to both titles. By doing so, we doubled the Then the model is initialized according to the current 𝑀. Thereafter, amount of training data. the SAEM model is trained for 50 epochs using the pre-processed training data. Subsequently, the best models of the current configu- 4 EXPERIMENT rations are selected by calculating 𝑀𝑅𝑅@𝑁 for 𝑁 = 1, ..., 100 for the matchings, according to cosine similarity, of the trained model 4.1 Data Pre-Processing & Augmentation on the validation data set. The four submission models represent Initial image feature vectors are generated following the procedure the best performing models in each configuration. practiced in the SCAN [12] project. As the first step of text pre- processing, we translate the article titles into english using DeepL. 5 RESULTS & FUTURE WORK For the text augmentation steps, we then used Quillbot to generate The performance of our four submitted models on the MediaEval paraphrases for the article titles using the default mode with the test set [8] can be seen in Table 1. While no massive performance highest possible abstraction level for the generated phrases [5]. differences are present, clear performance gaps can be observed nevertheless. Our best performing variant is pre-trained but does 4.2 Model Implementation not use the paraphrase-based data augmentation method, reflect- For the selected models we used the Adam [10] optimizer with an ing the general observed tendencies. As such, consistently better initial learning rate of 0.001 and a batch size of 64 while training. performance is observed for the pre-trained variants, while the use During pre-training, a decay rate of 0.1 was applied after every 10 of our data augmentation approach shows no consistent perfor- epochs, but when training on the MediaEval data the decay was mance improvements. Concluding, we recognize that our results applied only after every 15 epochs. The dimension of the internal indicate that exploiting learned information from similar task do- word embedding was set to 300 and the dimension of the final mains through transfer learning can be highly beneficial in news multimodal embedding was fixed to 256. re-matching scenarios with small amounts of training data, such as the considered MediaEval 2021 NewsImages task. 4.3 Experiment Protocol In this sense, our observations suggest that further investigation Training & Evaluation of pre-trained Models. The training de- regarding the exploitation of externally learned information may scribed below is performed separately for Flickr30K and MS-COCO. be worthwhile. In addition to more detailed analysis regarding the Firstly, the image-annotation pairs are split according to the public influence of information learned in less or more related tasks and split [12]. The SAEM models are then trained for 30 epochs and with less or more available data, investigating the explicit addition after each epoch the ratio of recommendations in which at least one of available contextual information, such as knowledge related to relevant image was ranked among the top 1, 5 and 10 is determined identities or locations, could allow for further valuable insights. NewsImages MediaEval’21, December 13-15 2021, Online REFERENCES Shamma, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual Genome: [1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Connecting Language and Vision Using Crowdsourced Dense Image Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top- Annotations. CoRR abs/1602.07332 (2016). arXiv:1602.07332 Down Attention for Image Captioning and Visual Question Answering. [12] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. In CVPR. 2018. Stacked cross attention for image-text matching. In Proceedings [2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey Hinton. 2016. Layer of the European Conference on Computer Vision (ECCV). 201–216. normalization. arXiv preprint arXiv:1607.06450 (2016). [13] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr 2018. BERT: Pre-training of Deep Bidirectional Transformers for Lan- Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common guage Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 Objects in Context. CoRR abs/1405.0312 (2014). arXiv:1405.0312 [4] Steven Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush [14] Leeja Mathew and Bindu. 2020. A Review of Natural Language Pro- Vosoughi, Teruko Mitamura, and Eduard Hovy. 2021. A Survey of Data cessing Techniques for Sentiment Analysis using Pre-trained Models. Augmentation Approaches for NLP. CoRR abs/2105.03075 (2021). In 2020 Fourth International Conference on Computing Methodologies [5] Tira Fitria. 2021. QuillBot as an online tool: Students’ alternative in and Communication (ICCMC). 340–345. paraphrasing and rewriting of English writing. Englisia: Journal of [15] Luis Perez and Jason Wang. 2017. The Effectiveness of Data Augmenta- Language, Education, and Humanities 9, 1 (2021), 183–196. tion in Image Classification using Deep Learning. CoRR abs/1712.04621 [6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep (2017). arXiv:1712.04621 learning. MIT press. [16] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. 2017. [7] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Deep Metric Learning with Angular Loss. CoRR abs/1708.01682 (2017). models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015). arXiv:1708.01682 http://arxiv.org/abs/1708.01682 [8] Benjamin Kille, Andreas Lommatzsch, Özlem Özgöbek, Mehdi Elahi, [17] Liwei Wang, Yin Li, and Svetlana Lazebnik. 2015. Learning Deep and Duc-Tien Dang-Nguyen. 2021. News Images in MediaEval 2021. Structure-Preserving Image-Text Embeddings. CoRR abs/1511.06078 In MediaEval 2021 Proceedings. MediaEval. (2015). arXiv:1511.06078 http://arxiv.org/abs/1511.06078 [9] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classi- [18] Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. fication. CoRR abs/1408.5882 (2014). arXiv:1408.5882 Learning fragment self-attention embeddings for image-text matching. [10] Diederik Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic In Proceedings of the 27th ACM International Conference on Multimedia. Optimization. In 3rd International Conference on Learning Representa- 2088–2096. tions, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track [19] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. Proceedings, Yoshua Bengio and Yann LeCun (Eds.). From image descriptions to visual denotations: New similarity metrics [11] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, for semantic inference over event descriptions. TACL 2 (2014), 67–78. Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A.