NewsSeek-NOVA at MediaEval 2021: Context-enriched Multimodal Transformers For News Images Re-matching Cláudio Bartolomeu, Rui Nóbrega, David Semedo NOVA LINCS, NOVA School of Science and Technology, Lisbon, Portugal c.bartolomeu@campus.fct.unl.pt,{rui.nobrega,df.semedo}@fct.unl.pt ABSTRACT 2 METHODOLOGY In this paper, we present our participation in the NewsImages task The connection between news and images goes beyond visual con- where we address the complex challenge of connecting images to cept matching [1, 3, 20]. In this scenario, not only the challenges news text. We leverage transformer-based multimodal models to of image-text matching are inherited [4], but also the underlying jointly attend to different contextual news elements when perform- journalistic subjectivity that stems from trade-off aspects, such as, ing predictions, and transfer learning to improve the performance. aesthetics or authenticity. Thus, we focused on adopting a model Our experiments demonstrate that the models benefit from jointly capable of considering multiple views of news articles, to predict attending to context-enriched samples, supporting our hypothesis. if an image matches a news article. Accordingly, we hypothesize We also extracted rich insights on the principles underlying the that by leveraging on self-attention models, specifically the Trans- connection between images and news text. former [17], and by providing extra contextual information, we 1 INTRODUCTION allow the model to jointly reason over multiple data views and learn the relationships between text and images directly from data. News articles are rich multimodal pieces that aim to inform users in News pieces are multimodal documents composed by title, text a concise and accurate manner. These are often composed by a title, body (as a set of paragraphs) and images that are used throughout a headline and a body of text. To better convey the topic and events the news piece to illustrate specific paragraphs, providing extra being covered, journalists use images as illustrations. Providing context. We also find several named entities, such as persons or visual elements helps the news reader visualizing the event and locations, that are crucial that define its topic and scope. The chal- have a better sense of what happened [11]. Connecting news text lenge is on jointly using all these information to match news to and images is a complex endeavour as it goes beyond matching images. We observed the following challenges: a) the topic of a what we see in an image (visual concepts) to words. Instead, it is news article cannot always be extracted from the images, b) the often explained by a combination of journalistic criteria combining news title is highly concise and lacks context (e.g. "63-year-old authenticity, topic semantic relevance and aesthetics [9, 10]. pedestrian succumbs to his injuries"), c) to correctly capture the In this paper we present our approach to the NewsImages task [6], news context it is important to consider the mentioned entities, as which asks researchers to re-match news images to articles, towards well as the news central topic, and finally, d) deal with subjectivity, devising a systematic approach that captures the intricacies of how evidenced by situations where multiple images could actually be the two modalities are connected, in a journalistic perspective. used, and the pattern is dependent on the journalist preference. Multimodal Transformer-based architectures [2, 8, 14] have demon- Data Pre-processing and Protocol. The dataset is comprised by strated to be highly effective at modeling image and text semantics. news articles, composed by title, text snippet (in German) and an These can be a) encoder-based, such as LXMERT [14], which is com- image. Since most multimodal pre-trained models were trained posed by an object relationship, language and cross-modality en- in English, and assuming that we do not lose information in the coders, or b) decoder-based (hence generative) like VL-T5/BART [2], translation process, we used a combination of the Google’s API which adopts a single decoder architecture to tackle multiple visio- and OPUS-MT [15], available in HuggingFace [18], to translate linguistic tasks in a generative manner. We will investigate how texts. For each news article, we extracted entities from the Title suited these self-attention models are to news content. and Text Snippet using Spacy [5]. For images, we used a Faster R- Particularly, in the developed models we followed the LXMERT CNN [12], trained on Visual Genome [1, 7], and extract a total of 36 architecture to exploit different ways to enrich these models context. region embeddings per image. This will allow the model to attend The idea is to provide complementary views of the two modalities, individually to specific parts of an image. We split the development and leverage the model’s capability of jointly attending to differ- set (7530 samples) by using 500 samples for validation, 1000 for ent news elements when performing a prediction. Then, through testing and the remaining for training. transfer-learning, we were able to significantly improve the per- Approach. We tackled the previously discussed challenges by formance on the NewsImages task dataset. The results confirm the adopting a multimodal transformer, LXMERT [14], and learning importance of providing extra context, in order to bridge the seman- enriched multimodal representations of news pieces. In particular, tic gap between images and news text. Namely, our best-performing we trained the model end-to-end, by optimizing all its loss functions variant, which achieved an MRR@100 of 9.31%, is the one that has except the visual question-answering one. It jointly learns internal access to more complementary views of the news piece. data representations and optimizes for matching images to news Copyright 2021 for this paper by its authors. Use permitted under Creative Commons texts, by scoring individual (image, news text) pairs. License Attribution 4.0 International (CC BY 4.0). Exploiting News Context. We investigated different ways to MediaEval’21, December 13-15 2021, Online provide extra context to the model. The first baseline takes as input MediaEval’21, December 13-15 2021, Online C. Bartolomeu et. al. news title + snippet, and the extracted image regions. Then, since Table 1: Runs results in MRR@100 and Recall at K (R@K). entities play a major role in news, inspired by [19] which extends masked language modeling to account for coarse-grain (n-gram) Run MRR R@5 R@10 R@50 R@100 information, we force the model to pay special attention to entities. 1-NT-CS + ME-TS 0.0922 0.1248 0.1927 0.4433 0.5990 Namely, we added a separate masked language modeling loss, with 2-NT-CS + ME-TSE 0.0877 0.1232 0.1922 0.4439 0.6068 increased masking probability for entity tokens. 3-NT-CHS + ME-TS 0.0931 0.1274 0.1906 0.4381 0.5875 Faces-entity context. We noticed that a large portion of images 4-NT-CSEF + ME-TSEF 0.0931 0.1269 0.2052 0.4611 0.6057 contain faces of persons, and these are then mentioned in the news 5-RRF (1 + 2 + 4) 0.1043 0.1467 0.2183 0.4789 0.6277 piece. To support this new input, we added an extra projection layer to LXMERT, mapping face features internal representations. Figure 1: Models’ predictions inspection example. As a result, the visual sub-network is augmented with face em- beddings, such that the model will be able to jointly reason over image regions, faces, news text and entities, and eventually learn the relations between faces and entities. Faces were extracted from all images using MTCNN [21], and face recognition embeddings from FaceNet [13] were used as features. Transfer Learning. Given the reduced size of the NewsImages task dataset, we resorted to pre-training, to improve model’s rep- resentations. Namely, we performed pre-training using the NY- Times800k [16] dataset, which comprises 440k news articles ( 100 times bigger). Then, we fine-tuned the model using the development dataset. This allows the model to be exposed to a greater number of different news pieces, therefore improving its representations and R@10 and R@50, what corroborates with our experiments, in which better capture cross-modality relationships. In NYTimes800k, each it achieved the best overall results. This proves that allowing the article can have multiples images. Moreover, in addition to the title model to jointly attend to faces and entities, better leverages the and news text body, each image also contains a caption. Since these context to establish the connection between images and news text. captions are describing an individual image, sometimes they do not Run 5 - RRF. During our experiments, we observed that different reflect the news article main topic. Thus, to have rich context, we configurations obtained better results at different recall thresholds considered the headline, a snippet and the image caption. (𝐾 value in R@K). Thus, in our last run, we used Reciprocal Rank 3 RESULTS AND DISCUSSION Fusion to merge the first, second and forth experiments’ ranks. This held the best results for all metrics. In all experiments we pre-trained LXMERT using NYTimes800k - with different combinations of the article’s headline, snippet and 3.1 Models’ Predictions Inspection image caption - and fine-tuned it on the task’s dataset (development split). For the task dataset, we fixed the language input to always To understand our model decisions, we inspect in Figure 1 two use the article’s title and text. Table 1 describes our runs results. sample predictions (using Run #4). In the top row, the model suc- We choose our best runs based on the results in our test split. ceeds, and in the bottom row it ranks the correct image in position 45. Both examples illustrate the inherent subjectivity of the task, Run 1 - NT-CS + ME-TS. In this experiment we used NYTimes800k as semantically, any of the shown images seem to match the arti- articles’ snippet (S) and image caption (C) during pre-training. From cle’s text. Notwithstanding, we can see that in general, the model the task dataset we used the articles’ title (T) and snippet (S). Com- captures the complex relations between modalities. pared to our baseline without pre-training, we noticed an improve- ment of ≈ 47.1% in MRR@100, on our test split. This shows the 4 CONCLUSIONS AND FUTURE WORK importance of transfer-learning to improve the model performance. In this work we proposed a set of context-enriched variants, of Run 2 - NT-CS + ME-TSE. For this run, we used NYTimes800k a multimodal transformer model, to address the task of news re- articles’ snippet (S) and image caption (C). From the task dataset matching. These alternate between the type of context (textual we used the title (T), snippet (S) and named entities (E). We noticed and visual) provided to the model, and used to learn the connec- improvements in R@50 and R@100, while worsening the other tion between images and text. We confirmed that going beyond metrics, what can be due to not using entities in pre-training. news title and a small snippet is crucial. Despite our promising Run 3 - NT-CHS + ME-TS. In this run we wanted to assess the results, we posit that there are essentially two key challenges that impact of considering the headline (H) in pre-training, together with follow: a) learn how different news entities are related and how they news snippet (S) and image caption (C). In this scenario we have are visually materialized, and b) dealing with inherent journalistic a better alignment between elements used in pre-training vs. fine- subjectivity when opting for a specific image. tuning. We observed an improvement in R@5, and a deterioration of R@10, R@50 and R@100. Acknowledgments This work has been partially funded by the iFetch project, Ref. 45920, co-financed by ERDF, COMPETE 2020, NORTE 2020 Run 4 - NT-CSEF + ME-TSEF. In this last run, we used our aug- and FCT under CMU Portugal, and by the FCT project NOVA LINCS Ref. mented LXMERT architecture to incorporate both face features (F) (UIDB/04516/2020). and entities (E). This experiment held the best results in MRR@100, NewsImages MediaEval’21, December 13-15 2021, Online REFERENCES (CVPR). 815–823. [1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark [14] Hao Tan and Mohit Bansal. 2019. LXMert: Learning cross-modality Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top- encoder representations from transformers. EMNLP-IJCNLP 2019 - Down Attention for Image Captioning and Visual Question Answering. 2019 Conference on Empirical Methods in Natural Language Processing In Proceedings of the IEEE Conference on Computer Vision and Pattern and 9th International Joint Conference on Natural Language Processing, Recognition (CVPR). Proceedings of the Conference (2019), 5100–5111. https://doi.org/10. [2] Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. 2021. Unifying Vision- 18653/v1/d19-1514 arXiv:1908.07490 and-Language Tasks via Text Generation. In Proceedings of the 38th [15] Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT — Building International Conference on Machine Learning (Proceedings of Machine open translation services for the World. In Proceedings of the 22nd Learning Research), Marina Meila and Tong Zhang (Eds.), Vol. 139. Annual Conferenec of the European Association for Machine Translation PMLR, 1931–1942. https://proceedings.mlr.press/v139/cho21a.html (EAMT). Lisbon, Portugal. [3] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. [16] Alasdair Tran, Alexander Mathews, and Lexing Xie. 2020. Transform VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. and tell: Entity-aware news image captioning. Proceedings of the (2018). https://github.com/fartashf/vsepp IEEE Computer Society Conference on Computer Vision and Pattern [4] Yan Gong, Georgina Cosma, and Hui Fang. 2021. On the Limitations of Recognition (2020), 13032–13042. https://doi.org/10.1109/CVPR42600. Visual-Semantic Embedding Networks for Image-to-Text Information 2020.01305 arXiv:2004.08070 Retrieval. Journal of Imaging 7, 8 (2021). https://doi.org/10.3390/ [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion jimaging7080125 Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. [5] Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural lan- Attention Is All You Need. NIPS’17: Proceedings of the 31st Interna- guage understanding with Bloom embeddings, convolutional neural tional Conference on Neural Information Processing Systems (2017). networks and incremental parsing. (2017). https://spacy.io/ arXiv:cs.CL/1706.03762 [6] Benjamin Kille, Andreas Lommatzsch, Özlem Özgöbek, Mehdi Elahi, [18] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, and Duc-Tien Dang-Nguyen. News Images in MediaEval 2021. In Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Working Notes Proceedings of the MediaEval 2021 Workshop, Online, Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von 13-15 December 2021. Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le [7] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexan- Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. der M. Rush. 2020. Transformers: State-of-the-Art Natural Lan- Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: guage Processing. In Proceedings of the 2020 Conference on Empiri- Connecting Language and Vision Using Crowdsourced Dense Image cal Methods in Natural Language Processing: System Demonstrations. Annotations. Int. J. Comput. Vision 123, 1 (may 2017), 32–73. https: Association for Computational Linguistics, Online, 38–45. https: //doi.org/10.1007/s11263-016-0981-7 //www.aclweb.org/anthology/2020.emnlp-demos.6 [8] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViL- [19] Dongling Xiao, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, BERT: Pretraining Task-Agnostic Visiolinguistic Representations and Haifeng Wang. 2021. ERNIE-Gram: Pre-Training with Explicitly for Vision-and-Language Tasks. In Advances in Neural Informa- N-Gram Masked Language Modeling for Natural Language Under- tion Processing Systems, H. Wallach, H. Larochelle, A. Beygelz- standing. In Proceedings of the 2021 Conference of the North American imer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Cur- Chapter of the Association for Computational Linguistics: Human Lan- ran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/ guage Technologies. Association for Computational Linguistics, Online, c74d97b01eae257e44aa9d5bade97baf-Paper.pdf 1702–1715. https://doi.org/10.18653/v1/2021.naacl-main.136 [9] Gonçalo Marcelino, Ricardo Pinto, and João Magalhães. 2018. Rank- [20] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, ing News-Quality Multimedia. In Proceedings of the 2018 ACM on Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, International Conference on Multimedia Retrieval (ICMR ’18). Associ- Attend and Tell: Neural Image Caption Generation with Visual At- ation for Computing Machinery, New York, NY, USA, 10–18. https: tention. In Proceedings of the 32nd International Conference on Ma- //doi.org/10.1145/3206025.3206053 chine Learning (Proceedings of Machine Learning Research), Francis [10] Gonçalo Marcelino, David Semedo, André Mourão, Saverio Blasi, João Bach and David Blei (Eds.), Vol. 37. PMLR, Lille, France, 2048–2057. Magalhães, and Marta Mrak. 2021. Assisting News Media Editors with https://proceedings.mlr.press/v37/xuc15.html Cohesive Visual Storylines. Association for Computing Machinery, New [21] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint York, NY, USA, 3257–3265. https://doi.org/10.1145/3474085.3475476 Face Detection and Alignment Using Multitask Cascaded Convolu- [11] Nelleke Oostdijk, Hans van Halteren, Erkan Bas, ar, and Martha Lar- tional Networks. IEEE Signal Processing Letters 23, 10 (2016), 1499–1503. son. 2020. The Connection between the Text and Images of News https://doi.org/10.1109/LSP.2016.2603342 Articles: New Insights for Multimedia Analysis. In Proceedings of the 12th Language Resources and Evaluation Conference. European Lan- guage Resources Association, Marseille, France, 4343–4351. https: //aclanthology.org/2020.lrec-1.535 [12] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R- CNN: Towards Real-Time Object Detection with Region Proposal Net- works. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Cur- ran Associates, Inc. https://proceedings.neurips.cc/paper/2015/file/ 14bfa6bb14875e45bba028a21ed38046-Paper.pdf [13] Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition