The Relation between Texts and Images in News: News Images in MediaEval 2023 Andreas Lommatzsch1,† , Benjamin Kille2 , Özlem Özgöbek2 , Mehdi Elahi3 and Duc-Tien Dang-Nguyen3 1 Technische Universität Berlin, Berlin, Germany 2 Norwegian University of Science and Technology, Trondheim, Norway 3 University of Bergen, Bergen, Norway Abstract News articles typically consist of text and images. Images plays a crucial role in catching the user’s attention and emphasising the article’s message. For each news text, the editor must select the best photos from the available set of recent photos, archived photos or stock images to both attract user’s attention and best fitting with the news article text. The NewsImages benchmark aims to shed a light on this real-world relation between news texts and the accompanying images. The task provides datasets and evaluation components for studying this relation. The datasets includes AI-generated images as an additional research challenge. This paper describes the NewsImages task in detail, giving the explanations for the dataset and evaluation metrics. It also discusses the connections to existing research and the addressed challenges. 1. Introduction In the fast-paced world of digital journalism, news articles are inherently multi-modal, seamlessly intertwining text and images to convey information. Among the various components of a news article, images occupy a pivotal role. They not only serve as a visual aid; they catch the readers’ interest, compelling them to delve into the text. Furthermore, images reinforce the central message of the article, often providing context or offering a visual perspective that words alone might fail to capture. With the rise of generative artificial intelligence, there has been a shift towards automating news article creation. This automation includes the generation of text and images that align perfectly with the content. NewsImages task aims to support the research in understanding the relationship between news texts and their accompanying images on news portals. This relationship is full of challenges. The vast expanse of news topics, the diversity in domains, the plethora of news portals, and the myriad styles of news articles, all culminate in a complex web of considerations when matching text with images. Delving deeper into the scenario, NewsImages is driven by several pertinent questions: How can the connection between texts and images in news articles be re-established? Multimedia Evaluation Workshop, 1–2 Feb. 2024, Amsterdam, Netherlands † Corresponding author. $ andreas.lommatzsch@dai-labor.de (A. Lommatzsch); benjamin.u.kille@ntnu.no (B. Kille); ozlem.ozgobek@ntnu.no (Ö. Özgöbek); mehdi.elahi@uib.no (M. Elahi); ductien.dangnguyen@uib.no (D. Dang-Nguyen) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings To what extent do generated images alter this re-establishment? Are there discernible patterns or principles that guide editors when they select images for news pieces? And, in the grand scheme of automated news generation, are there innovative methods to generate better-suited images for given news texts? 2. Background and Related Work Deciphering the relationship between text and images in news articles is an important task for understanding both the creation and the perception of content in the news sector. The depiction gap between texts and images is a major problem [1]. Significant advances in image comprehension have recently been made through deep neural networks, enabling systems not only to detect intricate concepts within images but also to identify pertinent objects with high precision. The embedding of concepts extracted from images and texts within a unified vector space is central to this advancement, facilitating nuanced correlations. While there are multiple datasets tailored for optimizing learning strategies in image labeling (e.g. MS COCO [2]), a new frontier lies in generative AI’s capability to produce high-resolution images from text descriptors. The landscape of news imagery, dominated by stock photos, portraits, and loosely related archival images, presents unique challenges, often accentuated by the absence of directly relevant visuals. This inspires the pivotal research question: How are images and text interconnected in the context of news? Furthermore, this opens a broader inquiry into AI’s potential role in enhancing news article formulation, opening avenues for automated, contextually appropriate visual representation. For the 5th time, the NewsImages challenge explores the aspects of multimedia content in news. The first editions (NewsREEL Multimedia [3, 4, 5]) focused on predicting the popularity of news items based on multimedia content. In 2021, the focus shifted to understanding the relationship between text and images [6]. In 2023, we extend the task by adding AI generated images to further explore the relation between news and AI generated images. The NewsImages task is related to several research topics, such as multi-modal recommender systems [7, 8, 9], the detection of fake news [10], and multi-modal embedding methods [11]. The task supports the research toward multi-modality in different news related domains. 3. Task Description The NewsImages benchmark investigates the connection between textual news content and associated imagery. This year’s task draws its data from two distinct news dissemination chan- nels: official publishers’ portals and RSS feeds. Participants are provided with a comprehensive training dataset, encompassing linked text-image pairs, complemented by a test dataset with disassociated pairs. The challenge mandates the development and critical evaluation of innova- tive methodologies to accurately re-associate news articles with corresponding images. The dataset represents a challenge with instances of images, such as conceptual stock photographs, potentially aligning with multiple articles. Participants are required to submit a prioritized list of plausible image matches, with the evaluation metric favoring early correct re-associations. 4. Dataset NewsImages provides a dataset compris- Batch Source Purpose No. Cases ing three parts built on news from news portals and an RSS news feed. As source GDELT-P1-a Web sites Training 8500 for the crawled web sites, we use the GDELT-P1-b Web sites Test 1500 GDELT project (https://www.gdeltproject. GDELT-P2-a Web sites Training 12 041 org/) that aggregates news from all over GDELT-P2-b Web sites Test 1500 the world. For the RT part the RSS feed RT-a RSS Feed Training 9755 rtde1 has been used. The dataset has RT-b RSS Feed Test 3000 been created using the following three steps: (1) Crawling: Crawl news items Table 1: Dataset Statistics. The dataset comes in six from the selected sources and eliminate batches. The number of cases refers to the news articles that do not consist of an article-image pairs. image and a suitable text. We use news items published in the period November 2022–August 2023. For the GDELT part the news title and the entities (extracted by GDELT from the news text for creating knowledge graphs) are used (http://data.gdeltproject.org/gkg/index.html. For the RT part, the news title and the snippet (both German originals and English machine translation of these fields) are used. (2) Cleaning: For ensuring the quality of images, we use different heuristics for removing duplicates, low quality images, and logos. In addition, we remove images mainly consist of text. (3) Image generation: For studying the problem of matching generated images we use Stable Diffusion. We use the news article’s headline as the prompt. The generated images are used to replace some of the original images. In the three parts of the dataset, the fraction of generated images differs. Part GDELT-P1 does not contain any generated images; GDELT-P2 contains 80% generated images, and RT has 50% generated images. (4) Splitting: Each part of the dataset is split into a training and a test set as Table 1 illustrates. The data set contains information related to articles and images. Articles’ metadata include the URL, title, and a text snippet (RT batch) or the entities extracted from the news text (GDELT batch). Image captions or image filenames must not be used in the task. 5. Evaluation The NewsImages benchmark is designed to analyze the relation between news texts and the accompanying images. As a concrete task, the participants must assign a matching image to for each news text in the given test set. Concretely, for each news article an ordered list of 100 images must be submitted. The participants provide a text file that provides a tab separated list of 100 image IDs for each news article ID. The participants’ submissions are evaluated against a ground truth defined by the originally crawled connection between the images and the text. The ground truth ensures that a 1:1 relation between the images and the texts exists. 5.1. Evaluation Metric The participants’ submissions are evaluated using the Mean ∑︀ Reciprocal Rank (MRR) [12] as the main evaluation criteria. MRR is defined as MRR = 𝑁1 𝑁 1 𝑛=1 rank(𝑥𝑛 ) , where rank(𝑥𝑛 ) returns 1 http://de.rt.com/feeds/news/ the rank at which the matching image was listed. The earlier the matching image appears on average, the higher the score. The Mean Reciprocal favors the top of the list and penalizes finding a match further down. In addition to MRR, we also compute the Average Recall (AR) at ranks 𝑁 for 𝑁 ∈ {1, 5, 10, 20, 50, 100}. AR computes the average over the recall scores calculated for each news article. The evaluation scores are computed separately for each batch. 5.2. Run Description Participants are encouraged to contribute working notes that elucidate their innovative concepts, fostering an in-depth exploration of the intricate relationship between textual content and images in news media. In this pursuit, participants have the opportunity to submit a maximum of five runs for each of the three test datasets. Each run entails a set of predictions tailored to these test datasets. We encourage participants to engage in a comprehensive comparative analysis of their various runs, encompassing assessments of quality, computational complexity, and resource utilization. Furthermore, the discussion of results should be characterized by a nuanced consideration of the datasets’ idiosyncrasies, illuminating how the discoveries made can be extrapolated to diverse scenarios. To culminate, participants are expected to articulate their insights and reflect on their potential contributions towards advancing cutting-edge research in this field. 6. Conclusion The linking between news texts and images is still a complicated problem due to the news domain’s diversity, editors’ habits, and readers’ expectations. The mixture of real photos, stock images, archived photos, and AI generated images makes it very challenging to extract not only concepts from images but also to understand the principles applying when selecting the images. The NewsImages challenge provides a medium-sized, real-world data set for investigating the existing principles for connecting images and texts. Participants can develop, optimize, and evaluate innovative re-matching methods for news texts and images.With the growing popularity and enhancement of AI methods for generating images, images that are more representative of the text could replace the partially matching images like stock photos. These artificial images could be used to reinforce the credibility of fake news but also avoid misinterpretation of news caused by ill-fitted stock images. Thus, understanding the relation between news texts and images remains a highly relevant and challenging research topic. News Images provides the foundation to foster the development and evaluation of innovative approaches. Acknowledgments We gratefully thank Marc Gallofré Ocaña and Sohail Ahmed Khan for supporting the data set creation. We acknowledge the contributions of the GDELT (https://www.gdeltproject.org/) project for providing the data which made the dataset creation possible. References [1] A. Lommatzsch, B. Kille, O. Özgöbek, Y. Zhou, J. Tešić, C. Bartolomeu, D. Semedo, L. Pivovarova, M. Liang, M. Larson, NewsImages: Addressing the Depiction Gap with an Online News Dataset for Text-Image Rematching, in: Proceedings of the 13th ACM Multimedia Systems Conference, MMSys ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 227–233. URL: https://doi.org/10.1145/3524273.3532891. [2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft COCO: Common Objects in Context, in: European Conference on Computer Vision, Springer, 2014, pp. 740–755. doi:10.1007/978-3-319-10602-1_48. [3] A. Lommatzsch, B. Kille, F. Hopfgartner, L. Ramming, MediaEval 2018 - Overview on NewsREEL Multimedia, in: Proceedings of the MediaEval Benchmarking Initiative for Multimedia Evaluation 2018, CEUR Workshop Proceedings, 2018. URL: http://ceur-ws.org/Vol-2283/. [4] Y. Deldjoo, B. Kille, M. Schedl, A. Lommatzsch, J. Shen, The 2019 Multimedia for Recommender System Task: MovieREC and NewsREEL at MediaEval, in: Procs. of the MediaEval Benchmarking Initiative for Multimedia Evaluation 2019, CEUR WS Procs., 2019. URL: http://ceur-ws.org/Vol-2670/. [5] B. Kille, A. Lommatzsch, O. Özgöbek, NewsImages: The Role of Images in Online News, in: Proceedings of the MediaEval Benchmarking Initiative for Multimedia Evaluation 2020, CEUR Workshop Proceedings, 2020. URL: http://ceur-ws.org/Vol-2882/. [6] B. Kille, A. Lommatzsch, Ö. Özgöbek, M. Elahi, D.-T. Dang-Nguyen, News Images in MediaEval 2021, in: Proceedings of the MediaEval Benchmarking Initiative for Multimedia Evaluation 2021, CEUR Workshop Proceedings, 2021. URL: http://ceur-ws.org/Vol-3181/paper2.pdf. [7] A. Salah, Q.-T. Truong, H. W. Lauw, Cornac: A Comparative Framework for Multimodal Recom- mender Systems., J. Mach. Learn. Res. 21 (2020) 95–1. [8] S. Oramas, O. Nieto, M. Sordo, X. Serra, A Deep Multimodal Approach for Cold-start Music Recommendation, in: Procs. of the WS on Deep Learning for Recommender Systems, 2017, pp. 32–37. [9] Y. Deldjoo, M. Schedl, P. Cremonesi, G. Pasi, Recommender Systems Leveraging Multimedia Content, ACM Computing Surveys (CSUR) 53 (2020) 1–38. [10] X. Zhou, R. Zafarani, A Survey of Fake News, ACM Computing Surveys 53 (2020) 1–40. URL: http://dx.doi.org/10.1145/3395046. doi:10.1145/3395046. [11] L. Cui, S. Wang, D. Lee, SAME: Sentiment-Aware Multi-Modal Embedding for Detecting Fake News, in: Pros of the 2019 Intl. Con. on Advances in Social Networks Analysis and Mining, ASONAM ’19, ACM, New York, NY, USA, 2020, p. 41–48. doi:10.1145/3341161.3342894. [12] E. M. Voorhees, et al., The TREC-8 Question Answering Track Report., in: TREC, volume 99, 1999, pp. 77–82.