NewsImages Fusion: Bridging Textual Context and Visual Content in Media Representation Dr.R.Priyadharsini1 , Arvind.V1,* , Harish.J1 , P.Vettri Chezian1 and MohanaPriya E1 1 Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai - 603110, Tamil Nadu, India Abstract As the consumption of news content becomes increasingly visual, the evaluation of news images plays a pivotal role in media understanding and interpretation. This research addresses the challenges associated with the automated assessment of news images with the mapping of textual information using Convolutional Neural Networks (CNNs). The work leverages a comprehensive dataset of news images and proposes a CNN architecture tailored to the intricacies of media content. The research first delves into the existing landscape of news image evaluation, highlighting gaps and limitations in current methodologies. Motivated by the need for robust and efficient image assessment tools, our work focuses on the design and implementation of a CNN tailored for news media. Upon Further Investigations,it was found out that the proposed system has an accuracy of 14.11. Keywords::NewsImages Fusion, Text-Image Relationship,image captioning 1. Introduction In the contemporary landscape of digital media, news dissemination is increasingly characterized by the integration of visual content, with news images serving as crucial elements in shaping public perception. As society navigates an era inundated with information, the ability to assess the credibility, relevance, and impact of news images becomes paramount. This research addresses the imperative need for automated and efficient methodologies to evaluate news images, a challenge exacerbated by the sheer volume and diversity of media content. Online news articles are multimodal: the textual content of an article is often accompanied by a multimedia item such as an image. The image is important for illustrating the content of the text, but also attracting readers’ attention. Research in multimedia and recommender systems generally assumes a simple relationship between images and text occurring together. For example, in image captioning [1] the caption is often assumed to describe the literally depicted content of the image. In contrast, when images accompany news articles, the relationship becomes less clear[2]. Since there are often no images available for the most recent news messages, stock images, archived photos, or even generated photos are used. An additional challenge is the wide spectrum of news domains, reaching from politics to economics to sports and to health and entertainment. The goal of this task is to investigate these intricacies in more depth, in order to understand the implications that it may have for the areas of journalism and news personalization. The task takes a large set of news articles paired with their corresponding images. The two entities have been paired but we do not know how. For instance, journalists MediaEval’23: Multimedia Evaluation Workshop, February 1–2, 2024, Amsterdam, The Netherlands and Online * Corresponding author. † These authors contributed equally. $ priyadharsinir@ssn.edu.in ( Dr.R.Priyadharsini); arvind2320028@ssn.edu.in ( Arvind.V); harish2320045@ssn.edu.in ( Harish.J); pvettri2320071@ssn.edu.in (P.Vettri Chezian); mohanapriya2320034@ssn.edu.in (M. E) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings could have selected an appropriate picture manually, generated an illustration using generative AI, or a machine could have selected an image from a stock photo database. The image can have a semantic relation to the story but has not necessarily been taken directly at the reported event, nor event exist (in case of synthetic images). Automatic image captioning is insufficient to map the images to articles. 2. Related Work The evolving landscape of multimedia content in news articles has spurred significant research efforts to understand and enhance the interaction between text and images. This section provides a comprehensive overview of the background and related work in this domain. Recent work by Lommatzsch et al.[3]has made substantial strides in bridging the "Depiction Gap" with the introduction of NewsImages. This online news dataset focuses on text-image rematching,offering valuable insights into the intricate relationship between news articles and their associated images. The authors highlight the challenges in accurately pairing textual and visual content, setting the stage for a deeper exploration. Garcin et al.[4] contribute to the discourse on recommendation systems, emphasizing the limitations of offline evaluations in predicting the performance of diverse recommendation techniques. Their study underscores the need for sophisticated models that incorporate novelty into recommendations and questions the reliability of Click-Through Rate (CTR) as a sole metric, especially for popular items. These findings resonate with the challenges encountered in multimedia recommendation tasks. Ge and Persia [5]provide a comprehensive survey of multimedia recommender systems, shedding light on challenges and opportunities in this domain. Their work spans across research communities,delving into the intersection of multimedia information systems and recommender systems.Categorizing papers based on recommender algorithm, multimedia object, and application domain, the survey identifies key features that pave the way for potential research opportunities.Continuous evaluation in large-scale information access systems is explored by Hopfgartner etal.[6]. They advocate for the adoption of living labs, presenting a case for ongoing evaluation.The relevance of their approach extends to the evaluation of multimedia recommendation systems, providing a framework for refining algorithms and adapting to evolving user preferences. Hossain et al.[1] contribute to the landscape of multimedia understanding with a comprehensive survey of deep learning for image captioning. The survey encompasses the evolving techniques used to bridge the semantic gap between textual descriptions and visual content, a challenge inherent in the news domain explored by our work. The stream-based recommender task overview presented by Lommatzsch et al. [7] at CLEF 2017 is particularly relevant to our study. It emphasizes the need for ongoing evaluation and education in the field of recommender systems, aligning with our goal of refining algorithms based on insights gained from continuous assessments. Oostdijk et al.[2] contribute insights into the connection between text and images in news articles. Their work offers new perspectives for multimedia analysis, which resonates with our exploration of the impact of image content on consumer engagement in the context of social media posts related to major U.S. airlines and compact SUV models. Lops et al. [8] provide a comprehensive survey of content-based recommender systems, addressing fundamental aspects characterizing this category of systems. Their exploration of techniques for representing items to be recommended aligns with the challenges posed by diverse multimedia content in news articles. Li and Xie [9] leverage observational data to explore the impact of image content on consumer engagement with social media posts. The study introduces pathways through which image content influences engagement, aligning with our investigation into the interaction between text and images in the realm of news articles.Finally, Liu, Han, and Chilton [10] present a significant contribution to the field with their work on multimodal image generation for news illustration. Their exploration of generating images for news articles aligns with the overarching theme of our study, emphasizing the importance of understanding the relationship between textual and visual content. 3. Objective Develop a comprehensive dataset of news images representative of diverse media con- texts.Design and implement a CNNarchitecture tailored to the specific characteristics of news images.Evaluate the performance of the proposed CNN against benchmark methods using care- fully selected metrics. Provide insights into the potential applications and limitations of CNNs in the realm of news image evaluation.This task explores the relationship between text and images in news articles. A dataset includes paired news articles and images, with undisclosed pairing methods—whether manual selection, generative AI, or automatic machine choice. The images may have semantic ties to the story but need not depict the reported event. Conventional image captioning falls short in accurately mapping images to articles in this diverse context.This dataset is curated from web news articles, providing crucial details for each article, including URL, Title, and initial news text. Paired with each article is a corresponding image, and the dataset covers both English and German articles, with machine-translated versions for the latter.With a 1:1 relationship, the dataset follows a structure akin to NewsImages 2022 data structures. 4. Approach The provided code defines a convolutional neural network (CNN) model for image classifica- tion using PyTorch. The CNN architecture consists of two convolutional layers followed by max pooling operations and two fully connected layers. The model is trained on a custom dataset,NewsDataset, which combines textual and image data. It loads image data from a specified folder and transforms it using resize and tensor conversion operations. The training process involves iterating through the dataset, computing predictions, and optimizing the model parameters using the MRR metrics. Evaluation metrics such as Mean Reciprocal Rank (MRR), Precision@K,and Recall@K are calculated both during training and testing phases to assess the model’s performance. Finally, the model is evaluated on a separate test dataset, and Precision@K and Recall@K values are reported. Overall, the code represents a pipeline for training and evaluating a CNN model for image classification tasks involving textual and image data. 5. Evaluation Methodology The computation involves the Mean Reciprocal Rank (MRR) as the official metric and a series of Precision@K scores and Recall@K values, where K takes values from 1, 5, 10, 20, 50, 100. The primary metric for the task is the average MRR, providing insights into the average position at which the linked image appears.Additionally, the average precision scores offer a comprehensive evaluation of performance across various ranks within the list. Figure 1: Architecture Diagram 6. Results and Analysis A series of experiments was conducted, The proposed system was evaluated using MRR met- rics.The Training accuracy was found out to be 76.52 and the Testing accuracy was found out to be 14.11. K-Values Precision Recall 1 0.1598 0.3211 5 0.1566 0.1566 10 0.1566 3.3199 20 0.1568 6.2695 50 0.1001 10.0000 100 0.0500 10.0000 Table 1 Training Values K-Values Precision Recall 1 0.0000 0.0000 5 0.2000 0.0625 10 0.3000 0.1875 20 0.5000 0.6250 50 0.3200 1.0000 100 0.1600 1.0000 Table 2 Testing Values 7. Discussion And Outlook The insights gathered from the referenced works pave the way for a comprehensive discussion on the intricate relationship between text and images in news articles. The diverse perspectives offered by researchers in multimedia recommender systems, continuous evaluation, image captioning, and content-based recommendation systems provide a rich foundation for our analysis.Wee have also observed that the architecture involves two convolutional layers for feature extraction, followed by fully connected layers for further processing and classification. The convolutional layers extract and learn features from the input image, while the fully connected layers combine these features to make predictions about the input image’s class. References [1] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga, A comprehensive survey of deep learning for image captioning, ACM Computing Surveys (CSUR) 51 (2019) 1–36. [2] N. Oostdijk, H. van Halteren, E. Bas, ar, M. Larson, The connection between the text and images of news articles: New insights for multimedia analysis (2020) 4343–4351. [3] A. Lommatzsch, B. Kille, Y. Zhou, J. Tesic, C. Bartolomeu, D. Semedo, L. Pivovarova, M. Liang, M. Larson, Newsimages: Addressing the depiction gap with an online news dataset for text-image rematching (2022) 227–233. [4] F. Garcin, B. Faltings, O. Donatsch, A. Alazzawi, C. Bruttin, A. Huber, Offline and online evaluation of news recommender systems at swissinfo.ch (2014) 169–176. [5] M. Ge, F. Persia, A survey of multimedia recommender systems: Challenges and opportunities, International Journal of Semantic Computing 11 (2017) 411–428. [6] F. Hopfgartner, K. Balog, A. Lommatzsch, L. Kelly, B. Kille, A. Schuth, M. Larson, Continuous evaluation of large-scale information access systems: a case for living labs (2019) 511–543. [7] A. Lommatzsch, B. Kille, F. Hopfgartner, M. Larson, T. Brodt, J. Seiler, Ö. Özgöbek, Clef 2017 newsreel overview: A stream-based recommender task for evaluation and education (2017) 239–254. [8] P. Lops, M. De Gemmis, G. Semeraro, Content-based recommender systems: State of the art and trends (2011) 73–105. [9] Y. Li, Y. Xie, Is a picture worth a thousand words? an empirical study of image content and social media engagement, Journal of Marketing Research 57 (2020) 1–19. [10] V. Liu, H. Qiao, L. Chilton, Multimodal image generation for news illustration (2022). doi:10.1145/ 3526113.3545621.