An Empirical Exploration of Perceived Similarity between News Article Texts and Images Lucien Heitz1,2,* , Abraham Bernstein1 and Luca Rossetto1 1 University of Zurich, Switzerland 1 UZH - Digital Society Initiative, Switzerland Abstract The NewsImages task at MediaEval implicitly assumes that there is a one-to-one mapping between news articles and images, given that there is exactly one image that is considered a fit in the evaluation phase. In this quest for insight, we empirically explore this assumption. We conduct a user study where we show participants images from different sources and ask how well the image fits a given article from the NewsImages task. We find that 1.) there can be multiple images per article that are considered equally fitting, 2.) images from within the task dataset can beat the ground truth images for certain articles, and 3.) AI-generated articles underperform in comparison with editorially selected images. Based on our insights, we suggest an alternative evaluation strategy for the task and a clear separation of editorial images and AI-generated content. 1. Introduction The MediaEval NewsImages benchmark [1] aims at deepening the understanding of the relation between news articles and editorially selected images. The task participants are asked to come up with computational means to find the correct mapping between a set of articles and images in a test set. The quality of the restored mapping is assessed using measures such as Mean Reciprocal Rank and Hits@k. These evaluation metrics assume—at least implicitly—that only one image matches an article; all other options are deemed equally incorrect. Given a non-literal relationship between the content of an article and its image, the assumption that there could only ever be exactly one matching image appears to be overly restrictive. In this Quest for Insight, we empirically investigate the perceived fit between a given news headline and teaser with several images. We sampled a subset of the task dataset and paired every article with four images. We then asked participants to rate how fitting the images are. The evaluated images include 1.) the ground truth/baseline provided by the dataset, 2.) an image retrieved from an external stock image platform, 3.) an AI-generated image, and 4.) an alternative image from the task dataset. We show that the NewsImage evaluation procedure can lead to a situation where the task formulation promotes sub-optimal image selection. We, therefore, suggest and discuss alternative evaluation strategies. Our motivation for doing so is to increase the external validity and practical applicability of the insights gained from the NewsImages tasks. There is still a need for a better understanding of writing headlines, assigning images to stories, as well as the tools supporting news editors. The focus on perceived image fit is, therefore, only a first step in assessing the practical relevance of the text-image matching strategies in this task. MediaEval’23: Multimedia Evaluation Workshop, February 1–2, 2024, Amsterdam, The Netherlands and Online * Corresponding author. $ heitz@ifi.uzh.ch (L. Heitz); bernstein@ifi.uzh.ch (A. Bernstein); rossetto@ifi.uzh.ch (L. Rossetto)  0000-0001-7987-8446 (L. Heitz); 0000-0002-0128-4602 (A. Bernstein); 0000-0002-5389-9465 (L. Rossetto) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Approach We implement our experiment as an online survey on Qualtrics and recruited participants via Prolific.1 We chose a within-subject design experimental setup with repeated measurements for the user survey. We recruited a total of 𝑁 = 73 users. Participation required users to be fluent English speakers. No additional requirements were imposed. Each participant rates the perceived fit between the title plus the lead of an article and different images. The articles were randomly chosen from the GDELT2 training dataset. For each article headline and lead, users are shown a selection of four different images. This image selection was created using the following methods: Ground truth As a baseline, we use the image defined as the corresponding fit in the GDELT2 training set. For a given article, the dataset includes either an editorially selected image from the respective news outlet or an AI-generated image from the task organizers. We use the GDELT2 dataset, as this is the only task dataset containing news in English with both editorially selected and AI-generated images. Stock image In order to be able to compare both within and outside of the dataset, we included an image obtained from the free stock photo platform Unsplash.2 We retrieve images using the article headline and lead as search terms.3 The first image from the search results was selected without any further manual curation. We added stock images to see if outside-domain sources serves as an alternative image pool. AI generated The AI image is generated with Stable Diffusion [2] using the Realistic Vision V6 model.4 The article headline and lead are used as positive prompts, together with the recommended negative prompts of the model, and DPM++ SDE Karras (25 steps) as sampler with a randomized seed for each picture. We include an additional source of AI images, as this enables us to tell whether people like/dislike AI-generated images in general, or if there exist model-specific preferences. CLIP-based retrieval We include an additional image from the dataset that was not considered a match. We used the image with the best rank that was different from the ground truth, retrieved using an OpenCLIP model [3] that was pre-trained on the LAION-5B [4] dataset.5 For more details on our retrieval approach, please see our Working Notes paper [5]. We feature alternative images from the task dataset to 1.) better assess the performance of our OpenCLIP approach and 2.) look at the performance of inside-domain images. Each user was given five article units. One such unit consists of four questions where one article title and lead are paired with each of the four image choices. Participants had to rate how well each of the four image choices fit the given article headline and lead. Overall, we showed 20 different news article-image pairs to each user. A total of 20 news articles were randomly selected from the task dataset. Between 16 and 19 user-ratings were recorded for each question unit of these news articles. Question units were randomly assigned to users, as was the ordering of article-image pairings within each unit. 1 Official website of the Qualtrics: https://www.qualtrics.com, and Prolific: https://www.prolific.com/ 2 Official website of Unsplash: https://www.unsplash.com/ 3 As GDELT2 presents a collection of articles from different news outlets, not every article includes a lead. In case no lead is available, we instead selected the article’s first sentence. 4 Available online: https://www.huggingface.co/SG161222/Realistic_Vision_V6.0_B1_noVAE 5 Available online: https://www.huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k We asked users the following question to assess the fit for each of the four images: “Please indicate to what extent you agree with the following statement: I think this image fits the news article.” The agreement for each article-image pair was expressed on a Likert scale from 1 (Strongly disagree) to 7 (Strongly agree). 3. Results and Analysis Figure 1 shows the agreement on the fit of an image with a given news article, grouped by the four different image sources. Ratings in green express the perceived fit for natural images, i.e., editorially selected images. Ratings in orange express the perceived fit for images that were generated. In our analysis in this section, we focus on 1.) participants’ disagreement in terms of fit across image sources and 2.) the model selection for image generation together with the impact of AI-generated content. The perceived fit of an image ranges from “Strongly agree” to “Strongly disagree” across all four groups. This range of ratings is an indicator for us that there is a certain degree of disagreement in terms of image fit among users. Not all users agree on what a good fit is. Multiple images per article can be perceived to be equally good in terms of fit. This is true across all image sources, applying to both natural as well as generated images. If there truly was only one fitting image—as the evaluation process of the task assumes—we would expect a predominantly disagreeing rating for all but the ground truth. This is, however, not what we observe in our results. Both our generated AI images and CLIP results can match the user ratings of the ground truth. One source that cannot match the ground truth is stock images. We see that outside-domain images seem to provide a worse perceived fit than inside-domain images retrieved via CLIP. Regarding the added AI-generated content, we see a substantial difference in the rating distributions when comparing the natural and generated images in the ground truth. And while our own AI-generation pipeline manages to achieve a more favorable rating distribution, more closely mirroring the ratings of natural images in the ground truth, it nevertheless fails to provide a high count of results with a “Strongly agree” rating. While leveraging CLIP’s capacity for content curation does shift the rating of the ground truth AI images to more favorable ratings, it still falls short of providing results comparable with natural images. We, therefore, see a clear qualitative difference between the two image types, one that seemingly cannot be alleviated by the image selection/retrieval technique. At the same time, however, we find that there is no general difference in perceived fit when it comes to AI-generated content. The fitness of the images has less to do with the nature of the image and is more dependent on the model used for generating the images. 4. Discussion and Outlook Based on our survey results, we would like to suggest two potential changes for future News- Image tasks. Our first discussion point centers around the evaluation procedure. The implicit assumption of the evaluation of the retrieval task is that there is exactly one fitting image for each article. If this one-to-one relation between articles and images were to exist, we would expect Figure 1 to mirror this assumption. We would need to see only top ratings in the ground truth group and high disagreement ratings across all other image sources. However, this is not the case. Hence, the task evaluation procedure fails to account for the fact that there are multiple images—even within the same task dataset—that fit an article equally well, if not better. Strongly agree Agree Somewhat agree Neither agree nor disagree Somewhat disagree Disagree Strongly Disagree Image Type Natural Generated Ground Truth Stock AI CLIP Figure 1: Distribution of participants’ agreement with the statement “I think this image fits the news article.” with respect to the four different image sources. Therefore, we think the evaluation procedure should be adjusted, as the implicit assumption of exactly one relevant image per news article does not hold. It might be necessary to switch from an ex-ante relevance assessment to an ex-post one. A possible inspiration is the Inferred Mean Average Precision [6] metric, used by the TRECVid Ad-Hoc Video Search task. The second point of the discussion focuses on the inclusion of AI-generated content. Our results show that AI images tend to receive lower overall ratings. As a consequence, we see that including both natural and generated images creates a certain tension in the task’s goals/aspiration vs. its execution. Suppose finding a fitting image is the primary objective of this task, and AI-generated content is part of the ground truth. In that case, this leads to the creation of retrieval techniques that select images that are empirically shown to be perceived as less fitting than comparable natural ones. Competing systems would be required to focus on re-creating/guessing the image generation pipeline of AI content. Shifting the task focus from finding fitting images to trying to recreate image-generation pipelines should be avoided, as this is in conflict with the goal of the task. AI-content should, therefore, be treated differently from the editorial one. One possible solution is to have two ded- icated sub-tasks with two different datasets, one with and one without AI images. Furthermore, the decision not to communicate the image generation pipeline needs careful reconsideration, as this has significant implications for the system design. Using our survey findings, we would like to conclude our paper by giving an outlook for additional real-world challenges regarding text-image matching for news articles. Among the most important questions to investigate is the process of how editors select images. Do they primarily focus on the title, the lead, of the article text? The practical implication of this question is that the task organizers might need to provide additional information, as the dataset currently does not feature full articles. Related aspects that are worth investigating for building retrieval pipelines are the intentions of editors when selecting images, the tools at their disposal, domain-specific requirements, together with user expectations, and varying preferences. Acknowledgments This work was partially funded by the Digital Society Initiative (DSI) of the University of Zurich under a grant of the DSI Excellence Program and the Swiss National Science Foundation through project MediaGraph (contract no. 202125). References [1] A. Lommatzsch, B. Kille, Özlem Özgöbek, M. Elahi, D.-T. Dang-Nguyen, News Images in MediaEval 2023, in: Working Notes Proceedings of the MediaEval 2023 Workshop, 2024, p. 4. URL: https: //irml.dailab.de/wp-content/uploads/2023/11/NewsImages2023-LabOverview-v20231101.pdf. [2] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-resolution image synthesis with latent diffusion models, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, IEEE, 2022, pp. 10674–10685. URL: https: //doi.org/10.1109/CVPR52688.2022.01042. doi:10.1109/CVPR52688.2022.01042. [3] M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, J. Jitsev, Reproducible scaling laws for contrastive language-image learning, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, IEEE, 2023, pp. 2818–2829. URL: https://doi.org/10.1109/CVPR52729.2023.00276. doi:10.1109/CVPR52729.2023.00276. [4] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, J. Jitsev, LAION-5B: an open large-scale dataset for training next generation image-text models, in: NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/ a1859debfb3b59d094f3504d5ebb6c25-Abstract-Datasets_and_Benchmarks.html. [5] L. Heitz, Y. K. Chan, H. Li, K. Zeng, A. Bernstein, L. Rossetto, Prompt-based alignment of headlines and images using openclip, in: Working Notes Proceedings of the MediaEval 2023 Workshop, 2024. [6] E. Yilmaz, J. A. Aslam, Estimating average precision when judgments are incomplete, Knowl. Inf. Syst. 16 (2008) 173–211. URL: https://doi.org/10.1007/s10115-007-0101-7. doi:10.1007/ S10115-007-0101-7.