Prompt-based Alignment of Headlines and Images Using OpenCLIP Lucien Heitz1,2,* , Yuin Kwan Chan1 , Hongji Li1 , Kerui Zeng1 , Abraham Bernstein1 and Luca Rossetto1 1 University of Zurich, Switzerland 2 UZH - Digital Society Initiative, Switzerland Abstract In this paper, we describe how we leverage OpenCLIP to generate automated image recommendations for online news articles for the MediaEval 2023 NewsImages task. By exploring different text prompting techniques, a total of five retrieval approaches were devised. Results show, however, that the best- performing approach is an unmodified CLIP version with the raw article headline as input. We reflect on this finding and its implication for future NewsImages tasks. 1. Introduction In recent years, methods for aligning visual media, such as images with short textual descriptions covering their semantic content, have enjoyed increased attention. The introduction of the first CLIP model [1] can be considered a step-change in this regard. In this paper, we leverage these methods for our contribution to the MediaEval 2023 NewsImages task, which aims to align news articles with fitting images [2]. Given a non-literal relationship between the content of a news article and its corresponding image, this task is slightly different from the more classical problem of semantic text and image alignment. An additional complicating factor is that a news article is often substantially longer than an image caption, introducing further challenges. In our approach outlined in this paper, we rely on an OpenCLIP model [3], pre-trained on the LAION-5B dataset [4]. It consists of over five billion web-sourced image-caption pairs, which is six orders of magnitude larger than the provided task training set. Furthermore, as the LAION-5B dataset is web-sourced, it features a relevant subset of online news images. We opt not to fine-tune the OpenCLIP model but instead experiment with different ways of how best to generate textual input from the available article data. The textual input we generated serves as a pseudo caption for a news article. The motivation behind doing so is due to distinct linguistic features of news headlines (e.g., frequent use of noun strings and omission of auxiliary verbs [5]). These features set headlines apart from image captions, the latter of which was used to train the CLIP model. With the input sensitivity of CLIP in mind [1], we, therefore, tried to close this linguistic gap when using headlines as input prompts for better query results. MediaEval’23: Multimedia Evaluation Workshop, February 1–2, 2024, Amsterdam, The Netherlands and Online * Corresponding author. $ heitz@ifi.uzh.ch (L. Heitz); yuinkwan.chan@uzh.ch (Y. K. Chan); hongji.li@uzh.ch (H. Li); kerui.zeng@uzh.ch (K. Zeng); bernstein@ifi.uzh.ch (A. Bernstein); rossetto@ifi.uzh.ch (L. Rossetto)  0000-0001-7987-8446 (L. Heitz); 0009-0004-6727-6471 (Y. K. Chan); 0009-0005-6729-0190 (H. Li); 0009-0008-6174-5272 (K. Zeng); 0000-0002-0128-4602 (A. Bernstein); 0000-0002-5389-9465 (L. Rossetto) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings The remainder of this paper is structured as follows: Section 2 describes the details of the setup of our retrieval pipeline, we then present the run evaluations in Section 3, and we conclude our paper with a discussion of these results in Section 4, together with the lessons learned and the implications for the next iterations of this task. 2. Approach Our main approach for the image retrieval task is generating a representative pseudo caption from the textual content of an article. This text will then be used as an input prompt for an unmodified OpenCLIP model in order to select fitting images. The motivation behind focusing on the text prompt is twofold: First, there are stylistic differences between article headlines and leads on the one hand and image captions—which is what CLIP was trained on—on the other hand. Second, the dataset provided for the task was too small to meaningfully fine-tune the CLIP model, which is why an unmodified version was used instead. In total, we used five different text generation strategies and submitted one run for each approach. We discuss the details of each of the five approaches in the overview below. Run 1 - Raw Title For the first run, simply use the article’s title as it is stored in the provided dataset. No additional tags, leads, or outlet information were used. This approach serves as the internal baseline for assessing the performance of the subsequent approaches. Run 2 - Pre-processed Title For the second run, we included a text-cleaning pre-processing step. This included removing stop words, punctuation, and other special characters that could potentially result from encoding mismatches (e.g., ‘Ã’ or ‘¢’). Furthermore, we removed any mention of the news outlet and converted the entire text to lowercase. The motivation behind doing so is to create a structurally more caption-like input text without performing any semantic manipulation. Run 3 - Raw Tags Run three uses the tags provided in the dataset rather than the article title. The tags are concatenated into a string using a comma as a separator. As the RT dataset did not contain any tags, we used the article text instead. The inclusion of tags allows us to include more information, potentially on what is depicted in the image, without exceeding the token limit of the text encoder of the CLIP model. Run 4 - T5 For the fourth run, we use a pre-trained T5 model [6] to automatically rephrase the article text into a descriptive statement. The goal of this text transformation is to represent the information contained in an article’s title in a form that is closer to a traditional image caption or alt-text, which comprises the training data of the OpenCLIP model. This rewriting aims to produce text that is both structurally and semantically similar to an image caption. Run 5 - NER-TextRank 10 For run five, we used named entity recognition provided by the spaCy1 framework to extract relevant entities from the article title and text. The extracted entities were scored using TextRank [7] to sort them by predicted relevance and remove generic ones. The entities were combined into a string (using the same procedure as in Run 3), to again produce text similar to an image caption. 1 Official website of spaCy: https://www.spacy.io/ Table 1 Results for all five submitted runs. Results in bold indicate the best performance for Hits@k, and the underlined numbers indicate the second-best results. The numbers are listed separately for each of the three task datasets, together with the average score. Hits@ GDELT1 GDELT2 RT AVG 5 0.691 0.629 0.281 0.534 10 0.779 0.715 0.366 0.620 Run 1 - Raw title 50 0.907 0.864 0.556 0.775 100 0.943 0.915 0.635 0.831 5 0.651 0.597 0.279 0.509 10 0.734 0.684 0.353 0.590 Run 2 - Pre-processed 50 0.879 0.849 0.536 0.755 100 0.923 0.902 0.628 0.818 5 0.622 0.569 0.213 0.468 10 0.714 0.662 0.276 0.551 Run 3 - Raw tags 50 0.878 0.842 0.458 0.726 100 0.925 0.892 0.545 0.788 5 0.657 0.605 0.190 0.484 10 0.747 0.686 0.256 0.563 Run 4 - T5 50 0.881 0.854 0.413 0.716 100 0.927 0.906 0.491 0.774 5 0.559 0.525 0.185 0.423 10 0.647 0.619 0.243 0.503 Run 5 - NER 50 0.817 0.785 0.419 0.674 100 0.871 0.848 0.507 0.742 The different methods of the run submissions are a first approach to create pseudo captions for news headlines. However, it remains an open question what the recommendations and precise requirements are that result in an optimal rephrasing of the article title. Please see Section 4 for a more detailed discussion of alternative approaches. 3. Results and Analysis Table 1 summarizes the achieved results from the five submitted runs. The numbers show that Run 1 achieved the highest scores for all Hits@k across all three task datasets; using the raw title as input to the CLIP model substantially outperformed all other approaches. Text cleaning (Run 2) and rephrasing (Run 3) did not improve the retrieval process by producing more caption-like input text. Rewriting news article headlines and teasers into statement sentences with T5 seems to have the opposite effect, as it did not improve the raw title. Similarly, adding named entities to the input text via text augmentation processes (Run 3 and Run 5) seems to mainly introduce more noise into the retrieval process. TextRank’s text keyword extraction was especially detrimental to the retrieval tasks, resulting in the overall lowest scores (see Table 1, Run 5). Comparing the achieved scores across datasets, we see our approach performing best on GDELT1, followed by GDELT2, and RT. The analysis of the results suggests that this is mainly due to the inclusion of AI-generated content in GDELT2 and RT. The reason for generated content performing worse might be due to the fact that the contents of the image, e.g., human subjects, can be heavily stylized in the GDELT1 and RT datasets. Evaluations of the training runs showed that retrieving the correct image—if the matching image was AI-generated—was highly dependent on knowing the details of the model used to create the image in the first place. As this information was not communicated for the provided task datasets, this made it very difficult to include any AI-specific text rephrasing or augmentation technique to account for the characteristics of the AI images properly. Overall, we do take the performance of the raw title baseline as an indicator for editors to select images mainly based on the headline of a news story. As such, we think generating pseudo-captions remains a worthwhile strategy to pursue. 4. Discussion and Outlook Leading up to the submission, we explored several alternative text-image embedding approaches. Approaches included training feed-forward networks, LSTM options, and Siamese networks. We combined these approaches with various techniques to rewrite the text prompts, such as employing vector combinations of title, lead, tags, and text. Unfortunately, no approach was able to beat the baseline of using the raw article title as input for the OpenCLIP model. We believe this shortcoming is partly due to the limited size of our training dataset when exploring alternative text-image embedding approaches. The task dataset was too small to serve as a training set for model fine-tuning. For future iterations of the NewsImages task, having access to a larger training dataset is, therefore, critical. Looking at future iterations of the NewsImages task, we would like to highlight two possible strategies that could lead to an improvement of the retrieval pipeline. The first option to explore is creating a dedicated model that transforms article headlines into caption-like descriptions. For that, we would need to more closely investigate outlet- and story-specific requirements for rephrasing. The second option is to take the current pipeline and reverse it. The resulting workflow would start with the image selection, generating a caption for each image, and then finding the most closely matching title/news headline. In concluding our working notes paper, we want to briefly comment on two shortcomings we saw in connection with the evaluation process and goal of the task. The first point we want to address is that by allowing one and only one image to be a valid match for a given news article, the evaluation process seemingly implies there to be a one-to-one relationship between article text and image (cf. [8]). This introduces an artificial quality standard that does not exist in the editorial process of selecting an image for a news article. For a given story, editors can select from among multiple images. Ideally, this is reflected in the evaluation process; the fit of a given image-article pair should have a more fine-grained score than the current binary one of either being the original image or not. Our second point focused on AI-generated content. We found that the inclusion of AI- generated images in GDELT2 and RT not only led to an overall lower score compared to GDELT1, but their inclusion potentially entails a major shift in the task’s goal. Instead of focusing on providing meaningful image recommendations for article headlines, the task instead becomes more focused on trying to recreate the exact image generation pipelines. For more details on the two highlighted aspects, please see our Quest for Insight paper [8]. Acknowledgments This work was partially funded by the Digital Society Initiative (DSI) of the University of Zurich under a grant of the DSI Excellence Program and the Swiss National Science Foundation through project MediaGraph (contract no. 202125). References [1] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language supervision, in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, PMLR, 2021, pp. 8748–8763. URL: http://proceedings.mlr.press/v139/radford21a. html. [2] A. Lommatzsch, B. Kille, Özlem Özgöbek, M. Elahi, D.-T. Dang-Nguyen, News Images in MediaEval 2023, in: Working Notes Proceedings of the MediaEval 2023 Workshop, 2024, p. 4. [3] M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, J. Jitsev, Reproducible scaling laws for contrastive language-image learning, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, IEEE, 2023, pp. 2818–2829. URL: https://doi.org/10.1109/CVPR52729.2023.00276. doi:10.1109/CVPR52729.2023.00276. [4] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, J. Jitsev, LAION-5B: an open large-scale dataset for training next generation image-text models, in: NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/ a1859debfb3b59d094f3504d5ebb6c25-Abstract-Datasets_and_Benchmarks.html. [5] S. Marcoci, et al., Some typical linguistic features of english newspaper headlines, Linguistic and Philosophical Investigations (2014) 708–714. [6] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res. 21 (2020) 140:1–140:67. URL: http://jmlr.org/papers/v21/20-074.html. [7] F. Barrios, F. López, L. Argerich, R. Wachenchauzer, Variations of the similarity function of textrank for automated summarization, CoRR abs/1602.03606 (2016). URL: http://arxiv.org/abs/1602.03606. arXiv:1602.03606. [8] L. Heitz, A. Bernstein, L. Rossetto, An empirical exploration of perceived similarity between news article texts and images, in: Working Notes Proceedings of the MediaEval 2023 Workshop, 2024.