1. Introduction

Investigating the Performance of the CLIP Model and Concept Matching in Text-Image Retrieval Systems

Xiaomeng Wang

Mingliang Liang

Martha Larson

0 0 Radboud University , Netherlands

Improving comprehension of the textual and visual interaction in news articles significantly improves the eficiency of news text-image retrieval. We evaluate the performance of the pre-trained CLIP model on MediaEval 2023 NewsImages benchmark. Additionally, we investigate the contribution of concept matching to our text-image matching system, by tokenizing, part-of-speech tagging and filtering to extract concepts from the news title. In addition, by analyzing the training datasets, we also gain insights of better performance for text-image matching. Our working notes report the oficial results of our submitted runs and shows additional experiments.

1. Introduction

Retrieving a suitable image (text) that perfectly corresponds to a text (image) is a challenging task in Vision-Language domains [ 1, 2, 3 ], especially in the the domain of news articles [4]. This is because the connection between the image and the related news article is loose [4]. Consequently, recognizing the interactions between images and text is particularly important in the realm of news, as it helps to develop better models for matching news images and text. The MediaEval 2023 NewsImage benchmark [4] ofers datasets and evaluation components specifically designed to explore the relationship between news articles and accompanying images, which participants are required to retrieve the correct images based on the given news items’ titles and texts.

Large-scale Vision-Language pre-trained models have shown remarkable zero-shot performance on text-image retrieval task [5, 6]. Therefore, we employ the CLIP [5] (Contrastive Language-Image Pretraining) model to perform news text-image retrieval across the given datasets. Because, OpenCLIP provides open source code and pre-trained models at diferent scales, we can directly utilize it on the NewsImage task without fine-tuning. The OpenCLIP model achieves good performance according to the evaluation metrics.

Further, we investigate the capability of matching between concepts and images. The motivation for this experiment is that nouns tend to have a more direct correlation with the content visually represented in an image, compared to other parts of speech. Therefore, we extract nouns and proper nouns from the news titles as concepts, subsequently, we employ these extracted concepts to retrieve the corresponding news images. Experimental results indicate that the text-image retrieval system performs better when concepts are embedded in natural language structures, such as news title.

Finally, we manually inspect the training subsets on how news titles, text snippets, and entities correlate with their accompany news images. We gain the impression that text-image matching performs better when the text literally describes the news image.

2. Approach 2.1. Extracting concepts from the news title

To explore the capability of concept matching for text-image matching system, we extract concepts from the given news titles. This extraction process consists of three primary steps: Tokenization, achieved by breaking down the news title into individual tokens or words using libraries like NLTK [7]; Part-of-Speech Tagging, which assigns each token its specific part of speech (e.g., noun, verb, adjective); and lastly, Filtering, where we extract nouns and proper nouns (NN, NNP) to create a text consisting only of elements that can be considered to be concepts. We present some examples in Table 1.

2.2. Utilizing CLIP model for news text-image retrieval

We employ the CLIP [5] model to extract features from both images and texts. Our choice of pre-trained model is OpenCLIP [8], an open-source implementation of CLIP. Specifically, we directly leverage the ViT-B-16 [8] model pre-trained on the Laion-400m dataset [9] without ifne-tuning. For the training an test datasets, we firstly pre-process and encode the news text and images separately using their respective encoders. Subsequently, we measure similarity using cosine similarity between the text embedding and the embeddings of all images. Finally, we compile a top-100 list of the most relevant images based on their similarity scores.

2.3. Sampling examples from the training datasets

We also conduct text-image retrieval on training subsets, and we manually inspect the wellperforming examples from the training subsets. This experiment is additional to the runs that we oficially submitted to the task. The training subsets are sampled from the provided training datasets—GDELT-P1, GDELT-P2, and RT datasets—to match the sizes of their respective test datasets. As a result, the GDELT-P1, GDELT-P2, and RT training subsets uses for this experiment contain 1500, 1500, and 3000 examples respectively.

3. Results and Analysis 3.1. Retrieval results on test datasets

The results of the news text-image retrieval task across three test datasets (see Section 2.1 and 2.2) are presented in Table 2 . Three text types-title only, concepts only, entities/text snippet only-are evaluated. “title only”, where the news title was used for news image retrieval, “concepts only”, where concepts extracted from the news title were utilized, and “entities/text snippet only”, where entities/text snippet provided in the dataset were used for retrieval. The evaluation metrics are Mean Reciprocal Rank (MRR) and Recall@k (R@k) (k=5, 10, 50, 100).

Across all datasets, the “title only” approach consistently outperformed the “concepts only” approach in terms of evaluation metrics. Specifically, in the GDELT-P1 test dataset, utilizing the

GDELT-P1 GDELT-P2

Text type title only concepts only entities only

title only concepts only entities only

title only concepts only text snippet only news title resulted in an MRR of 0.49178, with R@5 and R@10 values of 0.63467 and 0.71400, respectively. In contrast, employing only the extracted concepts yielded lower performance metrics, with an MRR of 0.36364, R@5 of 0.48933, and R@10 of 0.57733. Similar trends were observed in the GDELT-P2 and RT test datasets. In a word, the capability of text-image retrieval system is beyond simple concept matching, specifically beyond matching nouns and proper nouns.

GDELT-P1 GDELT-P2

RT Text type title only concepts only entities only

title only concepts only entities only

title only concepts only text snippet only

3.2. Retrieval results on training subsets

The results of the news text-image retrieval task across three training subsets (see Section 2.3) are demonstrated in Table 3. The “title only” text type has the best results, followed by ”concepts only” and finally ”entities or text snippet only” across the respective subsets. In other words, it would be easier to retrieve accompanying news images when utilizing the news title, rather than relying solely on entities or text snippets. As illustrated in the examples from three training subsets in Table 4, the news title demonstrate high relevance to the accompanying news image. Conversely, the contents of text snippets or entities tend to have a more contextual and inferential connection to the visual content in the news image. We had the impression that the text-image matching is more efective with descriptive texts to the visual content in the news image.

Also, we manually inspect the well-performing and the poorly-performing examples from the training subsets, which the matching ranks are 1 and beyond 100. We perceived that the text-image matching system is more successful when the text explicitly describes the visual elements present in the news image. Specifically, the text-image matching appears to perform better when the news image includes objects that correspond to words mentioned in the text or containing words that match words in the text.

4. Discussion and Outlook

In this paper, we propose utilization of the pre-trained CLIP model for news text-image retrieval. We conduct a comprehensive analysis on training subsets and test datasets, comparing evaluation results when utilizing the news title, concepts extracted from the news title, and entities/text snippets, respectively.

The experimental results show that the text-image matching system is capable of going beyond mere concepts matching, specifically as the matching of nouns and proper nouns. In addition, the system demonstrates greater eficiency in processing visually descriptive texts that contain concepts in natural phrases, such as news titles. Nevertheless, the system still has dificulties in understanding the relationship between inferrable texts and the corresponding news images. For example, the system faces challenges when matching texts that contain metadata such as attributes, sources, or other information related to the content of the image. Previous paper [10] has introduced the concept of "description gap", which refers to the gap between the textual representation of an image and its accompanying text. In the future, it is crucial to enhance the advanced understanding and reasoning capabilities of text-image matching systems. [2] R. Yan, A. G. Hauptmann, A review of text and image retrieval approaches for broadcast news video, Information Retrieval 10 (2007) 445–484. [3] T. Yu, J. Liu, Z. Jin, Y. Yang, H. Fei, P. Li, Multi-scale multi-modal dictionary bert for efective text-image retrieval in multimedia advertising, in: Proc. of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 4655–4660. [4] A. Lommatzsch, B. Kille, Ö. Özgöbek, M. Elahi, D.-T. Dang-Nguyen, News images in mediaeval 2023, in: Proc. of the MediaEval 2024 Workshop, 2024. [5] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: Proc. of the International Conference on Machine Learning, PMLR, 2021, pp. 8748–8763. [6] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, T. Duerig, Scaling up visual and vision-language representation learning with noisy text supervision, in: Proc. of International Conference on Machine Learning, 2021. [7] S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit, " O’Reilly Media, Inc.", 2009. [8] M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, J. Jitsev, Reproducible scaling laws for contrastive language-image learning, in: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 2818–2829. [9] C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, A. Komatsuzaki, Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, in: Proc. of the Neural Information Processing Systems, 2021. [10] A. Lommatzsch, B. Kille, Ö. Özgöbek, Y. Zhou, J. Tešić, C. Bartolomeu, D. Semedo, L. Pivovarova, M. Liang, M. Larson, Newsimages: Addressing the depiction gap with an online news dataset for text-image rematching., in: Proc. of 13th ACM Multimedia Systems, 2022, pp. 227–233.

[1]

Yuan ,

Zhang ,

Tian ,

Rong ,

Zhang ,

Wang ,

Fu ,

Sun , Remote sensing cross-modal text-image retrieval based on global and local information , IEEE Transactions on Geoscience and Remote Sensing 60 ( 2022 ) 1 - 16 . doi: 10 .1109/TGRS. 2022 . 3163706 .