Cross-modal Networks, Fine-Tuning, Data Augmentation and Dual Softmax Operation for MediaEval NewsImages 2023 Antonios Leventakis1,* , Damianos Galanopoulos1,* and Vasileios Mezaris1 1 Information Technologies Institute / Centre for Research and Technology Hellas, Thessaloniki, Greece Abstract Matching images to articles is challenging and can be considered a special version of the cross-media retrieval problem. This notebook paper presents our solution for the MediaEval NewsImages 2023 benchmarking task. We investigate the performance of pre-trained cross-modal networks. Specifically, we investigate two pre-trained CLIP model variations and fine-tuned one for domain adaptation. Additionally, we utilize a data augmentation technique and a method for revising the similarities produced by either one of the networks, i.e., a dual softmax operation, to improve our solutions’ performance. We report the official results for our submitted runs and additional experiments we conducted to evaluate our runs internally. We conclude that fine-tuning benefits the performance, and it is important to consider the data’s nature when selecting the appropriate pre-trained CLIP model. 1. Introduction In this paper, we deal with the text-to-image retrieval task adapted for the needs of the MediaEval NewsImages 2023 task [1]. Nowadays, news sites publish multimedia content in their online news articles to better convey the message the textual article wants to convey to readers. So, associating news articles with multimedia content is crucial for several research tasks such as cross-modal retrieval and disinformation detection. Our participation [2] in the NewsImages 2022 task showed that cross-modal networks trained on large sets of data, such as CLIP [3], perform optimally. Based on that outcome, to deal with image retrieval using textual articles, this year’s approach is based on pre-trained versions of CLIP [3]. To further adapt them to this specific task, we fine-tune them with extra news article-based datasets to improve the performance. Moreover, similarly to our previous works [2, 4], we adopt a dual-softmax operation (DS) to recalculate the initially computed title-image similarities, an approach that in some cases leads to improved performance. Lastly, we utilize a data augmentation technique on the textual part of the data to increase the amount of available data for training and the robustness that derives from the diversity that data augmentation introduces to the models. 2. Related Work Text-image association is a challenging task that has gained a lot of interest in recent years. The task has been extensively examined in the multimedia research community e.g. see [5, 6], and there is consensus that the evolution of deep learning methods has boosted performance. Indicative relevant methods include VinVL [7], where an object detector is pre-trained to encode images and visual objects on images and a cross-modal model is trained to associate visual and MediaEval’23, 1-2 February 2024, Amsterdam, The Netherlands and Online * Corresponding authors. $ aleventakis@iti.gr (A. Leventakis); dgalanop@iti.gr (D. Galanopoulos); bmezaris@iti.gr (V. Mezaris) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings textual features. Regarding the NewsImages 2021 participations, HCMUS [8] proposed a solution based on the pre-trained model CLIP [3] along with sophisticated text preprocessing, which achieved the best performance. In NewsImages 2022 the best-performing approach [2] explored CLIP’s capabilities alongside a trainable cross-modal network; and concluded that using CLIP was, by a small margin, better than training a custom cross-modal network. Therefore, utilizing the power of CLIP models seems to be the most suitable approach for the task. 3. Approach 3.1. Data, pre-processing and augmentation To adapt the CLIP model to the specific needs of the task, we explore the fine-tuning capabilities for this model. We preprocess both training, evaluation and the official test textual data in order to fully exploit our approach’s power. We gathered around 4.8 million image-title pairs from the news domain to fine-tune the pre-trained CLIP model for training. Specifically, we utilize the NYTimes800k [9], N24News [10] and BreakingNews [11] datasets along with data publicly available in kaggle.com from news websites including Al Jazeera1 , CNN2 , BBC3 , HuffPost News4 and Bloomberg5 to fine-tune the model. To internally evaluate our approach, we merge last year’s NewsImages training data [12] and use them to investigate the performance of our approach. For each one of these datasets we utilize a data augmentation technique to double the amount of data available. Specifically, we exploit the paraphrasing ability of the Text-to-Text Transformer [13] to create diverse but semantically similar text titles for every image. This approach not only enables us to have more training data but also lets us compute the image-title similarities of the evaluation and test datasets from both the original and the generated text titles for each image. Then, by using a mean pooling operation between the values that occur from the computations we end up with our final predictions. 3.2. Pre-trained models As pre-trained cross-modal networks, we utilize two different implementations of the CLIP [3] model in order to examine their performance. More specifically, we utilize the “ViT-L/14@336px”, the largest version of the CLIP model currently available to the public by OpenAI, and as a second variation, we utilize the “ViT-H/14” model of openCLIP [14], the open-source implementation of CLIP. We use these models to calculate text and image feature representations. For a given article, in order to retrieve the most relevant images from the test set, we calculate the cosine similarity between the article’s title CLIP embedding and the embeddings of all test images, and the top-100 most relevant images are selected in a ranked list, from the most relevant to the least relevant image. 3.3. Fine-tuned model We also examined fine-tuning the “ViT-L/14@336px” CLIP model using the aforementioned training datasets to improve its performance. We choose to keep the image encoder of the model frozen and only train the text encoder’s parameters for one epoch with a batch size of 480 (performing gradient accumulation to handle GPU memory limitations). The Adam optimizer is employed while the learning rate is set to 3e-7. 1 https://data.world/opensnippets/al-jazeera-news-dataset 2 https://data.world/opensnippets/cnn-news-dataset 3 https://data.world/opensnippets/bbc-uk-news-dataset 4 https://data.world/crawlfeeds/huffspot-news-dataset 5 https://data.world/crawlfeeds/bloomberg-quint-news-dataset 3.4. Dual-softmax similarity revision At the retrieval stage, we calculate the similarities between all images from the test set and all testing articles, resulting in a similarity matrix Z ∈ ℛ𝐶×𝐷 , where 𝐶 is the number of the testing article queries and 𝐷 the number of test images. Following [2, 4], to revise the calculated similarities, we apply two cross-dimension softmax operations (one row-wise: dim = 0, and one column-wise: dim = 0) as follows: Z* = Softmax(Z, dim = 0) ⊙ Softmax(Z, dim = 1): where ⊙ denotes the element-wise product. 3.5. Inference-stage scores aggregation As mentioned before, we also augment the test data’s textual part, resulting in two article-image pairs for each original pair contained in the dataset. So, in all our runs (e.g. regardless of whether we use a pre-trained CLIP or we fine-tune it), we end up with two article-image similarity scores. To aggregate these scores, we experimented with different aggregation methods (not presented here for brevity), and we chose to perform mean pooling to obtain our final prediction. 4. Submitted Runs and Results We submitted five runs for each testing dataset (GDELT-P1, GDELT-P2, RT), as detailed below: • Run #1 (ViT-H/14_ds): This uses the text and image embeddings of the “ViT-H/14” pre- trained openCLIP model and calculates the cosine similarity between the embedding of an article and all images. Then, the dual-softmax revision method is used to recalculate the similarities. Finally, for each article, the 100 most relevant images are selected. • Run #2 (ViT-L/14@336px): This uses the text and image embeddings of the “ViT- L/14@336px” pre-trained CLIP model and calculates the cosine similarity between the embedding of an article and all images. Then for each article, the 100 most relevant images are selected. • Run #3 (ViT-L/14@336px_ds): Similarly to Run #2, additionally using dual softmax revision to revise the computed similarities. • Run #4 (ViT-L/14@336px_ft): We fine-tune the “ViT-L/14@336px” pre-trained model using the original and the augmented data from the collected datasets. • Run #5 (ViT-L/14@336px_ft_ds): Similarly to Run #4, additionally using dual softmax revision to revise the computed similarities. We present the official results on the three testing datasets and results from the internal experiments we conducted in order to evaluate our methods and select our final runs. Recall@K, where 𝐾 = 5, 10, 50, 100 and Mean Reciprocal Rank (MRR) are used as evaluation metrics. Table 1 (A) presents the results on the three testing datasets evaluated officially by the task organizers. Run #1 (ViT-H/14 + DS) performs the best on the GDELT-P2 dataset on all metrics. Run #4 (ViT-L/14@336px_ft) and Run #5 (ViT-L/14@336px_ft_ds) perform the best in MRR terms on GDELT-P1 and RT respectively, while in Recall@K terms the results are mixed. The dual softmax operation is beneficial in the RT dataset but not in GDELT-P1 and GDELT-P2 while the CLIP fine-tuning (comparison between Run #2 and Run #4) is beneficial in all datasets in the majority of the metrics but achieves the best results only in GDELT-P1. The above official results contrast with the findings of our internal experiments, conducted prior to the release of the official results. Table 1 (B) presents our internal results on the dataset we used for selecting our best models and examining our runs’ performance. From these Table 1 Evaluation results for the five submitted runs. A. Official evaluation results on the three testing datasets. Test dataset R@5 R@10 R@50 R@100 MRR Run #1 0.76733 0.84000 0.93533 0.96000 0.62368 Run #2 0.77800 0.85133 0.94267 0.96867 0.62431 GDELT-P1 Run #3 0.76933 0.84467 0.93933 0.97067 0.62380 Run #4 0.77933 0.84867 0.94533 0.97067 0.62972 Run #5 0.76933 0.84400 0.93733 0.96867 0.62716 Run #1 0.69067 0.77600 0.90133 0.93200 0.56156 Run #2 0.64133 0.73533 0.86933 0.92267 0.52082 GDELT-P2 Run #3 0.63867 0.72667 0.87067 0.91533 0.51986 Run #4 0.64400 0.73267 0.87800 0.92867 0.52615 Run #5 0.64267 0.73200 0.87333 0.91933 0.52025 Run #1 0.34400 0.43800 0.63333 0.71300 0.26153 Run #2 0.33467 0.41100 0.60033 0.68633 0.24712 RT Run #3 0.34733 0.43267 0.63000 0.71300 0.26048 Run #4 0.33967 0.41700 0.60900 0.69300 0.25292 Run #5 0.35400 0.43633 0.63300 0.71933 0.26162 B. Results on our internal evaluation dataset. Run #1 0.43720 0.51466 0.6919 0.75926 0.343 Run #2 0.45129 0.53137 0.71286 0.77548 0.354 Test dataset: Run #3 0.45503 0.53711 0.71261 0.77959 0.356 NewsImages 2022 training data Run #4 0.44917 0.53561 0.71373 0.78047 0.356 Run #5 0.45603 0.5401 0.71673 0.78358 0.357 preliminary experiments, we concluded that Run #5 constantly outperforms the rest of the runs in every dataset, i.e. the use of the “ViT-L/14@336px” model, our fine-tuning and the dual softmax revision seemed to be beneficial for performance. The contrast between our findings and the official results in the GDELT-P2 dataset is probably explained by the significant amount (80%) of generated images that exist in that dataset. Our results suggest that the “ViT-H/14” model is more capable of handling such synthetic data than the “ViT-L/14@336px”, but the reasons for this need to be further investigated. 5. Conclusion In this work we proposed a solution for the MediaEval NewsImages task using state-of-the-art text and image representations calculated from a pre-trained cross-modal network, a fine- tuned cross-modal network and a similarity revision approach. We concluded from the official evaluation results that for generated images the “ViT-H/14” model is more suitable for the task while the “ViT-L/14@336px” models perform better for real images. Also, fine-tuning pre-trained models for domain adaptation seems beneficial in most cases, while employing different CLIP version can significantly affect the final performance. Acknowledgements This work was supported by the EU’s Horizon Europe and Horizon 2020 research and innovation programmes under grant agreements 101070190 AI4Trust and 101021866 CRiTERIA, respectively. References [1] A. Lommatzsch, B. Kille, Ö. Özgöbek, M. Elahi, D.-T. Dang-Nguyen, News Images in MediaEval 2023, in: Proceedings of the MediaEval Benchmarking Initiative 2023, CEUR Workshop Proceedings, 2024. URL: http://ceur-ws.org/. [2] D. Galanopoulos, V. Mezaris, Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022, in: Working Notes Proceedings of the MediaEval 2022 Workshop, volume 3583, CEUR Workshop Proceedings, 2023. [3] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, et al., Learning Transferable Visual Models From Natural Language Supervision, in: Proc. of the 38th Int. Conf. on Machine Learning (ICML), 2021. [4] D. Galanopoulos, V. Mezaris, Are all combinations equal? Combining textual and visual features with multiple space learning for text-based video retrieval, in: European Conference on Computer Vision Workshops (ECCVW), Springer, 2022. [5] N. Borah, U. Baruah, Image retrieval using neural networks for word image spotting—a review, in: H. K. Deva Sarma, V. Piuri, A. K. Pujari (Eds.), Machine Learning in Information and Communication Technology, Springer Nature Singapore, Singapore, 2023, pp. 243–268. [6] K. Ueki, Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval, in: 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE, 2021, pp. 628–634. [7] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, J. Gao, VinVL: Revisiting visual representations in vision-language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5579–5588. [8] T. Cao, N. Ngô, T. D. Le, T. Huynh, N. T. Nguyen, H. Nguyen, M. Tran, HCMUS at MediaEval 2021: Fine-tuning CLIP for Automatic News-Images Re-Matching, in: Working Notes Proceedings of the MediaEval 2021 Workshop, Online, 13-15 December 2021, volume 3181 of CEUR Workshop Proceedings, CEUR-WS.org, 2021. [9] A. Tran, A. Mathews, L. Xie, Transform and tell: Entity-aware news image captioning, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [10] W. Zhen, S. Xu, Z. Xiangxie, Y. Jie, N24News: A New Dataset for Multimodal News Classification, in: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 2022, pp. 6768–6775. [11] R. Arnau, Y. Fei, M.-N. Francesc, M. Krystian, BreakingNews: Article Annotation by Image and Text Processing, in: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, pp. 1072–1085. [12] A. Lommatzsch, B. Kille, Ö. Özgöbek, M. Elahi, D.-T. Dang-Nguyen, News Images in MediaEval 2022, in: Working Notes Proceedings of the MediaEval 2022 Workshop, volume 3583, CEUR Workshop Proceedings, 2023. [13] R. Colin, S. Noam, R. Adam, L. Katherine, N. Sharan, M. Michael, Z. Yanqi, W. Li, P. J. Liu, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, in: Journal of Machine Learning Research, 2020, pp. 1–67. [14] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models From Natural Language Supervision, in: ICML, 2021.