HCMUS at MediaEval 2021: Fine-tuning CLIP for Automatic News-Images Re-Matching Thien-Tri Cao1,2 , Nhat-Khang Ngo1,2 , Thanh-Danh Le1,2 , Tuan-Luc Huynh1,2 , ,Ngoc-Thien Nguyen1,2 , Hai-Dang Nguyen 1,2 , Minh-Triet Tran1,2,3 1 University of Science, VNU-HCM 2 Vietnam National University, Ho Chi Minh city, Vietnam 3 John von Neumann Institute, VNU-HCM {cttri,ltdanh,htluc,nnkhang,nnthien}19@apcs.fitus.edu.vn,nhdang@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn ABSTRACT 3 APPROACH Matching text and images based on their semantics have an essential CLIP(Contrastive Language–Image Pre-training)[7] is proposed by role in cross-media retrieval. The NewsImages task, MediaEval2021, Radford et al. It is a powerful pretrain-model for text-image match- explores the challenge of building accurate and high-performance ing tasks. In our survey, CLIP is the best choice as the baseline for algorithms. We proposed different approaches leveraging the ad- fine-tuning. The CLIP model has been trained with more than 400 vantages of fine-tuning CLIP for the multi-class retrieval task. With million text-image pairs, and the dataset domain is huge, covering our approach, the best-performed method reaches a recall@100 the dataset portion of the NewsImage task. The inference result of score of 0,77441. the CLIP model (without training with NewsImage dataset) for the dataset provided by the organizers is out-performance compared to the models that we built ourselves or using CLIP as the backbone and train it with NewsImage dataset. In addition, the number of 1 INTRODUCTION text-image pairs in the NewsImage dataset is relatively small and In the context of journalism, authors often use images to represent does not represent the specificity of the dataset. Therefore, we de- the main content of a particular article. A study in 2020 indicates cided not to retrain the model with the NewsImage Task dataset that the textual content and accompany images might not be related but just use it as an evaluation dataset and fine-tune the number [6]. Many previous studies in multimedia and recommendation of words, the preprocessing step based on the performance of the system domains mostly investigate image-text pairs with simple re- model on this dataset. Our fine-tuning takes place at a step that lationships, e.g., [3]. The MediaEval 2021 NewsImages Task calls for determines how many words to include in the model as well as researchers to investigate the real-world relationship of news text which words should be kept or discarded. Basically, our approach and images in more depth, in order to understand its implications consists of 4 steps:(1) translation, (2) text preprocessing, (3) image for journalism and news recommendation system [1] and text vectorization, (4) feed to CLIP, and (5) evaluation. The HCMUS-team participates in the Image-Text-Re-Matching task. Particularly, given a set of image-text pairs in the wild, the 3.1 Translation task requires us to correctly re-assign images to their decoupled articles, with the aim to understand the implication of journalism The language used in articles in NewsImage dataset is German, but in choosing illustrative images. the language in CLIP is English, so we need to translate all articles into English. Google translate is a useful API to help us do this as it is free and highly accurate. 2 RELATED WORK Learning correspondences between images and texts are challeng- ing because of their representation discrepancies. A majority of 3.2 Text preprocessing studies focus on connecting objects with corresponding seman- The conventional preprocess includes dropping NA instances, con- tic words in sentences. Lee et al. [4] proposed Stack-Cross atten- verting categorical labels into numerical labels, converting all text tion mechanism to find correspondence scores between objects into lowercase, etc. Additionally, we also expand contractions such and words. As an improvement, Liu et al. [5] introduced a graph- as "He’s", "She’s" and remove some words like "an","a","the". We structured network to capture both image-sentence level relations believe this extra preprocessing works will help extract even more and object-word level correspondences. On the other hand, Wang et useful information for our embedding features. Finally, Ekphrasis al. [8] combines early and late fusion strategies. The incorporation library [2] helps us segment words that are intentionally or unin- helps models to learn both intra-modal and inter-modal information tentionally written and correct misspellings or typos for cleaner efficiently. text. After the preprocessing step, we determined the number of words to be fed into the model as we realized it greatly affected the model’s performance. Basically, we gradually adjust the number of Copyright 2021 for this paper by its authors. Use permitted under Creative Commons words fed into the model and observe the change of performance License Attribution 4.0 International (CC BY 4.0). MediaEval’21, December 13-15 2021, Online of model. Experimental results on its influence will be described in more detail in the Experiments and Experimental results sections. MediaEval’21, December 13-15 2021, Online T.Cao et al. Table 1: Submission result Method MR@100 MeanRecall@5 MeanRecall@10 MeanRecall@50 MeanRecall@100 Run01 0,23576 0,30601 0,37285 0,54674 0,61984 Run02 0,25521 0,34987 0,42611 0,60522 0,67258 Run03 0,27323 0,36971 0,44804 0,63708 0,71018 Run04 0,27446 0,368677 0,44752 0,64961 0,71906 Run05 0,29434 0,38172 0,48460 0,68825 0,77441 4 EXPERIMENTS AND EXPERIMENTAL ACKNOWLEDGMENTS RESULTS This work was funded by Gia Lam Urban Development and In- 4.1 Experiments vestment Company Limited, Vingroup and supported by Vingroup Innovation Foundation (VINIF) under project code VINIF.2019.DA19 We have submitted five runs for this task. Basically, they are all generated from CLIP model but differ in the number of words REFERENCES of the article included in the model, resulting in different results. [1] Özlem Özgöbek Duc Tien Dang Nguyen adn Mehdi Elahi Andreas Lom- From run 1 to run 4, corresponding to the number of words of matzsch, Benjamin Kille. 2021. News Images in MediaEval 2021. each article that we feed into CLIP is 10, 20, 30, and 40 words. In Proc. of the MediaEval 2021 Workshop. Online. (2021). https: We fine-tuned the word count of each article because during our //multimediaeval.github.io/editions/2021/tasks/newsimages/ experiments on the NewsImage set, we noticed that, as we gradually [2] Christos Baziotis, Nikos Pelekis, and Christos Doulkeridis. 2017. DataS- increased the number of words feed into the CLIP, the recall@1 tories at SemEval-2017 Task 4: Deep LSTM with Attention for Message- gradually decreased while the recall@100 increased. That said, the level and Topic-based Sentiment Analysis. In Proceedings of the 11th influence of the word count in an article on the model’s performance International Workshop on Semantic Evaluation (SemEval-2017). Asso- is significant. The last run is the average ensemble submission ciation for Computational Linguistics, Vancouver, Canada, 747–754. combines all the results of run 1 to run 4 methods. In this method, [3] MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A Comprehensive Survey of Deep Learning for all runs have the same weights of 0.25. Image Captioning. ACM Comput. Surv. 51, 6, Article 118 (feb 2019), 36 pages. https://doi.org/10.1145/3295748 4.2 Experimental Results [4] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Table . 1 shows the results of our experiment from run 1 to run 2018. Stacked cross attention for image-text matching. In Proceedings 5 using the following scales: MRR@100, MeanRecall@5, MeanRe- of the European Conference on Computer Vision (ECCV). 201–216. call@10, MeanRecall@50, MeanRecall@100. [5] Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Through the experiment, we found that the MeanRecall@5 scale Wang, and Yongdong Zhang. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on is relatively low, but with such a large number of pairs and the diffi- Computer Vision and Pattern Recognition. 10921–10930. culty level of the problem is very high, these results are completely [6] Nelleke Oostdijk, Hans van Halteren, Erkan Bas, ar, and Martha Lar- acceptable. Experimental results show that the more the number son. 2020. The Connection between the Text and Images of News of words fed into the CLIP model, the model’s performance on all Articles: New Insights for Multimedia Analysis. In Proceedings of the metrics gradually increases from run 01 to run 04, which is true 12th Language Resources and Evaluation Conference. European Lan- with the hypothesis that we set out from the beginning. Overall, guage Resources Association, Marseille, France, 4343–4351. https: Run 05 gives the best performance when it gets the best result in //aclanthology.org/2020.lrec-1.535 all metrics, and this is understandable since the ensemble strategy [7] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel always increases the performance of models on the benchmark Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela dataset. As mentioned, our evaluation on NewsImage (open set) Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Su- shows that MeanRecall@1 will be higher if the number of words of pervision. CoRR abs/2103.00020 (2021). arXiv:2103.00020 https: articles fed into the model is small, but in the test dataset ( secret //arxiv.org/abs/2103.00020 set), because we don’t get the MeanRecall@1 value for each run, [8] Yifan Wang, Xing Xu, Wei Yu, Ruicong Xu, Zuo Cao, and Heng Tao we can’t conclude whether the hypothesis is true or false. Shen. 2021. Combine Early and Late Fusion Together: A Hybrid Fusion Framework for Image-Text Matching. In 2021 IEEE International 5 CONCLUSION AND FUTURE WORKS Conference on Multimedia and Expo (ICME). IEEE, 1–6. News Images is a difficult task when it requires exactly matching the image with the text for nearly 2000 pairs, but we have obtained relatively satisfactory results with 0,77441 for the MeanRecall@100 scale. This demonstrates the efficiency of the model structure, as well as the benefits that the pretrain-model brings, when the dataset used to train in the NewsImage task is not too large. In the future, we wish to investigate more methods and delve into this topic as it is a potential field that still has many problems to be solved.