HCMUS at MediaEval 2021: Fine-tuning CLIP for Automatic
                   News-Images Re-Matching
                           Thien-Tri Cao1,2 , Nhat-Khang Ngo1,2 , Thanh-Danh Le1,2 ,
              Tuan-Luc Huynh1,2 , ,Ngoc-Thien Nguyen1,2 , Hai-Dang Nguyen 1,2 , Minh-Triet Tran1,2,3
                                                              1 University of Science, VNU-HCM
                                              2 Vietnam National University, Ho Chi Minh city, Vietnam
                                          3 John von Neumann Institute, VNU-HCM

       {cttri,ltdanh,htluc,nnkhang,nnthien}19@apcs.fitus.edu.vn,nhdang@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn

ABSTRACT                                                                             3     APPROACH
Matching text and images based on their semantics have an essential                  CLIP(Contrastive Language–Image Pre-training)[7] is proposed by
role in cross-media retrieval. The NewsImages task, MediaEval2021,                   Radford et al. It is a powerful pretrain-model for text-image match-
explores the challenge of building accurate and high-performance                     ing tasks. In our survey, CLIP is the best choice as the baseline for
algorithms. We proposed different approaches leveraging the ad-                      fine-tuning. The CLIP model has been trained with more than 400
vantages of fine-tuning CLIP for the multi-class retrieval task. With                million text-image pairs, and the dataset domain is huge, covering
our approach, the best-performed method reaches a recall@100                         the dataset portion of the NewsImage task. The inference result of
score of 0,77441.                                                                    the CLIP model (without training with NewsImage dataset) for the
                                                                                     dataset provided by the organizers is out-performance compared to
                                                                                     the models that we built ourselves or using CLIP as the backbone
                                                                                     and train it with NewsImage dataset. In addition, the number of
1    INTRODUCTION                                                                    text-image pairs in the NewsImage dataset is relatively small and
In the context of journalism, authors often use images to represent                  does not represent the specificity of the dataset. Therefore, we de-
the main content of a particular article. A study in 2020 indicates                  cided not to retrain the model with the NewsImage Task dataset
that the textual content and accompany images might not be related                   but just use it as an evaluation dataset and fine-tune the number
[6]. Many previous studies in multimedia and recommendation                          of words, the preprocessing step based on the performance of the
system domains mostly investigate image-text pairs with simple re-                   model on this dataset. Our fine-tuning takes place at a step that
lationships, e.g., [3]. The MediaEval 2021 NewsImages Task calls for                 determines how many words to include in the model as well as
researchers to investigate the real-world relationship of news text                  which words should be kept or discarded. Basically, our approach
and images in more depth, in order to understand its implications                    consists of 4 steps:(1) translation, (2) text preprocessing, (3) image
for journalism and news recommendation system [1]                                    and text vectorization, (4) feed to CLIP, and (5) evaluation.
   The HCMUS-team participates in the Image-Text-Re-Matching
task. Particularly, given a set of image-text pairs in the wild, the
                                                                                     3.1    Translation
task requires us to correctly re-assign images to their decoupled
articles, with the aim to understand the implication of journalism                   The language used in articles in NewsImage dataset is German, but
in choosing illustrative images.                                                     the language in CLIP is English, so we need to translate all articles
                                                                                     into English. Google translate is a useful API to help us do this as
                                                                                     it is free and highly accurate.
2    RELATED WORK
Learning correspondences between images and texts are challeng-
ing because of their representation discrepancies. A majority of                     3.2    Text preprocessing
studies focus on connecting objects with corresponding seman-                        The conventional preprocess includes dropping NA instances, con-
tic words in sentences. Lee et al. [4] proposed Stack-Cross atten-                   verting categorical labels into numerical labels, converting all text
tion mechanism to find correspondence scores between objects                         into lowercase, etc. Additionally, we also expand contractions such
and words. As an improvement, Liu et al. [5] introduced a graph-                     as "He’s", "She’s" and remove some words like "an","a","the". We
structured network to capture both image-sentence level relations                    believe this extra preprocessing works will help extract even more
and object-word level correspondences. On the other hand, Wang et                    useful information for our embedding features. Finally, Ekphrasis
al. [8] combines early and late fusion strategies. The incorporation                 library [2] helps us segment words that are intentionally or unin-
helps models to learn both intra-modal and inter-modal information                   tentionally written and correct misspellings or typos for cleaner
efficiently.                                                                         text. After the preprocessing step, we determined the number of
                                                                                     words to be fed into the model as we realized it greatly affected the
                                                                                     model’s performance. Basically, we gradually adjust the number of
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   words fed into the model and observe the change of performance
License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, December 13-15 2021, Online
                                                                                     of model. Experimental results on its influence will be described in
                                                                                     more detail in the Experiments and Experimental results sections.
MediaEval’21, December 13-15 2021, Online                                                                                                T.Cao et al.

                                                          Table 1: Submission result

                       Method     MR@100       MeanRecall@5       MeanRecall@10    MeanRecall@50         MeanRecall@100
                       Run01      0,23576         0,30601            0,37285          0,54674               0,61984
                       Run02      0,25521         0,34987            0,42611          0,60522               0,67258
                       Run03      0,27323         0,36971            0,44804          0,63708               0,71018
                       Run04      0,27446        0,368677            0,44752          0,64961               0,71906
                       Run05      0,29434        0,38172             0,48460          0,68825               0,77441


4  EXPERIMENTS AND EXPERIMENTAL                                           ACKNOWLEDGMENTS
   RESULTS                                                                This work was funded by Gia Lam Urban Development and In-
4.1 Experiments                                                           vestment Company Limited, Vingroup and supported by Vingroup
                                                                          Innovation Foundation (VINIF) under project code VINIF.2019.DA19
We have submitted five runs for this task. Basically, they are all
generated from CLIP model but differ in the number of words               REFERENCES
of the article included in the model, resulting in different results.
                                                                           [1] Özlem Özgöbek Duc Tien Dang Nguyen adn Mehdi Elahi Andreas Lom-
From run 1 to run 4, corresponding to the number of words of                   matzsch, Benjamin Kille. 2021. News Images in MediaEval 2021.
each article that we feed into CLIP is 10, 20, 30, and 40 words.               In Proc. of the MediaEval 2021 Workshop. Online. (2021). https:
We fine-tuned the word count of each article because during our                //multimediaeval.github.io/editions/2021/tasks/newsimages/
experiments on the NewsImage set, we noticed that, as we gradually         [2] Christos Baziotis, Nikos Pelekis, and Christos Doulkeridis. 2017. DataS-
increased the number of words feed into the CLIP, the recall@1                 tories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-
gradually decreased while the recall@100 increased. That said, the             level and Topic-based Sentiment Analysis. In Proceedings of the 11th
influence of the word count in an article on the model’s performance           International Workshop on Semantic Evaluation (SemEval-2017). Asso-
is significant. The last run is the average ensemble submission                ciation for Computational Linguistics, Vancouver, Canada, 747–754.
combines all the results of run 1 to run 4 methods. In this method,        [3] MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and
                                                                               Hamid Laga. 2019. A Comprehensive Survey of Deep Learning for
all runs have the same weights of 0.25.
                                                                               Image Captioning. ACM Comput. Surv. 51, 6, Article 118 (feb 2019),
                                                                               36 pages. https://doi.org/10.1145/3295748
4.2    Experimental Results                                                [4] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He.
Table . 1 shows the results of our experiment from run 1 to run                2018. Stacked cross attention for image-text matching. In Proceedings
5 using the following scales: MRR@100, MeanRecall@5, MeanRe-                   of the European Conference on Computer Vision (ECCV). 201–216.
call@10, MeanRecall@50, MeanRecall@100.                                    [5] Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin
    Through the experiment, we found that the MeanRecall@5 scale               Wang, and Yongdong Zhang. 2020. Graph structured network for
                                                                               image-text matching. In Proceedings of the IEEE/CVF Conference on
is relatively low, but with such a large number of pairs and the diffi-
                                                                               Computer Vision and Pattern Recognition. 10921–10930.
culty level of the problem is very high, these results are completely
                                                                           [6] Nelleke Oostdijk, Hans van Halteren, Erkan Bas, ar, and Martha Lar-
acceptable. Experimental results show that the more the number                 son. 2020. The Connection between the Text and Images of News
of words fed into the CLIP model, the model’s performance on all               Articles: New Insights for Multimedia Analysis. In Proceedings of the
metrics gradually increases from run 01 to run 04, which is true               12th Language Resources and Evaluation Conference. European Lan-
with the hypothesis that we set out from the beginning. Overall,               guage Resources Association, Marseille, France, 4343–4351. https:
Run 05 gives the best performance when it gets the best result in              //aclanthology.org/2020.lrec-1.535
all metrics, and this is understandable since the ensemble strategy        [7] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel
always increases the performance of models on the benchmark                    Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela
dataset. As mentioned, our evaluation on NewsImage (open set)                  Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021.
                                                                               Learning Transferable Visual Models From Natural Language Su-
shows that MeanRecall@1 will be higher if the number of words of
                                                                               pervision. CoRR abs/2103.00020 (2021). arXiv:2103.00020 https:
articles fed into the model is small, but in the test dataset ( secret
                                                                               //arxiv.org/abs/2103.00020
set), because we don’t get the MeanRecall@1 value for each run,            [8] Yifan Wang, Xing Xu, Wei Yu, Ruicong Xu, Zuo Cao, and Heng Tao
we can’t conclude whether the hypothesis is true or false.                     Shen. 2021. Combine Early and Late Fusion Together: A Hybrid
                                                                               Fusion Framework for Image-Text Matching. In 2021 IEEE International
5     CONCLUSION AND FUTURE WORKS                                              Conference on Multimedia and Expo (ICME). IEEE, 1–6.
News Images is a difficult task when it requires exactly matching
the image with the text for nearly 2000 pairs, but we have obtained
relatively satisfactory results with 0,77441 for the MeanRecall@100
scale. This demonstrates the efficiency of the model structure, as
well as the benefits that the pretrain-model brings, when the dataset
used to train in the NewsImage task is not too large. In the future,
we wish to investigate more methods and delve into this topic as it
is a potential field that still has many problems to be solved.