<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Image-Text Re-Matching with Zero-shot and Finetuning of CLIP</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yuta Fukatsu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Masaki Aono</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Toyohashi University of Technology</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Images play an important role in the perception of online news. We aim to gain more insight into the interplay of images and texts in different news domains. In this paper, we describe our method toward Image Text ReMatching based on the CLIP model with zero-shots and finetuning. Specifically, we introduce WISE-FT, a method for linear interpolation of weights between the zero-shot and finetuning models, to improve recall for re-matching. The WISE-FT has been reported to be effective to boost accuracy in CLIP on classification experiments. We obtained certain MRR and Recall with these methods.1 Online news articles in recent years have been a mixture of text and image-based articles. Images are often added to text articles to attract attention and help readers understand the article intuitively. Typically, studies of multimedia and recommendation systems assume a simple relationship between images and text. For example, in the study of image captions [1], it is assumed that the caption is a textual representation of the landscape of the image. However, news-specific studies point to a more complex relationship [2]. The NewsImages task [3] in MediaEval 2022 investigates this relationship to understand its implications for journalism and news personalization. In MediaEval 2021, Thien-Tri et al. [4] used CLIP (Contrastive Language-Image Pre-Training) [5]. Thien-Tri et al. did not train on the dataset, but we train on the dataset published by the organizer, resulting in improvement in Recall and MRR was observed. However, news data has a different relationship to image and text dataset such as MSCOCO [6]. Therefore, CLIP's inherent ability may be lost due to shifts in the data distribution caused by finetuning. As a solution we apply WISE-FT [7], which has been reported to be robust to shifts in data distribution in class classification, in our retrieval task. The paper is organized as follows: in Sec. 2, related studies; in Sec. 3, our approach; in Sec. 4, experimental results and their analysis are presented, and trends are discussed through visualization. Finally, Sec. 5 discusses the conclusions and challenges.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 INTRODUCTION</title>
    </sec>
    <sec id="sec-2">
      <title>2 RELATED WORK 2.1</title>
    </sec>
    <sec id="sec-3">
      <title>CLIP</title>
      <p>
        CLIP (Contrastive Language-Image Pre-Training) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] a neural network trained by a large dataset of image-text pairs. CLIP
can predict the most relevant text given an image without directly optimizing for the dataset in a part icular task. This process
of making predictions for a different task other than the pre-training task without optimization is called zero-shot. CLIP is very
powerful in this zero-shot, comparable to the zero-shot performance of the original ResNet50 in ImageNet. The feature
representations obtained by CLIP through pre-training can be used to vectorize data in the retrieval task.
      </p>
    </sec>
    <sec id="sec-4">
      <title>WISE-FT on classification</title>
      <p>
        CLIP shows consistent accuracy in zero-shot inference across a variety of datasets. Finetuning can also improve accuracy on
specific datasets. However, shifts in the data distribution in finetuning can reduce robustness. Wortsman et al [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] introduced
WISE-FT, an ensemble of weights for zero-shot and finetuning models, for this problem and found that it improves accuracy
in classification problems. The ensemble of weights in WISE-FT is achieved by linear interpolation of weights. For a
hyperparameter α, the model weights are determined by the following equation (1), where   refers to the model weights.
 
−
= (1 −  ) ×  
− ℎ
+  ×  
(1)
3.1
      </p>
    </sec>
    <sec id="sec-5">
      <title>APPROACH</title>
    </sec>
    <sec id="sec-6">
      <title>Training CLIP</title>
      <p>
        Three datasets are provided by the organizer in this task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Thus, we train CLIP separately on each dataset. Finetuning of
CLIP was done using the experimental method of Sedigheh et al [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The model of CLIP consists of a Vision Encoder and a
Text Encoder. The Vision Encoder can be implemented by CNN based models such as ResNet or by Transformer-based
models such as Vision Transformer. We used Vision Transformer Based model with a patch size of 32 as the Vision
Encoder. The model structure of CLIP and the hyperparameters in the training are the same for each dataset. In addition,
only the Online News portal dataset is in German. Since CLIP is pre-trained on the English dataset, the dataset must be
translated in order to benefit from it. We used the method of Jörg et al [9] for German-English translation.
3.2
      </p>
    </sec>
    <sec id="sec-7">
      <title>Applying WISE-FT</title>
      <p>News dataset has two characteristics. First, they may differ in content from domain to domain. Second, they contain depending
on when the news is released. For the first characteristic, finetuning adapted to the data in that domain is expected to improve
Recall. However, for the second characteristic, there is a possibility that finetuning may degrade performance due to
differences in data distribution caused by the timing of news releases. Therefore, we use WISE-FT, which has been reported
to improve accuracy in classification problems, for the retrieval problem. WISE-FT is a linear interpolation of weights as
shown in Equation 1. Since training is performed here for each dataset, the WISE-FT is also performed for each dataset. The
hyperparameter α of the linear interpolation is also different for each dataset.
3.3</p>
    </sec>
    <sec id="sec-8">
      <title>Splitting the Dataset and Submitted Runs</title>
      <p>In this task, three datasets are provided by the organizer for this task: Online News portals, Twitter, and RSS news feed. We
sort each dataset in chronological order, using 80% from the top as training data and the remaining 20% as validation data.
The predictions for the test data were conducted by extracting features from all the test data, followed by using the features to
compute cosine similarity to obtain the top 100 candidates.</p>
      <p>In Run1, we use zero-shot prediction by CLIP pretrained on WebImageText. In Run2, we finetuned CLIP on each dataset
and predict. In Run3, we apply WISE-FT to the CLIP trained on each dataset. The parameter alpha is the value with the
highest Recall@1 in each validation data. Specifically, 0.5 is used for Online news portals, 0.5 for Twitter, and 0.4 for RSS
news feeds.
4
4.1</p>
    </sec>
    <sec id="sec-9">
      <title>RESULTS AND ANALYSIS</title>
    </sec>
    <sec id="sec-10">
      <title>Submission Result</title>
      <p>The results of the submitted runs are summarized in Table 1 for Online News portals, Table2 for Twitter, and Table3 for RSS
news feed. The left column shows the name of the Runs. The evaluation metrics shown are MRR@100, Recall@5, Recall@10,
Recall@50, and Recall@100. Each column in Table 1 shows the results of zero-shot, finetuning, and WISE-FT of CLIP.</p>
    </sec>
    <sec id="sec-11">
      <title>4.2 Analysis of Each Dataset 4.2.1</title>
    </sec>
    <sec id="sec-12">
      <title>Ranking Changes by WISE-FT</title>
      <p>The analysis in this chapter uses our own split data for validation. Table 4 shows the average improvement and decrease in
ranking from finetuning to WISE-FT. The tabulations here assume of an improvement from zero-shot to finetuning. It
should be noted here that the size of each validation dataset is different for each dataset. Recall values improved the most for
RSS news feeds, but the average increase in ranking was the lowest.
The left side of Figure 1 shows a case that was worsened by finetuning but improved by WISE-FT, and the right side shows
a case that was improved by finetuning but worsened by WISE-FT. Here, the original text is too long to display, so only the
first sentence is displayed. In cases that were exacerbated by WISE-FT, news about the person was found. In the cases that
improved with WISE-FT, news was seen where the text described the scene of the image. It is possible that WISE-FT
brought general information into focus.
The left side of Figure 3 shows a case that was worsened by finetuning but improved by WISE-FT, and the right side shows a
case that was improved by finetuning but worsened by WISE-FT. Here, the original text is too long to display, so only the first
sentence is displayed. Despite the higher accuracy compared to other datasets, the text is not a direct expression to the image
in both cases.</p>
    </sec>
    <sec id="sec-13">
      <title>5 CONCLUSIONS</title>
      <p>We adopted CLIP and implemented finetuning and WISE-FT. As a result, we achieved MRR@100 score of 0.240,
Recall@100 score of 0.735 for Online News portals test set, MRR@100 score of 0.476 and Recall@100 score of 0.595 for
Twitter test set, and MRR@100 score of 0.455 and Recall@100 score of 0.859 for RSS news feed test set. We confirmed that
the zero-shot method can obtain a constant Recall. Furthermore, in all datasets, finetuning increased Recall more than
zeroshot. This indicates that CLIP pre-trained on datasets with different domains has some effect on the news dataset, and that
learning to adapt to the news dataset is an effective method. On the other hand, the improvement in Recall by WISE-FT was
significant only for the RSS news feed dataset. This may be because the hyperparameter  in the linear interpolation of
WISEFT was determined by the validation data, which did not result in optimal model weights for the test data. Alternatively, the
finetunig may have over-adapted to the content within a certain time period.</p>
      <p>Future work is needed to understand why similar images were attached to each news article. For example, the two images
on the left in Figure 3 are images of buildings, but in the news articles they have different information such as place names.
For this reason, the application of methods other than deep learning, such as pre-associating images to unique expressions such
as place names, may improve Recall.
[9] Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – Building open translation services for the World. In
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 479–480,
Lisboa, Portugal. European Association for Machine Translation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>MD. Zakir</given-names>
            <surname>Hossain</surname>
          </string-name>
          , Ferdous Sohel, Mohd Fairuz Shiratuddin, and
          <string-name>
            <given-names>Hamid</given-names>
            <surname>Laga</surname>
          </string-name>
          .
          <article-title>A Comprehensive Survey of Deep Learning for Image Captioning</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>51</volume>
          ,
          <issue>6</issue>
          ,
          <string-name>
            <surname>Article 118</surname>
          </string-name>
          (
          <issue>Feb</issue>
          .
          <year>2019</year>
          ). https://doi.org/10.1145/3295748
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Nelleke</given-names>
            <surname>Oostdijk</surname>
          </string-name>
          , Hans van Halteren,
          <article-title>Erkan Bas, ar, and Martha Larson. The Connection between the Text and Images of News Articles:New Insights for Multimedia Analysis</article-title>
          .
          <source>In Proceedings of The 12th Language Resources and Evaluation Conference</source>
          .
          <volume>4343</volume>
          -
          <fpage>4351</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Kille</surname>
          </string-name>
          , Andreas Lommatzsch, Özlem Özgöbek, Mehdi Elahi and
          <string-name>
            <surname>Duc-Tien</surname>
          </string-name>
          Dang-Nguyen.
          <article-title>News Images in MediaEval 2022</article-title>
          .
          <source>Proc. of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Thien-Tri</surname>
            <given-names>Cao</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nhat-Khang</surname>
            <given-names>Ngo</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thanh-Danh</surname>
            <given-names>Le</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuan-Luc</surname>
            <given-names>Huynh</given-names>
          </string-name>
          , ,
          <string-name>
            <surname>Ngoc-Thien</surname>
            <given-names>Nguyen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hai-Dang</surname>
            <given-names>Nguyen</given-names>
          </string-name>
          ,
          <source>MinhTriet Tran</source>
          .
          <year>2021</year>
          . HCMUS at MediaEval 2021:
          <article-title>Fine-tuning CLIP for Automatic News-Images Re-Matching</article-title>
          .
          <source>In Proceedings of the MediaEval 2021 Workshop</source>
          , Online,
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          December 2021
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          , Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Learning Transferable Visual Models From Natural Language Supervision</article-title>
          .
          <source>CoRR abs/2103</source>
          .00020 (
          <year>2021</year>
          ). arXiv:
          <volume>2103</volume>
          .00020 https://arxiv.org/abs/2103.00020
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.-Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>Common Objects in Context</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2014</year>
          , pp.
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -10602-1_
          <fpage>48</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] Wortsman, Mitchell and Ilharco, Gabriel and Kim, Jong Wook and
          <article-title>Li, Mike and Kornblith, Simon and Roelofs, Rebecca and Gontijo-Lopes, Raphael and Hajishirzi, Hannaneh and Farhadi, Ali and Namkoong, Hongseok</article-title>
          and Schmidt, Ludwig.
          <year>2021</year>
          .
          <article-title>Robust fine-tuning of zero-shot models</article-title>
          .
          <source>arXiv preprint arXiv:2109</source>
          .
          <year>01903</year>
          . https://arxiv.org/abs/2109.01903
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Sedigheh</given-names>
            <surname>Eslami</surname>
          </string-name>
          , Gerard de Melo, Christoph Meinel,
          <article-title>Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain</article-title>
          ?
          <source>CoRR abs/2112</source>
          .13906 (
          <year>2021</year>
          ). arXiv:
          <volume>2112</volume>
          .13906, https://arxiv.org/abs/2112.13906
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>