<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Re-matching Images and News Using CLIP Pretrained Model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Huu-Nghia Vu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hai-Dang Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Triet Tran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>John von Neumann Institute</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh city</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Discovering the relationship between images and news or articles is an extremely complex problem due to long and irrelevant text. The NewsImages 2022 task aims to describe the relation between the textual and visual (images) content of news articles. In recent years, image-text matching has gained increasing popularity, as it bridges the heterogeneous image-text gap and plays an essential role in understanding image and language. We proposed the advantages of fine-tuning CLIP for this task. The evaluation shows that our method produces promising results for the image-text matching task but needs further optimizations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Nowadays, articles become common in daily life to update daily news in a concise and accurate
manner. This is usually done by highlighting the title and main idea of each section. Besides,
to make the article more intuitive, journalists often insert images. Readers from there have an
overview and complete view of the problem mentioned in the article, and what is happening. And
images are becoming one of the most popular ways not only to summarize content for articles or
sections of articles but also to attract readers’ attention. The MediaEval 2022 NewsImages task
expects researchers to discover and develop patterns/models to describe the relation between
images and texts of news articles (including text body and headlines), serving to improve
multimedia and recommended systems.</p>
      <p>We participate in this task and propose a method for this task. Given pairs of matched images
and articles, our task is to correctly reassign images to articles to understand how to select
illustrations from the perspective of journalism. We fine-tune the CLIP model due to its powerful
in image-text matching problem.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Learning the correspondence between images and texts is quite complicated. Research in
multimedia and recommended systems in image captioning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] assumes a simple relationship
between images and text occurring together. but the caption often describes the literally depicted
content of the image. Wang et al [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] investigates two-branch neural networks for learning the
similarity between two data modalities: retrieving sentences given images and vice versa, but
fails miserably for image-sentence retrieval. Li et al [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] propose a simple and interpretable
reasoning model to generate visual representation that captures key objects and semantic
concepts of a scene and uses the gate and memory mechanism to perform global semantic
reasoning on these relationship-enhanced features, however image-text similarity measure still
promising given enhanced whole image representation. Liu at el [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] present a novel Graph
Structured Matching Network (GSMN) to learn fine-grained correspondence between image
and text.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>
        CLIP (Contrastive Language–Image Pre-training) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is a powerful pretrained-model for
imagetext matching tasks. The model has been trained with more than 400 million text-image pairs,
which is much larger than the dataset of the tasks. Because of the lack of diversity of contest’s
dataset, we only use the pre-trained model in the training set, fine-tune and evaluate on the test
set.
      </p>
      <p>The CLIP model consists of two sub-models: image encoder and text encoder. The images
are processed by resized to 224 × 224 using bicubic interpolation and normalized before being
fed to the image encoder which is implemented using ViT-B/32 model. However, processing the
text becomes quite complicated. Our text processing includes the following steps in order:
• Translate text into English: The text contains three domains: RSS, RT (Russia Today),
and TW (Twitter). The RT news is written in Germany, so we translated them to English
to be compatible with the CLIP model.
• Text selection: Before being fed to the text encoder, the text needs to be vectorized. The
maximum length of this vector is 77 (the value assign to the context length in the model
for computational eficiency). In fact, there are some samples in RSS and RT data with an
extremely huge length (about 10.000 words or more). We only choose the title part as well
as some sentences that can carry the general content of the article. And the sentences at
the beginning of each paragraph or section will be selected sentences because it often
summarizes part of the news
• Process the text basicly: With the given text, we apply the following steps sequentially:
text lowercase, remove punctuation, remove extra spaces, remove default stop-words,
stem, lemmatize, expand contractions. We do this step with the help of the NLTK library.</p>
      <p>
        We believe that this will improve the extracted features.
• Emoji process: For the Twitter data, emojis and icons take the majority of the content of
text. So, we use the Ekphrasis [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] library to replace them to some tags, and additionally
correct misspellings or typos for cleaner text ([e.g.] teen slang words). This step is not
applied to RSS and RT data.
      </p>
      <p>We only apply the first three steps to the RSS and RT data, and only apply the Emoji process
to the TW data. Then the text is vectorized and being fed to the text encoder using Transformer.</p>
      <p>We can see that with the problem statement, we can use two independent models to train
and get their extracted features to match. However, in addition to the advantage that CLIP is
trained on a huge dataset, the pairing of images and text during training makes it possible for
the model to learn and explore the relationship between the features of images and texts</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>We performed our experiment on 2 sets of text: the title with (T1) and without the summary of
each section/paragraph (T2). We firstly evaluate the result on the training set to see how the
CLIP model works with the dataset of the contest.</p>
      <p>Table 1 shows the results of our experiment using the following metrics: Mean Reciprocal
Rank (MRR), MeanRecall@{ 1,5,10,50,100}. Note that we did not perform our experiment with
T2 set of TW data (because the text length is short). From the result, we can see that with more
text, the MRR , and MeanRecall@{10, 50, 100} is better but not MeanRecall@{1, 5}. Thus we can
see the length of the text or the amount of input information (T1) may have a great influence
on the model.</p>
      <p>In addition, the results obtained in RT data are higher than that of RSS and RT data. This
can be explained because in the text processing step, we have omitted part of the text in RSS
and RT data but not in TW data. This results in loss of information, resulting in lower results.
Moreover, taking only the first sentences of each section makes it impossible for us to strictly
control the amount of information lost. But we are quite surprised that with a little information,
the result of TW data is better.</p>
      <p>Table 2 shows the results of our experiment with the text that does not contain the summary
of each paragraph (T2). The result of RT data is lower than RSS and TW, this also happened in
the result of training set. This may be due to missing information from the translation. However,
the MeanRecall@{10,50,100} scale very well. Once again, the result of TW is the best from three
domains.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Outlook</title>
      <p>Learning and discovering the relationship between image and text is quite challenging. We
obtained good results for Recall@{ 10,50,100}. This shows the benefits and power of
finetuning pre-trained model. However, we still found that the data processing, especially the text
processing before entering the model, ran into problem as we could not control the amount of
information loss based on one-sided view (take the first sentence of each section). This leads
to unexpected results in two domains RSS and RT, compared with TW domains. We need to
improve our text processing techniques to be able to process very long documents but still
retain valuable features, as well as handle large text in the model but still ensure stable accuracy.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. Z.</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sohel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Shiratuddin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Laga</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey of deep learning for image captioning</article-title>
          ,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1810</year>
          .04020. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1810</year>
          .
          <volume>04020</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lazebnik</surname>
          </string-name>
          ,
          <article-title>Learning two-branch neural networks for image-text matching tasks</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>41</volume>
          (
          <year>2019</year>
          )
          <fpage>394</fpage>
          -
          <lpage>407</lpage>
          . doi:
          <volume>10</volume>
          . 1109/TPAMI.
          <year>2018</year>
          .
          <volume>2797921</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <article-title>Visual semantic reasoning for image-text matching</article-title>
          , in: ICCV,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Graph structured network for image-text matching</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>10921</fpage>
          -
          <lpage>10930</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2103.00020. doi:
          <volume>10</volume>
          .48550/ARXIV.2103.00020.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Baziotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pelekis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Doulkeridis</surname>
          </string-name>
          , Datastories at semeval
          <article-title>-2017 task 4: Deep lstm with attention for message-level and topic-based sentiment analysis</article-title>
          ,
          <source>in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Vancouver, Canada,
          <year>2017</year>
          , pp.
          <fpage>747</fpage>
          -
          <lpage>754</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>