<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Investigating the Performance of the CLIP Model and Concept Matching in Text-Image Retrieval Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiaomeng Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mingliang Liang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martha Larson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Radboud University</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Improving comprehension of the textual and visual interaction in news articles significantly improves the eficiency of news text-image retrieval. We evaluate the performance of the pre-trained CLIP model on MediaEval 2023 NewsImages benchmark. Additionally, we investigate the contribution of concept matching to our text-image matching system, by tokenizing, part-of-speech tagging and filtering to extract concepts from the news title. In addition, by analyzing the training datasets, we also gain insights of better performance for text-image matching. Our working notes report the oficial results of our submitted runs and shows additional experiments.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Retrieving a suitable image (text) that perfectly corresponds to a text (image) is a challenging
task in Vision-Language domains [
        <xref ref-type="bibr" rid="ref1">1, 2, 3</xref>
        ], especially in the the domain of news articles [4].
This is because the connection between the image and the related news article is loose [4].
Consequently, recognizing the interactions between images and text is particularly important
in the realm of news, as it helps to develop better models for matching news images and text.
The MediaEval 2023 NewsImage benchmark [4] ofers datasets and evaluation components
specifically designed to explore the relationship between news articles and accompanying
images, which participants are required to retrieve the correct images based on the given news
items’ titles and texts.
      </p>
      <p>Large-scale Vision-Language pre-trained models have shown remarkable zero-shot
performance on text-image retrieval task [5, 6]. Therefore, we employ the CLIP [5] (Contrastive
Language-Image Pretraining) model to perform news text-image retrieval across the given
datasets. Because, OpenCLIP provides open source code and pre-trained models at diferent
scales, we can directly utilize it on the NewsImage task without fine-tuning. The OpenCLIP
model achieves good performance according to the evaluation metrics.</p>
      <p>Further, we investigate the capability of matching between concepts and images. The
motivation for this experiment is that nouns tend to have a more direct correlation with the content
visually represented in an image, compared to other parts of speech. Therefore, we extract
nouns and proper nouns from the news titles as concepts, subsequently, we employ these
extracted concepts to retrieve the corresponding news images. Experimental results indicate
that the text-image retrieval system performs better when concepts are embedded in natural
language structures, such as news title.</p>
      <p>Finally, we manually inspect the training subsets on how news titles, text snippets, and
entities correlate with their accompany news images. We gain the impression that text-image
matching performs better when the text literally describes the news image.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <sec id="sec-2-1">
        <title>2.1. Extracting concepts from the news title</title>
        <p>To explore the capability of concept matching for text-image matching system, we extract
concepts from the given news titles. This extraction process consists of three primary steps:
Tokenization, achieved by breaking down the news title into individual tokens or words using
libraries like NLTK [7]; Part-of-Speech Tagging, which assigns each token its specific part of
speech (e.g., noun, verb, adjective); and lastly, Filtering, where we extract nouns and proper
nouns (NN, NNP) to create a text consisting only of elements that can be considered to be
concepts. We present some examples in Table 1.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Utilizing CLIP model for news text-image retrieval</title>
        <p>We employ the CLIP [5] model to extract features from both images and texts. Our choice of
pre-trained model is OpenCLIP [8], an open-source implementation of CLIP. Specifically, we
directly leverage the ViT-B-16 [8] model pre-trained on the Laion-400m dataset [9] without
ifne-tuning. For the training an test datasets, we firstly pre-process and encode the news text
and images separately using their respective encoders. Subsequently, we measure similarity
using cosine similarity between the text embedding and the embeddings of all images. Finally,
we compile a top-100 list of the most relevant images based on their similarity scores.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Sampling examples from the training datasets</title>
        <p>We also conduct text-image retrieval on training subsets, and we manually inspect the
wellperforming examples from the training subsets. This experiment is additional to the runs that
we oficially submitted to the task. The training subsets are sampled from the provided training
datasets—GDELT-P1, GDELT-P2, and RT datasets—to match the sizes of their respective test
datasets. As a result, the GDELT-P1, GDELT-P2, and RT training subsets uses for this experiment
contain 1500, 1500, and 3000 examples respectively.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Analysis</title>
      <sec id="sec-3-1">
        <title>3.1. Retrieval results on test datasets</title>
        <p>The results of the news text-image retrieval task across three test datasets (see Section 2.1
and 2.2) are presented in Table 2 . Three text types-title only, concepts only, entities/text
snippet only-are evaluated. “title only”, where the news title was used for news image retrieval,
“concepts only”, where concepts extracted from the news title were utilized, and “entities/text
snippet only”, where entities/text snippet provided in the dataset were used for retrieval. The
evaluation metrics are Mean Reciprocal Rank (MRR) and Recall@k (R@k) (k=5, 10, 50, 100).</p>
        <p>Across all datasets, the “title only” approach consistently outperformed the “concepts only”
approach in terms of evaluation metrics. Specifically, in the GDELT-P1 test dataset, utilizing the</p>
        <sec id="sec-3-1-1">
          <title>GDELT-P1</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>GDELT-P2</title>
          <p>RT</p>
          <p>Text type
title only
concepts only
entities only</p>
          <p>title only
concepts only
entities only</p>
          <p>title only
concepts only
text snippet only
news title resulted in an MRR of 0.49178, with R@5 and R@10 values of 0.63467 and 0.71400,
respectively. In contrast, employing only the extracted concepts yielded lower performance
metrics, with an MRR of 0.36364, R@5 of 0.48933, and R@10 of 0.57733. Similar trends were
observed in the GDELT-P2 and RT test datasets. In a word, the capability of text-image retrieval
system is beyond simple concept matching, specifically beyond matching nouns and proper
nouns.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>GDELT-P1</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>GDELT-P2</title>
          <p>RT
Text type
title only
concepts only
entities only</p>
          <p>title only
concepts only
entities only</p>
          <p>title only
concepts only
text snippet only</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Retrieval results on training subsets</title>
        <p>The results of the news text-image retrieval task across three training subsets (see Section 2.3)
are demonstrated in Table 3. The “title only” text type has the best results, followed by ”concepts
only” and finally ”entities or text snippet only” across the respective subsets. In other words, it
would be easier to retrieve accompanying news images when utilizing the news title, rather
than relying solely on entities or text snippets. As illustrated in the examples from three
training subsets in Table 4, the news title demonstrate high relevance to the accompanying
news image. Conversely, the contents of text snippets or entities tend to have a more contextual
and inferential connection to the visual content in the news image. We had the impression that
the text-image matching is more efective with descriptive texts to the visual content in the
news image.</p>
        <p>Also, we manually inspect the well-performing and the poorly-performing examples from
the training subsets, which the matching ranks are 1 and beyond 100. We perceived that the
text-image matching system is more successful when the text explicitly describes the visual
elements present in the news image. Specifically, the text-image matching appears to perform
better when the news image includes objects that correspond to words mentioned in the text or
containing words that match words in the text.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion and Outlook</title>
      <p>In this paper, we propose utilization of the pre-trained CLIP model for news text-image retrieval.
We conduct a comprehensive analysis on training subsets and test datasets, comparing evaluation
results when utilizing the news title, concepts extracted from the news title, and entities/text
snippets, respectively.</p>
      <p>The experimental results show that the text-image matching system is capable of going
beyond mere concepts matching, specifically as the matching of nouns and proper nouns. In
addition, the system demonstrates greater eficiency in processing visually descriptive texts
that contain concepts in natural phrases, such as news titles. Nevertheless, the system still has
dificulties in understanding the relationship between inferrable texts and the corresponding
news images. For example, the system faces challenges when matching texts that contain
metadata such as attributes, sources, or other information related to the content of the image.
Previous paper [10] has introduced the concept of "description gap", which refers to the gap
between the textual representation of an image and its accompanying text. In the future, it
is crucial to enhance the advanced understanding and reasoning capabilities of text-image
matching systems.
[2] R. Yan, A. G. Hauptmann, A review of text and image retrieval approaches for broadcast news
video, Information Retrieval 10 (2007) 445–484.
[3] T. Yu, J. Liu, Z. Jin, Y. Yang, H. Fei, P. Li, Multi-scale multi-modal dictionary bert for efective
text-image retrieval in multimedia advertising, in: Proc. of the 31st ACM International Conference
on Information &amp; Knowledge Management, 2022, pp. 4655–4660.
[4] A. Lommatzsch, B. Kille, Ö. Özgöbek, M. Elahi, D.-T. Dang-Nguyen, News images in mediaeval
2023, in: Proc. of the MediaEval 2024 Workshop, 2024.
[5] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, et al., Learning transferable visual models from natural language supervision, in: Proc. of
the International Conference on Machine Learning, PMLR, 2021, pp. 8748–8763.
[6] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, T. Duerig, Scaling
up visual and vision-language representation learning with noisy text supervision, in: Proc. of
International Conference on Machine Learning, 2021.
[7] S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with the natural
language toolkit, " O’Reilly Media, Inc.", 2009.
[8] M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann,
L. Schmidt, J. Jitsev, Reproducible scaling laws for contrastive language-image learning, in: Proc.
of the IEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 2818–2829.
[9] C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev,
A. Komatsuzaki, Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, in: Proc. of
the Neural Information Processing Systems, 2021.
[10] A. Lommatzsch, B. Kille, Ö. Özgöbek, Y. Zhou, J. Tešić, C. Bartolomeu, D. Semedo, L. Pivovarova,
M. Liang, M. Larson, Newsimages: Addressing the depiction gap with an online news dataset for
text-image rematching., in: Proc. of 13th ACM Multimedia Systems, 2022, pp. 227–233.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Rong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Remote sensing cross-modal text-image retrieval based on global and local information</article-title>
          ,
          <source>IEEE Transactions on Geoscience and Remote Sensing</source>
          <volume>60</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          . doi:
          <volume>10</volume>
          .1109/TGRS.
          <year>2022</year>
          .
          <volume>3163706</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>