<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ensemble Pre-trained Multimodal Models for Image-text Retrieval in the NewsImages MediaEval 2023</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Taihang Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jianxiang Tian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiangrun Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaoman Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ye Jiang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Qingdao University of Science and Technology</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the investigation of two pre-trained multimodal models, BLIP-2 and CLIP, in the MediaEval 2023 NewsImages task. The pre-trained models are utilized to extract text and image features, and then compute their cosine similarities. We also use the Dual Softmax and an ensemble of three models to enhance the retrieval quality of the extracted features. The experimental results demonstrate that the multimodal features extracted from the CLIP model significantly outperform those of the BLIP-2. Meanwhile, the Dual Softmax and ensemble method could also improve the retrieval performance. We release our code at https://github.com/xxm1215/qust_mediaeval2023 This working note paper presents the experiments conducted in the MediaEval 2023 NewsImages task. The task aims to explore the relationship between the textual and visual (images) content of news articles[1]. Online news articles are often accompanied by multimedia items such as images and videos. Images are important supplementary feature of news articles and also more attractive than textual content of that. This paper utilizes two cross-modal models: 1) a pretrained BLIP-2 model, and 2) a pretrained CLIP model, to extract feature vectors from the text and images. Additionally, this paper also uses the Dual Softmax to recalculate the text-image similarity to improve performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Text-image retrieval is a task that aims to retrieve the text/images that are semantically similar
to their query image/text. To better understand the relationship between the text and images
of news articles, the BLIP-2 model[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is utilized. BLIP-2 is a general and eficient pre-training
strategy that utilizes a frozen pre-trained image encoder and a large language model (LLM). It
trains a lightweight 12-layer Transformer encoder between them, thereby achieving
state-ofthe-art performance on many visual-language tasks.
      </p>
      <p>
        Focusing on NewImages 2022, Damianos[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] et al. proposed a text-image retrieval based on a
pre-trained CLIP model. To address the new challenges posed by NewsImages this year, the
CLIP model[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], developed by OpenAI is utilized. This pre-trained neural network, designed for
matching text and images, was trained on a large amount of text-image pairs through contrastive
learning, performing well across various visual tasks.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <sec id="sec-3-1">
        <title>3.1. Data pre-processing</title>
        <p>The NewsImages task of MediaEval 2023 provides three datasets: RT, GDELT-P1, and GDELT-P2.
We preprocess both the training and testing textual data, retaining only the url, titleEN, and
textEN fields from the datasets. Furthermore, the fields of titleEN and textEN are concatenated.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Pre-trained models</title>
        <p>
          First, we use the BLIP-2 model as the feature extractor, extracting features from images and text
separately. Due to the inconsistency in the feature dimensions of the extracted images and text,
we encode them respectively using its oficial library, Lavis[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We map the extracted text and
image features to a lower dimension and then compute the cosine similarity ranking between
the low-dimensional feature vectors of the images and text.
        </p>
        <p>
          We also use the CLIP model, encoding text and images separately through the pre-trained
ViT-H/14 and ViT-H/14@336px models. We calculate the cosine similarity between the features
of the article text (or article titles) and all test image features. We randomly split 10% of the RT
training dataset provided by NewsImages to train our model. In 2022, Damianos proposed a
Dual softmax method[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], which improved video retrieval by revising query-video similarities.
Inspired by this, we add the Dual softmax method to recalculate the similarity ranking between
text and images.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Multi-task Contrastive Learning Model</title>
        <p>We transformed the datasets by labeling the text-image pairs in the training set as 1, and each
text with non-paired images as 0. To address data volume and sample distribution issues, we
designed a threshold  to control random sampling. When  = 0.0002, the ratio of 1s to 0s is
1:1.</p>
        <p>We use the pre-trained ViT-H/14@336px model to encode article text and images separately.
Two Self-attention and MLP (Multi-Layer Perceptron) modules are used to extract text and image
features, respectively, outputting two 768-dimensional feature matrices. After concatenating
these feature matrices, they are fed into an MLP for a binary classification task. The model
is trained using a multi-task learning approach, designed with a contrastive loss and binary
cross-entropy loss. We also introduce a scaler parameter  to balance the multi-task learning,
and we set  to 0.8 to make the model focus more on the contrastive loss. The final loss is
 =   + (1 −  ) where  and  are the contrastive loss and binary
crossentropy loss respectively.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Ensemble Pre-trained Multimodal Models</title>
        <p>Voting in ensemble learning is a method of combining predictions from multiple models to
make a final decision. We utilize the Hard Voting approach for its simplicity. Therefore, we
integrate predictions from the CLIP model and the Multi-Task Contrastive Learning model
through voting to make the final decision, thereby enhancing the model’s accuracy.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Implementation Details</title>
        <p>For each of the test datasets (RT, GDELT-P1, GDELT-P2), we submitted the results of five runs
separately, with the implementation details are presented as follows:</p>
        <p>Run #1: Using the BLIP-2 model as the feature extractor, we encode article text and images
separately to obtain their features respectively. We rank the images by calculating the cosine
similarity between text and all test images. From these ranking results, we select the top 100
most relevant images as our predicted results.</p>
        <p>Run #2: Using the ViT-H/14 model of CLIP as the feature extractor, we encode article text
and images separately. We calculate the similarity between text features and all test image
features. We utilize the dual softmax method to calculate the similarity ranking between text
and images. The top 100 most relevant images are selected as our predicted results.</p>
        <p>Run #3: By designing a multi-task contrastive learning model, we process the test set
similarly to the training set. For each text, we calculate cosine similarity with all test images
and keep only the top 100 text-image pairs based on similarity as our predicted results.</p>
        <p>Run #4: As Run #2, we use the ViT-H/14@336px model of CLIP to encode article text (or
article titles) and images separately.</p>
        <p>Run #5: Based on Runs #2, #3, and #4, we retrained three models, the results of each model
include all texts, with each text corresponding to 100 images and the cosine similarity between
each text and image. We then select a specific text URL and sum the cosine similarities for all
identical images. The results are sorted in descending order, and the top 100 most relevant
images are selected as our predicted results.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>We present the oficial evaluation results in Table 1 for the three testing datasets, using Recall@K
(where K=5, 10, 50, 100) and Mean Reciprocal Rank (MRR) as our evaluation metrics.</p>
      <p>The experimental results show that Run #1 (BLIP-2 model) performed the worst on all three
test sets. This indicates that the BLIP-2 model underperforms in zero-shot text-image retrieval
scenarios.</p>
      <p>Run #2 (CLIP’s ViT-H/14) significantly outperformed Run #1 (BLIP-2 model). In Run #4,
which is similar to Run #2, we use the CLIP’s ViT-H/14@336px model. The performance was
slightly better than the ViT-H/14 model used in Run #2, ranking 2nd. This suggests that in
text-image retrieval tasks, using a larger model with higher resolution correlates with better
matching accuracy.</p>
      <p>Run #3 (multi-task contrastive learning model) performed slightly worse on the three testing
datasets.This indicates that our model is currently trained with a simple one-layer fully connected
network to map image-text features, which is overly simplistic.</p>
      <p>Run #5, which was based on Runs #2, #3, and #4, combined the predictive results of the three
models for the final decision. It scored the highest and performed the best on all three testing
datasets. This illustrates the efectiveness of our voting strategy within ensemble learning,
which played a pivotal role in mitigating overfitting risks and enhancing generalizability.</p>
      <p>Initially, we attempted to use the BLIP-2 model for image captioning on the images in the
training set, calculating the cosine similarity between the generated text and the original text.
We found that the similarity between them was very low. We also added prompts to expand the
generation of image captions, but the results were still unsatisfactory.</p>
      <p>We examined the RT dataset and found that some images have a very limited correlation
with the news articles, as shown in Table 2. We doubt that the image-text pairs provided in the
RT dataset have very limited correlation than that in the GDELT ones, and lead to our runs in
GDELT are generally better than those of the RT dataset.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this working note, we proposed diferent solutions and investigated the performance of
the BLIP-2 model, CLIP model, and a multi-task contrastive learning model in this task. Our
ifndings revealed that for the dataset provided by the NewsImages task (characterized by small
scale and concentrated information), the CLIP model significantly outperform the BLIP-2 model
if we only use these model as feature extractor and our designed multi-task contrastive learning
model. The ongoing work is examining how enhancements to the BLIP-2 model could meet the
challenges posed by text-image retrieval tasks.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Acknowledgements</title>
      <p>This work is funded by The Youth Project of Shandong Provincial Natural Science Foundation
of China (ZR2023QF151).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          , Ö. Özgöbek,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elahi</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.-T.</surname>
          </string-name>
          Dang-Nguyen,
          <article-title>News images in mediaeval 2023 (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2301.12597</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Galanopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          ,
          <article-title>Cross-modal networks and dual softmax operation for mediaeval newsimages 2022 (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Lavis: A library for language-vision intelligence</article-title>
          ,
          <source>arXiv preprint arXiv:2209.09019</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Galanopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          ,
          <article-title>Are all combinations equal? combining textual and visual features with multiple space learning for text-based video retrieval</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>627</fpage>
          -
          <lpage>643</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>