<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Integrated Multi-stage Contextual Attention Network for Text-Image Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yi Shao</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yawen Chen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yang Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tianlin Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xuan Zhang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ye Jiang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jing Li</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiande Sun</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Qingdao University of Science and Technology</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Shandong Normal University</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Shandong Police College</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>For the Mediaeval competition's NewsImages track, we introduce the Integrated Multi-stage Contextual Attention Network (IMCAN) for efective text-image matching. Our method harnesses the representational power of BERT and Vision Transformer to extract textual and visual features, which are then synergistically enhanced by a series of Transformer encoders integrated with a contextual multi-modal attention mechanism. This architecture significantly improves the alignment and fusion of modalityspecific features, crucial for the cross-modal retrieval task. The end-to-end training strategy optimizes the model for precise feature matching, ensuring the robustness of our approach. In our experimental evaluation, we primarily utilize MRR and Mean Recall@K as metrics to measure performance. Comparison with the CLIP model results suggests that there is still much room for model improvement in the text-image matching task, but creative models still show desirability in this Text-Image matching.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>News articles frequently employ a combination of text and image to disseminate information,
with textual narratives incorporating visual representations to grab attention and help readers
understand the content more intuitively. The process of image-text matching is a measure of
the visual and textual similarities that exist between images and accompanying text, which
is especially critical for cross-modal retrieval tasks. Although this area has seen considerable
progress in recent years, image-text matching continues to pose a significant challenge due to
intricate matching patterns and pronounced semantic divergences between the two media. The
NewsImages task within MediaEval 2023 has delved into this dynamic.</p>
      <p>In this paper, we propose a novel integrated multi-stage contextual attention network
(IMCAN) for text-image task, which stands out as a comprehensive solution for the intricate
demands of text-image matching, showcasing a deep synergy between both modalities. The
multi-stage feature extraction and fusion approach, coupled with the use of cosine similarity to
measure text-image distance, efectively selects relevant images with a high degree of similarity
to the text, and the model’s performance on the three datasets highlights its capabilities in
text-image matching.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Text-image matching has attracted extensive research in the multimedia research community,
and the emergence of deep learning techniques has significantly improved performance in
this area. Tom Sühr et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] devised a method to embed text and image inputs into a unified
embedding space, enabling matching of text-image pairs through distance or similarity metrics.
Yuta Fukatsu et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] leveraged the ADAPT model for improved cross-modal retrieval in
image-text matching, utilizing Swin Transformer and DistilBERT for extracting image and text
features, respectively.
      </p>
      <p>
        Nikolaos Sarafianos et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] introduced TIMAM, a text-image modal adversarial matching
method, which exploits both adversarial and cross-modal matching objectives to learn modal
invariant feature representations and demonstrates that BERT, a publicly available language
model for extracting word embeddings, can be successfully applied to the field of text-to-image
matching. Notably, the emergence of the CLIP [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] model has inspired several enhancements in
image-text matching models, harnessing its powerful capabilities for improved performance [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>
        In the NewsImages track of the Mediaeval 2023 competition, we introduce an IMCAN model
for cross-modal retrieval tasks, which integrates multi-stage transformer encoders and a
Multimodal Contextual Attention Network (MCAN) for refining and fusing the textual and visual
features extracted via BERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and ViT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The overall architecture is illustrated in 1.
      </p>
      <p>Specifically, we construct a multi-stage Transformer encoder designed to further deepen and
refine the textual and visual features extracted via BERT and ViT. Each stage of the encoder
concentrates on extracting higher-level semantic representations, thereby enhancing the model’s
capability to comprehend complex semantic information. This multi-stage feature extraction</p>
      <p>Recall@5</p>
      <p>Recall@10
approach aids our model in capturing richer and more nuanced data features. To facilitate
efective fusion of textual and image features, we introduce a Multi-modal Contextual Attention
Network (MCAN). This network comprises two contextual attention blocks, which process the
text and image information from diferent Transformer encoding stages. This work to accurately
align and integrate features from both modalities, thus enhancing the model’s ability to capture
cross-modal correlations. MCAN not only fuses multi-modal information but also ensures
the richness and consistency of the final feature representation, providing robust support for
retrieval tasks.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <sec id="sec-4-1">
        <title>4.1. Comparative Experiment</title>
        <p>Table 1 shows the performance of CLIP fine-tuned on three datasets. CLIP shows excellent
retrieval performance on GDELT-1 and slightly degrades on GDELT-2, but still maintains high
retrieval recall on a larger subset. Standard CLIP performs poor inference directly on the
German RT data set because it is trained on English data. We respectively used standard CLIP
to reason on the English machine-translated text of the RT data set, and used multilingual CLIP
to reason on the German text of the RT data set. It can be seen that it is a good strategy to
machine-translate the German text into English and then use standard CLIP to fine-tune it.</p>
        <p>Table 2 shows the performance of the proposed model IMCAN on two English datasets.
Limited by time and computing resources, we only submitted the training results on the English
data set to the oficial in time. The main function of the proposed model is to put text features and
image features into the same feature space, so we use cosine similarity loss and triplet loss for
training respectively. Specifically, for training with triplet loss, we augment each text and image
data separately using the nlpaug package and afine transformation, generating 5 augmented
samples as positive sample set. We randomly select 100 samples from all non-matching samples
and choose the 5 samples with the lowest loss as negative sample set to construct the triplet.
The triplet results shown in Table 2 are the best results obtained by taking the boundary value
every 0.1 from 0.1 to 0.9. Table 2 shows that the MRR of the model trained with cosine similarity
loss is higher than that of triplet loss. When the test subset capacity is small, the recall rate of
the model trained by cosine similarity is significantly higher than that of the triplet loss. When
the subset reaches 100, the performance of the models trained with the two losses is close. This
shows that it is dificult for the triplet loss to make an optimal match among similar samples in
this task, but it can still find relatively similar options in a large number of samples.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Ablation Experiment</title>
        <p>According to Table 1 and Table 2, even though the proposed model has obvious commonalities
with CLIP in structure, its performance on the English data set lags significantly behind CLIP.
In addition to diferences in computing resources such as the amount of training data and batch
size, we also considered the rationality of the structure of the model itself. Specifically, we
organized IMCAN with diferent numbers of stages as ablation experiments to study whether
the current number of stages is reasonable.</p>
        <p>Table 3 shows the performance of diferent numbers of stages. Here, 0 stage means directly
computing the similarity between text features extracted by BERT and image features extracted
by ViT. These ablation models with diferent numbers of stages were trained on GDELT-1
dataset with a training set:test set ratio of 8:2 for 100 epochs using cosine similarity loss function
and have converged. Table 3 demonstrates that when the number of stages increases from 0 to
2, the MRR and recall rate of all subset sizes show a steady improvement. However, when the
number of stages increases from 2 to 4, the MRR and recall rate of all subset sizes show a steady
decline. When the number of stages reaches 3 or 4, we found that although the training loss
steadily decreased, the MRR and Recall@K of the test set showed slow and severely fluctuating
improvements, with this phenomenon more pronounced in the 4-stage model than in the 3-stage
model. This indicates that when the number of stages reaches 3 and 4, the model may have
memorized too much noise due to the excessive number of parameters, leading to overfitting.
Therefore, we did not attempt to use models with 5 or more stages, and considered the 2-stage
model as optimal.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>ACKNOWLEDGMENTS</title>
      <p>Thanks to the organizers of the MediaEval2023, especially to those organizers for NewsImages.
This work was supported in part by the Joint Project for Innovation and Development of
Shandong Natural Science Foundation (Grant No. ZR2022LZH012), and in part by the Joint Project
for Smart Computing of Shandong Natural Science Foundation (Grant No. ZR2020LZH015).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sühr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madhavanr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Avanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Berk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          ,
          <article-title>Image-Text Rematching for News Items using Optimized Embeddings and CNNs in MediaEval NewsImages 2021</article-title>
          , in: Proceedings of the MediaEval workshop, CEUR-WS.org,
          <year>2021</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3181</volume>
          /paper11.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fukatsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aono</surname>
          </string-name>
          ,
          <article-title>Image-Text Re-Matching Using Swin Transformer and DistilBERT</article-title>
          , in: Proceedings of the MediaEval workshop, CEUR-WS.org,
          <year>2021</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3181</volume>
          /paper26. pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Sarafianos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. A.</given-names>
            <surname>Kakadiaris</surname>
          </string-name>
          ,
          <article-title>Adversarial Representation Learning for Text-to-Image Matching</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5814</fpage>
          -
          <lpage>5824</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning Transferable Visual Models from Natural Language Supervision</article-title>
          , in: International conference on machine learning,
          <source>PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Wan,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Sun,</surname>
          </string-name>
          <article-title>CLIP Pre-trained Models for Cross-modal Retrieval in NewsImages 2022</article-title>
          , in: Proceedings of the MediaEval workshop, CEUR-WS.org,
          <year>2022</year>
          . URL: https://2022.multimediaeval.com/paper3975.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-Training of Deep Bidirectional Transformers for Language Understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>