<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Damianos Galanopoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasileios Mezaris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Technologies Institute / Centre for Research &amp; Technology Hellas</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Matching images to articles is challenging and can be considered a special version of the cross-media retrieval problem. This working note paper presents our solution for the MediaEval NewsImages benchmarking task. We investigated the performance of two cross-modal networks, a pre-trained network and a trainable one, the latter originally developed for text-video retrieval tasks and adapted to the NewsImages task. Moreover, we utilize a method for revising the similarities produced by either one of the cross-modal networks, i.e., a dual softmax operation, to improve our solutions' performance. We report the oficial results for our submitted runs and additional experiments we conducted to evaluate our runs internally.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Text-image association is a challenging task that has gained a lot of interest in recent years.
The task has been extensively examined in the multimedia research community e.g. see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
and there is consensus that the evolution of deep learning methods has boosted performance.
Indicative relevant methods include [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], where an object detector is pre-trained to encode
images and visual objects on images and a cross-modal model is trained to associate visual and
textual features; and [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], where a context-aware attention network is proposed that focuses on
important regions within images to extract possible correlations between image regions and
words.
      </p>
      <p>
        NewsImages is a relatively new and highly specific task, and limited research has been done
on it. Focusing on the previous year’s NewsImages participations, HCMUS [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] proposed a
solution based on the power of the pre-trained model CLIP [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] along with sophisticated text
preprocessing, which achieved the best performance. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] a visual topic model was proposed
to align topics illustrated on images with textual topics using knowledge distillation training.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <sec id="sec-3-1">
        <title>3.1. Data pre-processing</title>
        <p>
          We preprocess both training and testing textual data in order to fully exploit our approach’s
power. First, we use a language detector of the lingua python package to detect the article’s
language. Then we use a translator model from Hugging Face Transformers package [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to
translate German articles (title and text) into English.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Pre-trained model: CLIP</title>
        <p>
          We utilize an open-source implementation of CLIP [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], the openCLIP [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], as our pre-trained
model. To obtain text and image feature representations, we use the ViT-H/14 pre-trained
model. For a given article, in order to retrieve the most relevant images from the test set, we
calculate the cosine similarity between the article’s title (or article’s text) CLIP embedding and
the embeddings of all test images. Then the top-100 most relevant images are selected in a
ranked list, from the most relevant to the least relevant image.
3.3. Trainable model:  × 
In parallel to CLIP, we examined a modification of the  ×  model [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] adapted to deal with
images instead of videos. The  ×  model utilizes textual and visual features and encodes
them into multiple joint feature spaces. In these spaces, instances from diferent modalities (e.g.,
textual snippets, images, etc.) are directly comparable; thus, their similarity can be calculated.
In contrast to the original version of  ×  , here we treat the image as a special video version
that consists of only one frame. Moreover, we use only one textual and image feature (obtained
from the openCLIP ViT-H/14 pre-trained model) as the initial representation instead of multiple
ones. In essence, in this way we try to adapt the pre-trained CLIP representations specifically
to the NewsImages task.
        </p>
        <p>
          Since the NewsImages-provided training datasets are relatively small, we first utilize a large
dataset that contains news articles, images, and captions, the NYTimes800k dataset [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], to
pre-train our  ×  model. Subsequently, we merge the NewsImages-provided training datasets
and we split this overall dataset in a 80-20% manner to finetune our model. We use the 80%
portion of the dataset to train the model and the remaining 20% to validate the performance of
our approach for selecting the best possible model.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Dual-softmax similarity revision</title>
        <p>In order to improve the performance of our method we utilize a similarity revision approach at
the retrieval stage, both for CLIP and  ×  . We calculate the similarities between all images
from the test set and all testing articles, resulting in a similarity matrix Z ∈ ℛ× , where
 is the number of the testing article queries and  the number of test images. To revise
the calculated similarities, we apply two cross-dimension softmax operations (one row-wise:
 = 0, and one column-wise:  = 1) as follows:</p>
        <p>Z* = Softmax(Z,  = 0) ⊙ Softmax(Z,  = 1)
where ⊙ denotes the Hadamard product.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Submitted Runs and Results</title>
      <p>We submitted five runs for each testing dataset (TW, RT, RSS), as detailed below:
• Run #1 (iti_certh_clip_run_1): This uses the text and image CLIP embeddings and
calculates the cosine similarity between the embedding of an article and all images. Then
for each article, the 100 most relevant images are selected.
• Run #2 (iti_certh_clip_ds_run_2): As Run #1, additionally using the dual softmax (DS)
revision method to recalculate the article-images similarities.
• Run #3 (iti_certh_TxV_run_3): We train the  ×  model using a merged dataset
consisting of the 80% of the three provided training datasets. We use this trained model
to calculate the three testing datasets’  ×  article title and images embeddings. Finally,
we use the cosine similarity to compute the similarities between a testing article and all
images and the 100 most relevant images are selected.
• Run #4 (iti_certh_TxV_ds_run_4): Similarly to Run #3, additionally using dual softmax
revision to revise the computed similarities.
• Run #5 (iti_certh_TxV_text_ds_run_5): Similarly to Run #4 but using the full text of the
articles instead of just the title that was used in all the above runs.</p>
      <p>We present the oficial results on three testing datasets and results from the internal
experiments we conducted in order to evaluate our methods and select our final runs. The Recall@K,
where  = 5, 10, 50, 100 and the Mean Reciprocal Rank (MRR) are used as evaluation metrics.</p>
      <p>Table 1.A presents the results on the three testing datasets evaluated oficially by the task
organizers. Run #2 (CLIP + DS) performs the best on all datasets in MRR terms and on RSS
and RT in Recall@K terms, while on the TW dataset the results are mixed. The dual softmax
operation is beneficial for the raw CLIP embeddings, but it has limited efect on our trainable
solutions ( ×  ). Moreover, Run #5 ( ×  using articles’ full text) achieves lower scores than
the other runs on the RSS and RT datasets, but on the TW dataset performs comparably to Runs
#3 and #4.</p>
      <p>The above oficial results contrast with the findings of our internal experiments, conducted
prior to the release of the oficial results. Table 1.B presents our internal results on the 20%
of the provided training dataset (using the remaining 80% for training and validation where
necessary). We conducted these experiments to select our best-trained models and examine our
runs’ performance. From these preliminary experiments, we had concluded that Runs #3 and #4
constantly outperform the rest of the runs in every dataset, i.e. our training step seemed to be
beneficial for performance.</p>
      <p>The distribution diversity between the task’s oficial training and testing datasets could explain
the contrast between the oficial results and our findings. Our experiments were conducted on
an 80-20% split of the oficial training set, so our internal-experiments test set is closely related
to our training set, and this is beneficial for our experiments. Contrarily, the oficial test set is
probably more diferent, as it was collected at a much later time than the training set; in this
case the original CLIP model, which was trained on much larger and more diverse datasets, is
more suitable to address this task.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work we proposed a solution for the MediaEval NewsImages task using state-of-the-art
text and image representations calculated from a pre-trained cross-modal network, a
taskadapted trainable cross-modal network and a similarity revision approach. We concluded from
Run #1
Run #2
Run #3
Run #4
Run #5
Run #1
Run #2
Run #3
Run #4
Run #5
Run #1
Run #2
Run #3
Run #4
Run #5
0.61000
0.62133
0.59933
0.60067
0.59267
0.42733
0.46200
0.43933
0.43933
0.37267
0.68267
0.69333
0.68267
0.68267
0.68000
0.52267
0.54667
0.53733
0.53733
0.46200
0.82067
0.82667
0.80800
0.80800
0.81267
0.71800
0.75533
0.74667
0.75200
0.65667
0.86533
0.87400
0.86067
0.85667
0.85400
0.80667
0.83400
0.82533
0.82333
0.72667
the oficial evaluation results that the utilization of cutting-edge models trained on huge-scale
datasets (i.e. CLIP) performs better compared to our cross-modal network that is trained on a
quite small but task-specific dataset. Moreover, our proposed DS similarity revision approach
was shown to improve the performance.</p>
      <p>In our future work we will aim to improve textual pre-processing, combine more text-video
and text-image retrieval methods and introduce explainable AI methods in order to achieve
improved results and to better understand which model components influence the most the
results.</p>
      <p>Acknowledgements This work was supported by the EU Horizon 2020 programme under
grant agreement H2020-101021866 CRiTERIA.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ozgobek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elahi</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.-T.</surname>
          </string-name>
          Dang-Nguyen,
          <article-title>News Images in MediaEval 2022</article-title>
          , in:
          <source>Proceedings of the MediaEval 2022 Workshop</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , , et al.,
          <source>Learning Transferable Visual Models From Natural Language Supervision, in: Proceedings of the 38th Int. Conf. on Machine Learning (ICML)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Galanopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          ,
          <article-title>Are all combinations equal? Combining textual and visual features with multiple space learning for text-based video retrieval</article-title>
          , in: Computer Vision - ECCVW 2022, Springer,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Borah</surname>
          </string-name>
          , U. Baruah,
          <article-title>Image Retrieval Using Neural Networks for Word Image Spotting-A Review, Machine Learning in Information</article-title>
          and Communication
          <string-name>
            <surname>Technology</surname>
          </string-name>
          (
          <year>2023</year>
          )
          <fpage>243</fpage>
          -
          <lpage>268</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ueki</surname>
          </string-name>
          ,
          <article-title>Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval</article-title>
          ,
          <source>in: 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>628</fpage>
          -
          <lpage>634</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Gao,</surname>
          </string-name>
          <article-title>VinVL: Revisiting visual representations in vision-language models</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>5579</fpage>
          -
          <lpage>5588</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Context-aware attention network for image-text retrieval</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>3536</fpage>
          -
          <lpage>3545</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ngô</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Huynh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tran</surname>
          </string-name>
          , HCMUS at MediaEval 2021:
          <article-title>Fine-tuning CLIP for Automatic News-Images Re-Matching</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2021 Workshop</source>
          , Online,
          <fpage>13</fpage>
          -15
          <source>December</source>
          <year>2021</year>
          , volume
          <volume>3181</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pivovarova</surname>
          </string-name>
          , E. Zosa,
          <article-title>Visual Topic Modelling for NewsImage Task at MediaEval 2021</article-title>
          , in: Working
          <source>Notes Proceedings of the MediaEval 2021 Workshop</source>
          , Online,
          <fpage>13</fpage>
          -15
          <source>December</source>
          <year>2021</year>
          , volume
          <volume>3181</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          , et al.,
          <article-title>Transformers: State-of-the-Art Natural Language Processing</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ilharco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wortsman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wightman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Carlini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Taori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Namkoong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          , L. Schmidt, OpenCLIP,
          <year>2021</year>
          . URL: https://doi.org/ 10.5281/zenodo.7439141.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathews</surname>
          </string-name>
          , L. Xie, Transform and Tell:
          <string-name>
            <surname>Entity-Aware News</surname>
          </string-name>
          Image Captioning, in: IEEE/CVF Conference on
          <source>Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>