<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring a Pre-trained Model for Re-Matching News Texts and Images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mingliang Liang</string-name>
          <email>mingliang.liang@ru.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martha Larson</string-name>
          <email>martha.larson@ru.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Radboud University</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>We investigate the use of a pre-trained model to address the task of re-matching images and texts collected from online news sites. Our aim is to explore the potential of pre-training in learning the connection between the visual and textual modalities. Online news is challenging because it covers a large number of semantic concepts and also the correlation between the modalities can be weak. The results show that the proposed method has good performance in text-image retrieval. The performance was 46.58% (R@100).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Pre-trained models are used for various vision and language tasks [
        <xref ref-type="bibr" rid="ref10 ref15 ref5 ref7">5,
7, 10, 15</xref>
        ]. They have proven efective in downstream tasks, such
as image-text retrieval and Visual Question Answering (VQA).
Pretrained models have several advantages. First, they can increase
performance on small datasets, where we need to improve the
generalization ability of the model. An important example of a domain
in which data is limited is news. Topics in the news develop rapidly
and the amount of up-to-date data at any given moment is
naturally limited. Second, the basic elements of images and text are not
always associated in the same way across data sets. Investigating
pre-trained models can also help us to better understand the
connection between image and text modalities. Third, a pre-trained
model can be quickly validated on a new dataset without the need
to spend a lot of time and computing power on retraining.
      </p>
      <p>
        In this paper, we use a pre-trained model to address image-text
re-matching, a subtask of NewsImages at MediaEval 2021 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We
evaluate our approach on a test set containing 1915 articles that
was collected from online news sites. Each article consists of an
image and an associated text (title and snippet), cf. Figure 1. In the
test data, the text has been disassociated from the images, and the
task is to re-match them.
      </p>
      <p>The task is challenging because there is not a 1-to-1 relationship
between the concepts depicted in the images and those described
in the text. As illustrated by Figure 1, only a few concepts may
be common to both image and text. Also, in the collection a very
broad range of topics and concepts are present. Our hope is that
pre-training introduces prior knowledge that allows more efective
learning of the relationship between text and images.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Cross-modal retrieval generally leverages a common space.
Individual modalities express similar semantics diferently, and the
common space homogenizes the semantic representation. VSE++,
DeViSE and CLIP learn a robust shared space, under which the
learned-feature representation can be measured between modalities
and preserves the correlations in paired samples as well [
        <xref ref-type="bibr" rid="ref13 ref3 ref4">3, 4, 13</xref>
        ].
      </p>
      <p>
        ViT shows that by mapping sequences of image patches to
embeddings, which replaces the word embeddings as input, the
transformer can perform very well in visual tasks. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Inspired by ViT,
Vilt uses patch projection embedding to encode images [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Then,
in vision and language interaction tasks, transformers are used
instead of dot products for interaction between features [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This
interaction method can not only increase the model’s reasoning
speed, but also can capture detailed relationships between vision
and language. As mentioned above, we used Vilt to address the
subtask of Image-Text Re-Matching at MediaEval 2021.
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
    </sec>
    <sec id="sec-4">
      <title>Dataset and data pre-processing</title>
      <p>
        The dataset comprises 7530 training samples and 1915 test
samples, released for the MediaEval 2021 Image-Text Re-Matching
subtask [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We need to crawl the images by ourselves using the URLs
provided by the task organizers and drop the 404 Not Found URLs
from the training and test set. For compatibility with the pre-trained
model, we translate the text into English using Google Translate.
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Model</title>
      <p>In this section, we describe the architecture of the model that we
used to address the Image-Text Re-matching subtask. The model</p>
      <p>The refugee home has been on Langenbergstrasse in Blumenberg since the end of 2014. Originally - and announced by the city at the time
the residential containers were to remain for two years.
consists of a text encoder, an image encoder and a transformer for
modality interaction.</p>
      <p>
        Text encoder: The text is processed in the same way as BERT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>A modal-type embedding is introduced to distinguish diferent
modalities. The text endcoder consists of three parts: word
embedding, token position embedding, and modality-type embedding.</p>
      <p>
        First, it converts the input tokens to embeddings with a word
embedding matrix. Then, the word embedding is summed with the
position embedding and modality-type embedding to create the
input of the encoder [
        <xref ref-type="bibr" rid="ref10 ref16">10, 16</xref>
        ].
      </p>
      <p>Image encoder: Inspired by ViT, the model use a 32 × 32 patch
projection to embed the image. The method of patch projection
is slicing the image into patches, flattening the patches and
mapping them into  dimensions with a trainable linear projection.</p>
      <p>
        In other words, patch projections, rather than regions or grid
features with high weights, are used to create the image to image
embedding [
        <xref ref-type="bibr" rid="ref2 ref7">2, 7</xref>
        ]. Similar to the text encoder, the image embedding,
position embedding, and modality-type embedding are summed.
      </p>
      <p>Modality interaction schema: After we have the image and
text embeddings, a transformer is used for both intra-modal and
inter-modal interaction. It outputs a contextualized feature
sequence for prediction. The image is replaced by a diferent image
with probability of 0.5. Then, we compute the binary loss function.</p>
      <p>Loss function: We keep the loss function identical to BERT
and it contains two parts: the first is image-text matching (ITM)
and the second is masked language modeling (MLM). Words in the
sentence are randomly masked with the probability of 0.15 before
being input into the model.
4</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND ANALYSIS</title>
      <p>
        The task asks participants to predict a ranked list of images
corresponding to each text and report the  @ result. Our results
are shown in Table 1. The model uses four datasets for pre-training:
Microsoft COCO (MSCOCO), Visual Genome (VG), SBU Captions
(SBU), and Google Conceptual Captions (GCC) [
        <xref ref-type="bibr" rid="ref11 ref14 ref8 ref9">8, 9, 11, 14</xref>
        ]. In
the experiment, we kept the default parameters of the model and
ifne-tuned it on single GPU from Google Colab with a batch size
of 32. We load the pre-trained model and fine-tune it on the task
dataset. We also train the model without loading the pre-trained
model. The result without the pre-trained model is much worse.
      </p>
      <p>The result demonstrates that the model is learning useful features
from the pre-trained datasets. In turn, this means that the
characteristics of the pre-training dataset provide an adequate match
with the news domain. Our experiment confirms the benefits of
pre-training for the NewsImages rematching task.</p>
      <p>To better understand the performance of cross-modal alignment,
we inspected the top 5 results of our text-image retrieval for a
selection of test texts. We also generated a keyword heatmap for
each image in the results. Figure 2 shows a random test text from
this analysis. The fourth result (with the border) is correct according
to the ground truth. However, we can see that many of the other
four images that are returned match the text with respect to the
word “containers”, which is a key concept in the text. This example
demonstrates that the pre-training, the model makes our retrieval
results closer to the real description scene of the text, even though
they may not be the correct match.
5</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>
        In this work, we explored the performance of the pre-trained model
in the text-image re-matching subtask of NewsImages at MediaEval
2021. Compared to the model without pre-training, the pre-trained
model achieved better performance, as expected with a relatively
small dataset. However, there is still a big gap compared to the
performance improvement delivered by pretraining when text-image
retrieval is carried out on other datasets, such as MSCOCO and
Flickr30K [
        <xref ref-type="bibr" rid="ref12 ref9">9, 12</xref>
        ]. In future work, we will try to collect more data in
the news domain to help the model improve its performance. We
would like to develop a new pre-trained model with more complex
modality interaction.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Alexey</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          , Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and
          <string-name>
            <given-names>Neil</given-names>
            <surname>Houlsby</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale</article-title>
          . In International Conference on Learning Representations.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Fartash</given-names>
            <surname>Faghri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>David J.</given-names>
            <surname>Fleet</surname>
          </string-name>
          , Jamie Ryan Kiros, and
          <string-name>
            <given-names>Sanja</given-names>
            <surname>Fidler</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>VSE++: Improving Visual-Semantic Embeddings with Hard Negatives</article-title>
          .
          <source>In British Machine Vision Conference</source>
          <year>2018</year>
          ,
          <article-title>BMVC 2018, Newcastle</article-title>
          , UK, September 3-
          <issue>6</issue>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Frome</surname>
          </string-name>
          , Gregory S. Corrado, Jonathon Shlens, Samy Bengio,
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <surname>Marc'Aurelio Ranzato</surname>
            , and
            <given-names>Tomás</given-names>
          </string-name>
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>DeViSE: A Deep Visual-Semantic Embedding Model</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8</source>
          ,
          <year>2013</year>
          ,
          <string-name>
            <given-names>Lake</given-names>
            <surname>Tahoe</surname>
          </string-name>
          , Nevada, United States.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Kille</surname>
          </string-name>
          , Frank Hopfgartner, Torben Brodt, and
          <string-name>
            <given-names>Tobias</given-names>
            <surname>Heintz</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>The Plista dataset</article-title>
          .
          <source>In Proceedings of the 2013 international news recommender systems workshop and challenge.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Kille</surname>
          </string-name>
          , Andreas Lommatzsch, Özlem Özgöbek, Mehdi Elahi, and
          <string-name>
            <surname>Duc-Tien</surname>
          </string-name>
          Dang-Nguyen.
          <year>2021</year>
          .
          <article-title>News Images in MediaEval 2021</article-title>
          .
          <source>In Proc. of the MediaEval 2021 Workshop</source>
          , Online,
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          December
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Wonjae</given-names>
            <surname>Kim</surname>
          </string-name>
          , Bokyung Son, and
          <string-name>
            <given-names>Ildoo</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>ViLT: Vision-andLanguage Transformer Without Convolution or Region Supervision</article-title>
          .
          <source>In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research)</source>
          , Vol.
          <volume>139</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Ranjay</given-names>
            <surname>Krishna</surname>
          </string-name>
          , Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis,
          <string-name>
            <surname>Li-Jia</surname>
            <given-names>Li</given-names>
          </string-name>
          ,
          <article-title>David A Shamma, and</article-title>
          <string-name>
            <surname>others.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Visual genome: Connecting language and vision using crowdsourced dense image annotations</article-title>
          .
          <source>International journal of computer vision 123</source>
          , 1 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Tsung-Yi Lin</surname>
            ,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>Serge J.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
            , James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
            <given-names>C. Lawrence</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>Common Objects in Context</article-title>
          .
          <source>In Proceedings of the European Conference on Computer Vision</source>
          (ECCV).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jiasen</surname>
            <given-names>Lu</given-names>
          </string-name>
          , Dhruv Batra, Devi Parikh, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Visionand-Language Tasks</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems</source>
          <year>2019</year>
          ,
          <article-title>NeurIPS 2019</article-title>
          , December 8-
          <issue>14</issue>
          ,
          <year>2019</year>
          , Vancouver, BC, Canada.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Vicente</surname>
            <given-names>Ordonez</given-names>
          </string-name>
          , Girish Kulkarni,
          <string-name>
            <given-names>and Tamara L.</given-names>
            <surname>Berg</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Im2Text: Describing Images Using 1 Million Captioned Photographs</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December</source>
          <year>2011</year>
          , Granada, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Bryan</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Plummer</surname>
          </string-name>
          , Liwei Wang,
          <string-name>
            <surname>Chris M. Cervantes</surname>
            , Juan C. Caicedo, Julia Hockenmaier, and
            <given-names>Svetlana</given-names>
          </string-name>
          <string-name>
            <surname>Lazebnik</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-toSentence Models</article-title>
          .
          <source>In 2015 IEEE International Conference on Computer Vision</source>
          , ICCV 2015, Santiago, Chile, December 7-
          <issue>13</issue>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Alec</surname>
            <given-names>Radford</given-names>
          </string-name>
          , Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Learning Transferable Visual Models From Natural Language Supervision</article-title>
          .
          <source>In Proceedings of the 38th International Conference on Machine Learning.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Piyush</surname>
            <given-names>Sharma</given-names>
          </string-name>
          , Nan Ding,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Goodman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Radu</given-names>
            <surname>Soricut</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning</article-title>
          .
          <source>In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers).</given-names>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Weijie</surname>
            <given-names>Su</given-names>
          </string-name>
          , Xizhou Zhu, Yue Cao,
          <string-name>
            <given-names>Bin</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Lewei</given-names>
            <surname>Lu</surname>
          </string-name>
          , Furu Wei, and
          <string-name>
            <given-names>Jifeng</given-names>
            <surname>Dai</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>VL-BERT: Pre-training of Generic Visual-Linguistic Representations</article-title>
          .
          <source>In 8th International Conference on Learning Representations, ICLR</source>
          <year>2020</year>
          ,
          <string-name>
            <given-names>Addis</given-names>
            <surname>Ababa</surname>
          </string-name>
          , Ethiopia,
          <source>April 26-30</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Lukasz Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is All you Need</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9</source>
          ,
          <year>2017</year>
          , Long Beach, CA, USA.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>