<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Textual Concept Expansion for Text-Image Matching within Online News Content</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mingliang Liang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martha Larson</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Radboud University</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We investigate a Textual Concept Expansion (TCE) approach to address the NewsImages task in MediaEval'22. Specifically, we use a pre-trained multi-label classifier to predict concepts beyond the words in the captions to enrich the captions. We explore TCE because it leverages commonsense knowledge, which can improve the performance in news dataset. The results show that the proposed method achieve a strong performance in text-image retrieval in NewsImages task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The goal of the NewsImages task is to learn the relationship between images and articles.
Task participants design and implement systems that return images that are related to a query
article [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The task is challenging because there is complex relationship between images
and articles. Specifically, due to the nature of news, not everything depicted in the image is
described in the article. As a result, information related to the image is missing from the article.
The loose connection of images and articles in news dataset prompted us to explore external
knowledge to enrich the articles by textual concept expansion (TCE) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which can provide
possible co-occurrence concepts related to the images.
      </p>
      <p>
        In this paper, we propose an new approach to address of NewsImages task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which extends
the concepts for the articles by a multi-label classifier that pre-trained on MS COCO dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
We combine these concepts text from the articles as the input of text encoder. Then we fine-tuned
our model on a pre-trained model which was pre-trained on a 4M dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In tasks of vision-and-language (VL), such as image-text retrieval and visual question
answering(VQA), co-attention (cross-modal attention) and merged-attention transformer have
been shown to have strong performance in learning the relationship between image and
text [
        <xref ref-type="bibr" rid="ref4 ref5">5, 6, 4, 7, 8</xref>
        ]. The co-attention transformer layer proposed by ViLBERT [7] allows the model
to have a deep interaction between diferent modalities. VisualBERT [ 9] combines image regions
and language with a transformer to align image and text, which is called merged-attention. In
this paper, we will apply merged-attention to the News Images dataset.
      </p>
      <p>
        In additional, vision-and-language pre-training (VLP) has become a popular approach to
tackle image-text retrieval task [
        <xref ref-type="bibr" rid="ref4 ref5">10, 5, 6, 4, 7, 8</xref>
        ]. Learning pre-trained representations from large
numbers of image-text pairs can lead to better baseline performance of the model in
visionand-language tasks. Also, the pre-trained model has demonstrated substantial improvement in
performance on the NewsImages task dataset [11, 12] in MediaEval 2021. For this reason, we
also take advantage of pre-trained model to fine-tune our model to obtain a strong performance
on the NewsImages task MediaEval 2022.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <p>In this section, we begin with a brief introduction to the NewsImages dataset and data
preprocessing. Then we present the technical details of the pre-trained model and our textual
concept expansion (TCE) approach.</p>
      <sec id="sec-2-1">
        <title>2.1. Dataset and data pre-processing</title>
        <p>The dataset for the NewsImages task includes about 9300 training samples and 4500 test samples
released for the MediaEval 2022 [13], the dataset crawled from three diferent news sources
website: Online News portals (rt), Twitter (tw), and RSS news feed (rss). The articles of rt news
feed come from German news publisher. Therefore, we translated the German text into English
via Google Translate in order to keep consistent with the rest of the text.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Textual Concept Expansion</title>
        <p>
          We tackled the NewsImages task with a pre-trained merged-attention model with Triple
Contrastive Learning (TCL) as a baseline model [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Then, we expand the model with textual concept
expansion.
        </p>
        <p>
          Pre-trained Model with Triple Contrastive Learning (TCL): TCL applies triple
contrastive learning both cross-modally and intra-modally, which can maintain the similarity of
image-text pairs and similar samples from same modality [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. TCL has a vision encoder, a text
encoder, and a fusion encoder. For the both vision encoder and text encoder, TCL utilize two
separate data augmentation operators to generate the inputs of a encoder and a momentum
encoder [14]. The outputs of both the vision and text encoders are fed into the fusion encoder,
which predicts whether the image-text pairs match. TCL is pre-trained on 4.0M images and
5.1M image-text pairs, which consists of four datasets, inculde MS COCO [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], Visual Genome
(VG) [15], Conceptual Captions(CC) [16], and SBU Captions [17]. Additionally, we also tested
the zero-shot performance of TCL that was not fine-tuned in the target dataset.
        </p>
        <p>
          Textual Concept Expansion: The information in online news articles the accompanying
images is often complementary. To address this challenges, we propose to use Textual Concept
Expansion (TCE) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], which expands a text through more concepts that may co-occur in the
same context. In additional, TCE as a method of query expansion has demonstrated efective on
image-text matching task [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Here, we attempt to explore the efectiveness TCE in text-image
matching for online news. For this purpose, we train a pre-trained multi-label classifier on MS
COCO dataset, which can predict the co-occurrence concepts to enrich text. Specifically, when
we train the multi-label classifier, we first select the top  most frequent concepts of three types
(i.e., Object, Motion and Property) from the MS COCO data set as concepts vocabulary  =
{1, 2, . . . , }. Then, we label the training dataset. In MS COCO dataset, each image has five
captions that created by diferent people to describe the image, contain complementary concepts.
As we know, objects that appear in the same scene are closer together in terms of commonsense
knowledge than objects that do not appear in the scene. Therefore, we merge the captions
of each image in MS COCO,  = {1, 2, . . . , },  is the caption numbers of each image, 
is the image number. The target labels of classifier is  = {1, 2, . . . , },  is the number
of co-occurrence concepts for  . Finally, we use a pre-trained BERT [18] model and add a
multi-label classification layer to train the multi-label classifier by a multi-label classification
loss:
        </p>
        <p>When we fine-tune the pre-trained TCL model, the pre-trained multi-label classifier be used
to predict the concepts of an input caption. Then, we combine each caption and its predicted
concepts to use as the input of the text encoder of TCL in order to fine-tune our model. We set
the confidence score of the multi-label classifier to 0.1 to select the prediction concepts.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Analysis</title>
      <p>
        The task asks participants to predict a ranked list of images corresponding to each text and
report the text-image @ results. In all of our experiments, we kept the default parameters
of the TCL [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and fine-tuned it on two 3090Ti GPUs with a batch size of 16.
      </p>
      <p>First, we evaluate the TCL that pre-trained on 4M image-pair dataset without fine-tuning
on NewsImages dataset. As shown in first experiment Table 1, the zero-shot result of
textimage retrieval tasks on NewsImages dataset achieves a strong baseline performance. Then, we
ifne-tune the model on NewsImages dataset, the MRR@100 increases from 11.39, 4.29, 15.48 to
13.40, 7.59, 21.58 on three test data separately. Next, we go on to evaluate the performance of
TCE fine-tune on pre-trained TCL. Comparing the row 2 and row 3 of rss and tw in Table 1,
TCE can further improve the performance in rss and tw test data on MRR@100. And rt test
data, TCE can improve the performance on R@50 and R@100, from 28.40, 36.67 to 30.00, 39.13.
TCE is more improved than rt in the rss and tw datasets. We conjecture that the training
dataset for the multi-label classifier is closer to rss and tw than rt. This encourages us to train
a more generalization TCE model to expand caption and improve the performance of most
vision-language tasks.</p>
      <p>The results of our experiments show that textual concept expansion works well on
NewsImages datasets, although we pre-trained our multi-label classifier on another dataset that was not
very close to the news dataset. The predicted concepts can help the vision and language model
to learn more common and general knowledge.</p>
      <p>Visualisation Analysis : To better understand the TCE, we give same visualisation examples
to show the expand concepts of captions. As shown in the Table 2, The “airplane, passenger
could fully live in this
solar-powered house on wheels</p>
      <p>sitting white large down blue
background air flying day ground
sky taking airplane plane</p>
      <p>passenger airport
man two people group young field
other side couple green playing
men game together ball ready
tennis court four play players
sitting next white top small front
has street black sits area building
wooden parked back middle lot car
parking bike home house
and airport” in the row 1, the “man, people, ball and player” in row 2 and the “building and
car” are very useful to recall the right image. As we observe from the Table 2, the concepts are
not appear in the captions, but we can train a model to learn the commonsense knowledge to
expand the captions.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion and Outlook</title>
      <p>In this work, we have explored the performance of the textual concept expansion (TCE) on the
text-image matching task of NewsImages at MediaEval 2022. Compared with the model without
expanded concepts, the captions that are expanded with concepts achieve better performance.
In the future work, we would like to train a stronger and better generalizable textual concept
expansion model to predict more useful concepts for the caption in the Vision-language tasks.
[6] W. Kim, B. Son, I. Kim, ViLT: Vision-and-Language Transformer Without Convolution or Region</p>
      <p>Supervision, in: International Conference on Machine Learning, 2021.
[7] J. Lu, D. Batra, D. Parikh, S. Lee, ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations
for Vision-and-Language Tasks, in: Advances in neural information processing systems, 2019.
[8] Z.-Y. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, N. V. Peng, Z. Liu, M. Zeng, An
Empirical Study of Training End-to-End Vision-and-Language Transformers, in: 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2022.
[9] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, K.-W. Chang, Visualbert: A Simple and Performant Baseline
for Vision and Language, arXiv:1908.03557 (2019).
[10] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models from Natural Language
Supervision, in: International Conference on Machine Learning, 2021.
[11] M. Liang, M. Larson, Exploring a Pre-trained Model for Re-Matching News Texts and Images, in:
Proceedings of the MediaEval Benchmarking Initiative for Multimedia Evaluation 2021, CEUR
Workshop Proceedings, 2021.
[12] C. Bartolomeu, R. Nóbrega, D. Semedo, NewsSeek-NOVA at MediaEval 2021: Context-enriched
Multimodal Transformers for News Images Re-matching, in: Proceedings of the MediaEval
Benchmarking Initiative for Multimedia Evaluation 2021, CEUR Workshop Proceedings, 2021.
[13] A. Lommatzsch, B. Kille, O. Özgöbek, Y. Zhou, J. Tešić, C. Bartolomeu, D. Semedo, L. Pivovarova,
M. Liang, M. Larson, NewsImages: Addressing the Depiction Gap with an Online News Dataset for
Text-Image Rematching, in: Proceedings of the 13th ACM Multimedia Systems Conference, 2022.
[14] K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum Contrast for Unsupervised Visual
Representation Learning, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2020.
[15] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A.</p>
      <p>Shamma, M. S. Bernstein, L. Fei-Fei, Visual Genome: Connecting Language and Vision Using
Crowdsourced Dense Image Annotations, International Journal of Computer Vision (2017).
[16] P. Sharma, N. Ding, S. Goodman, R. Soricut, Conceptual Captions: A Cleaned, Hypernymed, Image
Alt-text Dataset for Automatic Image Captioning, in: Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), 2018.
[17] V. Ordonez, G. Kulkarni, T. Berg, Im2Text: Describing Images Using 1 Million Captioned
Photographs, in: Advances in neural information processing systems, 2011.
[18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding, in: Human Language Technology Conference of the North American
Chapter of the Association for Computational Linguistics, 2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          , Özlem Özgöbek,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elahi</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.-T.</surname>
          </string-name>
          Dang-Nguyen,
          <article-title>News Images in MediaEval 2021</article-title>
          ,
          <source>in: Proc. of the MediaEval 2021 Workshop</source>
          , Online,
          <fpage>13</fpage>
          -15
          <source>December</source>
          <year>2021</year>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <article-title>Textual Concept Expansion with Commonsense Knowledge to Improve Dual-Stream Image-Text Matching</article-title>
          , in: International Conference on Multimedia Modeling,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.-Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>Common Objects in Context</article-title>
          ,
          <source>in: Proceedings of the European Conference on Computer Vision</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chilimbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Vision-Language Pre-Training with Triple Contrastive Learning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Selvaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Gotmare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Align before Fuse: Vision and Language Representation Learning with Momentum Distillation</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>