Deep Embedding-based Multimodal Matching for News Articles:
Exploring the Effects of Transfer Learning & Data Augmentation
                       Martin Ludwig Zehetner                                                       Mohamed Amine Dhiab
           Technische Universität Berlin, Berlin, Germany                                Technische Universität Berlin, Berlin, Germany
                     m.zehetner@tu-berlin.de                                                     m.dhiab@campus.tu-berlin.de

ABSTRACT                                                                                Wu et al. [18] present such an embedding-based SAEM model
Choosing the right combination of news article image and title is                    framework to solve a cross-media retrieval task posed by the image
critical for attracting new users and converting latent interest into                captioning data sets, i.e. images with associated descriptive sen-
clicks. In the context of the MediaEval 2021 NewsImages Challenge,                   tences, Flickr30k [19] or MS-COCO [13]. The SAEM models can
we therefore investigated the underlying multimodal matching                         be understood as neural networks divided into two branches. The
problem between images and titles of news articles by employing                      first of these branches takes feature vectors representing salient
deep embedding-based models to map and match images and text                         regions in images as inputs. These vectors and their intra-modal re-
in the same semantic space. Additionally we explored the impact                      lations are then encoded using self-attention layers and final image
of transfer learning and paraphrasing-based data augmentation                        embeddings are generated using an average pooling. In the second
schemes on the task performance. We observed a clear improve-                        branch, text inputs are converted to continuous context-sensitive
ment in performance using transfer learning approaches, but no                       word embeddings using the encoder of a transformer initialized
consistent improvement using the data augmentation technique                         with a pre-trained BERT [3] architecture. The continuous represen-
we selected. Our best model achieved a mean recall@100 of 0.3488.                    tations are then fed into three distinct 1D Convolutional Neural
                                                                                     Network (CNN) layers [9] and a fully connected layer subsequently
                                                                                     generates the global text embeddings. Furthermore, to guarantee
1    INTRODUCTION                                                                    the mapping of similar images to texts, a weighted combination of
Online news articles commonly try to convey content in a distinc-                    a triplet loss [17] and angular loss [16] is used while training the
tive and direct manner through combinations of expressive titles                     models with image-text pairs. This allows computed embeddings
and images. Understanding the relationships between these news                       to be directly compared using the cosine similarity measure.
images and texts, e.g. news headlines, can therefore help to provide
insight into a wide variety of different tasks in the news domain.                   3 APPROACH
   In this sense, the MediaEval 2021 NewsImages Re-Matching Task
aims to advance this investigation by setting a task to re-match                     3.1 Deep Embedding-based Multimodal
decoupled real-world news articles with the images used in said                          Matching using SAEM
articles [8]. In particular, given a news article the corresponding                  A central component of our approach is the generation of directly
article image is to be selected from the set of all images.                          comparable text and image embeddings using variations of SAEM
   We attempted to further the understanding of these relation-                      models. We chose to use the SAEM model framework as it allows
ships and solve the task given by using multimodal embeddings,                       easy customization of the input image feature sources and achieved
specifically embeddings that map images and texts into the same                      good results in other cross-media retrieval tasks on the Flickr30k
semantic “news” vector space. These embeddings are generated                         and MS-COCO image captioning data sets [18].
using variations of the Self-Attention Embeddings for Image-Text                        We decided to only use the images and article titles from the
Matching (SAEM) model framework [18] and are intended to enable                      MediaEval data set [8] in our SAEM variations, due to superior per-
the computation of semantic similarity, with regard to the “news”                    formance in initial tests and restricted translation and paraphrasing
domain, between texts and images using basic similarity metrics.                     resources. For the generation of image feature vectors representing
In this context, we placed a special emphasis on investigating the                   the salient image regions, we used a bottom-up attention mecha-
effects of a transfer learning scheme, using information exploited                   nism [1] consisting of a Faster R-CNN with ResNet-101 trained on
from image caption data sets, and augmenting the given text data                     Visual Genomes [11]. We set the dimensions of these feature vec-
by applying a paraphrase-based approach.                                             tors to (36, 2048). In addition we used the WordPiece tokenizer [3]
                                                                                     to process the article titles to generate the initial text input.
2    RELATED WORK                                                                       It is important to point out that unlike in the introduction of
Matching media objects of different modalities, e.g. text and images,                the SAEM model framework [18], in which the focus was only on
is essential for various multimedia tasks. Learning a common space                   matching semantically identical images and texts, the focus in our
into which text and image feature vectors can be embedded and                        approach lays rather on the somewhat more abstract task of map-
then compared in is a typical approach for such tasks [2, 7].                        ping the similarity of image and text elements in the semantics of
                                                                                     the online news article context, i.e. which image would be selected
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   to match a given article title. This is to be achieved by means of
License Attribution 4.0 International (CC BY 4.0).
                                                                                     training and fine-tuning on the provided MediaEval image-title
MediaEval’21, December 13-15 2021, Online
                                                                                     pairs [8]. Concurrently, the training process is used to optimize
MediaEval’21, December 13-15 2021, Online                                                                           M.L. Zehetner, M.A. Dhiab


the hyperparameters, e.g. learning parameters, the use of transfer        Table 1: Results: Mean Recall@k of the submitted SAEM
learning and the use of data augmentation. The trained models are         variants (pre: pre-trained, para: paraphrased titles used)
then used to compute the images which best match the article titles
based on the cosine similarity of the corresponding embeddings.               SAEM Variations      MR@5       MR@10      MR@50       MR@100
                                                                              not pre & not para   0.04073    0.07050     0.20836     0.30966
3.2    Transfer Learning
                                                                              not pre & para       0.04856    0.08512     0.21253     0.30287
Transfer learning broadly describes the use of knowledge learned              pre & not para       0.0700     0.1159      0.2585      0.3488
in one scenario to improve the training process or results in an-             pre & para            0.0653     0.1003      0.2381      0.3248
other [6]. Among the most commonly used strategies in the neural
network field are various approaches using pre-trained models [14].
   In our approach, due to the relatively small size of the MediaEval
data set, we decided to investigate and try to exploit the effect of      on the validation data. Afterwards, the model with the highest mean
a direct pre-trained model strategy in our task scenario. To this         between these 3 metrics is selected as the best pre-trained model.
end, we first train our SAEM variants on either the Flick30K or the       In our experiment this best model was trained on MS-COCO.
MS-COCO data set. We then fine-tune the SAEM models using the                 Training & Evaluation of MediaEval Models. The second phase
best versions of the pre-trained models as the initial model states       of the experiment focuses on the comparison of the performance
and then train the models using the news images and titles.               of the SAEM models with different configuration combinations.
                                                                          The “Initial Model State” 𝑀 and the “Textual Input Type” 𝑇 can
3.3    Data Preparation & Augmentation                                    be seen as variables of the configurations. 𝑀 can represent the
In general, performance of neural networks can be influenced heav-        initialization of the SAEM model randomly or based on the best
ily by the size and quality of the training data sets [15]. In many       pre-trained model. 𝑇 can represent the use of only translated article
real world tasks, relatively few data points are available for the        titles or the use of translated titles together with paraphrased titles.
training process, such as in our MediaEval task. A potential solu-        The data is then split, with the first two batches of the MediaEval
tion is data augmentation, i.e. the generation of new data points         data set representing the training data and the third batch used
to increase training data diversity [4]. Therefore, we investigated       as validation data [8]. Afterwards, for each of the possible four
the effects of an augmentation approach consisting of generating a        (𝑀, 𝑇 ) combinations, the following steps are performed. First the
new paraphrased title for each article title in the training set, where   input data is augmented and processed according to the current 𝑇 .
the image is mapped to both titles. By doing so, we doubled the           Then the model is initialized according to the current 𝑀. Thereafter,
amount of training data.                                                  the SAEM model is trained for 50 epochs using the pre-processed
                                                                          training data. Subsequently, the best models of the current configu-
4 EXPERIMENT                                                              rations are selected by calculating 𝑀𝑅𝑅@𝑁 for 𝑁 = 1, ..., 100 for
                                                                          the matchings, according to cosine similarity, of the trained model
4.1 Data Pre-Processing & Augmentation
                                                                          on the validation data set. The four submission models represent
Initial image feature vectors are generated following the procedure       the best performing models in each configuration.
practiced in the SCAN [12] project. As the first step of text pre-
processing, we translate the article titles into english using DeepL.     5     RESULTS & FUTURE WORK
For the text augmentation steps, we then used Quillbot to generate
                                                                          The performance of our four submitted models on the MediaEval
paraphrases for the article titles using the default mode with the
                                                                          test set [8] can be seen in Table 1. While no massive performance
highest possible abstraction level for the generated phrases [5].
                                                                          differences are present, clear performance gaps can be observed
                                                                          nevertheless. Our best performing variant is pre-trained but does
4.2    Model Implementation
                                                                          not use the paraphrase-based data augmentation method, reflect-
For the selected models we used the Adam [10] optimizer with an           ing the general observed tendencies. As such, consistently better
initial learning rate of 0.001 and a batch size of 64 while training.     performance is observed for the pre-trained variants, while the use
During pre-training, a decay rate of 0.1 was applied after every 10       of our data augmentation approach shows no consistent perfor-
epochs, but when training on the MediaEval data the decay was             mance improvements. Concluding, we recognize that our results
applied only after every 15 epochs. The dimension of the internal         indicate that exploiting learned information from similar task do-
word embedding was set to 300 and the dimension of the final              mains through transfer learning can be highly beneficial in news
multimodal embedding was fixed to 256.                                    re-matching scenarios with small amounts of training data, such as
                                                                          the considered MediaEval 2021 NewsImages task.
4.3    Experiment Protocol                                                   In this sense, our observations suggest that further investigation
   Training & Evaluation of pre-trained Models. The training de-          regarding the exploitation of externally learned information may
scribed below is performed separately for Flickr30K and MS-COCO.          be worthwhile. In addition to more detailed analysis regarding the
Firstly, the image-annotation pairs are split according to the public     influence of information learned in less or more related tasks and
split [12]. The SAEM models are then trained for 30 epochs and            with less or more available data, investigating the explicit addition
after each epoch the ratio of recommendations in which at least one       of available contextual information, such as knowledge related to
relevant image was ranked among the top 1, 5 and 10 is determined         identities or locations, could allow for further valuable insights.
NewsImages                                                                                             MediaEval’21, December 13-15 2021, Online


REFERENCES                                                                          Shamma, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual Genome:
 [1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark                 Connecting Language and Vision Using Crowdsourced Dense Image
     Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-                Annotations. CoRR abs/1602.07332 (2016). arXiv:1602.07332
     Down Attention for Image Captioning and Visual Question Answering.        [12] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He.
     In CVPR.                                                                       2018. Stacked cross attention for image-text matching. In Proceedings
 [2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey Hinton. 2016. Layer               of the European Conference on Computer Vision (ECCV). 201–216.
     normalization. arXiv preprint arXiv:1607.06450 (2016).                    [13] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev,
 [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.              Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr
     2018. BERT: Pre-training of Deep Bidirectional Transformers for Lan-           Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common
     guage Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805              Objects in Context. CoRR abs/1405.0312 (2014). arXiv:1405.0312
 [4] Steven Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush             [14] Leeja Mathew and Bindu. 2020. A Review of Natural Language Pro-
     Vosoughi, Teruko Mitamura, and Eduard Hovy. 2021. A Survey of Data             cessing Techniques for Sentiment Analysis using Pre-trained Models.
     Augmentation Approaches for NLP. CoRR abs/2105.03075 (2021).                   In 2020 Fourth International Conference on Computing Methodologies
 [5] Tira Fitria. 2021. QuillBot as an online tool: Students’ alternative in        and Communication (ICCMC). 340–345.
     paraphrasing and rewriting of English writing. Englisia: Journal of       [15] Luis Perez and Jason Wang. 2017. The Effectiveness of Data Augmenta-
     Language, Education, and Humanities 9, 1 (2021), 183–196.                      tion in Image Classification using Deep Learning. CoRR abs/1712.04621
 [6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep                 (2017). arXiv:1712.04621
     learning. MIT press.                                                      [16] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. 2017.
 [7] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF                Deep Metric Learning with Angular Loss. CoRR abs/1708.01682 (2017).
     models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).           arXiv:1708.01682 http://arxiv.org/abs/1708.01682
 [8] Benjamin Kille, Andreas Lommatzsch, Özlem Özgöbek, Mehdi Elahi,           [17] Liwei Wang, Yin Li, and Svetlana Lazebnik. 2015. Learning Deep
     and Duc-Tien Dang-Nguyen. 2021. News Images in MediaEval 2021.                 Structure-Preserving Image-Text Embeddings. CoRR abs/1511.06078
     In MediaEval 2021 Proceedings. MediaEval.                                      (2015). arXiv:1511.06078 http://arxiv.org/abs/1511.06078
 [9] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classi-        [18] Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019.
     fication. CoRR abs/1408.5882 (2014). arXiv:1408.5882                           Learning fragment self-attention embeddings for image-text matching.
[10] Diederik Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic              In Proceedings of the 27th ACM International Conference on Multimedia.
     Optimization. In 3rd International Conference on Learning Representa-          2088–2096.
     tions, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track     [19] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014.
     Proceedings, Yoshua Bengio and Yann LeCun (Eds.).                              From image descriptions to visual denotations: New similarity metrics
[11] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata,            for semantic inference over event descriptions. TACL 2 (2014), 67–78.
     Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A.