Image-Text Rematching for News Items using Optimized
           Embeddings and CNNs in MediaEval NewsImages 2021
             Tom Sühr, Ajay Madhavanr, Nasim Jamshidi Avanaki, René Berk, Andreas Lommatzsch
                                                 Technische Universität Berlin
                                                       Berlin, Germany
             {tom.suehr,jamshidiavanaki,ajay.m.ravichandran,rene.m.berk,andreas.lommatzsch}@campus.tu-berlin.de

ABSTRACT                                                                             the dataset and its’ features, the specific vocabulary of the domain
Finding a matching image for a news article is a core problem in the                 as well as the models for transforming the textual and visual data.
creation of traditional and online newspapers. The task of image-                       In this work we research the degree to which the textual and
text matching has thus become a vibrant research area in computer                    visual contents of a news article are related. Our developed model
science. The performance of state-of-the-art image retrieval sys-                    should be able to recommend a ranked list of related images, for a
tems on various benchmarks is excellent. However, they all rely                      given text input. We analyze, whether state of the art image-text
on datasets with a detailed textual description of the images or on                  matching architectures like VSE work for a small and homoge-
very large training collections. In this work, we optimize image-                    neous dataset from just one newspaper. Furthermore, we research
text matching algorithms for a small dataset based on the data of a                  which adaptations are needed to improve the performance in the
single newspaper. Our optimized processing pipeline and the com-                     MediaEval NewsImages scenario.
puted configurations reach precise results. The evaluation results                      The rest of this paper is organized as follows: Sec. 2 explains
obtained in the MediaEval NewsImages benchmark significantly                         our approach and the implementation. In Sec. 3 we present the
outperforming the algorithms from previous years.                                    performance results and discuss the specific strengths of the models.
                                                                                     Finally, we summarize our work and discuss extensions in Sec. 5.

1    INTRODUCTION                                                                    2         APPROACH
The process of selecting images for news articles in the multimedia                  Our approach follows the general architecture of Visual Semantic
industry is crucial. Images play a significant role of the storytelling              Embeddings [5]. The core idea of this architecture is to embed both,
process. They are used to attract the user’s attention, thus achieving               text input and image input, into a joint embedding space. In this
a high number of clicks or high average dwell time per user. How-                    joint embedding, matching text-image pairs can then be done based
ever, finding a good image that matches the news article in a single                 on distance or similarity measures such as cosine similarity. Thus,
picture is a hard task. Automating this task can provide beneficial                  the challenge of this approach is to learn such a joint embedding
effects in different areas, e.g. leveraging the efficiency of publishing             and to extract those features which characterize image and text
articles, saving costs and human resources. Finding a relationship                   pairs best. Fig. 1 shows our architecture and the components.
between a text and an image is a problem that is researched in the
                                                                                        Image Encoding. The image encoding consists of three steps: (i)
field of recommender systems. Several papers exist that achieved
                                                                                     preprocessing, (ii) feature extraction and (iii) linear mapping into
good results, but most works rely on huge generic data collections.
                                                                                     the joint embedding size. In the preprocessing, we normalize the
In this paper we develop models for a specific newspaper that has
                                                                                     RGB values of the pixels and resize the images to 250 pixels. In the
its own image database, a different journalistic style and a signifi-
                                                                                     second step, the preprocessed image are fed into a pretrained CNN
cantly smaller amount of data. We evaluate our models using the
                                                                                     (VGG19 [16]).
data provided in the MediaEval 2021 NewsImages Challenge. A
detailed description of the dataset and the evaluation metrics are
discussed in the Task Overview paper [11].                                                       Title            Text         Category                         Image
    Our approach is inspired by recent works in the domain of text                                                               Category
                                                                                                Title Pre-       Text Pre-          Pre-                       Image Pre-
and image encoding as well as advanced Image-Text Matching                                      processor        processor       processor                      processor
methods. We analyzed commonly used CNNs (pretrained on Ima-
geNet [6]) for the image encoding, such as ResNet [7], VGG [8], and                                   Semantic Embedding
DenseNet [9]. For the efficient encoding of texts and their contexts,                             Title            Text         Category
                                                                                               Embeddings       Embeddings     Embeddings
the use of text embeddings has shown promising results [2, 13, 18].
Recent image-text matching algorithms are usually based on two
                                                                                                  Title            Text        Category
                                                                                     Encoder


                                                                                                                                             Encoder


                                                                                                                                                       ImageNet pretrained VGG-19
                                                                                                                                              Image


                                                                                                 Fuser            Fuser         Fuser
                                                                                       Text


branches for extraction of image and text representations, for which
then the computed representations are aligned for both modalities                                             Feature Fuser                             Linear Transformation
in a joint semantic space [1, 3, 15, 20]. Critical aspects are the size of
                                                                                                             Article Embedding                            Image Embedding
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons                                           Constrastive Cosine Similarity
License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, 13-15 December 2021, Online
                                                                                                              Figure 1: Our system architecture.
MediaEval’21, 13-15 December 2021, Online                                                                                                   Sühr et al.


    Text Encoding. The text encoding stands in the center of this         our word embedding trained on German news article data, performs
work. One special feature of news image retrieval is that more than       better at recall at 50 and 100. The wiki-based embedding has a more
one textual input might exist. In the NewsImages task the article         fine-grained differentiation between words. Thus, given a word and
title, the snippet and the article category are provided. We employ       a slight modification, the wiki embedding is able to produce two
three preprocessing steps for each textual input. We apply stop           significantly different representations. Furthermore, the vocabulary
words removal and stemming (using nltk). In order to get the same         of the wiki embedding is much larger than the vocabulary of our
number of word vectors for each input, we picked a constant length        custom embedding. The custom embedding performs better if we
and cropped or extended the input to that length. Subsequently, we        look at a large interval of the ranking (r@50, r@100) because it is
vectorized the text and compute a semantic embedding [4, 13, 14].         better suited to embed news article words. In summary, the custom
Due to the limited amount of data, we test pretrained embeddings.         embedding provides a better representation of the articles compared
    First Fusion Layer: The task of the first fusion layer is to reduce   to the embeddings computed based on the wiki corpus. However,
the three matrices to three vector representations. Embedding each        when fine grained differentiations between words are relevant, the
textual input on a word level, yields three matrices of the sizes:        wiki-based embedding performs better.
(𝑎 = 5, 𝑤) for the title input, (𝑏 = 25, 𝑤) for the text input and
(𝑐 = 1, 𝑤) for the category input; the word embedding size 𝑤 is                           Model                     r@5      r@10      r@50     r@100
300.                                                                       A: Word Embeddings MaxPool + wiki        1.93%    3.76%    12.59%    19.37%
    Stacking and Second Fusion Layer: Receiving three inputs of            B: Word Embeddings Linear + wiki         4.49%    7.26%    20.99%    31.91%
                                                                           C: Word Embeddings MaxPool + custom      2.92%    4.60%    14.36%    24.86%
size (1, 𝑤) for title, text and category, the next step is to fuse all
                                                                           D: Word Embeddings Linear + custom       3.97%    7.10%    21.57%    33.26%
three representations and transform them in the size of the joint          E: Word/Subw. Emb. Linear + wiki         2.56%    4.70 %   16.19 %   26.68%
embedding space of the size (1, 𝑑). In order to achieve that, we stack    Table 1: The evaluation results obtained for the evaluation
all three input representations of (1, 𝑤) which yields one vector         set for the analyzed models.
of size (1, 3𝑤). Another fully connected layer of size (3𝑤, 𝑑) then
maps the stacked representations to the size of the joint embedding
space (1, 𝑑).
                                                                          4    CONTRIBUTIONS
    Contrastive Loss. A multitude of loss functions exist to train the
                                                                          In this work we made the following contributions: First, we showed
joint embedding space of article and image embedding. The loss
                                                                          that state of the art architectures perform significantly worse on
function should ensure with the learned model that the similarity
                                                                          a small, non-descriptive and homogeneous dataset. Secondly, we
between an article and the true matching image is higher than
                                                                          showed that the performance of embeddings trained on large cor-
the similarity to other images and vice versa; the use of a margin-
                                                                          pora such as Wikipedia, improve the performance in the top 10
based contrastive loss fulfills these requirements [3, 10, 12, 12, 13].
                                                                          retrieved images while tailored embeddings (to a specific style of a
For the image embedding 𝑥𝑖 and the article embedding 𝑥𝑡 we first
                                                                          newspaper) improve the top 100 performance. Thirdly, we provide
define the similarity measure as the inner product of both vectors:
                                                                          our code1 . For future Mediaeval participants and other researchers
𝑠 (𝑖, 𝑡) = ⟨𝑥𝑖 , 𝑥𝑡 ⟩ : R𝑑 × R𝑑 −→ R. In our implementation we use
                                                                          for benchmarking purposes and to build upon.
the L2-normalized vectors 𝑥𝑖 , 𝑥𝑡 for computing the similarity.

3   EXPERIMENTS AND RESULTS                                               5    CONCLUSION
                                                                          We have investigated how to adapt state of the art image-text match-
We tested different configurations focusing on finding optimal em-
                                                                          ing systems to a small, homogeneous and specific dataset. We ana-
beddings and hyperparameters.
                                                                          lyzed existing and well performing image-text matching systems
   The experimental results on the Mediaeval test set of size 3022
                                                                          like VSE, identified components which do not work well with our
are shown in Table 1. The experiments reveal that the linear layer
                                                                          dataset, and systematically tested possible substitutions for them.
for the dimension reduction of the textual inputs outperforms adap-
                                                                          Our experiment show that the non-viability of components like the
tive max pooling in all compositions with a margin of almost 10%
                                                                          trainable word embeddings have impacts on the viability of other
in the settings D and C. However, the adaptive max pooling com-
                                                                          components, e.g. the adaptive max pooling. We further showed that
ponent performed extremely well in most recent works. The reason
                                                                          we can successfully substitute these components in an easy way
for that seems to be the difference between pretrained word embed-
                                                                          and achieve reasonable performance on our data. Future work could
dings and fine-tuned word embeddings. The adaptive max pooling
                                                                          investigate other substitutions for the identified components, e.g.
can consider positions in the textual input. The fully connected
                                                                          optimizing the word embeddings with respect to the loss. Further-
layer on the other hand is better suited for the pretrained embed-
                                                                          more, future projects could research other configurations or even
dings because it will learn an average importance of the different
                                                                          inputs for the image encoding layer as well as investigating fairness
positions.
                                                                          aspects. It might be that our strategy works well for political arti-
   This suggests that the linear layer instead of adaptive max pool-
                                                                          cles but not for sports articles. Thus, analyzing and incorporating
ing is more adaptive to the word embedding.
                                                                          fairness aspects of matching and ranking [17, 19] could normalize
   In addition, we find the models B and D differ in the performance
                                                                          the performance of our model across various article subjects.
whereas the models only differ in the used data for learning the
embeddings. While model B with the word embedding trained on
                                                                          1 https://github.com/tsuehr/News-text-image-matching
wiki achieves higher recall at position 5 and 10, the same model with
NewsImages                                                                                                        MediaEval’21, 13-15 December 2021, Online


REFERENCES                                                                                  vision and pattern recognition. 3128–3137.
 [1] Yanbei Chen and Loris Bazzani. 2020. Learning joint visual semantic matching      [11] Bennamin Kille, Andreas Lommatzsch, Özlem Özgöbek, Mehdi Elahi, and Duc-
     embeddings for language-guided retrieval. In Computer Vision–ECCV 2020: 16th           Tien Dang-Nguyen. 2021. News Images in MediaEval 2021. In Proceedings of
     European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII           the MediaEval Benchmarking Initiative for Multimedia Evaluation 2021. CEUR
     16. Springer, 136–152.                                                                 Workshop Proceedings. http://ceur-ws.org/Vol-2882/
 [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:     [12] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-
     Pre-training of deep bidirectional transformers for language understanding.            semantic embeddings with multimodal neural language models. arXiv preprint
     arXiv preprint arXiv:1810.04805 (2018).                                                arXiv:1411.2539 (2014).
 [3] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++:   [13] Fangyu Liu, Rémi Lebret, Didier Orel, Philippe Sordet, and Karl Aberer. 2020.
     Improving visual-semantic embeddings with hard negatives. arXiv preprint               Upgrading the Newsroom: An Automated Image Selection System for News
     arXiv:1707.05612 (2017).                                                               Articles. ACM Transactions on Multimedia Computing, Communications, and
 [4] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++:        Applications (TOMM) 16, 3 (2020), 1–28.
     Improving Visual-Semantic Embeddings with Hard Negatives. (2018). https:          [14] Fangyu Liu, Rongtian Ye, Xun Wang, and Shuaipeng Li. 2020. HAL: Improved
     //github.com/fartashf/vsepp                                                            text-image matching by mitigating visual semantic hubs. In Proceedings of the
 [5] Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean,                AAAI Conference on Artificial Intelligence, Vol. 34. 11563–11571.
     Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic     [15] Lin Ma, Wenhao Jiang, Zequn Jie, Yu-Gang Jiang, and Wei Liu. 2019. Matching
     embedding model. (2013).                                                               image and sentence with multi-faceted representations. IEEE Transactions on
 [6] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger.                Circuits and Systems for Video Technology 30, 7 (2019), 2250–2261.
     2017. Densely connected convolutional networks. In Proceedings of the IEEE        [16] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net-
     conference on computer vision and pattern recognition. 4700–4708.                      works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
 [7] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry           [17] Tom Sühr, Asia J Biega, Meike Zehlike, Krishna P Gummadi, and Abhijnan
     Heck. 2013. Learning deep structured semantic models for web search using              Chakraborty. 2019. Two-sided fairness for repeated matchings in two-sided mar-
     clickthrough data. In Proceedings of the 22nd ACM international conference on          kets: A case study of a ride-hailing platform. In Proceedings of the 25th ACM
     Information & Knowledge Management. 2333–2338.                                         SIGKDD International Conference on Knowledge Discovery & Data Mining.
 [8] Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. 2017.                3082–3092.
     Learning robust visual-semantic embeddings. In Proceedings of the IEEE            [18] Keyu Wen, Xiaodong Gu, and Qingrong Cheng. 2020. Learning Dual Semantic
     International Conference on Computer Vision. 3571–3580.                                Relations with Graph Attention for Image-Text Matching. IEEE Transactions on
 [9] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng            Circuits and Systems for Video Technology (2020).
     Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, and        [19] Meike Zehlike, Tom Sühr, Carlos Castillo, and Ivan Kitanovski. 2020. Fairsearch:
     others. 2017. Google’s multilingual neural machine translation system: En-             A tool for fairness in ranked search results. In Companion Proceedings of the
     abling zero-shot translation. Transactions of the Association for Computational        Web Conference 2020. 172–175.
     Linguistics 5 (2017), 339–351.                                                    [20] Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for
[10] Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for gen-         image-text matching. In Proceedings of the European Conference on Computer
     erating image descriptions. In Proceedings of the IEEE conference on computer          Vision (ECCV). 686–701.