HCMUS at MediaEval 2020:
      Image-Text Fusion for Automatic News-Images Re-Matching
    Thuc Nguyen-Quang∗1,3 , Tuan-Duy H. Nguyen∗1,3 , Thang-Long Nguyen-Ho∗1,3 , Anh-Kiet Duong∗1,3 ,
     Nhat Hoang-Xuan∗1,3 , Vinh-Thuyen Nguyen-Truong∗1,3 , Hai-Dang Nguyen1,3 , Minh-Triet Tran1,2,3
                                  1 University of Science, VNU-HCM, 2 John von Neumann Institute, VNU-HCM
                               3 Vietnam National University, Ho Chi Minh city, Vietnam

    {nqthuc,nhtduy,ntvthuyen}@apcs.vn,{nhtlong,hxnhat,nhdang}@selab.hcmus.edu.vn,18120046@student.hcmus.edu.vn,
                                               tmtriet@fit.hcmus.edu.vn
ABSTRACT                                                                     embeddings to the same dimension. Finally, we train our Triplet Loss
Matching text and images based on their semantics has an important           [15] model with positive and negative pairs from a hard sample miner.
role in cross-media retrieval. Especially, in terms of news, text and        2.2    Image-Text Matching via Categorization
images connection is highly ambiguous. In the context of MediaEval           In this method, we train two gradient boosting decision trees [18],
2020 Challenge, we propose three multi-modal methods for map-                one for categorizing images, and the other for categorizing arti-
ping text and images of news articles to the shared space in order           cles. The target categories are [’nrw’, ’kultur’, ’region’, ’panorama’,
to perform efficient cross-retrieval. Our methods show systemic im-          ’sport’, ’wirtschaft’, ’koeln’, ’ratgeber’, ’politik’, ’unknown’], which
provement and validate our hypotheses, while the best-performed              are deduced from URLs in the train set.
method reaches a recall@100 score of 0.2064.                                    We use features extracted for images and text to train the decision
1    INTRODUCTION                                                            tree. To augment the data, we use VGG16, InceptionResNetV2, Mo-
News articles represent a complex class of multimedia, whose textual         bileNetV2, EfficientNetB1-7, Xception, ResNet152V2, NASNetLarge,
content and accompanying images might not be explicitly related              DenseNet201 [10, 14, 17, 27–30, 32] for images, while using pre-
[25]. Existing research in multimedia and recommendation system              trained BERT models[2, 8, 9, 11], and pretrained ELECTRA models
domains mostly investigate image-text pairs with simple relation-            [1, 9] to extract contextual features.
ships, e.g., image captions that literally describe components of the           We presume that images and articles of the same category might
images [16]. To address this, the MediaEval 2020 NewsImages Task             have some relations. Moreover, the rank of matching categories also
calls for researchers to investigate the real-world relationship of          affects ranking. For example, an image-text pair sharing a 3rd-ranked
news text and images in more depth, in order to understand its im-           category might be less relevant than the pair sharing a 1st-ranked
plications for journalism and news recommendation systems [19].              category. Hence, instead of using Jaccard similarity, we propose
   Our team at HCMUS responds to this call by addressing the Image-          an iterative ranking method that takes into account the order of
Text Re-Matching task. Particularly, given a set of image-text pairs         matched categories. At the 𝑘-th iteration, our method first finds
in the wild, the task requires us to correctly re-assign images to their     top-𝑘 categories for each image and top-𝑘 categories for each article.
decoupled articles, with the aim to understand the implication of            Then for each article, we create a list of candidate images whose
journalism in choosing illustrative images.                                  top-𝑘 categories intersect that of the article. This list of candidates
   Our methods mainly concern fusing cross-modal embeddings for              at the 𝑘-th iteration is concatenated to the final list. Finally, the re-
automatic matching. We experimented with a range of embedded                 maining images that are not candidates are kept in their order and
information, including simple set intersection, deep neural features,        concatenated to the end of the final list.
and knowledge-graph-enhanced neural features. We combine such                2.3    Graph-based Face-Name Matching
features in various ways for various experiments. Finally, we obtain         Based on our observation, in a lot of instances, the publisher uses a
our best result with the ensemble of experimented methods.                   portrait of somebody mentioned in the text. We build the face-name
2 METHODS                                                                    graph to represent the relation between the name and the face.
2.1 Metric Learning                                                             Person name extraction: To automatically extract people’s name
The primary idea of this baseline method is using metric learning to         from the text, we use entity-fishing[23] – an open-source high-
project embeddings of image-text pairs to bases of significant simi-         performance entity recognition and disambiguation tool. It relies on
larity. Particularly, we use two approaches to embed image features:         Random Forest and Gradient Tree Boosting to recognize named enti-
global context embedding and local context embedding. In the first           ties, in our case people’s names, and link them against Wikidata enti-
approach, we use the EfficientNet [30], a SOTA classification architec-      ties using their word embeddings and Wikidata entities’ embeddings.
ture, to extract features of the image before taking the flatten output         Face encoding: We use face recognition open-source library[13]
features. Our motivation in the latter approach is to harness critical       to detect and represent the face as 128-dims vector. The tool uses a
local information from the extracted global context. Thus, we use the        pre-trained model from the dlib-models repository[20] and chooses
bottom-up-attention model [3] to extract the top-𝑘 objects based on          ResNet as the backbone for face feature extraction.
their confidence score, before passing them over to a self-attention            Using the train set, we connect each person mentioned in the
sequential model. For both routines, we employ BERT [12] language            articles with features extracted from accompanying faces. During
model to embed textual content, then project the textual and image           testing, we encode the face from the image and aggregate the number
Copyright 2020 for this paper by its authors. Use permitted under Creative   of matched faces connected to the people mentioned in the text. Two
Commons License Attribution 4.0 International (CC BY 4.0).                   faces are matched if 𝐿2-distance between two vectors less than 0.6.
MediaEval’20, 14-15 December 2020, Online
                                                                             The ranking of images is sorted by the total matched.
MediaEval’20, December 14-15, 2020, Online                                                                                  T. Nguyen-Quang et al.

                    Table 1: Submission result                               extracted lists of labels are also linked with synsets using EWISER,
      Method                 Acc.    Recall@100        MRR@100               and the mean of these synset embedding vectors represent images.
                                                                                We then train a canonical correlation analysis (CCA) module with
      TripletLocal         0.0000            0.0248          0.0012          the vector representation on the train set before using it to transform
      TripletGlobal        0.0002            0.0238          0.0013          test set vectors. For relatedness measurement, for each test article,
      Group-Face&Cap      0.0194             0.1322         0.0237           we rank all images in the test set using the 𝐿2-distance between the
      KG-Fusion            0.0051            0.1667          0.0164          article vector and image vectors.
      Ensemble             0.0075           0.2064           0.0222
                                                                             3 EXPERIMENTAL RESULTS
2.4     Image-Text Fusion with Image                                         3.1 Data preprocessing
        Captioning and Contextual Embeddings                                 The MediaEval 2020 Image-Text Re-Matching benchmark releases
Based on the hypothesis that the description of the image is semanti-        three batches of data in total consists of the lede and titles of German
cally similar to the title, we build an image captioning model which         news articles and their accompanying images. The first two are used
is inspired by the tutorial Image captioning with visual attention[31].      for training, and the last one is used for testing.
The model has three main parts:                                                 For the sake of manual assertion, we decide to translate all the
• Image feature extractor: We use EfficientNet[30] for feature               text to English using Google Translate and employ this translated
    extraction. The feature has the shape (8, 8, 2048)                       text in our experiments. All data batches are cleaned automatically,
• Feature encoder: The features pass through fully connected giv-            with images crawled using the given URLs and pairs with 404 Not
    ing a vector 256-dims.                                                   Found URLs dropped from the train set.
• Decoder: To generate the caption, we use Bahdanau attention[4]             3.2    Submissions
    and GRU to predict the next word.                                        First, TripletLocal and TripletGlobal demonstrate respective methods
We merge the train set with Flickr and COCO for training. We use             in Section 2.1. In both submissions, we empirically choose 𝑘 = 30
fuzzywuzzy ratio and partial ratio string matching to compare cap-           to embed images with top-𝑘 objects, then sort candidate images for
tions and articles title. To represent the caption and the title as a vec-   each article by the similarity of their embedding to that of the article.
tor, we use RoBERTa and doc2vec[22] enwiki_dbow, apnews_dbow.                   The Group-Face&Cap submission, meanwhile, combine three dif-
Then, we calculate the similarity of two vectors by cosine similarity.       ferent methods. First, we matches image-article pairs using the
The final score is calculated by:                                            method in Section 2.2 with 𝑘 = 5. However, at each iteration, we
     𝑆 total =𝑆 wiki +𝑆 apnews +𝑆 RoBERTa + (1−𝐷 fuzzy ) + (1−𝐷 partial )    sort the candidates by 𝑆𝑡𝑜𝑡𝑎𝑙 score mentioned in 2.4. Finally, candi-
where 𝑆 wiki , 𝑆 apnews , 𝑆 RoBERTa are cosine similarity of two vectors     date images matched with the article through the method in Section
generated by enwiki_dbow, apnews_dbow, RoBERTa, and 𝐷 𝑓 𝑢𝑧𝑧𝑦 ,               2.3 are prioritized to the top of the final result.
𝐷𝑝𝑎𝑟𝑡𝑖𝑎𝑙 are fuzzywuzzy and partial ratios, respectively.                       The KG-Fusion submission manifest the method described in Sec-
2.5     Image-Text Fusion with Knowledge                                     tion 2.5. Specifically, the TResNet-L with ASL model used for multi-
                                                                             label extraction accepts a sigmoid threshold of 0.7, the EWISER
        Graph-based Contextual Embeddings                                    disambiguator consumes chunks of 5 tokens, and the target decom-
We observe that image-text pairs may not have any explicit relation-         position of the CCA module has 64 components.
ships. Yet, such text-image pairs could still remotely related through          Finally, the Ensemble submission combines all described methods,
layers of abstraction. For example, an article about violence could          weighting each models based on their efficiency. As such, the final
feature a stock photo of a gun barrel. Although such a stock photo           ranking of a candidate image is:
does not literally illustrate the textual content, we understand that            𝑅Ensemble =𝑤 1 𝑅Caption +𝑤 2 𝑅Triplet +𝑤 3 𝑅Face +𝑤 4 𝑅KG−Fusion .
a gun conveys a sense of threat, which, in turn, is related to violence.     where 𝑅𝐸𝑛𝑠𝑒𝑚𝑏𝑙𝑒 , 𝑅𝐶𝑎𝑝𝑡𝑖𝑜𝑛 , 𝑅𝑇 𝑟𝑖𝑝𝑙𝑒𝑡 , 𝑅𝐹𝑎𝑐𝑒 , 𝑅𝐾𝐺−𝐹𝑢𝑠𝑖𝑜𝑛 are ranks of
   Thus, we consider exploiting knowledge graphs. On a knowledge             the image produced by Group-Face&Cap, TripletGlobal, Face Match-
graph, such as BabelNet [24], the concept node of gun is also remotely       ing, and KG-Fusion methods, respectively. Weighting factors are
connected with violence through intermediate nodes. Thus, we hy-             empirically chosen to be 𝑤 1 =𝑤 4 = 1, 𝑤 2 = 0.02 and 𝑤 3 = 0.25.
pothesize that the projection of the textual and imagery content of a
news article onto a knowledge graph would be connected, and their            4     CONCLUSION AND FUTURE WORKS
embeddings, in turns, could be in close proximity.                           Although, our methods show poor accuracy, they systematically in-
   To implement this projection, we use EWISER word sense dis-               crease the performance on the recall@100 metric. This fact validates
ambiguator [6] to link textual entities from texts to their synsets          our hypotheses that incorporating high-level semantics increase
in the WordNet subset of BabelNet. Then, the mean of accompa-                performance. Moreover, our methods yield consistent results, i.e.,
nied SenSemBERT+LMMS embeddings corresponds to these ex-                     high-ranking images are of relevance to queried articles. Thus, they
tracted synsets representing the texts. For the images, we first map         can still be useful for building news image recommendation systems
images to the textual domain. To enhance the method by featur-               as the news-images suitability is not injective in practice. The ensem-
ing abstract human-level concepts in the mapping, we decide to               ble method’s performance also suggests practical system builders
use TResNET-L with Asymmetric Loss (ASL) [5, 26] pre-trained on              to use multiple methods to handle different aspects of the complex
OpenImagesV6[21] to extract multi-label from images. Our decision            image-text multimodal relation. In future works, we wish to inves-
is grounded since OpenImagesV6 features image-level labels con-              tigate better fusion methods, consider a thorough ablation study for
form with Freebase[7] knowledge graph with figurative labels, e.g.,          proposed methods, and enhance the dataset for thorough evaluation
festivals, sport, comedy, etc., while TResNET-L with ASL is the state-       with information retrieval metrics like NDCG.
of-the-art method for OpenImagesV6 multi-label benchmark. The                   Acknowledgments: Research is supported by Vingroup Innova-
                                                                             tion Foundation (VINIF) in project code VINIF.2019.DA19.
 NewsImages: The role of images in online news                                                           MediaEval’20, December 14-15, 2020, Online


 REFERENCES                                                                     [18] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong
 [1] 2020. Model from https://huggingface.co/german-nlp-group/electra-               Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient
     base-german-uncased. (2020).                                                    Gradient Boosting Decision Tree. In Advances in Neural Information
 [2] 2020. Model from https://huggingface.co/T-Systems-onsite/bert-                  Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
     german-dbmdz-uncased-sentence-stsb. (2020).                                     R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran
 [3] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark                  Associates, Inc., 3146–3154. https://proceedings.neurips.cc/paper/
     Johnson, Stephen Gould, and Lei Zhang. 2017. Bottom-Up and Top-                 2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
     Down Attention for Image Captioning and VQA. CoRR abs/1707.07998           [19] Benjamin Kille, Andreas Lommatzsch, and Özlem Özgöbek. 2020. News
     (2017). arXiv:1707.07998 http://arxiv.org/abs/1707.07998                        Images in MediaEval 2020. In Proc. of the MediaEval 2020 Workshop.
 [4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2016. Neural                Online.
     Machine Translation by Jointly Learning to Align and Translate. (2016).    [20] Davis E. King. 2018. dlib-models. (2018). https://github.com/davisking/
     arXiv:cs.CL/1409.0473                                                           dlib-models
 [5] Emanuel Ben-Baruch, Tal Ridnik, Nadav Zamir, Asaf Noy, Itamar Fried-       [21] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan
     man, Matan Protter, and Lihi Zelnik-Manor. 2020. Asymmetric Loss                Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci,
     For Multi-Label Classification. arXiv preprint arXiv:2009.14119 (2020).         Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. 2020. The
                                                                                     Open Images Dataset V4: Unified image classification, object detection,
 [6] Michele Bevilacqua and Roberto Navigli. 2020. Breaking through
                                                                                     and visual relationship detection at scale. IJCV (2020).
     the 80% glass ceiling: Raising the state of the art in Word Sense
                                                                                [22] Jey Han Lau and Timothy Baldwin. 2016. An Empirical Evaluation of
     Disambiguation by incorporating knowledge graph information.
                                                                                     doc2vec with Practical Insights into Document Embedding Generation.
     In Proceedings of the 58th Annual Meeting of the Association for
                                                                                     (2016). arXiv:cs.CL/1607.05368
     Computational Linguistics. 2854–2864.
                                                                                [23] Patrice Lopez. 2020.          Entity Fishing.         (2020).       https:
 [7] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie
                                                                                     //github.com/kermitt2/entity-fishing
     Taylor. 2008. Freebase: a collaboratively created graph database for
                                                                                [24] Roberto Navigli and Simone Paolo Ponzetto. 2010. BabelNet: Building
     structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD
                                                                                     a very large multilingual semantic network. In Proceedings of the 48th
     international conference on Management of data. 1247–1250.
                                                                                     annual meeting of the association for computational linguistics. 216–225.
 [8] Malte Pietsch Tanay Soni Branden Chan, Timo Möller. 2020. Model
                                                                                [25] NHJ Oostdijk, H van Halteren, Erkan Basar, and Martha A Larson.
     from https://huggingface.co/bert-base-german-cased. (2020).
                                                                                     2020. The Connection between the Text and Images of News Articles:
 [9] Branden Chan, Stefan Schweter, and Timo Möller. 2020. German’s
                                                                                     New Insights for Multimedia Analysis. (2020).
     Next Language Model. (2020). arXiv:cs.CL/2010.10906
                                                                                [26] Tal Ridnik, Hussam Lawen, Asaf Noy, and Itamar Friedman. 2020.
[10] François Chollet. 2017. Xception: Deep learning with depthwise
                                                                                     TResNet: High Performance GPU-Dedicated Architecture. arXiv
     separable convolutions. In Proceedings of the IEEE conference on
                                                                                     preprint arXiv:2003.13630 (2020).
     computer vision and pattern recognition. 1251–1258.
                                                                                [27] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,
[11] dbmdz. 2020. Model from https://huggingface.co/dbmdz/bert-base-
                                                                                     and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and
     german-uncased. (2020).
                                                                                     linear bottlenecks. In Proceedings of the IEEE conference on computer
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
                                                                                     vision and pattern recognition. 4510–4520.
     2019. BERT: Pre-training of Deep Bidirectional Transformers for
                                                                                [28] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
     Language Understanding. (2019). arXiv:cs.CL/1810.04805
                                                                                     lutional networks for large-scale image recognition. arXiv preprint
[13] Adam Geitgey. 2018.          Face Recognition.        (2018).     https:
                                                                                     arXiv:1409.1556 (2014).
     //github.com/ageitgey/face_recognition
                                                                                [29] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.
                                                                                     2016. Inception-v4, inception-resnet and the impact of residual
     Identity mappings in deep residual networks. In European conference
                                                                                     connections on learning. arXiv preprint arXiv:1602.07261 (2016).
     on computer vision. Springer, 630–645.
                                                                                [30] Mingxing Tan and Quoc V. Le. 2020.               EfficientNet: Rethink-
[15] Elad Hoffer and Nir Ailon. 2018. Deep metric learning using Triplet
                                                                                     ing Model Scaling for Convolutional Neural Networks. (2020).
     network. (2018). arXiv:cs.LG/1412.6622
                                                                                     arXiv:cs.LG/1905.11946
[16] MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and
                                                                                [31] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville,
     Hamid Laga. 2019. A comprehensive survey of deep learning for image
                                                                                     Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2016.
     captioning. ACM Computing Surveys (CSUR) 51, 6 (2019), 1–36.
                                                                                     Show, Attend and Tell: Neural Image Caption Generation with Visual
[17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q
                                                                                     Attention. (2016). arXiv:cs.LG/1502.03044
     Weinberger. 2017. Densely connected convolutional networks. In
                                                                                [32] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018.
     Proceedings of the IEEE conference on computer vision and pattern
                                                                                     Learning transferable architectures for scalable image recognition.
     recognition. 4700–4708.
                                                                                     In Proceedings of the IEEE conference on computer vision and pattern
                                                                                     recognition. 8697–8710.