HCMUS at MediaEval 2020: Image-Text Fusion for Automatic News-Images Re-Matching Thuc Nguyen-Quang∗1,3 , Tuan-Duy H. Nguyen∗1,3 , Thang-Long Nguyen-Ho∗1,3 , Anh-Kiet Duong∗1,3 , Nhat Hoang-Xuan∗1,3 , Vinh-Thuyen Nguyen-Truong∗1,3 , Hai-Dang Nguyen1,3 , Minh-Triet Tran1,2,3 1 University of Science, VNU-HCM, 2 John von Neumann Institute, VNU-HCM 3 Vietnam National University, Ho Chi Minh city, Vietnam {nqthuc,nhtduy,ntvthuyen}@apcs.vn,{nhtlong,hxnhat,nhdang}@selab.hcmus.edu.vn,18120046@student.hcmus.edu.vn, tmtriet@fit.hcmus.edu.vn ABSTRACT embeddings to the same dimension. Finally, we train our Triplet Loss Matching text and images based on their semantics has an important [15] model with positive and negative pairs from a hard sample miner. role in cross-media retrieval. Especially, in terms of news, text and 2.2 Image-Text Matching via Categorization images connection is highly ambiguous. In the context of MediaEval In this method, we train two gradient boosting decision trees [18], 2020 Challenge, we propose three multi-modal methods for map- one for categorizing images, and the other for categorizing arti- ping text and images of news articles to the shared space in order cles. The target categories are [’nrw’, ’kultur’, ’region’, ’panorama’, to perform efficient cross-retrieval. Our methods show systemic im- ’sport’, ’wirtschaft’, ’koeln’, ’ratgeber’, ’politik’, ’unknown’], which provement and validate our hypotheses, while the best-performed are deduced from URLs in the train set. method reaches a recall@100 score of 0.2064. We use features extracted for images and text to train the decision 1 INTRODUCTION tree. To augment the data, we use VGG16, InceptionResNetV2, Mo- News articles represent a complex class of multimedia, whose textual bileNetV2, EfficientNetB1-7, Xception, ResNet152V2, NASNetLarge, content and accompanying images might not be explicitly related DenseNet201 [10, 14, 17, 27–30, 32] for images, while using pre- [25]. Existing research in multimedia and recommendation system trained BERT models[2, 8, 9, 11], and pretrained ELECTRA models domains mostly investigate image-text pairs with simple relation- [1, 9] to extract contextual features. ships, e.g., image captions that literally describe components of the We presume that images and articles of the same category might images [16]. To address this, the MediaEval 2020 NewsImages Task have some relations. Moreover, the rank of matching categories also calls for researchers to investigate the real-world relationship of affects ranking. For example, an image-text pair sharing a 3rd-ranked news text and images in more depth, in order to understand its im- category might be less relevant than the pair sharing a 1st-ranked plications for journalism and news recommendation systems [19]. category. Hence, instead of using Jaccard similarity, we propose Our team at HCMUS responds to this call by addressing the Image- an iterative ranking method that takes into account the order of Text Re-Matching task. Particularly, given a set of image-text pairs matched categories. At the 𝑘-th iteration, our method first finds in the wild, the task requires us to correctly re-assign images to their top-𝑘 categories for each image and top-𝑘 categories for each article. decoupled articles, with the aim to understand the implication of Then for each article, we create a list of candidate images whose journalism in choosing illustrative images. top-𝑘 categories intersect that of the article. This list of candidates Our methods mainly concern fusing cross-modal embeddings for at the 𝑘-th iteration is concatenated to the final list. Finally, the re- automatic matching. We experimented with a range of embedded maining images that are not candidates are kept in their order and information, including simple set intersection, deep neural features, concatenated to the end of the final list. and knowledge-graph-enhanced neural features. We combine such 2.3 Graph-based Face-Name Matching features in various ways for various experiments. Finally, we obtain Based on our observation, in a lot of instances, the publisher uses a our best result with the ensemble of experimented methods. portrait of somebody mentioned in the text. We build the face-name 2 METHODS graph to represent the relation between the name and the face. 2.1 Metric Learning Person name extraction: To automatically extract people’s name The primary idea of this baseline method is using metric learning to from the text, we use entity-fishing[23] – an open-source high- project embeddings of image-text pairs to bases of significant simi- performance entity recognition and disambiguation tool. It relies on larity. Particularly, we use two approaches to embed image features: Random Forest and Gradient Tree Boosting to recognize named enti- global context embedding and local context embedding. In the first ties, in our case people’s names, and link them against Wikidata enti- approach, we use the EfficientNet [30], a SOTA classification architec- ties using their word embeddings and Wikidata entities’ embeddings. ture, to extract features of the image before taking the flatten output Face encoding: We use face recognition open-source library[13] features. Our motivation in the latter approach is to harness critical to detect and represent the face as 128-dims vector. The tool uses a local information from the extracted global context. Thus, we use the pre-trained model from the dlib-models repository[20] and chooses bottom-up-attention model [3] to extract the top-𝑘 objects based on ResNet as the backbone for face feature extraction. their confidence score, before passing them over to a self-attention Using the train set, we connect each person mentioned in the sequential model. For both routines, we employ BERT [12] language articles with features extracted from accompanying faces. During model to embed textual content, then project the textual and image testing, we encode the face from the image and aggregate the number Copyright 2020 for this paper by its authors. Use permitted under Creative of matched faces connected to the people mentioned in the text. Two Commons License Attribution 4.0 International (CC BY 4.0). faces are matched if 𝐿2-distance between two vectors less than 0.6. MediaEval’20, 14-15 December 2020, Online The ranking of images is sorted by the total matched. MediaEval’20, December 14-15, 2020, Online T. Nguyen-Quang et al. Table 1: Submission result extracted lists of labels are also linked with synsets using EWISER, Method Acc. Recall@100 MRR@100 and the mean of these synset embedding vectors represent images. We then train a canonical correlation analysis (CCA) module with TripletLocal 0.0000 0.0248 0.0012 the vector representation on the train set before using it to transform TripletGlobal 0.0002 0.0238 0.0013 test set vectors. For relatedness measurement, for each test article, Group-Face&Cap 0.0194 0.1322 0.0237 we rank all images in the test set using the 𝐿2-distance between the KG-Fusion 0.0051 0.1667 0.0164 article vector and image vectors. Ensemble 0.0075 0.2064 0.0222 3 EXPERIMENTAL RESULTS 2.4 Image-Text Fusion with Image 3.1 Data preprocessing Captioning and Contextual Embeddings The MediaEval 2020 Image-Text Re-Matching benchmark releases Based on the hypothesis that the description of the image is semanti- three batches of data in total consists of the lede and titles of German cally similar to the title, we build an image captioning model which news articles and their accompanying images. The first two are used is inspired by the tutorial Image captioning with visual attention[31]. for training, and the last one is used for testing. The model has three main parts: For the sake of manual assertion, we decide to translate all the • Image feature extractor: We use EfficientNet[30] for feature text to English using Google Translate and employ this translated extraction. The feature has the shape (8, 8, 2048) text in our experiments. All data batches are cleaned automatically, • Feature encoder: The features pass through fully connected giv- with images crawled using the given URLs and pairs with 404 Not ing a vector 256-dims. Found URLs dropped from the train set. • Decoder: To generate the caption, we use Bahdanau attention[4] 3.2 Submissions and GRU to predict the next word. First, TripletLocal and TripletGlobal demonstrate respective methods We merge the train set with Flickr and COCO for training. We use in Section 2.1. In both submissions, we empirically choose 𝑘 = 30 fuzzywuzzy ratio and partial ratio string matching to compare cap- to embed images with top-𝑘 objects, then sort candidate images for tions and articles title. To represent the caption and the title as a vec- each article by the similarity of their embedding to that of the article. tor, we use RoBERTa and doc2vec[22] enwiki_dbow, apnews_dbow. The Group-Face&Cap submission, meanwhile, combine three dif- Then, we calculate the similarity of two vectors by cosine similarity. ferent methods. First, we matches image-article pairs using the The final score is calculated by: method in Section 2.2 with 𝑘 = 5. However, at each iteration, we 𝑆 total =𝑆 wiki +𝑆 apnews +𝑆 RoBERTa + (1−𝐷 fuzzy ) + (1−𝐷 partial ) sort the candidates by 𝑆𝑡𝑜𝑡𝑎𝑙 score mentioned in 2.4. Finally, candi- where 𝑆 wiki , 𝑆 apnews , 𝑆 RoBERTa are cosine similarity of two vectors date images matched with the article through the method in Section generated by enwiki_dbow, apnews_dbow, RoBERTa, and 𝐷 𝑓 𝑢𝑧𝑧𝑦 , 2.3 are prioritized to the top of the final result. 𝐷𝑝𝑎𝑟𝑡𝑖𝑎𝑙 are fuzzywuzzy and partial ratios, respectively. The KG-Fusion submission manifest the method described in Sec- 2.5 Image-Text Fusion with Knowledge tion 2.5. Specifically, the TResNet-L with ASL model used for multi- label extraction accepts a sigmoid threshold of 0.7, the EWISER Graph-based Contextual Embeddings disambiguator consumes chunks of 5 tokens, and the target decom- We observe that image-text pairs may not have any explicit relation- position of the CCA module has 64 components. ships. Yet, such text-image pairs could still remotely related through Finally, the Ensemble submission combines all described methods, layers of abstraction. For example, an article about violence could weighting each models based on their efficiency. As such, the final feature a stock photo of a gun barrel. Although such a stock photo ranking of a candidate image is: does not literally illustrate the textual content, we understand that 𝑅Ensemble =𝑤 1 𝑅Caption +𝑤 2 𝑅Triplet +𝑤 3 𝑅Face +𝑤 4 𝑅KG−Fusion . a gun conveys a sense of threat, which, in turn, is related to violence. where 𝑅𝐸𝑛𝑠𝑒𝑚𝑏𝑙𝑒 , 𝑅𝐶𝑎𝑝𝑡𝑖𝑜𝑛 , 𝑅𝑇 𝑟𝑖𝑝𝑙𝑒𝑡 , 𝑅𝐹𝑎𝑐𝑒 , 𝑅𝐾𝐺−𝐹𝑢𝑠𝑖𝑜𝑛 are ranks of Thus, we consider exploiting knowledge graphs. On a knowledge the image produced by Group-Face&Cap, TripletGlobal, Face Match- graph, such as BabelNet [24], the concept node of gun is also remotely ing, and KG-Fusion methods, respectively. Weighting factors are connected with violence through intermediate nodes. Thus, we hy- empirically chosen to be 𝑤 1 =𝑤 4 = 1, 𝑤 2 = 0.02 and 𝑤 3 = 0.25. pothesize that the projection of the textual and imagery content of a news article onto a knowledge graph would be connected, and their 4 CONCLUSION AND FUTURE WORKS embeddings, in turns, could be in close proximity. Although, our methods show poor accuracy, they systematically in- To implement this projection, we use EWISER word sense dis- crease the performance on the recall@100 metric. This fact validates ambiguator [6] to link textual entities from texts to their synsets our hypotheses that incorporating high-level semantics increase in the WordNet subset of BabelNet. Then, the mean of accompa- performance. Moreover, our methods yield consistent results, i.e., nied SenSemBERT+LMMS embeddings corresponds to these ex- high-ranking images are of relevance to queried articles. Thus, they tracted synsets representing the texts. For the images, we first map can still be useful for building news image recommendation systems images to the textual domain. To enhance the method by featur- as the news-images suitability is not injective in practice. The ensem- ing abstract human-level concepts in the mapping, we decide to ble method’s performance also suggests practical system builders use TResNET-L with Asymmetric Loss (ASL) [5, 26] pre-trained on to use multiple methods to handle different aspects of the complex OpenImagesV6[21] to extract multi-label from images. Our decision image-text multimodal relation. In future works, we wish to inves- is grounded since OpenImagesV6 features image-level labels con- tigate better fusion methods, consider a thorough ablation study for form with Freebase[7] knowledge graph with figurative labels, e.g., proposed methods, and enhance the dataset for thorough evaluation festivals, sport, comedy, etc., while TResNET-L with ASL is the state- with information retrieval metrics like NDCG. of-the-art method for OpenImagesV6 multi-label benchmark. The Acknowledgments: Research is supported by Vingroup Innova- tion Foundation (VINIF) in project code VINIF.2019.DA19. NewsImages: The role of images in online news MediaEval’20, December 14-15, 2020, Online REFERENCES [18] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong [1] 2020. Model from https://huggingface.co/german-nlp-group/electra- Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient base-german-uncased. (2020). Gradient Boosting Decision Tree. In Advances in Neural Information [2] 2020. Model from https://huggingface.co/T-Systems-onsite/bert- Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, german-dbmdz-uncased-sentence-stsb. (2020). R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran [3] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Associates, Inc., 3146–3154. https://proceedings.neurips.cc/paper/ Johnson, Stephen Gould, and Lei Zhang. 2017. Bottom-Up and Top- 2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf Down Attention for Image Captioning and VQA. CoRR abs/1707.07998 [19] Benjamin Kille, Andreas Lommatzsch, and Özlem Özgöbek. 2020. News (2017). arXiv:1707.07998 http://arxiv.org/abs/1707.07998 Images in MediaEval 2020. In Proc. of the MediaEval 2020 Workshop. [4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2016. Neural Online. Machine Translation by Jointly Learning to Align and Translate. (2016). [20] Davis E. King. 2018. dlib-models. (2018). https://github.com/davisking/ arXiv:cs.CL/1409.0473 dlib-models [5] Emanuel Ben-Baruch, Tal Ridnik, Nadav Zamir, Asaf Noy, Itamar Fried- [21] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan man, Matan Protter, and Lihi Zelnik-Manor. 2020. Asymmetric Loss Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, For Multi-Label Classification. arXiv preprint arXiv:2009.14119 (2020). Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. 2020. The Open Images Dataset V4: Unified image classification, object detection, [6] Michele Bevilacqua and Roberto Navigli. 2020. Breaking through and visual relationship detection at scale. IJCV (2020). the 80% glass ceiling: Raising the state of the art in Word Sense [22] Jey Han Lau and Timothy Baldwin. 2016. An Empirical Evaluation of Disambiguation by incorporating knowledge graph information. doc2vec with Practical Insights into Document Embedding Generation. In Proceedings of the 58th Annual Meeting of the Association for (2016). arXiv:cs.CL/1607.05368 Computational Linguistics. 2854–2864. [23] Patrice Lopez. 2020. Entity Fishing. (2020). https: [7] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie //github.com/kermitt2/entity-fishing Taylor. 2008. Freebase: a collaboratively created graph database for [24] Roberto Navigli and Simone Paolo Ponzetto. 2010. BabelNet: Building structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD a very large multilingual semantic network. In Proceedings of the 48th international conference on Management of data. 1247–1250. annual meeting of the association for computational linguistics. 216–225. [8] Malte Pietsch Tanay Soni Branden Chan, Timo Möller. 2020. Model [25] NHJ Oostdijk, H van Halteren, Erkan Basar, and Martha A Larson. from https://huggingface.co/bert-base-german-cased. (2020). 2020. The Connection between the Text and Images of News Articles: [9] Branden Chan, Stefan Schweter, and Timo Möller. 2020. German’s New Insights for Multimedia Analysis. (2020). Next Language Model. (2020). arXiv:cs.CL/2010.10906 [26] Tal Ridnik, Hussam Lawen, Asaf Noy, and Itamar Friedman. 2020. [10] François Chollet. 2017. Xception: Deep learning with depthwise TResNet: High Performance GPU-Dedicated Architecture. arXiv separable convolutions. In Proceedings of the IEEE conference on preprint arXiv:2003.13630 (2020). computer vision and pattern recognition. 1251–1258. [27] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, [11] dbmdz. 2020. Model from https://huggingface.co/dbmdz/bert-base- and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and german-uncased. (2020). linear bottlenecks. In Proceedings of the IEEE conference on computer [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. vision and pattern recognition. 4510–4520. 2019. BERT: Pre-training of Deep Bidirectional Transformers for [28] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo- Language Understanding. (2019). arXiv:cs.CL/1810.04805 lutional networks for large-scale image recognition. arXiv preprint [13] Adam Geitgey. 2018. Face Recognition. (2018). https: arXiv:1409.1556 (2014). //github.com/ageitgey/face_recognition [29] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. 2016. Inception-v4, inception-resnet and the impact of residual Identity mappings in deep residual networks. In European conference connections on learning. arXiv preprint arXiv:1602.07261 (2016). on computer vision. Springer, 630–645. [30] Mingxing Tan and Quoc V. Le. 2020. EfficientNet: Rethink- [15] Elad Hoffer and Nir Ailon. 2018. Deep metric learning using Triplet ing Model Scaling for Convolutional Neural Networks. (2020). network. (2018). arXiv:cs.LG/1412.6622 arXiv:cs.LG/1905.11946 [16] MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and [31] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Hamid Laga. 2019. A comprehensive survey of deep learning for image Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2016. captioning. ACM Computing Surveys (CSUR) 51, 6 (2019), 1–36. Show, Attend and Tell: Neural Image Caption Generation with Visual [17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Attention. (2016). arXiv:cs.LG/1502.03044 Weinberger. 2017. Densely connected convolutional networks. In [32] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Proceedings of the IEEE conference on computer vision and pattern Learning transferable architectures for scalable image recognition. recognition. 4700–4708. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8697–8710.