HCMUS at MediaEval 2020: Image-Text Fusion for Automatic News-Images Re-Matching

HCMUS at MediaEval 2020: Image-Text Fusion for Automatic News-Images Re-Matching ThucNguyen-Quang University of Science VNU-HCM Vietnam National University

Ho Chi Minh city Vietnam

Tuan-DuyHNguyen University of Science VNU-HCM Vietnam National University

Ho Chi Minh city Vietnam

Thang-LongNguyen-Ho University of Science VNU-HCM Vietnam National University

Ho Chi Minh city Vietnam

Anh-KietDuong University of Science VNU-HCM Vietnam National University

Ho Chi Minh city Vietnam

NhatHoang-Xuan University of Science VNU-HCM Vietnam National University

Ho Chi Minh city Vietnam

Vinh-ThuyenNguyen-Truong University of Science VNU-HCM Vietnam National University

Ho Chi Minh city Vietnam

Hai-DangNguyen University of Science VNU-HCM Vietnam National University

Ho Chi Minh city Vietnam

Minh-TrietTran University of Science VNU-HCM John von Neumann Institute VNU-HCM Vietnam National University

Ho Chi Minh city Vietnam

HCMUS at MediaEval 2020: Image-Text Fusion for Automatic News-Images Re-Matching 5EE52B0781D19BF481CCE0E08F0F4EF6 GROBID - A machine learning software for extracting information from scholarly documents

Matching text and images based on their semantics has an important role in cross-media retrieval. Especially, in terms of news, text and images connection is highly ambiguous. In the context of MediaEval 2020 Challenge, we propose three multi-modal methods for mapping text and images of news articles to the shared space in order to perform efficient cross-retrieval. Our methods show systemic improvement and validate our hypotheses, while the best-performed method reaches a recall@100 score of 0.2064.

INTRODUCTION

News articles represent a complex class of multimedia, whose textual content and accompanying images might not be explicitly related [25]. Existing research in multimedia and recommendation system domains mostly investigate image-text pairs with simple relationships, e.g., image captions that literally describe components of the images [16]. To address this, the MediaEval 2020 NewsImages Task calls for researchers to investigate the real-world relationship of news text and images in more depth, in order to understand its implications for journalism and news recommendation systems [19].

Our team at HCMUS responds to this call by addressing the Image-Text Re-Matching task. Particularly, given a set of image-text pairs in the wild, the task requires us to correctly re-assign images to their decoupled articles, with the aim to understand the implication of journalism in choosing illustrative images.

Our methods mainly concern fusing cross-modal embeddings for automatic matching. We experimented with a range of embedded information, including simple set intersection, deep neural features, and knowledge-graph-enhanced neural features. We combine such features in various ways for various experiments. Finally, we obtain our best result with the ensemble of experimented methods.

METHODS 2.1 Metric Learning

The primary idea of this baseline method is using metric learning to project embeddings of image-text pairs to bases of significant similarity. Particularly, we use two approaches to embed image features: global context embedding and local context embedding. In the first approach, we use the EfficientNet [30], a SOTA classification architecture, to extract features of the image before taking the flatten output features. Our motivation in the latter approach is to harness critical local information from the extracted global context. Thus, we use the bottom-up-attention model [3] to extract the top-𝑘 objects based on their confidence score, before passing them over to a self-attention sequential model. For both routines, we employ BERT [12] language model to embed textual content, then project the textual and image embeddings to the same dimension. Finally, we train our Triplet Loss [15] model with positive and negative pairs from a hard sample miner.

Image-Text Matching via Categorization

In this method, we train two gradient boosting decision trees [18], one for categorizing images, and the other for categorizing articles. The target categories are ['nrw', 'kultur', 'region', 'panorama', 'sport', 'wirtschaft', 'koeln', 'ratgeber', 'politik', 'unknown'], which are deduced from URLs in the train set.

We use features extracted for images and text to train the decision tree. To augment the data, we use VGG16, InceptionResNetV2, Mo-bileNetV2, EfficientNetB1-7, Xception, ResNet152V2, NASNetLarge, DenseNet201 [10,14,17,[27][28][29][30]32] for images, while using pretrained BERT models[2, 8, 9, 11], and pretrained ELECTRA models [1,9] to extract contextual features.

We presume that images and articles of the same category might have some relations. Moreover, the rank of matching categories also affects ranking. For example, an image-text pair sharing a 3rd-ranked category might be less relevant than the pair sharing a 1st-ranked category. Hence, instead of using Jaccard similarity, we propose an iterative ranking method that takes into account the order of matched categories. At the 𝑘-th iteration, our method first finds top-𝑘 categories for each image and top-𝑘 categories for each article. Then for each article, we create a list of candidate images whose top-𝑘 categories intersect that of the article. This list of candidates at the 𝑘-th iteration is concatenated to the final list. Finally, the remaining images that are not candidates are kept in their order and concatenated to the end of the final list.

Graph-based Face-Name Matching

Based on our observation, in a lot of instances, the publisher uses a portrait of somebody mentioned in the text. We build the face-name graph to represent the relation between the name and the face.

Person name extraction: To automatically extract people's name from the text, we use entity-fishing [23] -an open-source highperformance entity recognition and disambiguation tool. It relies on Random Forest and Gradient Tree Boosting to recognize named entities, in our case people's names, and link them against Wikidata entities using their word embeddings and Wikidata entities' embeddings.

Face encoding: We use face recognition open-source library [13] to detect and represent the face as 128-dims vector. The tool uses a pre-trained model from the dlib-models repository [20] and chooses ResNet as the backbone for face feature extraction.

Using the train set, we connect each person mentioned in the articles with features extracted from accompanying faces. During testing, we encode the face from the image and aggregate the number of matched faces connected to the people mentioned in the text. Two faces are matched if 𝐿2-distance between two vectors less than 0.6. The ranking of images is sorted by the total matched. T. Nguyen-Quang et al. Based on the hypothesis that the description of the image is semantically similar to the title, we build an image captioning model which is inspired by the tutorial Image captioning with visual attention [31].

The model has three main parts:

• Image feature extractor: We use EfficientNet [30] for feature extraction. The feature has the shape (8, 8, 2048) • Feature encoder: The features pass through fully connected giving a vector 256-dims. • Decoder: To generate the caption, we use Bahdanau attention [4] and GRU to predict the next word. We merge the train set with Flickr and COCO for training. We use fuzzywuzzy ratio and partial ratio string matching to compare captions and articles title. To represent the caption and the title as a vector, we use RoBERTa and doc2vec [22] enwiki_dbow, apnews_dbow. Then, we calculate the similarity of two vectors by cosine similarity. The final score is calculated by: 𝑆 total =𝑆 wiki +𝑆 apnews +𝑆 RoBERTa + (1−𝐷 fuzzy ) + (1−𝐷 partial ) where 𝑆 wiki , 𝑆 apnews , 𝑆 RoBERTa are cosine similarity of two vectors generated by enwiki_dbow, apnews_dbow, RoBERTa, and 𝐷 𝑓 𝑢𝑧𝑧𝑦 , 𝐷 𝑝𝑎𝑟𝑡𝑖𝑎𝑙 are fuzzywuzzy and partial ratios, respectively.

Image-Text Fusion with Knowledge Graph-based Contextual Embeddings

We observe that image-text pairs may not have any explicit relationships. Yet, such text-image pairs could still remotely related through layers of abstraction. For example, an article about violence could feature a stock photo of a gun barrel. Although such a stock photo does not literally illustrate the textual content, we understand that a gun conveys a sense of threat, which, in turn, is related to violence. Thus, we consider exploiting knowledge graphs. On a knowledge graph, such as BabelNet [24], the concept node of gun is also remotely connected with violence through intermediate nodes. Thus, we hypothesize that the projection of the textual and imagery content of a news article onto a knowledge graph would be connected, and their embeddings, in turns, could be in close proximity.

To implement this projection, we use EWISER word sense disambiguator [6] to link textual entities from texts to their synsets in the WordNet subset of BabelNet. Then, the mean of accompanied SenSemBERT+LMMS embeddings corresponds to these extracted synsets representing the texts. For the images, we first map images to the textual domain. To enhance the method by featuring abstract human-level concepts in the mapping, we decide to use TResNET-L with Asymmetric Loss (ASL) [5,26] pre-trained on OpenImagesV6 [21] to extract multi-label from images. Our decision is grounded since OpenImagesV6 features image-level labels conform with Freebase [7] knowledge graph with figurative labels, e.g., festivals, sport, comedy, etc., while TResNET-L with ASL is the stateof-the-art method for OpenImagesV6 multi-label benchmark. The extracted lists of labels are also linked with synsets using EWISER, and the mean of these synset embedding vectors represent images.

We then train a canonical correlation analysis (CCA) module with the vector representation on the train set before using it to transform test set vectors. For relatedness measurement, for each test article, we rank all images in the test set using the 𝐿2-distance between the article vector and image vectors.

EXPERIMENTAL RESULTS

Data preprocessing

The MediaEval 2020 Image-Text Re-Matching benchmark releases three batches of data in total consists of the lede and titles of German news articles and their accompanying images. The first two are used for training, and the last one is used for testing.

For the sake of manual assertion, we decide to translate all the text to English using Google Translate and employ this translated text in our experiments. All data batches are cleaned automatically, with images crawled using the given URLs and pairs with 404 Not Found URLs dropped from the train set.

Submissions

First, TripletLocal and TripletGlobal demonstrate respective methods in Section 2.1. In both submissions, we empirically choose 𝑘 = 30 to embed images with top-𝑘 objects, then sort candidate images for each article by the similarity of their embedding to that of the article.

The Group-Face&Cap submission, meanwhile, combine three different methods. First, we matches image-article pairs using the method in Section 2.2 with 𝑘 = 5. However, at each iteration, we sort the candidates by 𝑆 𝑡𝑜𝑡𝑎𝑙 score mentioned in 2.4. Finally, candidate images matched with the article through the method in Section 2.3 are prioritized to the top of the final result.

The KG-Fusion submission manifest the method described in Section 2.5. Specifically, the TResNet-L with ASL model used for multilabel extraction accepts a sigmoid threshold of 0.7, the EWISER disambiguator consumes chunks of 5 tokens, and the target decomposition of the CCA module has 64 components.

Finally, the Ensemble submission combines all described methods, weighting each models based on their efficiency. As such, the final ranking of a candidate image is:

𝑅 Ensemble =𝑤 1 𝑅 Caption +𝑤 2 𝑅 Triplet +𝑤 3 𝑅 Face +𝑤 4 𝑅 KG−Fusion .

where 𝑅 𝐸𝑛𝑠𝑒𝑚𝑏𝑙𝑒 , 𝑅 𝐶𝑎𝑝𝑡𝑖𝑜𝑛 , 𝑅 𝑇 𝑟𝑖𝑝𝑙𝑒𝑡 , 𝑅 𝐹𝑎𝑐𝑒 , 𝑅 𝐾𝐺−𝐹𝑢𝑠𝑖𝑜𝑛 are ranks of the image produced by Group-Face&Cap, TripletGlobal, Face Matching, and KG-Fusion methods, respectively. Weighting factors are empirically chosen to be 𝑤 1 =𝑤 4 = 1, 𝑤 2 = 0.02 and 𝑤 3 = 0.25.

CONCLUSION AND FUTURE WORKS

Although, our methods show poor accuracy, they systematically increase the performance on the recall@100 metric. This fact validates our hypotheses that incorporating high-level semantics increase performance. Moreover, our methods yield consistent results, i.e., high-ranking images are of relevance to queried articles. Thus, they can still be useful for building news image recommendation systems as the news-images suitability is not injective in practice. The ensemble method's performance also suggests practical system builders to use multiple methods to handle different aspects of the complex image-text multimodal relation. In future works, we wish to investigate better fusion methods, consider a thorough ablation study for proposed methods, and enhance the dataset for thorough evaluation with information retrieval metrics like NDCG.

Table 1 :1Submission resultMethodAcc. Recall@100 MRR@100TripletLocal0.00000.02480.0012TripletGlobal0.00020.02380.0013Group-Face&Cap 0.01940.13220.0237KG-Fusion0.00510.16670.0164Ensemble0.00750.20640.02222.4 Image-Text Fusion with ImageCaptioning and Contextual Embeddings

Acknowledgments: Research is supported by Vingroup Innovation Foundation (VINIF) in project code VINIF.2019.DA19.

Bottom-Up and Top-Down Attention for Image Captioning and VQA PeterAnderson XiaodongHe ChrisBuehler DamienTeney MarkJohnson StephenGould LeiZhang arXiv:1707.07998 2017. 2017 Neural Machine Translation by Jointly Learning to Align and Translate DzmitryBahdanau KyunghyunCho YoshuaBengio arXiv:cs.CL/1409.0473 2016. 2016 Asymmetric Loss For Multi-Label Classification EmanuelBen-Baruch TalRidnik NadavZamir AsafNoy ItamarFriedman MatanProtter LihiZelnik-Manor arXiv:2009.14119 2020. 2020 arXiv preprint Breaking through the 80% glass ceiling: Raising the state of the art in Word Sense Disambiguation by incorporating knowledge graph information MicheleBevilacqua RobertoNavigli Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics 2020 Freebase: a collaboratively created graph database for structuring human knowledge KurtBollacker ColinEvans PraveenParitosh TimSturge JamieTaylor Proceedings of the 2008 ACM SIGMOD international conference on Management of data the 2008 ACM SIGMOD international conference on Management of data 2008 MaltePietsch TanaySoni BrandenChan TimoMöller Model from 2020. 2020 German's Next Language Model BrandenChan StefanSchweter TimoMöller arXiv:cs.CL/2010.10906 2020. 2020 Xception: Deep learning with depthwise separable convolutions FrançoisChollet Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2017 Model from 2020. 2020 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding JacobDevlin Ming-WeiChang KentonLee KristinaToutanova arXiv:cs.CL/1810.04805 2019. 2019 AdamGeitgey Face Recognition 2018. 2018 Identity mappings in deep residual networks KaimingHe XiangyuZhang ShaoqingRen JianSun European conference on computer vision Springer 2016 Deep metric learning using Triplet network EladHoffer NirAilon arXiv:cs.LG/1412.6622 2018. 2018 A comprehensive survey of deep learning for image captioning FerdousMd Zakir Hossain MohdSohel HamidFairuz Shiratuddin Laga ACM Computing Surveys (CSUR) 51 6 2019. 2019 Densely connected convolutional networks GaoHuang ZhuangLiu LaurensVan Der Maaten KilianQWeinberger Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2017 LightGBM: A Highly Efficient Gradient Boosting Decision Tree GuolinKe QiMeng ThomasFinley TaifengWang WeiChen WeidongMa QiweiYe Tie-YanLiu Advances in Neural Information Processing Systems IGuyon UVLuxburg SBengio HWallach RFergus SVishwanathan RGarnett Curran Associates, Inc 2017 30 News Images in MediaEval BenjaminKille AndreasLommatzsch ÖzlemÖzgöbek Proc. of the MediaEval 2020 Workshop. Online of the MediaEval 2020 Workshop. Online 2020. 2020 DavisEKing dlib-models 2018. 2018 The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale AlinaKuznetsova HassanRom NeilAlldrin JasperUijlings IvanKrasin JordiPont-Tuset ShahabKamali StefanPopov MatteoMalloci AlexanderKolesnikov TomDuerig VittorioFerrari IJCV 2020. 2020 An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation HanJey TimothyLau Baldwin arXiv:cs.CL/1607.05368 2016. 2016 PatriceLopez Entity Fishing 2020. 2020 BabelNet: Building a very large multilingual semantic network RobertoNavigli SimonePaolo Ponzetto Proceedings of the 48th annual meeting of the association for computational linguistics the 48th annual meeting of the association for computational linguistics 2010 The Connection between the Text and Images of News Articles: New Insights for Multimedia Analysis Nhj Oostdijk ErkanVan Halteren MarthaABasar Larson 2020. 2020 TalRidnik HussamLawen AsafNoy ItamarFriedman arXiv:2003.13630 TResNet: High Performance GPU-Dedicated Architecture 2020. 2020 arXiv preprint Mobilenetv2: Inverted residuals and linear bottlenecks MarkSandler AndrewHoward MenglongZhu AndreyZhmoginov Liang-ChiehChen Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2018 Very deep convolutional networks for large-scale image recognition KarenSimonyan AndrewZisserman arXiv:1409.1556 2014. 2014 arXiv preprint ChristianSzegedy SergeyIoffe VincentVanhoucke AlexAlemi arXiv:1602.07261 Inception-v4, inception-resnet and the impact of residual connections on learning 2016. 2016 arXiv preprint EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks MingxingTan VQuoc Le arXiv:cs.LG/1905.11946 2020. 2020 Show, Attend and Tell: Neural Image Caption Generation with Visual Attention KelvinXu JimmyBa RyanKiros KyunghyunCho AaronCourville RuslanSalakhutdinov RichardZemel YoshuaBengio arXiv:cs.LG/1502.03044 2016. 2016 Learning transferable architectures for scalable image recognition BarretZoph VijayVasudevan JonathonShlens Quoc VLe Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2018