-

1613-0073

Multimodal Fusion in NewsImages 2023: Evaluating Translators, Keyphrase Extraction, and CLIP Pre-Training

Quang-Vinh Dinh

Tien-Huy Nguyen

2 4

Hoang-Long Nguyen-Huu

2 4

Thien-Doanh Le

1 4

Huu-Loc Tran

2 4

Quoc-Khanh Le-Tran

2 4

Hoang-Bach Ngo

3 4

Minh-Hung An

0 0 FPT Telecom , Ho Chi Minh City , Vietnam 1 International University , Ho Chi Minh City , Vietnam 2 University of Information Technology , Ho Chi Minh City , Vietnam 3 University of Science , VNU-HCM 4 Vietnam National University , Ho Chi Minh City , Vietnam 5 Vietnamese German University , Binh Duong , Vietnam

Matching the most appropriate image to its corresponding article poses a significant challenge in this landscape. This paper explores the intricate challenge of matching headline images to news articles, utilizing the zero-shot capability of CLIP to address the complex relationship between texts and both real and AI-generated images in the MediaEval 2023 News-Images Challenge. Additionally, analyzes the ramifications of diverse translation methodologies on the eficacy of CLIP performance. The innovative approach involving key phrase extraction for CLIP input demonstrates competitive results across various benchmarks in information extraction and matching.

CEUR ceur-ws.org

1. Introduction

In today’s Internet age, online news articles play a crucial role as fundamental sources of information on current events, employing compelling titles and content segments to engage and inform readers efectively. Journalists strategically integrate images to enhance content intuitiveness, enabling a comprehensive understanding of the presented information and captivating the reader’s attention. The MediaEval Multimedia Evaluation benchmark, with a focus on the NewsImages task [ 1 ], explores the intricate relationship between textual narratives and visual elements in news articles, contributing significantly to understanding collaborative dynamics in news discourse. Recent advancements, exemplified by Contrastive Language-Image Pretraining (CLIP) [ 2 ], provide a robust foundation for research combining text and real images, comprehensively exploring their relationship in news articles. Taking advantage of CLIP’s zero-shot capabilities to evaluate experiments on real images and AI-generated images. ∗All authors contributed equally to this paper. CEUR

2. Related works

Understanding the interaction between text and images in news is crucial for grasping news content creation. Recent studies challenge the notion of a simple text-image connection, highlighting the limitations of traditional image captioning models. New dynamic attentionbased models, like those by Messina et al. [ 3 ] and Qizhang et al. [ 4 ], ofer adaptability but increase computational complexity. Nelleke Oostdijk’s analysis in [ 5 ] highlighted the limitations of a simplistic correlation between modalities, demonstrating that images possess the capability to depict entities within text or unrelated visual elements. Research like Lidia Pivovarova’s [ 6 ] in the MediaEval 2021 NewsImages task, which integrated knowledge distillation and a visual topic model, shows that images can represent entities from text or unrelated visuals, and alignment between text and visual topics is possible. HCMUS [ 7 ] achieved competitive results through advanced text preprocessing and the utilization of the CLIP pre-trained model; however, this approach also relied on the translator. Our method further investigates the efect of diferent translators on performance, using CLIP and key phrase extraction to predict relevant text for images, thus deepening the understanding of the complex text-image relationship.

3. Proposed Methods

The fundamental concept of this architecture is to integrate both text and image as inputs by embedding them into a shared space. For the image input, each image undergoes vector embedding by the CLIP image encoder. These embeddings are then indexed using the Faiss library (as you can see in Figure 1). With the text input, we process the headline and snippet of the news (including translating text and optionally extracting keyphrases) before encoding into an embedding using the CLIP text encoder. In the end, we identify the 100 most relevant images by leveraging the K-NN (with cosine similarity) algorithm.

3.1. Translator

The dataset consists of three components, each derived from news content sourced from news portals, including GDELT1 and GDLT2, and a news feed RT dataset. The articles from RT News are written in German and paired with their corresponding English texts in the dataset. In [ 7 ], using Google Translate as a translator tool, in addition, we use another translator tool, mBART (multilingual Bidirectional and Auto-Regressive Transformers) [ 8 ], to experiment and evaluate the impact of diferent translation methods on overall performance. Through this experimentation, we could have a better insight into each translator tool’s advantages and disadvantages, allowing us to explore the relationship between the features of images and news.

3.2. Keyphrase

The section aims to analyze and extract relevant keyphrases from given inputs. The CLIP without keyphrase approach shows suboptimal accuracy, attributed to lengthy and noisy headlines and snippet sentences. The observation that images closely match the content in the headline and snippet, so to enrich key information for the image query, suggests the need for the keyphrase approach to extract more crucial entities. To address this problem, our approach uses KBIR (Keyphrase Boundary Infilling with Replacement) [ 9 ] pre-trained model, designed for NLP tasks, for efective keyphrase extraction and generation from text, crucial for the CLIP model.

3.3. Using CLIP as a Zero-shot retriever

Automatic image captioning and text-image matching have advanced significantly, typically requiring labelled data and specialized training. CLIP, a pre-trained neural network model, takes a unique approach by learning joint image and text representations without task-specific optimization. Its ability to transfer knowledge to other tasks without prior training, along with a large and varied pre-training dataset, especially with news articles collected on the internet, makes it a suitable and attractive option for tasks such as NewsImage. In addition, this allows the model to achieve state-of-the-art performance on tasks it hasn’t been explicitly trained on before, ofering a promising baseline for further research [ 10, 11, 12, 13 ]. This research investigates the zero-shot performance of CLIP, and the advanced ViT-L/14@336p model, the most potent CLIP variant, is employed for optimal results.

4. Experimental Results

The competition task requires participants to predict a sequentially organized list of images that closely aligns with the accompanying textual article. Evaluation employs the Mean Reciprocal Rank (MRR) metric and MeanRecall@K scores (K in 5, 10, 50, 100). Our research undergoes assessment on three datasets provided by the competition organizers, leading to distinct experimental methodologies and variations in textual input for CLIP due to inherent dissimilarities in each dataset. Consequently, our innovative approaches difer for each dataset under consideration.

The experimental findings on various lexical translation methodologies show relatively consistent results compared to utilizing the organizers’ translated text. Comparisons between translation models indicate that using Googletrans is better mBART for the RT dataset, this led to the decision to leverage Googletrans for ongoing enhancements. However, based on experimental reports, among many translation methodologies, we conclude that translator modules do not greatly afect CLIP’s performance, so future methods should consider the option of removing the translator module to reduce pipeline complexity.

In the experiment, incorporating keyphrases into the textual content that previously consisted of the headline and snippet as input for CLIP yields overall performance improvements, particularly in the MeanRecall@5 and MeanRecall@10 metrics (increases of 0.00133 and 0.00734, respectively), as shown in Table 2. The keyphrase’s ability to encapsulate main ideas helps the model focus more on crucial information and clarify images query, resulting in commendable outcomes, particularly in retrieving images within the 5th and 10th ranks.

5. Conclusion and future work

This study tackles the demanding text-image matching task in the MediaEval 2023 NewsImages challenge, achieving notable success using the pre-trained CLIP model’s zero-shot capability. Our experiments underscore the eficacy of the model architecture and the benefits of employing a pre-trained model. We studied to experiment with the CLIP’s ability on both real and synthetic images, yielding promising outcomes for real images and proficient performance on AI-generated images. In addition, we proved that adding a translator did not improve performance, so we consider not using it in the pipeline in the future. Conversely, using key phrases showed positive signs of slightly increasing the accuracy of top-5 and top-10 image queries.

Future eforts will concentrate on implementing a more extensive approach, exploring additional techniques, such as re-ranking strategies [ 15 ] or face recognition systems [ 16 ] enhance to enrich crucial information for the image query and to further improve overall performance.

[1]

Lommatzsch ,

Kille , Ö. Özgöbek,

Elahi , D.-T. Dang-Nguyen, News images in mediaeval 2023 ( 2023 ).

[2]

Radford ,

J. W.

Kim ,

Hallacy ,

Ramesh , G. Goh,

Agarwal ,

Sastry ,

Askell ,

Mishkin ,

Clark , et al., Learning transferable visual models from natural language supervision , in: International conference on machine learning, PMLR , 2021 , pp. 8748 - 8763 .

[3]

Messina ,

Falchi ,

Esuli , G. Amato, Transformer reasoning network for image-text matching and retrieval , CoRR abs/ 2004 .09144 ( 2020 ). URL: https://arxiv.org/abs/ 2004 .09144. arXiv: 2004 .09144.

[4]

Zhang ,

Lei ,

Zhang ,

S. Z.

Li , Context-aware attention network for image-text retrieval , in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020 , pp. 3536 - 3545 .

[5]

Oostdijk ,

H. v.

Halteren , E. Basar,

M. A.

Larson , The connection between the text and images of news articles: New insights for multimedia analysis ( 2020 ).

[6]

Pivovarova , E. Zosa, Visual topic modelling for newsimage task at mediaeval 2021 , in: Working Notes Proceedings of the MediaEval 2021 Workshop , MediaEval, 2021 .

[7]

Cao ,

Ngô , T.-D. Le,

Huynh , N.-T. Nguyen,

Nguyen ,

Tran , Hcmus at mediaeval 2021: Fine-tuning clip for automatic news-images re-matching 3181 ( 2021 ).

[8]

Liu ,

Gu ,

Goyal ,

Li ,

Edunov ,

Ghazvininejad ,

Lewis ,

Zettlemoyer , Multilingual denoising pre-training for neural machine translation , CoRR abs/ 2001 .08210 ( 2020 ). URL: https://arxiv.org/abs/ 2001 .08210. arXiv: 2001 .08210.

[9]

Kulkarni ,

Mahata ,

Arora ,

Bhowmik , Learning rich representation of keyphrases from text , CoRR abs/2112 .08547 ( 2021 ). URL: https://arxiv.org/abs/2112.08547. arXiv: 2112 . 08547 .

[10]

H.-N.

Vu , H. -D. Nguyen , M.-T. Tran, Re-matching images and news using clip pretrained model ( 2022 ).

[11]

Zhang ,

Shao ,

Zhang , W. Wan,

Li ,

Sun , Clip pre-trained models for cross-modal retrieval in newsimages 2022 ( 2022 ).

[12]

Galanopoulos ,

Mezaris , Cross-modal networks and dual softmax operation for mediaeval newsimages 2022 ( 2022 ).

[13] M.-D. Le-Quynh, A.-T. Nguyen , A. -T. Quang-Hoang, V. -H. Dinh , T.-H. Nguyen , H. -B. Ngo , M.-H. An , Enhancing video retrieval with robust clip-based multimodal system , in: Proceedings of the 12th International Symposium on Information and Communication Technology, SOICT '23 , Association for Computing Machinery, New York, NY, USA, 2023 , p. 972 - 979 . URL: https://doi.org/10.1145/3628797.3629011. doi: 10 .1145/3628797.3629011.

[14]

Rombach ,

Blattmann ,

Lorenz ,

Esser ,

Ommer , High-resolution image synthesis with latent difusion models , 2022 . arXiv: 2112 . 10752 .

[15]

Liu ,

Xi ,

Qin ,

Sun ,

Chen ,

Zhang ,

Tang , Neural re-ranking in multi-stage recommender systems: A review , 2022 . arXiv: 2202 . 06602 .

[16]

Dalvi ,

Bafna ,

Bagaria ,

Virnodkar , A survey on face recognition systems , 2022 . arXiv: 2201 . 02991 .