=Paper=
{{Paper
|id=Vol-3658/paper6
|storemode=property
|title=Beyond Keywords: ChatGPT's Semantic Understanding for Enhanced Media
Search
|pdfUrl=https://ceur-ws.org/Vol-3658/paper6.pdf
|volume=Vol-3658
|authors=Hoang Chau Truong Vinh,Doan Khai Ta,Duc Duy Nguyen,Le Thanh Nguyen,Quang Vinh Nguyen
|dblpUrl=https://dblp.org/rec/conf/mediaeval/VinhTNNN23
}}
==Beyond Keywords: ChatGPT's Semantic Understanding for Enhanced Media
Search==
Beyond Keywords: ChatGPT’s Semantic
Understanding for Enhanced Media Search
Hoang-Chau Truong-Vinh1,† , Doan-Khai Ta2 , Duc-Duy Nguyen2 , Le-Thanh Nguyen3
and Quang-Vinh Nguyen4
1
Vietnamese-German University, Vietnam
2
Hanoi University of Science and Technology, Vietnam
3
University of Information Technology, Vietnam National University Ho Chi Minh City, Vietnam
4
Chonnam National University, Korea
Abstract
In this paper, we present our participation in media content retrieval, in which we retrieve and connect the
image for a specific article, such as news. We propose a method of using prompt engineering techniques
and taking advantage of ChatGPT to generate descriptions of potential images in the article, which
are then filtered and passed with the corresponding image into the text-image model. Our experiment
demonstrates the efficiency of proposed framework in enhancing media content retrieval through high
relevant and quality data, presenting an effective approach to combining the LLM model with media
content problems.
1. Introduction
The NewsImage task aim to find images being suitable for corresponding articles, according to
Lommatzsch, Kille, Özgöbek Elahi and Dang-Nguyen.[1] This challenge attracts a lot of attention
and investigation, due to it complexity of the relationship between text and image, which is
sometimes direct; the image explicitly describes the text (recording the event, demonstrating
the situation); or sometimes indirect; the image explains in some abstract semantics to attract
the reader’s attention (the image is not taken in the event described in the text, or the image is
a symbolic representation of the text’s main theme); or sometimes the image is generated by
AI. Due to the aforementioned difficulties, this work intend to integrate the Large Language
Model (or LLM) - ChatGPT. Eversince it first appearance, Chat-GPT has shown great potential
in suggesting ideas for a given context, which suited the scenario of NewImages Retrieval where
ideas for an image to be used are various and didn’t appear to follow any rules, limiting the
existing methods to handle the problem. By leveraging prompting techniques and incorporating
ChatGPT, we provide valuable additional context for training dataset. Moreover, we adopt the
capabilities of the vision-language pretrained BLIP [2] model to explore the complex relationship
between text and image. The proposed strategy enhance the efficiency and effectiveness of
retrieving relevant media content based on pseudo-labeling and textual descriptions.
MediaEval’23: Multimedia Evaluation Workshop, February 1–2, 2024, Amsterdam, The Netherlands and Online
*
Corresponding author.
†
These authors contributed equally.
$ 16076@student.vgu.edu.vn (H. Truong-Vinh); tadoankhai@gmail.com (D. Ta); duynd.researchai@gmail.com
(D. Nguyen); 19522238@gm.uit.edu.vn (L. Nguyen); vinhbn28@jnu.ac.kr (Q. Nguyen)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
2. Related Work
Exploring the relation between images and texts remains challenging because of their distinct
representation. Recent studies, such as Zhang et al. [3], introduced context-aware attention
networks to connect important areas in images with associated semantic words. Liu et al. [4]
captured both image-sentence level relations rather than focusing exclusively on the object-
word level. In the NewsImages task, news articles may describe elements not depicted in the
accompanying images, requiring methods that comprehend more complex relationships. Yang
et al. [5] utilized the power of the pretrained model CLIP to boost the performance. Liang et
al. [6] highlighted the significance of extending context by enriching articles through textual
concept expansion, providing potential co-occurrence concepts related to the images. However,
we are unaware of any previous works that exploit LLMs to bridge the semantic gap between
news articles and their cover images.
3. Approach
3.1. Overview
In this section, we present the overall proposed framework taking advantage of ChatGPT and
the BLIP [2] model to match news and corresponding images. We utilized ChatGPT to create a
new quality dataset by incorporating a greater amount of highly relevant information compared
to titles or texts of the existing data, from where we build the model based on the power of
BLIP. The proposed framework will be explained in further detail in the following subsections.
Figure 1: Diagram of implementation steps: 1. Extract image captions and text from training URLs
using web scraping; 2. Input the extracted texts into ChatGPT, using prompts to generate 5 most-related
cover image descriptions; 3. Compute cosine similarity between paired text - ground-truth image vectors;
4. Filter pairs above the relevance threshold of 0.25; 5. Fine-tune BLIP model with new training data.
3.2. Data Collection and Construction
Text Processing. We preprocess the text by English translating (Google API), lowercasing,
expanding contractions, removing stop words and punctuation. Additionally, the Ekphrasis
library [7] helps correct misspellings and word segmentation issues for cleaner text. Through
performance experiments, we determine the optimal text length of 40 for the model input.
Text-Image Pair Construction. To obtain strongly aligned text-image pairs for fine-tuning
BLIP, we leverage the state-of-the-art ChatGPT agent to automatically generate descriptive
captions for images instead of expensive human annotation [8] [9]. Specifically, we utilize
requests to access the URLs provided in the training data. We then use the Beautiful Soup
library to extract any Image Captions (if available) and all text from the webpage. For all the text
extracted, we provide it to ChatGPT with a carefully selected prompt "Summarize this article
and suggest 5 images to use as the cover image, each no more than one sentence" to generate an
additional 5 image descriptions. The RT dataset uses image descriptions and organizers’ images
to create text-image pairs, resulting in 5 pairs per article, and additional of 1 or 2 pairs is formed
due to image captions if present. For the GDELT dataset, one pair is formed per article using
the English title and image. Key keywords from the article text are extracted and paired with
the image to form supplementary text-image pairs.
In our approach, each ChatGPT-generated caption or article title is fed into the BLIP text
encoder, and each corresponding image is fed into the BLIP image encoder. After we compute
the cosine similarity between textual and visual feature vectors, we filter out pairs with similarity
below 0.25. This thresholding balances data size and meaning quality. Ultimately, we constructed
a filtered training dataset with tight image-text semantic alignment for adapting our multimodal
model.
Model Fine-tuning. Having selected relevant text-image training pairs, we further enhance
BLIP’s multimodal representation learning capabilities via model fine-tuning. Previous works
[10] [11] have demonstrated that adapting pre-trained models on downstream datasets can
better align the embedding space for the target task.
Specifically, we append a classification head atop the dual BLIP encoders to predict matching
vs non-matching pairs based on feature similarity. The fine-tuning process minimizes binary
cross-entropy loss between predicted and ground-truth matching labels. This contrastive
learning serves to draw associated modalities closer in the embedded space while separating
unrelated pairs. After fine-tuning convergence, we evaluate the model on an image-text retrieval
task using article titles as queries. Image and text encodings are extracted and ranked by cosine
similarity. Top-1 accuracy measures how well the model can retrieve the ground-truth title
associated with each image. For fine-tuning, the initial learning rate was set to 1e-5 with 0.05
weight decay for regularization. The rate gradually decayed to stabilize convergence. These
hyperparameters allowed adaptive updates to the pre-trained parameters without completely
overwriting them.
4. Results and Analysis
We submitted five runs for each dataset (GDELT1, GDELT2, RT), with the following details.
• Run #1: For this result, we utilized the pretrained model of BLIPv1 [2] on the COCO
dataset to extract embeddings for both article titles and images. We then employed cosine
similarity to calculate the similarity between images and titles, selecting the top 100
images with the highest similarity.
• Run #2: Similar to Run #1, in this result, we used the pretrained model of BLIPv2 [12].
The purpose of this experiment was to evaluate which model, BLIPv1 or BLIPv2, performs
better on the given data.
• Run #3: Upon observing that the results of BLIPv1 and BLIPv2 did not differ significantly
in the first two runs, and considering that BLIPv1 required less time during the training
process compared to BLIPv2, we decided to use the BLIPv1 model for further training.
We applied the method described in 3.1 to achieve the results this time.
• Run #4: In the fourth run, we continued to use the BLIPv1 model as in Run #3. For the
GDELT1, there were no changes in this training session. However, for the RT dataset,
beside using ChatGPT to suggest the cover-image descriptions, we also prompted for
keywords which described the article content. Other aspects of the data remained un-
changed.
Table 1
Experimental results comparing 4 runs on 3 datasets RT, GDELT1, GDELT2
Data Method R@5 R@10 R@50 R@100 MRR
Run #1 0.05467 0.08167 0.19067 0.25833 0.04042
Run #2 0.05300 0.07633 0.17167 0.23867 0.04045
RT Run #3 0.09067 0.13333 0.27133 0.36467 0.07072
Run #4 0.12067 0.17500 0.34100 0.42700 0.08727
Run #1 0.20867 0.27933 0.47933 0.57867 0.15169
Run #2 0.22200 0.29400 0.48733 0.57200 0.16368
GDELT1 Run #3 0.26733 0.35400 0.58200 0.65933 0.18974
Run #4 0.30467 0.39400 0.61467 0.70200 0.21365
Run #1 0.20933 0.26733 0.47267 0.56133 0.15404
Run #2 0.22000 0.28667 0.46733 0.53533 0.16208
GDELT2 Run #3 0.31533 0.41533 0.63867 0.71467 0.23320
Run #4 0.37067 0.44600 0.66400 0.73800 0.26778
Through the utilization of our filtering framework, we are able to generate highly precise
texts for article cover images during the training process. This method outperforms the reliance
solely on article titles, which often have inconsistencies and noise. As a result, the quality of our
results has significantly improved. In Run #4, we achieved the best outcome, with the dataset
GDELT2 performing the best. Our score with the R@100 metric was 0.73800, and there was
also notable improvement for other datasets in Run #4.
5. Discussion and Outlook
Despite inconsistent performance throughout the three datasets, we have proved the promising
future of reducing the semantic distance in NewsImage task by integrating large language
models such as ChatGPT into the pipeline. This shed light on another use case of such a gold
mine of LLMs by suggesting descriptions for cover images of news articles based on their
content. This description is then fed into generative models to create more-relevant images. In
future research, we aim to explore the concept of AI-generated images, where pictures are not
captured by humans but produced by machines. The emergence of AI-generated pictures has the
potential to threaten media cohesion, spark discussions among publishers and photographers,
and potentially facilitate the dissemination of misleading information through fabricated images.
References
[1] A. Lommatzsch, B. Kille, Ö. Özgöbek, M. Elahi, D.-T. Dang-Nguyen, News Images in MediaEval
2023, CEUR Workshop Proceedings, 2024. URL: http://ceur-ws.org/.
[2] J. Li, D. Li, C. Xiong, S. C. H. Hoi, BLIP: Bootstrapping Language-Image Pre-training for Unified
Vision-Language Understanding and Generation, CoRR abs/2201.12086 (2022). URL: https://arxiv.
org/abs/2201.12086. arXiv:2201.12086.
[3] Q. Zhang, Z. Lei, Z. Zhang, S. Z. Li, Context-Aware Attention Network for Image-Text Retrieval, in:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2020, pp. 3533–3542. doi:10.1109/CVPR42600.2020.00359.
[4] C. Liu, Z. Mao, T. Zhang, H. Xie, B. Wang, Y. Zhang, Graph Structured Network for Image-Text Match-
ing, CoRR abs/2004.00277 (2020). URL: https://arxiv.org/abs/2004.00277. arXiv:2004.00277.
[5] Z. Yang, S. Yi, W. Wenbo, L. Jing, S. Jiande, CLIP Pre-trained Models for Cross-modal Retrieval
in NewsImages 2022, in: Working Notes Proceedings of the MediaEval 2022 Workshop, CEUR
Workshop Proceedings, 2022. URL: http://ceur-ws.org/.
[6] L. Mingliang, L. Martha, Textual Concept Expansion for Text-Image Matching within Online
News Content, in: Working Notes Proceedings of the MediaEval 2022 Workshop, CEUR Workshop
Proceedings, 2022. URL: http://ceur-ws.org/.
[7] C. Baziotis, N. Pelekis, C. Doulkeridis, DataStories at SemEval-2017 task 4: Deep LSTM with attention
for message-level and topic-based sentiment analysis, in: Proceedings of the 11th International
Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics,
Vancouver, Canada, 2017, pp. 747–754. URL: https://aclanthology.org/S17-2126. doi:10.18653/v1/
S17-2126.
[8] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan,
P. Dollár, C. L. Zitnick, Microsoft COCO: common objects in context, CoRR abs/1405.0312 (2014).
URL: http://arxiv.org/abs/1405.0312. arXiv:1405.0312.
[9] S. Piyush, D. Nan, G. Sebastian, S. Radu, Conceptual Captions: A Cleaned, Hypernymed, Image
Alt-text Dataset For Automatic Image Captioning, Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Long Papers), 2018, pp. 2556–2565.
[10] S. Gururangan, A. Marasovic, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t
Stop Pretraining: Adapt Language Models to Domains and Tasks, CoRR abs/2004.10964 (2020).
URL: https://arxiv.org/abs/2004.10964. arXiv:2004.10964.
[11] K. Desai, J. Johnson, VirTex: Learning Visual Representations from Textual Annotations, CoRR
abs/2006.06666 (2020). URL: https://arxiv.org/abs/2006.06666. arXiv:2006.06666.
[12] J. Li, D. Li, S. Savarese, S. Hoi, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen
Image Encoders and Large Language Models, 2023. arXiv:2301.12597.