Beyond Keywords: ChatGPT’s Semantic
                                Understanding for Enhanced Media Search
                                Hoang-Chau Truong-Vinh1,† , Doan-Khai Ta2 , Duc-Duy Nguyen2 , Le-Thanh Nguyen3
                                and Quang-Vinh Nguyen4
                                1
                                  Vietnamese-German University, Vietnam
                                2
                                  Hanoi University of Science and Technology, Vietnam
                                3
                                  University of Information Technology, Vietnam National University Ho Chi Minh City, Vietnam
                                4
                                  Chonnam National University, Korea


                                                                         Abstract
                                                                         In this paper, we present our participation in media content retrieval, in which we retrieve and connect the
                                                                         image for a specific article, such as news. We propose a method of using prompt engineering techniques
                                                                         and taking advantage of ChatGPT to generate descriptions of potential images in the article, which
                                                                         are then filtered and passed with the corresponding image into the text-image model. Our experiment
                                                                         demonstrates the efficiency of proposed framework in enhancing media content retrieval through high
                                                                         relevant and quality data, presenting an effective approach to combining the LLM model with media
                                                                         content problems.


                                1. Introduction
                                The NewsImage task aim to find images being suitable for corresponding articles, according to
                                Lommatzsch, Kille, Özgöbek Elahi and Dang-Nguyen.[1] This challenge attracts a lot of attention
                                and investigation, due to it complexity of the relationship between text and image, which is
                                sometimes direct; the image explicitly describes the text (recording the event, demonstrating
                                the situation); or sometimes indirect; the image explains in some abstract semantics to attract
                                the reader’s attention (the image is not taken in the event described in the text, or the image is
                                a symbolic representation of the text’s main theme); or sometimes the image is generated by
                                AI. Due to the aforementioned difficulties, this work intend to integrate the Large Language
                                Model (or LLM) - ChatGPT. Eversince it first appearance, Chat-GPT has shown great potential
                                in suggesting ideas for a given context, which suited the scenario of NewImages Retrieval where
                                ideas for an image to be used are various and didn’t appear to follow any rules, limiting the
                                existing methods to handle the problem. By leveraging prompting techniques and incorporating
                                ChatGPT, we provide valuable additional context for training dataset. Moreover, we adopt the
                                capabilities of the vision-language pretrained BLIP [2] model to explore the complex relationship
                                between text and image. The proposed strategy enhance the efficiency and effectiveness of
                                retrieving relevant media content based on pseudo-labeling and textual descriptions.


                                MediaEval’23: Multimedia Evaluation Workshop, February 1–2, 2024, Amsterdam, The Netherlands and Online
                                *
                                  Corresponding author.
                                †
                                  These authors contributed equally.
                                $ 16076@student.vgu.edu.vn (H. Truong-Vinh); tadoankhai@gmail.com (D. Ta); duynd.researchai@gmail.com
                                (D. Nguyen); 19522238@gm.uit.edu.vn (L. Nguyen); vinhbn28@jnu.ac.kr (Q. Nguyen)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Related Work
Exploring the relation between images and texts remains challenging because of their distinct
representation. Recent studies, such as Zhang et al. [3], introduced context-aware attention
networks to connect important areas in images with associated semantic words. Liu et al. [4]
captured both image-sentence level relations rather than focusing exclusively on the object-
word level. In the NewsImages task, news articles may describe elements not depicted in the
accompanying images, requiring methods that comprehend more complex relationships. Yang
et al. [5] utilized the power of the pretrained model CLIP to boost the performance. Liang et
al. [6] highlighted the significance of extending context by enriching articles through textual
concept expansion, providing potential co-occurrence concepts related to the images. However,
we are unaware of any previous works that exploit LLMs to bridge the semantic gap between
news articles and their cover images.


3. Approach
3.1. Overview
In this section, we present the overall proposed framework taking advantage of ChatGPT and
the BLIP [2] model to match news and corresponding images. We utilized ChatGPT to create a
new quality dataset by incorporating a greater amount of highly relevant information compared
to titles or texts of the existing data, from where we build the model based on the power of
BLIP. The proposed framework will be explained in further detail in the following subsections.


Figure 1: Diagram of implementation steps: 1. Extract image captions and text from training URLs
using web scraping; 2. Input the extracted texts into ChatGPT, using prompts to generate 5 most-related
cover image descriptions; 3. Compute cosine similarity between paired text - ground-truth image vectors;
4. Filter pairs above the relevance threshold of 0.25; 5. Fine-tune BLIP model with new training data.
3.2. Data Collection and Construction
Text Processing. We preprocess the text by English translating (Google API), lowercasing,
expanding contractions, removing stop words and punctuation. Additionally, the Ekphrasis
library [7] helps correct misspellings and word segmentation issues for cleaner text. Through
performance experiments, we determine the optimal text length of 40 for the model input.

Text-Image Pair Construction. To obtain strongly aligned text-image pairs for fine-tuning
BLIP, we leverage the state-of-the-art ChatGPT agent to automatically generate descriptive
captions for images instead of expensive human annotation [8] [9]. Specifically, we utilize
requests to access the URLs provided in the training data. We then use the Beautiful Soup
library to extract any Image Captions (if available) and all text from the webpage. For all the text
extracted, we provide it to ChatGPT with a carefully selected prompt "Summarize this article
and suggest 5 images to use as the cover image, each no more than one sentence" to generate an
additional 5 image descriptions. The RT dataset uses image descriptions and organizers’ images
to create text-image pairs, resulting in 5 pairs per article, and additional of 1 or 2 pairs is formed
due to image captions if present. For the GDELT dataset, one pair is formed per article using
the English title and image. Key keywords from the article text are extracted and paired with
the image to form supplementary text-image pairs.
   In our approach, each ChatGPT-generated caption or article title is fed into the BLIP text
encoder, and each corresponding image is fed into the BLIP image encoder. After we compute
the cosine similarity between textual and visual feature vectors, we filter out pairs with similarity
below 0.25. This thresholding balances data size and meaning quality. Ultimately, we constructed
a filtered training dataset with tight image-text semantic alignment for adapting our multimodal
model.

Model Fine-tuning. Having selected relevant text-image training pairs, we further enhance
BLIP’s multimodal representation learning capabilities via model fine-tuning. Previous works
[10] [11] have demonstrated that adapting pre-trained models on downstream datasets can
better align the embedding space for the target task.
   Specifically, we append a classification head atop the dual BLIP encoders to predict matching
vs non-matching pairs based on feature similarity. The fine-tuning process minimizes binary
cross-entropy loss between predicted and ground-truth matching labels. This contrastive
learning serves to draw associated modalities closer in the embedded space while separating
unrelated pairs. After fine-tuning convergence, we evaluate the model on an image-text retrieval
task using article titles as queries. Image and text encodings are extracted and ranked by cosine
similarity. Top-1 accuracy measures how well the model can retrieve the ground-truth title
associated with each image. For fine-tuning, the initial learning rate was set to 1e-5 with 0.05
weight decay for regularization. The rate gradually decayed to stabilize convergence. These
hyperparameters allowed adaptive updates to the pre-trained parameters without completely
overwriting them.


4. Results and Analysis
We submitted five runs for each dataset (GDELT1, GDELT2, RT), with the following details.

    • Run #1: For this result, we utilized the pretrained model of BLIPv1 [2] on the COCO
      dataset to extract embeddings for both article titles and images. We then employed cosine
      similarity to calculate the similarity between images and titles, selecting the top 100
      images with the highest similarity.
    • Run #2: Similar to Run #1, in this result, we used the pretrained model of BLIPv2 [12].
      The purpose of this experiment was to evaluate which model, BLIPv1 or BLIPv2, performs
      better on the given data.
    • Run #3: Upon observing that the results of BLIPv1 and BLIPv2 did not differ significantly
      in the first two runs, and considering that BLIPv1 required less time during the training
      process compared to BLIPv2, we decided to use the BLIPv1 model for further training.
      We applied the method described in 3.1 to achieve the results this time.
    • Run #4: In the fourth run, we continued to use the BLIPv1 model as in Run #3. For the
      GDELT1, there were no changes in this training session. However, for the RT dataset,
      beside using ChatGPT to suggest the cover-image descriptions, we also prompted for
      keywords which described the article content. Other aspects of the data remained un-
      changed.


Table 1
           Experimental results comparing 4 runs on 3 datasets RT, GDELT1, GDELT2
       Data       Method        R@5           R@10         R@50        R@100          MRR
                  Run #1       0.05467       0.08167      0.19067      0.25833       0.04042
                  Run #2       0.05300       0.07633      0.17167      0.23867       0.04045
      RT          Run #3       0.09067       0.13333      0.27133      0.36467       0.07072
                  Run #4       0.12067       0.17500      0.34100      0.42700       0.08727
                  Run #1       0.20867       0.27933      0.47933      0.57867       0.15169
                  Run #2       0.22200       0.29400      0.48733      0.57200       0.16368
      GDELT1      Run #3       0.26733       0.35400      0.58200      0.65933       0.18974
                  Run #4       0.30467       0.39400      0.61467      0.70200       0.21365
                  Run #1       0.20933       0.26733      0.47267      0.56133       0.15404
                  Run #2       0.22000       0.28667      0.46733      0.53533       0.16208
      GDELT2      Run #3       0.31533       0.41533      0.63867      0.71467       0.23320
                  Run #4       0.37067       0.44600      0.66400      0.73800       0.26778

   Through the utilization of our filtering framework, we are able to generate highly precise
texts for article cover images during the training process. This method outperforms the reliance
solely on article titles, which often have inconsistencies and noise. As a result, the quality of our
results has significantly improved. In Run #4, we achieved the best outcome, with the dataset
GDELT2 performing the best. Our score with the R@100 metric was 0.73800, and there was
also notable improvement for other datasets in Run #4.


5. Discussion and Outlook
Despite inconsistent performance throughout the three datasets, we have proved the promising
future of reducing the semantic distance in NewsImage task by integrating large language
models such as ChatGPT into the pipeline. This shed light on another use case of such a gold
mine of LLMs by suggesting descriptions for cover images of news articles based on their
content. This description is then fed into generative models to create more-relevant images. In
future research, we aim to explore the concept of AI-generated images, where pictures are not
captured by humans but produced by machines. The emergence of AI-generated pictures has the
potential to threaten media cohesion, spark discussions among publishers and photographers,
and potentially facilitate the dissemination of misleading information through fabricated images.
References
 [1] A. Lommatzsch, B. Kille, Ö. Özgöbek, M. Elahi, D.-T. Dang-Nguyen, News Images in MediaEval
     2023, CEUR Workshop Proceedings, 2024. URL: http://ceur-ws.org/.
 [2] J. Li, D. Li, C. Xiong, S. C. H. Hoi, BLIP: Bootstrapping Language-Image Pre-training for Unified
     Vision-Language Understanding and Generation, CoRR abs/2201.12086 (2022). URL: https://arxiv.
     org/abs/2201.12086. arXiv:2201.12086.
 [3] Q. Zhang, Z. Lei, Z. Zhang, S. Z. Li, Context-Aware Attention Network for Image-Text Retrieval, in:
     Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
     2020, pp. 3533–3542. doi:10.1109/CVPR42600.2020.00359.
 [4] C. Liu, Z. Mao, T. Zhang, H. Xie, B. Wang, Y. Zhang, Graph Structured Network for Image-Text Match-
     ing, CoRR abs/2004.00277 (2020). URL: https://arxiv.org/abs/2004.00277. arXiv:2004.00277.
 [5] Z. Yang, S. Yi, W. Wenbo, L. Jing, S. Jiande, CLIP Pre-trained Models for Cross-modal Retrieval
     in NewsImages 2022, in: Working Notes Proceedings of the MediaEval 2022 Workshop, CEUR
     Workshop Proceedings, 2022. URL: http://ceur-ws.org/.
 [6] L. Mingliang, L. Martha, Textual Concept Expansion for Text-Image Matching within Online
     News Content, in: Working Notes Proceedings of the MediaEval 2022 Workshop, CEUR Workshop
     Proceedings, 2022. URL: http://ceur-ws.org/.
 [7] C. Baziotis, N. Pelekis, C. Doulkeridis, DataStories at SemEval-2017 task 4: Deep LSTM with attention
     for message-level and topic-based sentiment analysis, in: Proceedings of the 11th International
     Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics,
     Vancouver, Canada, 2017, pp. 747–754. URL: https://aclanthology.org/S17-2126. doi:10.18653/v1/
     S17-2126.
 [8] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan,
     P. Dollár, C. L. Zitnick, Microsoft COCO: common objects in context, CoRR abs/1405.0312 (2014).
     URL: http://arxiv.org/abs/1405.0312. arXiv:1405.0312.
 [9] S. Piyush, D. Nan, G. Sebastian, S. Radu, Conceptual Captions: A Cleaned, Hypernymed, Image
     Alt-text Dataset For Automatic Image Captioning, Proceedings of the 56th Annual Meeting of the
     Association for Computational Linguistics (Long Papers), 2018, pp. 2556–2565.
[10] S. Gururangan, A. Marasovic, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t
     Stop Pretraining: Adapt Language Models to Domains and Tasks, CoRR abs/2004.10964 (2020).
     URL: https://arxiv.org/abs/2004.10964. arXiv:2004.10964.
[11] K. Desai, J. Johnson, VirTex: Learning Visual Representations from Textual Annotations, CoRR
     abs/2006.06666 (2020). URL: https://arxiv.org/abs/2006.06666. arXiv:2006.06666.
[12] J. Li, D. Li, S. Savarese, S. Hoi, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen
     Image Encoders and Large Language Models, 2023. arXiv:2301.12597.