Image Retrieval with Short Text Queries Ojas Rane1,∗ , Conor Nugent2 and Brian Mac Namee1 1 School of Computer Science, University College Dublin, Dublin, Ireland 2 Shutterstock Ireland Abstract The development of multimodal embedding spaces has made the development of semantic image retrieval applications possible. Such embeddings are trained on sentence-length descriptions of images which provides the model with rich context and helps in forming strong joint representations. However, this creates a mismatch with real-world usage, as users generally query these systems with just short phrases or keywords. The performance of models trained using long text can be poor for short text queries. To address this issue, we propose an image retrieval system that works with short text queries, but still takes advantage of large models pre-trained on long captions. We use four methods to improve performance: Fine-Tuning, Prompting, Expanding Short Texts Using Pre-trained GPT, and the SILC framework. We conduct our experiments using two pre-trained models, CLIP and SigLIP, to evaluate the effectiveness of these methods in enhancing short text-based image retrieval; and using the Flickr30k dataset for which short search strings have been generated using a pre-trained BLIP model. Keywords image retrieval, machine learning, multi-modal models 1. Introduction Embedding based Image retrieval is a fundamental computer-vision task that handles two modalities: language (the search terms entered by a user) and vision (the relevant images retrieved). The basic idea is to find similar images based on query text from a large database. Embedding based Image retrieval systems, like others in multi-modal machine learning tasks, have greatly benefited from the emergence of pre-trained models. Pre-trained models are trained on generic tasks using large datasets and the representations they learn can be used and fine-tuned for downstream tasks. Initially, pre-training was primarily used in for text and images independently [1]. This started to change with the emergence of multi-modal pre-trained models like ViLBERT [2] , LXMERT [3], and UNITER [4]. These models take an image and associated text as input and pass them through independent image and text encoders. The resulting representations are then fused and passed through another transformer which generates a joint representation. The CLIP [5] model illustrated the usefulness of pre-training having been pre-trained on a dataset of 400 million image-text pairs. Many of the pre-trained multi-modal models use datasets such as COCO [6], WIT [7], and Flickr [8], which primarily contain sentence-length text data. The longer text descriptions offer richer context, providing information not just about the objects in the corresponding image, but also about the background and scene details. This additional context helps form stronger joint representations, as the model can more easily match the correct image-text pairs. -> In real-world image retrieval systems, however, users typically input short, relatively unstructured text queries—often consisting of just a few keywords. Since these models are pre-trained on sentence-length data, they often fail to retrieve relevant output. The primary challenge with short text queries is their lack of context. When given only a brief description of an image, the model finds it difficult to accurately match the query with the correct image To develop an embedding based image retrieval system that works effectively with short text queries, we have utilized pre-trained CLIP and SigLIP [9] models. These models are chosen for their strong zero-shot and generalization capabilities, which are particularly useful when working with limited AICS’24: 32nd Irish Conference on Artificial Intelligence and Cognitive Science, December 09–10, 2024, Dublin, Ireland ∗ Corresponding author. $ ojas.rane@ucdconnect.ie (O. Rane); cnugent@shutterstock.com (C. Nugent); brian.macnamee@ucd.ie (B. Mac Namee) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings textual input. We evaluate four modifications to enhance effectiveness of image retrieval with short text queries: Fine-Tuning, Prompting, Expanding Short Texts Using Pre-Trained GPT Model [10], and SILC [11]. These techniques modify the text query to add more context before it is processed by the text encoder, thereby improving the system’s ability to match short queries with relevant images. As for short text dataset, there is no available dataset with an image-text pair that contains short text. Hence we have implemented image captioning using a pre-trained BLIP [12] model on the Flickr30k dataset to generate short text. The experiments described in this paper provide a comprehensive analysis of how image retrieval systems built using CLIP and SigLIP [9] perform with short and long texts and how the four modifications help bridge the performance gap between short and long text queries in these models. 2. Related Work The concept of pre-training has long been used in the field of computer vision, for example pre-trained models such as VGG16 [13] and ResNet [14]. These models showed the effectiveness of transfer learning for visual tasks. It wasn’t long before researchers used similar techniques for language [1], followed by the emergence of the transformer architecture for language models [15], and then more powerful pre-trained language models such as BERT [16] and GPT [17]. The success of pre-trained models for both vision and language led to the development of multimodal models that combine image and text data. Multi-modal Vision Language Pre-training (VLP) [18] has been used for tasks including visual question answering, image-text retrieval, and image captioning. These models are first pre-trained on a large dataset of aligned image-text pairs to form a joint representation of image-text dat, and then fine-tuned for specific downstream tasks. VLP models can be divided based on how they combine their image and text features, either early or late fusion. Early fusion involves combining features at the beginning of the pipeline, allowing the model to learn a joint representation from the start. In late fusion, modalities are processed separately before being combined at the decision level. Early fusion models typically follow the BERT architecture, and can be further sub-divided into single-stream and dual-stream architectures. In a single-stream architecture like Pixel-BERT [19], VisualBERT [20], and UNITER [4], both visual and language modalities are processed together through a single transformer. In contrast, a dual-stream architecture like ViLBERT or LXMERT processes visual and language modalities separately using distinct transformers before combining them. In late fusion models, modalities are processed separately before combination at the decision level. CLIP and ALIGN [21] are examples of late fusion models in vision-language pre-training. These architectures process image-text pairs independently before employing contrastive learning techniques to learn how they can be combined. SigLIP is a model similar to CLIP that has shown better performance with more computational efficiency. Embedding based image retrieval systems can broadly be divided into two categories: non-pretrained models and pre-trained models. Non-pretrained models include SCAN [22] and VSRN [23]. SCAN is an attention-based model which uses Faster R-CNN [24] to align objects and words together. Pre-trained models have evolved over time. Earlier versions were built using architectures such as UNITER, Pixel- BERT, and ViLBERT. Currently, state-of-the-art performance in image retrieval is achieved using more advanced pre-trained models like ALIGN, FILIP [25], Florence [26], and BLIP-2 [27] 3. Modifications for Short Text Image Retreival Pre-trained vision-language models like CLIP and SigLIP are trained on long-sentence text captions. These captions provide rich context about the objects and surroundings facilitating easy association with the corresponding image. In contrast, the short texts used in image retrieval typically comprise an average of 3-5 words which doesn’t match the texts used in training which leads to poor image retrieval performance. This section described the four modifications we use to improve the performance of image retrieval systems built using pre-trained multi-modal models. 3.1. Fine-Tuning Fine-tuning is a technique that involves training a model on a specific dataset or task, which allows it to learn and adjust its parameters. Models like CLIP and SigLIP are trained on long text data so when given short text inputs struggle to extract meaningful representations to use in image retrieval. To improve the model’s ability to represent both short texts and images effectively, the model is fine-tuned using a dataset of images and short text captions. This fine-tuning process helps enhance the performance of image retrieval systems. 3.2. Expanding Short Texts Using Pre-Trained GPT Model GPT-2 [10] is a decoder-only transformer architecture that tries to predict the next token based on a previously occurred sequence of tokens. GPT-2 is known for strong performance in zero-shot learning on various NLP tasks, and it can also be fine-tuned to adapt to specific tasks. For our experiment a pre-trained GPT-2 model (openai-community/gpt2) is used to expand short-text data into longer, more contextual text. The GPT model is fine-tuned to take short text captions as input and output expanded text. The resulting expanded captions are expected to produce more meaningful feature vectors, which could potentially enhance the performance on various downstream tasks. For fine-tuning we unfroze the last 8 layers of the GPT-2 model to allow weight updates. The loss function used for this was cross-entropy loss which was computed between the model’s predictions and the target tokens. AdamW [28] is used as the optimiser. The dataset used is a version of the Flikr30k dataset [8] for which short captions have been generated (as described in Section 4.1). Stopwords are removed from these short captions and pairs of short captions after and before stopword removal are used to fine tune the model. Once fine tuning is complete input texts (short captions) first go through the fine tuned GPT-2 model, and transform into a longer text before going through the CLIP/SigLIP text encoder. 3.3. SILC Framework Vision-Language Pre-trained Models (VLP) like CLIP and SigLIP have primarily been trained for image- level tasks such as classification and retrieval. These models excel at open vocabulary tasks at the image level, using contrastive learning to match similar image and text embeddings. However, they struggle with pixel-level tasks like segmentation and object detection. Tasks like object detection and segmentation require an understanding of the local features and also have the ability to make predictions at the pixel level. The SILC Framework aims to enhance models like CLIP and SigLIP by incorporating capabilities for pixel-level tasks. The SILC Framework combines image-text contrastive pretraining with local-to-global consistency learning by self-distillation. Local-to-global consistency learning involves ensuring local features from cropped patches are consistent with global features from the entire image. This helps models learn strong visual features that enable better local understanding. This improved local understanding not only improves the performance of pixel-level prediction tasks but also for tasks like retrieval because of better local feature understanding. The SILC Framework consists of a two-tower transformer model, which maintains teacher and student models. The teacher model gets the global view and the student model gets the local view as shown in Figure 1. Here the teacher model is the Exponential Moving Average (EMA) of the student model. This gives a stable base for the student model to learn rather than having a gradient update. The task is to match teachers’ feature embedding with only locally cropped images. SILC utilizes 2 global views which are randomly cropped to sizes between 0.4 and 1.0 of the original image and 8 local views which are of size 0.05 to 0.4 of the original image. Figure 1: An illustration of the SILC framework (reproduced from [11]) The SILC Framework uses a dual-loss function which combines contrastive loss and self-distillation loss. The first step involves calculating contrastive loss using both global views of the image paired with their associated text. The average loss is taken across views. For the self-distillation component, 16 global-local image pairs are constructed and loss is calculated as the cross-entropy between the probability distributions of the teacher and student models, encouraging consistency between global and local feature representations. The final loss function is the summation of the two components which maintains strong image-text associations while enhancing local feature understanding 3.4. Prompting Prompt engineering has emerged as an effective way to help Large Language Models produce more accurate and relevant results [29]. This is done by adding relevant background information and instructions to the the prompts which the model uses to generate output. CLIP uses prompt engineering to handle unlabeled classes in image classification tasks. It uses prompt templates to bridge the gap between textual and visual information, using templates like “A photo of a label”, where label can be replaced by classes from the classification problem. The short texts in the dataset used in our experiments (see Section 4.1) have an average length of just 3 words. It predominantly contains phrases combining adjectives with object classes (e.g., ’grass green’) or multiple object classes with descriptors (e.g., ’two dogs on road’). To improve the context of short text and bridge the gap between textual and visual data, we wrap the short text data in a prompt template and then pass it through the text encoder. 4. Experimental Setup This section describes the setup of an experiment designed to evaluate the gap between retrieval performance when long texts and short texts are used, and the effectiveness of the four modifications described in the previous section to reduce this gap. The experiment also compares the effectiveness of systems built using the CLIP and SigLIP models. 4.1. Dataset Our experiment compares the performance of image retrieval systems using long text captions versus short text captions. This necessitates the need of a dataset with images and two distinct caption sets: long and short. We selected the Flickr30k dataset, which contains 31,783 images and 158,915 captions (5 captions per image). For our purposes, we chose one caption per image from Flickr30k to serve as the Figure 2: Sample images from the dataset showing long and short captions associated with images long caption data. As short captions did not exist, we generated them using a pre-trained BLIP model [12]. The captions generated using the BLIP model still exceeded our desired length for short search phrases. The captions generated are further trimmed using stop word removal (which eliminates common words such as articles, prepositions, and conjunctions, which typically don’t carry significant meaning in search queries). This approach allows us to directly compare the effectiveness of long and short captions for the same set of images. The long captions are taken directly from the Flickr30k dataset, averaging 13 words in length. They are descriptive and contain multiple objects, scenes, and background details. They don’t solely focus on the main elements dominating the image but also include finer details. In contrast, our generated short text captions average just 3 words. Since it is generated using an image captioning technique, it tends to contain the main focus of the image. For instance, a short caption like “soccer goal” describes an image primarily featuring a soccer ball and goal, while its corresponding long caption, “A goalie is catching a soccer ball”, adds more details by specifying the action occurring in the image. Figure 2 illustrates examples of images and their short and long captions. 4.2. Baseline Model Our experiment uses CLIP and SigLIP models as the baseline. For our CLIP baseline, we employed the CLIP (ViT-B/32) model, which uses a Vision Transformer (ViT) with a 32x32 pixel patch size as its image encoder. The text encoder is a CLIP-based text encoder. This model’s architecture consists of 12 transformer layers, 512 embedding dimensions, and 8 attention heads. To adapt it to our dataset, we fine-tuned the last two layers of the CLIP model. For the SigLIP model, we have used SigLIP (siglip-base-patch16-256) which is a base SigLIP model with a 16 X 16 pixel patch size. It uses a transformer-based text encoder and a Vision Transformer for the image encoder. It has 24 transformer layers, 1024 embedding dimensions, and 16 attention heads. For, the SigLIIP model we have trained all layers (in total 203,202,050 trainable parameters). 4.3. Evaluation Metrics The evaluation metric used for the Embedding based image retrieval system is Recall@k. Recall@k measures a system’s ability to retrieve the relevant items in its top k results. In our experiments we measure Recall@1, Recall@5, and Recall@10 which is a standard practice. Recall@1 measures a system’s ability to retrieve the single most relevant item, while Recall@10 measures the system’s ability to at least retrieve relevant items. 4.4. Implementation We utilized the CLIP model with its default architecture. This includes a transformer-based text encoder and a Vision Transformer (ViT) [30] image encoder. Specifically, we used the pre-trained CLIP model "openai/clip-vit-base-patch32" available through the Hugging Face Transformers library [31]. We used pre-trained weights for most of the model, fine-tuning only the last two layers of both encoders. Our data preprocessing involved loading images using the Python Imaging Library (PIL) [32] and converting them to RGB format. For optimization, we have used the AdamW optimizer with a learning rate of 1e-5 and a weight decay of 0.01. To address the problem of exploding gradients, we implemented gradient clipping with a maximum norm of 0.5. Throughout our experiments, we used a batch size of 16. For SigLIP, we used the base SigLIP model, specifically the "google/siglip-base-patch16-224" model available through the Hugging Face Transformers package. The image encoder is a Vision Transformer with a patch size of 16, designed to process input images of 256x256 pixels. The text encoder is the default transformer-based architecture provided by SigLIP. Our data preprocessing pipeline involved loading images using the Python Imaging Library (PIL) and converting them to RGB format. These images were then processed using the AutoImageProcessor specific to the SigLIP model. For text data, we employed the AutoTokenizer for tokenization, with padding set to the maximum length. We use the AdamW optimizer with a learning rate of 1e-5 and a weight decay of 0.01. We kept the batch size at 16, aligning with our earlier experiments. 5. Results & Discussion Table 1 shows the results of our experiment, showing the outcomes of 14 distinct experiments conducted on CLIP and SigLIP models. We evaluate each model configuration using three key performance metrics: Recall@1, Recall@5, and Recall@10 and highlight best performance for each measure. When starting the experiments, it made more sense to use CLIP in the zero-shot setting rather than fine-tuning the model. Firstly, CLIP is an enormous model with a vast number of parameters, making it computationally expensive to train from scratch. Second, CLIP is highly data-sensitive, requiring massive amounts of data to achieve good generalization. Our dataset, consisting of only 30,000 image- text pairs, was far too small to effectively train CLIP. Given these constraints, we decided to start with the zero-shot CLIP model, leveraging its pre-trained capabilities without further adjustments. Table 1 Recall@k scores for different models with and without modifications applied. Best performances in each column are highlighted in light blue. Long Text Short Text Experiment Recall@ Recall@ Model 1 5 10 1 5 10 CLIP (No Fine-Tuning) 0.547 0.795 0.885 0.257 0.501 0.611 CLIP (No Fine-Tuning) – – – 0.281 0.519 0.609 with Prompting CLIP (No Fine-Tuning) – – – 0.235 0.465 0.577 with GPT-2 CLIP (Fine-Tuned) 0.639 0.890 0.940 0.331 0.647 0.766 CLIP (Fine-Tuned) – – – 0.257 0.509 0.632 with Prompting CLIP (Fine-Tuned) – – – 0.316 0.613 0.746 with GPT-2 CLIP (Fine-Tuned) – – – 0.319 0.663 0.773 with SILC SigLIP (Fine-Tuned) 0.713 0.913 0.961 0.323 0.659 0.791 SigLIP (Fine-Tuned) – – – 0.272 0.548 0.671 with Prompting SigLIP (Fine-Tuned) – – – 0.362 0.681 0.810 with GPT-2 SigLIP (Fine-Tuned) – – – 0.367 0.683 0.805 with SILC At the start we anticipated that CLIP without fine-tuning would perform significantly better with long Flickr captions compared to short ones. This was based on CLIP’s pre-training on the WebImageText (WIT) dataset [7], which primarily contains long text captions. As expected we observed a substantial performance gap when long and short search strings were used. CLIP with long search strings achieved a Recall@1 of 0.5470, Recall@5 of 0.7950, and Recall@10 of 0.8850. These scores align closely with those reported by [33], who achieved a Recall@1 of 0.5496 when testing the same base CLIP model (CLIP-ViT-224/32) on the Flickr30k 1k dataset in a zero-shot setting. To improve the performance on shorter search queries, we decided to use methods that would add more context while keeping the solutions simple and avoid making changes to CLIP’s architecture. Ultimately, we focused on two approaches that seemed most practical and aligned with our constraints standard prompting and expanding short text using GPT. 5.1. Expanding Queries When we applied these methods to CLIP in zero-shot settings, neither showed significant improvement over the original model when used with short search strings. The one with prompting model achieved Recall@5 of 0.5190 when short search strings were used, while expanding short search strings with a GPT led to Recall@5 of just 0.4649. The findings suggest that to achieve better representation and improved performance, CLIP needs to be fine-tuned. Fine-tuning will allow the CLIP model to adapt to short texts, different form the texts it was trained with. For fine-tuning, we determined it wasn’t feasible to unfreeze too many layers. Instead, we opted to unfreeze the last two layers of both the text and image encoders in the CLIP model. After applying fine-tuning, there was a significant increase in the performance of the CLIP model when short search strings were used, with the Recall@5 score increasing to 0.6470. Similarly after fine tuning, the scores when short search strings were expanded using a GPT model also increased the Recall@5 to 0.6130. Even after fine tuning prompting didn’t improve performance. The fine tuned CLIP model only showed minimal performance increase for long search strings, which indicates what a good zero-shot learner CLIP is. 5.2. Modifying Embedding Spaces The last method we tried, which gave the best results, was using the SILC framework. This method gives the CLIP model pixel-level prediction ability which improves local feature understanding. When the SILC modification was used the fine-tuned CLIP model achieved Recall@5 of 0.6630. Based on the impact it had on the CLIP model we used fine tuning on the SigLIP model in all experiments. When used with long search strings the fine-tuned SigLIP model achieved Recall@5 of 0.9130, quite an improvement over the CLIP model. When short search strings were used the fine-tuned SigLIP model outperformed the the equivalent CLIP models in all cases. Overall SigLIP with the SILC Framework modification achieved the best performance for short search strings. 5.3. Discussion We can conclude that SigLIP is a superior model than CLIP, performing better with both short and long search strings. For short search strings specifically, SigLIP performs best when enhanced with either the SILC framework or GPT-based text expansion. These findings suggest that SigLIP, combined with one of these techniques, represents the current best approach for image retrieval systems, particularly those dealing with short text inputs A significant limitation of our project was the quality of the short text data used for training and testing the models. This data, generated using the BLIP captioning method, often failed to capture the most relevant aspects of the images. The captions tended to focus on elements occupying the most space in the image, rather than the most important subjects. For instance, in an image with a small object against a large sky background, the caption might simply state “sky is blue”, ignoring the main subject. Similarly, when a water body was present in the background with a subject in the foreground, the generated caption might only mention the water, completely omitting the primary subject. Another issue it has caused is the repetitive values of many of the captions. Captions like “sky blue”, “group people”, and “water calm” are heavily repeated within the dataset. This could have caused problems during the retrieval stage leading to low Recall@1 scores for all models when short search strings are used. 6. Conclusions & Future Work In this study, we successfully developed an image retrieval system capable of processing short text queries, addressing a critical gap between the sentence-length data typically used in training and the brief, keyword-based queries common in real-world applications. There was a significant performance gap between long and short text in multi-modal models like CLIP and SigLIP. To narrow this gap, we proposed and evaluated several modifications, with the SILC framework and GPT-based text expansion proving most effective in enhancing short text performance. Surprisingly, simple prompting techniques showed limited benefits. Our experiments consistently demonstrated SigLIP’s superiority over CLIP for both long and short text queries. Our research into methods to improve image retrieval performance when short text queries are used, uncovered several promising approaches. ViSTA [34] and ROSITA [35] utilize scene graphs to enhance contextual information. However, these methods require architectural modifications to CLIP, making them less suitable for our current purposes. We also explored CoCoOP [36], which replaces standard prompts like “a photo of” with a neural network. This network is trained using backpropagation, and the learned weights represent a prompt added before the original text. Unfortunately, this method also necessitates changes to the model’s architecture. In the future, we aim to integrate these methods into our models to assess their impact on short text performance. Additionally, we plan to use more accurate short text data that better represents user queries for image retrieval while reducing repetitive values. Acknowledgments This work was supported by Science Foundation Ireland under Grant 12/RC/2289_P 2. References [1] J. Howard, S. Ruder, Universal language model fine-tuning for text classification, arXiv preprint arXiv:1801.06146 (2018). [2] J. Lu, D. Batra, D. Parikh, S. Lee, Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, Advances in neural information processing systems 32 (2019). [3] H. Tan, M. Bansal, Lxmert: Learning cross-modality encoder representations from transformers, arXiv preprint arXiv:1908.07490 (2019). [4] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, Uniter: Universal image-text representation learning, in: European conference on computer vision, Springer, 2020, pp. 104–120. [5] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763. [6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014, pp. 740–755. [7] K. Srinivasan, K. Raman, J. Chen, M. Bendersky, M. Najork, Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning, in: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, 2021, pp. 2443–2449. [8] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, S. Lazebnik, Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649. [9] X. Zhai, B. Mustafa, A. Kolesnikov, L. Beyer, Sigmoid loss for language image pre-training, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11975–11986. [10] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9. [11] M. F. Naeem, Y. Xian, X. Zhai, L. Hoyer, L. Van Gool, F. Tombari, Silc: Improving vision language pretraining with self-distillation, arXiv preprint arXiv:2310.13355 (2023). [12] J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation, in: International conference on machine learning, PMLR, 2022, pp. 12888–12900. [13] K. Simonyan, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014). [14] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [15] A. Vaswani, Attention is all you need, Advances in Neural Information Processing Systems (2017). [16] J. Devlin, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [17] A. Radford, Improving language understanding by generative pre-training (2018). [18] F.-L. Chen, D.-Z. Zhang, M.-L. Han, X.-Y. Chen, J. Shi, S. Xu, B. Xu, Vlp: A survey on vision-language pre-training, Machine Intelligence Research 20 (2023) 38–56. [19] Z. Huang, Z. Zeng, B. Liu, D. Fu, J. Fu, Pixel-bert: Aligning image pixels with text by deep multi-modal transformers, arXiv preprint arXiv:2004.00849 (2020). [20] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, J. Dai, Vl-bert: Pre-training of generic visual-linguistic representations, arXiv preprint arXiv:1908.08530 (2019). [21] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, T. Duerig, Scaling up visual and vision-language representation learning with noisy text supervision, in: International conference on machine learning, PMLR, 2021, pp. 4904–4916. [22] K.-H. Lee, X. Chen, G. Hua, H. Hu, X. He, Stacked cross attention for image-text matching, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 201–216. [23] K. Li, Y. Zhang, K. Li, Y. Li, Y. Fu, Visual semantic reasoning for image-text matching, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4654–4662. [24] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, IEEE transactions on pattern analysis and machine intelligence 39 (2016) 1137–1149. [25] L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, C. Xu, Filip: Fine-grained interactive language-image pre-training, arXiv preprint arXiv:2111.07783 (2021). [26] L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, et al., Florence: A new foundation model for computer vision, arXiv preprint arXiv:2111.11432 (2021). [27] J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, in: International conference on machine learning, PMLR, 2023, pp. 19730–19742. [28] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101 (2017). [29] P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, A. Chadha, A systematic survey of prompt engi- neering in large language models: Techniques and applications, arXiv preprint arXiv:2402.07927 (2024). [30] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020). [31] T. Wolf, Huggingface’s transformers: State-of-the-art natural language processing, arXiv preprint arXiv:1910.03771 (2019). [32] A. Clark, Pillow (pil fork) documentation, 2015. URL: https://buildmedia.readthedocs.org/media/ pdf/pillow/latest/pillow.pdf. [33] Z.-Y. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, P. Zhang, L. Yuan, N. Peng, et al., An empirical study of training end-to-end vision-and-language transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18166–18176. [34] M. Cheng, Y. Sun, L. Wang, X. Zhu, K. Yao, J. Chen, G. Song, J. Han, J. Liu, E. Ding, et al., Vista: Vision and scene text aggregation for cross-modal retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5184–5193. [35] Y. Cui, Z. Yu, C. Wang, Z. Zhao, J. Zhang, M. Wang, J. Yu, Rosita: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 797–806. [36] K. Zhou, J. Yang, C. C. Loy, Z. Liu, Conditional prompt learning for vision-language models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16816–16825.