Multimodal Learning for Image-Text Matching: A Blip-Based Approach Dhanya Srinivasan1 , Subhashree M1,* , Mirunalini P1 and Jaisakthi S M2 1 Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai - 603110, Tamil Nadu, India 2 School of Computer Science & Engineering, Vellore Institute of Technology, Chennai Campus, Chennai-600127, Tamil Nadu, India Abstract This study delves into the domain of multimodal learning, focusing on image-text alignment to discern common olfactory references within multilingual content. The task aims for the Multimodal Understand- ing of Smells in Texts and Images (MUSTI) at MediaEval ’23 [1]. The goal of this task is to address the gap between multimedia analysis and multimedia representation. The task aims to predict whether a text passage and an image evoke the same smell source or not. Our research employs the Bootstrap- ping Language Image Pre-training (BLIP) model which is a Visual Language Pre-training (VLP) model and capable of both vision-language understanding and generative tasks. We particularly engaged the BlipForConditionalGeneration model, a variant of BLIP, for image captioning to generate textual descriptions for the input images. These generated captions are matched with the corresponding text of the images based on the similarity score. Using the obtained similarity score a binary classification is performed using a multinomial Naive Bayes classifier. Our objective is to evaluate the effectiveness of amalgamating image captioning and text classification for this task. We employed a base model using BLIP and fine-tuned the same model and achieved an F1 score of 48.93% and 55.91% respectively. Keywords: Deep learning model, Smells, BLIP 1. Introduction Exploring olfactory information in images and text is crucial for historical and interdisciplinary research, shedding light on nuanced cultural contexts. Museums and galleries globally pioneer olfactory enrichments for immersive experiences, emphasizing the interdisciplinary potential and the importance of historically accurate olfactory settings. Automating olfactory information extraction has not gained much importance among the researchers since it is a challenging task to identify them in texts or images because of rare linguistic evidence in texts and its implicit representations in images [2]. Motivated by the profound impact of scent on emotions and memories, the MUSTI challenge at Mediaeval’23 explores the olfactory dimension in digital collections. This paper focuses on the MUSTI subtask 1, expediting the understanding of olfactory references in multilingual text and images and forging connections between modalities. This task is a binary classification of whether an image and a text passage evoke the same smell source or not. In this study we evaluate the effectiveness of amalgamating image captioning and text classification for this task. We assess the performance of the state-of-the-art model, MediaEval’23: Multimedia Evaluation Workshop, February 1–2, 2024, Amsterdam, The Netherlands and Online * Corresponding author. † These authors contributed equally. $ dhanya2010903@ssn.edu.in (D. Srinivasan); subhashree2010066@ssn.edu.in (S. M); miruna@ssn.edu.in (M. P); jaisakthi.murugaiyan@vit.ac.in (J. S. M) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings BLIP, for image captioning, followed by the Multinomial Naive Bayes classifier to predict the classification labels on the test data of the MUSTI challenge. We provide insights into the performances of both the base and fine-tuned versions of this model. In Section 2, we cite related work and references. Next, we provide a detailed explanation of our approach in Section 3. Following this, the results of the models in various configurations are reported in Section 4. Finally, a summary of our evaluation and an outlook conclude this paper in Section 5. 2. Related Work The detection of smells, or olfaction, has traditionally been associated with human senses, but recent interdisciplinary research has extended this concept to image and text analysis. In image analysis, convolutional neural networks (CNNs), as demonstrated by [3], have shown promise in correlating visual patterns with specific smells. On the textual front, Natural Language Processing (NLP) techniques, as explored by [4], utilize word embeddings and semantic analysis to infer olfactory attributes from textual descriptions. A recent trend involves combining both modalities, as seen in the work of [2], where a multimodal deep learning architecture jointly analyzes images and textual descriptions for improved olfaction detection. Challenges include the subjective nature of olfactory perception and the need for large-scale annotated datasets, but ongoing research aims to refine multimodal models, explore transfer learning techniques, and address ethical considerations related to olfaction data in image and text. This nascent field holds promise for applications ranging from environmental monitoring to sentiment analysis of product reviews. A multilingual benchmark annotated with smell-related information which covers six languages are made available to the research community and they also discussed olfactory information extraction [5]. The performance of two state-of-the-art models, VilBERT and mUNITER on the MUSTI challenge test data and present the performances of base and fine-tuned versions of these models [2]. In [6] studies the relatedness of evocation of smells between texts and images generated was given as a task and overview of the participants model. performance analysis and dataset were also discussed. Shoa et al. [7] proposed a object detection based method for matching olfactory information in text and images. But this work faces a problem of data imbalance since the authors extract both positive and negative objects from the image. ICPR2022 ODeuropa Challenge [8] focused on recognizing odor-active objects in historical artworks. The winning team used PPYOLO-E [9] object detector with CSP-Resnet [10] backbone, trying grayscale image augmentation and style transfer for training, but found a strong object detection model to be most effective. 3. Approach In this study, the research methodology unfolds through a systematic approach, addressing the complexities inherent in matching language descriptions with visual stimuli in the context of olfactory experiences. The methodology followed is an amalgamation of image captioning using BLIP model and text classification using a Naive Bayes classifier. 3.1. Data Collection and Preprocessing The study employs a dataset comprising of image-text pairs sourced from [11]. The dataset includes information such as image filenames, text descriptions, language labels, and labels of objects invoking the smell if present. A metadata file was prepared using the image filenames Figure 1: Proposed Architecture for Understanding Smells in Texts and Images and their corresponding captions as given in the dataset. This was stored for the fine-tuning of the captioning model. 3.2. Image Captioning For understanding smells in texts and images we propose to use the BLIP model [12] which is a VLP framework, basically used for vision language objectives such as: image-text contrastive learning, image-text matching, and image-conditioned language modeling. We have used two models, the first model used a pre-trained architecture (baseline model) and the latter one was a fine-tuned model where the hyperparameters were set. The baseline model was directly used to generate image captions and the obtained captions were used further to understand the correlation between the images and texts. In the fine- tuned model, the processor functions were used as a wrapper to combine the two processors - BERT tokenizer and BLIP image processor into a single interface, allowing the model to handle both text and image inputs seamlessly during inference and training. It applies WordPiece tokenization on text, simultaneously resizing and preprocessing raw images into the format required by the model. In the case of fine-tuned model, the same baseline model was tuned by loading the images as batches of 17 and then the model undergoes a rigorous fine-tuning regimen of 20 epochs, facilitated by the Adam with weight decay optimizer (AdamW) with a learning rate of 5e-5. Short captions are then decoded from the output of the model for all images obtained using the URLs. These captions are then stored and used for subsequent classification. 3.3. Text Similarity Classifier The Multinomial Naive Bayes classifier is chosen for its aptitude in discerning binary rela- tionships, undergoing training for 3 epochs on the transformed training dataset, transforming texts using a CountVectorizer. The classifier is then used for binary classification based on the similarity of original text and generated captions. By ultimately assigning binary labels using a threshold, we attempt the classification task of identifying the correlation of smells across different modalities. 4. Results and Analysis The state-of-the-art BLIP model demonstrated promising results in image captioning, generating textual descriptions for images. Subsequent integration with the Multinomial Naive Bayes classifier allowed us to construct a binary image-text similarity classifier. The performance metrics, including accuracy, precision, recall, and F1 score, were computed for both base and fine-tuned versions of the model. These metrics provide insights into the model’s ability to correctly identify positive instances while minimizing false positives and false negatives. Our analysis reveals the efficacy of combining image captioning with text classification for the MUSTI task. The utilization of a binary classifier helps determine the co-relation of smells across different modalities. The model seems to perform significantly better when fine-tuned to the MUSTI dataset, as shown in Table 1. It achieves a precision of 67.42% and displays a moderate level of success in the classification task, in contrast to the 62.2% precision of the base model. The corresponding F1-scores of 55.91% and 48.93% for the fine-tuned and base models, respectively, further highlight the nuanced performance difference. The variation of the metrics for each class "YES" and "NO" highlights the difference in the ability of the model to predict positive and negative classes. While the model is fairly successful in identifying and predicting negative classes in both the fine-tuned and base versions, it struggles in positive class prediction. Fine-tuned model Base model Metric Precision Recall F1-Score Support Precision Recall F1-Score Support NO 0.7895 0.7000 0.7420 150 0.7480 0.6333 0.6859 150 YES 0.3284 0.4400 0.3761 50 0.2466 0.3600 0.2927 50 Accuracy 0.6350 200 0.5650 200 Macro Avg 0.5589 0.5700 0.5591 200 0.4973 0.4967 0.4893 200 Weighted 0.6742 0.6350 0.6506 200 0.6227 0.5650 0.5876 200 Avg Table 1 Comparison of Classification Metrics for the English Language In conclusion, the fine-tuned model, with a precision of 78.9%, recall of 70%, accuracy of 74.2%, and F1-score of 55.91% on the test data, demonstrates its superior performance in the given task. 5. Conclusion and Future Directions In this study, we employ the VLP framework BLIP coupled with a Multinomial Naive Bayes classifier, achieving promising results through fine-tuning MUSTI data for English. Challenges persist in automating olfactory information extraction due to limited linguistic evidence and implicit image representation. While BLIP excels for English, it falls short for multilingual data, and the use of Naive Bayes captures semantic similarity but struggles with detecting similar olfactory sources. Text tokenization, crucial for semantic understanding, may lead to information loss. As part of future work, we would like to further fine-tune the model in order to improve accuracy for the positive class and we may also explore advanced multimodal architectures and incorporate additional contextual cues to improve the model’s grasp of olfactory references. References [1] A. Hürriyetoglu, I. Novalija, M. Zinnen, V. Christlein, P. Lisena, S. Menini, M. van Erp, R. Troncy, The MUSTI challenge @ MediaEval 2023 - multimodal understanding of smells in texts and images with zero-shot evaluation, in: Working Notes Proceedings of the MediaEval 2023 Workshop, Amsterdam, the Netherlands and Online, 1-2 February 2024, 2023. [2] K. Akdemir, A. Hürriyetoğlu, R. Troncy, T. Paccosi, S. Menini, M. Zinnen, V. Christlein, Multimodal and multilingual understanding of smells using vilbert and muniter, in: Proceedings of MediaEval 2022 CEUR Workshop, 2022. [3] S. Kim, J. Park, J. Bang, H. Lee, Seeing is smelling: Localizing odor-related objects in images, in: Proceedings of the 9th Augmented Human International Conference, 2018, pp. 1–9. [4] S. Menini, T. Paccosi, S. S. Tekiroğlu, S. Tonelli, Scent mining: Extracting olfactory events, smell sources and qualities, in: S. Degaetano-Ortlieb, A. Kazantseva, N. Reiter, S. Szpakowicz (Eds.), Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Association for Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 135–140. URL: https://aclanthology.org/2023.latechclfl-1.15. doi:10.18653/v1/ 2023.latechclfl-1.15. [5] S. Menini, T. Paccosi, S. Tonelli, M. Van Erp, I. Leemans, P. Lisena, R. Troncy, W. Tullett, A. Hür- riyetoğlu, G. Dijkstra, F. Gordijn, E. Jürgens, J. Koopman, A. Ouwerkerk, S. Steen, I. Noval- ija, J. Brank, D. Mladenic, A. Zidar, A multilingual benchmark to capture olfactory situa- tions over time, in: N. Tahmasebi, S. Montariol, A. Kutuzov, S. Hengchen, H. Dubossarsky, L. Borin (Eds.), Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 1–10. URL: https://aclanthology.org/2022.lchange-1.1. doi:10.18653/v1/2022.lchange-1.1. [6] A. Hürriyetoglu, T. Paccosi, S. Menini, M. Zinnen, P. Lisena, K. Akdemir, R. Troncy, M. van Erp, MUSTI - multimodal understanding of smells in texts and images at mediaeval 2022, in: S. Hicks, A. G. S. de Herrera, J. Langguth, A. Lommatzsch, S. Andreadis, M. Dao, P. Martin, A. Hürriyetoglu, V. Thambawita, T. S. Nordmo, R. Vuillemot, M. A. Larson (Eds.), Working Notes Proceedings of the MediaEval 2022 Workshop, Bergen, Norway and Online, 12-13 January 2023, volume 3583 of CEUR Workshop Proceedings, CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3583/paper50.pdf. [7] Y. Shao, Y. Zhang, W. Wan, J. Li, J. Sun, Multilingual text-image olfactory object matching based on object detection, in: Proceedings of MediaEval 2023 CEUR Workshop, 2022. [8] M. Zinnen, P. Madhu, R. Kosti, P. Bell, A. Maier, V. Christlein, Odor: The icpr2022 odeuropa challenge on olfactory object recognition, in: 2022 26th International Conference on Pattern Recognition (ICPR), IEEE, 2022, pp. 4989–4994. [9] X. Long, K. Deng, G. Wang, Y. Zhang, Q. Dang, Y. Gao, H. Shen, J. Ren, S. Han, E. Ding, et al., Pp-yolo: An effective and efficient implementation of object detector, arXiv preprint arXiv:2007.12099 (2020). [10] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, I.-H. Yeh, Cspnet: A new backbone that can enhance learning capability of cnn, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 390–391. [11] Mediaeval 2023, https://multimediaeval.github.io/editions/2023/tasks/musti/, 2023. [12] J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation, in: International Conference on Machine Learning, PMLR, 2022, pp. 12888–12900.