Multimodal and Multilingual Olfactory Matching based on Contrastive Learning Sergio Esteban-Romero1,* , Iván Martín-Fernández1 , Jaime Bellver-Soler1 , Manuel Gil-Martín1 and Fernando Fernández-Martínez1 1 Grupo de Tecnología del Habla y Aprendizaje Automático (THAU Group), Information Processing and Telecommunications Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid (UPM) Abstract This paper introduces an innovative approach to the multimodal smell identification task, using CLIP- based solutions employing Vision Transformers (ViT) as image processors and language-specific text encoders. The proposed method addresses the question of whether image-text pairs convey similar olfactory experiences by aligning them in a shared embedding space. A notable consideration in our study is the challenge posed by class imbalance, where certain olfactory experiences have a more significant representation. Hence, this paper describes a supervised methodology during the training of the CLIP- based model, enhancing positive olfactory relationships while mitigating them otherwise. Additionally, we have also explored different data balancing procedures aimed at preserving the original distribution between languages. One of our proposed approaches has demonstrated enhanced accuracy compared to the top-performing result reported in the past 2022 MUSTI challenge edition. 1. Introduction and Related Work In the evolving landscape of artificial intelligence (AI) algorithms, the sensory dimension of smell has been left out of the picture compared to the advances in computer vision, natural language processing, and audio recognition. However, from all that media sources, it is possible to identify which are the olfactory elements present in them. The integration of smell in the digital landscape presents a set of challenges given its ability to be captured indirectly. Those efforts resulted in challenges like the MUSTI task of MediaEval2022 [1] and Media- Eval2023 [2]. In particular, it is focused on the development of systems to determine whether image-text pairs evoke the same smell or not but also to effectively identify which smell sources are present in the images. Also, best approaches related to the task are based on the use of state-of-the-art models to explore images and text separately to later perform a visual entailment task, as presented by Akdemir et al. [3]. Another relevant solution comes from Shao et al. [4] where Yolov5 is used to extract features that will be encoded alongside the textual passage using a multilingual model to finally feed a classifier. In this paper, we present a solution based on the creation of language-specific CLIP [5] models using different text encoders for each of the four available languages, and also a combination of MediaEval’23: Multimedia Evaluation Workshop, February 1–2, 2024, Amsterdam, The Netherlands and Online * Corresponding author. † These authors contributed equally. $ sergio.estebanro@upm.es (S. Esteban-Romero); ivan.martinf@upm.es (I. Martín-Fernández); jaime.bellver@upm.es (J. Bellver-Soler); manuel.gilmartin@upm.es (M. Gil-Martín); fernando.fernandezm@upm.es (F. Fernández-Martínez)  0009-0008-6336-7877 (S. Esteban-Romero); 0009-0004-2769-9752 (I. Martín-Fernández); 0009-0006-7973-4913 (J. Bellver-Soler); 0000-0002-4285-6224 (M. Gil-Martín); 0000-0003-3877-0089 (F. Fernández-Martínez) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings them training a simple neural network to perform a linear regression task. Our work is focused on Subtask 1, predicting whether a text passage and an image evoke the same smell source. 2. Approach Training process Classification using language-specific model Combination of Experts Text F-CLIP English Text encoder encoder Input F-CLIP German Input data data YES/NO YES/NO Image encoder F-CLIP Cosine Sigmoid Sigmoid similarity F-CLIP Italian Image F-CLIP French Labels encoder Figure 1: Overview of the proposed solutions. The original implementation of CLIP [5] is based on the premise that for a given image- text pair, both text and image encoders ideally produce identical representations within a shared embedding space. However, common standard implementations of the framework are unsuitable for our specific scenario, since for negative smell relationships, we expect different representations even when they depict similar overall semantic concepts. To extend the CLIP contrastive approach to encompass negative pairs alongside positive ones, we modify the loss function to accommodate both similar (positive) and dissimilar (negative) examples. Consequently, during training, our loss function aims to bring together model representations of positive pairs, fostering similar embeddings, while simultaneously driving apart representations of negative pairs, encouraging dissimilar embeddings. This incorporation of both positive and negative pairs allows the model to discern between relevant and irrelevant image-text pairs, thereby improving its capacity to comprehend and differentiate semantic content. For the models used for training, we used the ViT1 checkpoint for all languages for the vision part. Regarding text encoders, those used are English MPNET2 , French Camembert3 , Italian BERT4 and German BERT5 . The checkpoints are available at huggingface. Since the number of examples to train using the original challenge dataset is low and in combination with class imbalance, three different experimental setups, described in Table 1, were considered. With these additional experiments, our aim was to obtain a more general and unbiased model that can generalize adequately, but also being potentially more capable of detecting positive cases. 2.1. Training language-specific CLIP models Every language-specific CLIP model is trained by using image-text pair inputs for their cor- responding encoders to obtain textual and visual embeddings. Next, we calculate the cosine similarity and utilize it as input for a sigmoid function, ensuring alignment of the resulting output with the specific format demanded by our task. Finally, our approach adopts the Binary Cross-Entropy (BCE) loss. Note that the usage of different text encoders yields language-dependent visual classifiers. As a result, a projection layer is included after obtaining the embeddings, normalizing the input 1 https://huggingface.co/google/vit-base-patch16-224-in21k 2 https://huggingface.co/sentence-transformers/all-mpnet-base-v2 3 https://huggingface.co/dangvantuan/sentence-camembert-large 4 https://huggingface.co/dbmdz/bert-base-italian-uncased 5 https://huggingface.co/bert-base-german-dbmdz-uncased size while effectively reducing the complexity of the problem. In particular, the final shared space is restricted to a size of 256. Despite utilizing language-specific text encoders, data from all languages is used when training each fine-tuned CLIP. Although we aimed to develop language-specialized models, we thought it would also be beneficial to learn from others because some words or expressions do not vary much between languages and also because there are few samples for some cases. 2.2. Combination of Experts (CoE) When training every pair of language-specific text and image encoders is performed, we hypothesized that their combination might enhance overall performance. Therefore, to benefit from their distinct expertise, cosine similarities are computed for each pair of examples using language-specific models from all languages. The values obtained are used as input to train a simple neural network with a regression layer, producing a single output that represents the probability of belonging to the positive class. Finally, a threshold is applied to obtain the final classification based on the predictions of the model. 3. Results and Discussion To evaluate our models, we have followed a 5-fold cross-validation scheme. In addition, a fixed reduced test set particular for each of the experimental setups considered was used. A description of them is reported in Table 1. The metrics used are f1 macro, since it is the one used in the challenge to evaluate models, and also the area under the curve (AUC) in combination with receiver operating characteristic (ROC) curves to define the best possible threshold in class imbalance scenarios. For doing so, we used the scikit-learn ROC curve implementation to select it among those proposed. To obtain it, the geometric mean of sensitivity and specificity of each proposal is calculated and the threshold maximizing it is selected. Regarding the experimental setup with the original challenge dataset, corresponding to unbalanced negative (Unbal. Neg.), a noticeable bias could be observed in the mean of cosine similarities (Mean Cos. Sim.). Consequently, models trained using such data are expected to be biased too. Following the threshold selection procedure defined previously, 0.3 is approximately the best to be used. If we apply it in the experiments carried out under our evaluation scheme, the best performance is obtained for the language-specific model using the French text encoder with a 0.7031. It also achieves a 0.6342 in the challenge test dataset. Furthermore, the CoE reports a macro-f1 value of 0.4648 in Table 2, showing that the performance is considerably lower in contrast to using one model independently. For the balanced setup, considering the mean value of the cosine similarities, we can conclude that our models are now less biased than when all available data are used. Looking again at the results in Table 1 and in Table 2, the language-specific French model is the best, obtaining values of 0.7253 and 0.6401, respectively. In particular, the latter surpasses the best result reported by Akdemir et al. on [3] which is 0.6176. However, the CoE solution achieves here a macro-f1 of 0.4443 on the challenge test dataset. Finally, if we again compute the best possible threshold to be applied over the probabilities from the models, a value of 0.5 is obtained. For the case with class imbalance towards the positive class, represented as Unbal. Pos, the mean value of cosine similarities suggests that our models are slightly biased. This also highlights that data augmentation processes may be required to train a model with these specific biases, as we thought that removing more examples to enhance a larger bias would lead to low-performance models. In this case, best performing model is obtained using the German Table 1 Distribution of examples for each of the different experimental setups considered with the mean value of the cosine similarity for each image-text pair on each test set. Additionally, f1 macro score obtained under 5-fold cross-validation procedure for each model trained is presented. Exp. setup Total Pos. Neg. Test Mean Cos.Sim. English French German Italian CoE Unbal. Neg. 2,374 593 1,781 356 (15%) -0.64±0.05 0.6532 0.7031 0.6814 0.6311 0.6226 Balanced 1,218 593 625 122 (10%) -0.02±0.06 0.6889 0.7253 0.7015 0.6504 0.6490 Unbal. Pos. 994 593 401 146 (15%) 0.26±0.10 0.7043 0.6873 0.7125 0.6542 0.5950 Table 2 Results from runs on the challenge test dataset, using models trained on unbalanced negative (Unbal Neg.), unbalanced positive (Unbal Pos.), and balanced (Bal.) datasets. Exp. setup English French German Italian CoE Unbal Neg, 0.6157 0.6342 0.5903 0.5834 0.4648 Balanced 0.5789 0.6401 0.6284 0.5234 0.4443 Unbal. Pos. 0.5279 0.4538 0.6193 0.5399 0.4363 text encoder. However, attending to Table 2, the CoE result was evaluated on the challenge test dataset and obtained a score of 0.4363. Regarding optimal thresholds, in this case values are obtained around 0.55 and 0.6, the latter being selected for the results presented. 4. Conclusions In this paper, we propose a method to simultaneously adapt text and image encoders. When the training process is finished and they are combined, a scent embedding space is obtained, where for image-text pairs with olfactory relationships, their individual representations will be similar and different otherwise. The effectiveness of a single pair of text-image encoders achieves better overall performance compared to the combination of the output of all experts in different languages for this specific problem. Moreover, balancing the dataset has been proven as an effective technique for better generalization. It is also illustrated in Table 1 how after removing almost half of the examples from the original dataset to balance it, some improvements in the f1 score can be observed. This is also highlighted in Table 2 since the best result is obtained when using the balanced setup. In particular, the language-specific French model shows better overall performance, possibly as a result of having used a more powerful text encoder compared to the rest. However, more exploration is required to benefit from the expertise of each language-specific model, as they are proven to work independently. Acknowledgments S.E.-R.’s research was supported by the Spanish Ministry of Education (FPI grant PRE2022- 105516). This work was funded by Project ASTOUND (101071191 — HORIZON-EIC-2021- PATHFINDERCHALLENGES-01) of the European Commission and by the Spanish Ministry of Science and Innovation through the projects GOMINOLA (PID2020-118112RB-C22) and BeWord (PID2021-126061OB-C43), funded by MCIN/AEI/ 10.13039/501100011033 and by the European Union “NextGenerationEU/PRTR”. References [1] A. Hürriyetoglu, T. Paccosi, S. Menini, M. Zinnen, P. Lisena, K. Akdemir, R. Troncy, M. van Erp, MUSTI - multimodal understanding of smells in texts and images at mediaeval 2022, in: S. Hicks, A. G. S. de Herrera, J. Langguth, A. Lommatzsch, S. Andreadis, M. Dao, P. Martin, A. Hürriyetoglu, V. Thambawita, T. S. Nordmo, R. Vuillemot, M. A. Larson (Eds.), Working Notes Proceedings of the MediaEval 2022 Workshop, Bergen, Norway and Online, 12-13 January 2023, volume 3583 of CEUR Workshop Proceedings, CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3583/paper50.pdf. [2] A. Hürriyetoglu, I. Novalija, M. Zinnen, V. Christlein, P. Lisena, S. Menini, M. van Erp, R. Troncy, The MUSTI challenge @ MediaEval 2023 - multimodal understanding of smells in texts and images with zero-shot evaluation, in: Working Notes Proceedings of the MediaEval 2023 Workshop, Amsterdam, the Netherlands and Online, 1-2 February 2024, 2023. [3] K. Akdemir, A. Hürriyetoglu, R. Troncy, T. Paccosi, S. Menini, M. Zinnen, V. Christlein, Multimodal and multilingual understanding of smells using vilbert and muniter, in: S. Hicks, A. G. S. de Herrera, J. Langguth, A. Lommatzsch, S. Andreadis, M. Dao, P. Martin, A. Hürriyetoglu, V. Thambawita, T. S. Nordmo, R. Vuillemot, M. A. Larson (Eds.), Working Notes Proceedings of the MediaEval 2022 Workshop, Bergen, Norway and Online, 12-13 January 2023, volume 3583 of CEUR Workshop Proceedings, CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3583/paper36.pdf. [4] Y. Shao, Y. Zhang, W. Wan, J. Li, J. Sun, Multilingual text-image olfactory object matching based on object detection, in: S. Hicks, A. G. S. de Herrera, J. Langguth, A. Lommatzsch, S. Andreadis, M. Dao, P. Martin, A. Hürriyetoglu, V. Thambawita, T. S. Nordmo, R. Vuillemot, M. A. Larson (Eds.), Working Notes Proceedings of the MediaEval 2022 Workshop, Bergen, Norway and Online, 12-13 January 2023, volume 3583 of CEUR Workshop Proceedings, CEUR-WS.org, 2022. URL: https: //ceur-ws.org/Vol-3583/paper15.pdf. [5] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language super- vision, CoRR abs/2103.00020 (2021). URL: https://arxiv.org/abs/2103.00020. arXiv:2103.00020.