<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. Bellver-Soler);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multimodal and Multilingual Olfactory Matching based on Contrastive Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sergio Esteban-Romero</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iván Martín-Fernández</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaime Bellver-Soler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Gil-Martín</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Fernández-Martínez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Grupo de Tecnología del Habla y Aprendizaje Automático (THAU Group), Information Processing and Telecommunications Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid</institution>
          ,
          <addr-line>UPM</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper introduces an innovative approach to the multimodal smell identification task, using CLIPbased solutions employing Vision Transformers (ViT) as image processors and language-specific text encoders. The proposed method addresses the question of whether image-text pairs convey similar olfactory experiences by aligning them in a shared embedding space. A notable consideration in our study is the challenge posed by class imbalance, where certain olfactory experiences have a more significant representation. Hence, this paper describes a supervised methodology during the training of the CLIPbased model, enhancing positive olfactory relationships while mitigating them otherwise. Additionally, we have also explored diferent data balancing procedures aimed at preserving the original distribution between languages. One of our proposed approaches has demonstrated enhanced accuracy compared to the top-performing result reported in the past 2022 MUSTI challenge edition.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Related Work</title>
      <p>In the evolving landscape of artificial intelligence (AI) algorithms, the sensory dimension of
smell has been left out of the picture compared to the advances in computer vision, natural
language processing, and audio recognition. However, from all that media sources, it is possible
to identify which are the olfactory elements present in them. The integration of smell in the
digital landscape presents a set of challenges given its ability to be captured indirectly.</p>
      <p>
        Those eforts resulted in challenges like the MUSTI task of MediaEval2022 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and
MediaEval2023 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In particular, it is focused on the development of systems to determine whether
image-text pairs evoke the same smell or not but also to efectively identify which smell sources
are present in the images.
      </p>
      <p>
        Also, best approaches related to the task are based on the use of state-of-the-art models to
explore images and text separately to later perform a visual entailment task, as presented by
Akdemir et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Another relevant solution comes from Shao et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] where Yolov5 is used to
extract features that will be encoded alongside the textual passage using a multilingual model
to finally feed a classifier.
      </p>
      <p>
        In this paper, we present a solution based on the creation of language-specific CLIP [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] models
using diferent text encoders for each of the four available languages, and also a combination of
them training a simple neural network to perform a linear regression task. Our work is focused
on Subtask 1, predicting whether a text passage and an image evoke the same smell source.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <p>Training process</p>
      <p>Text encoder</p>
      <p>Labels
Input data</p>
      <p>Image encoder</p>
      <p>F-CLIP</p>
      <p>Input
data</p>
      <p>siCmoislainriety Sigmoid YES/NO
Classification using language-specific model</p>
      <p>Text
encoder
Image
encoder</p>
      <p>Combination of Experts
F-CLIP English
F-CLIP German
F-CLIP Italian
F-CLIP French</p>
      <p>Sigmoid YES/NO</p>
      <p>
        The original implementation of CLIP [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is based on the premise that for a given
imagetext pair, both text and image encoders ideally produce identical representations within a
shared embedding space. However, common standard implementations of the framework
are unsuitable for our specific scenario, since for negative smell relationships, we expect
diferent representations even when they depict similar overall semantic concepts. To extend
the CLIP contrastive approach to encompass negative pairs alongside positive ones, we modify
the loss function to accommodate both similar (positive) and dissimilar (negative) examples.
Consequently, during training, our loss function aims to bring together model representations of
positive pairs, fostering similar embeddings, while simultaneously driving apart representations
of negative pairs, encouraging dissimilar embeddings. This incorporation of both positive and
negative pairs allows the model to discern between relevant and irrelevant image-text pairs,
thereby improving its capacity to comprehend and diferentiate semantic content.
      </p>
      <p>For the models used for training, we used the ViT1 checkpoint for all languages for the vision
part. Regarding text encoders, those used are English MPNET2, French Camembert3, Italian
BERT4 and German BERT5. The checkpoints are available at huggingface.</p>
      <p>Since the number of examples to train using the original challenge dataset is low and in
combination with class imbalance, three diferent experimental setups, described in Table 1,
were considered. With these additional experiments, our aim was to obtain a more general
and unbiased model that can generalize adequately, but also being potentially more capable of
detecting positive cases.</p>
      <sec id="sec-2-1">
        <title>2.1. Training language-specific CLIP models</title>
        <p>Every language-specific CLIP model is trained by using image-text pair inputs for their
corresponding encoders to obtain textual and visual embeddings. Next, we calculate the cosine
similarity and utilize it as input for a sigmoid function, ensuring alignment of the resulting
output with the specific format demanded by our task. Finally, our approach adopts the Binary
Cross-Entropy (BCE) loss.</p>
        <p>Note that the usage of diferent text encoders yields language-dependent visual classifiers.
As a result, a projection layer is included after obtaining the embeddings, normalizing the input
1https://huggingface.co/google/vit-base-patch16-224-in21k
2https://huggingface.co/sentence-transformers/all-mpnet-base-v2
3https://huggingface.co/dangvantuan/sentence-camembert-large
4https://huggingface.co/dbmdz/bert-base-italian-uncased
5https://huggingface.co/bert-base-german-dbmdz-uncased
size while efectively reducing the complexity of the problem. In particular, the final shared
space is restricted to a size of 256.</p>
        <p>Despite utilizing language-specific text encoders, data from all languages is used when
training each fine-tuned CLIP. Although we aimed to develop language-specialized models, we
thought it would also be beneficial to learn from others because some words or expressions do
not vary much between languages and also because there are few samples for some cases.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Combination of Experts (CoE)</title>
        <p>When training every pair of language-specific text and image encoders is performed, we
hypothesized that their combination might enhance overall performance. Therefore, to benefit
from their distinct expertise, cosine similarities are computed for each pair of examples using
language-specific models from all languages. The values obtained are used as input to train a
simple neural network with a regression layer, producing a single output that represents the
probability of belonging to the positive class. Finally, a threshold is applied to obtain the final
classification based on the predictions of the model.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Discussion</title>
      <p>To evaluate our models, we have followed a 5-fold cross-validation scheme. In addition, a
ifxed reduced test set particular for each of the experimental setups considered was used. A
description of them is reported in Table 1. The metrics used are f1 macro, since it is the one used
in the challenge to evaluate models, and also the area under the curve (AUC) in combination
with receiver operating characteristic (ROC) curves to define the best possible threshold in class
imbalance scenarios. For doing so, we used the scikit-learn ROC curve implementation to select
it among those proposed. To obtain it, the geometric mean of sensitivity and specificity of each
proposal is calculated and the threshold maximizing it is selected.</p>
      <p>Regarding the experimental setup with the original challenge dataset, corresponding to
unbalanced negative (Unbal. Neg.), a noticeable bias could be observed in the mean of cosine
similarities (Mean Cos. Sim.). Consequently, models trained using such data are expected to be
biased too. Following the threshold selection procedure defined previously, 0.3 is approximately
the best to be used. If we apply it in the experiments carried out under our evaluation scheme,
the best performance is obtained for the language-specific model using the French text encoder
with a 0.7031. It also achieves a 0.6342 in the challenge test dataset. Furthermore, the CoE
reports a macro-f1 value of 0.4648 in Table 2, showing that the performance is considerably
lower in contrast to using one model independently.</p>
      <p>
        For the balanced setup, considering the mean value of the cosine similarities, we can conclude
that our models are now less biased than when all available data are used. Looking again at
the results in Table 1 and in Table 2, the language-specific French model is the best, obtaining
values of 0.7253 and 0.6401, respectively. In particular, the latter surpasses the best result
reported by Akdemir et al. on [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] which is 0.6176. However, the CoE solution achieves here a
macro-f1 of 0.4443 on the challenge test dataset. Finally, if we again compute the best possible
threshold to be applied over the probabilities from the models, a value of 0.5 is obtained.
      </p>
      <p>For the case with class imbalance towards the positive class, represented as Unbal. Pos,
the mean value of cosine similarities suggests that our models are slightly biased. This also
highlights that data augmentation processes may be required to train a model with these specific
biases, as we thought that removing more examples to enhance a larger bias would lead to
low-performance models. In this case, best performing model is obtained using the German
text encoder. However, attending to Table 2, the CoE result was evaluated on the challenge test
dataset and obtained a score of 0.4363. Regarding optimal thresholds, in this case values are
obtained around 0.55 and 0.6, the latter being selected for the results presented.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this paper, we propose a method to simultaneously adapt text and image encoders. When
the training process is finished and they are combined, a scent embedding space is obtained,
where for image-text pairs with olfactory relationships, their individual representations will
be similar and diferent otherwise. The efectiveness of a single pair of text-image encoders
achieves better overall performance compared to the combination of the output of all experts
in diferent languages for this specific problem. Moreover, balancing the dataset has been
proven as an efective technique for better generalization. It is also illustrated in Table 1
how after removing almost half of the examples from the original dataset to balance it, some
improvements in the f1 score can be observed. This is also highlighted in Table 2 since the best
result is obtained when using the balanced setup. In particular, the language-specific French
model shows better overall performance, possibly as a result of having used a more powerful
text encoder compared to the rest. However, more exploration is required to benefit from the
expertise of each language-specific model, as they are proven to work independently.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>S.E.-R.’s research was supported by the Spanish Ministry of Education (FPI grant
PRE2022105516). This work was funded by Project ASTOUND (101071191 —
HORIZON-EIC-2021PATHFINDERCHALLENGES-01) of the European Commission and by the Spanish Ministry of
Science and Innovation through the projects GOMINOLA (PID2020-118112RB-C22) and BeWord
(PID2021-126061OB-C43), funded by MCIN/AEI/ 10.13039/501100011033 and by the European
Union “NextGenerationEU/PRTR”.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Akdemir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Erp</surname>
          </string-name>
          ,
          <article-title>MUSTI - multimodal understanding of smells in texts and images at mediaeval 2022</article-title>
          , in: S. Hicks,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Langguth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Andreadis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hürriyetoglu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordmo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vuillemot</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          (Eds.),
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          , volume
          <volume>3583</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper50.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          , I. Novalija,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Erp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          , The MUSTI challenge @
          <article-title>MediaEval 2023 - multimodal understanding of smells in texts and images with zero-shot evaluation</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2023 Workshop</source>
          , Amsterdam,
          <source>the Netherlands and Online, 1-2 February</source>
          <year>2024</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Akdemir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <article-title>Multimodal and multilingual understanding of smells using vilbert and muniter</article-title>
          , in: S. Hicks,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Langguth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Andreadis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hürriyetoglu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordmo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vuillemot</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          (Eds.),
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          , volume
          <volume>3583</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper36.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Wan,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Multilingual text-image olfactory object matching based on object detection</article-title>
          , in: S. Hicks,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Langguth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Andreadis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hürriyetoglu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordmo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vuillemot</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          (Eds.),
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          , volume
          <volume>3583</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: https: //ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper15.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>CoRR abs/2103</source>
          .00020 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2103.00020. arXiv:
          <volume>2103</volume>
          .
          <fpage>00020</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>