<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MUSTI - Multimodal Understanding of Smells in Texts and Images Using CLIP</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>P. Mirunalini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>V.Sanjhay</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>S Rohitram</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M.Rohith</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Chennai - 603110, Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This study, titled the Multimodal understanding of Smells in Texts and Images (MUSTI) task, aims to explore the relationship between textual descriptions and visual depictions of smells. Using machine learning techniques, specifically leveraging the CLIP model and tokenization methods, this research extracts features from both text and images to analyze and correlate olfactory elements.The approach involves a model that computes similarity between textual descriptions and images based on their scent-evoking content. This model engages in classification tasks, determining whether a given textimage pair shares a common smell source (1 for positive correlation, 0 for negative correlation). By calculating similarity scores between text and image features, it quantifies the degree of correlation based on scent-related content, enabling a nuanced understanding of connections between textual descriptions and visual representations of smells. Trained on a dataset of image-text pairs, the model outputs scores for accuracy, providing a foundation for a comprehensive analysis of scent-related associations within multimedia content.] The understanding of smells, an underrepresented dimension in multimedia analysis, serves as the focal point of the MUSTI (Multimodal Understanding of Smells in Texts and Images) task. This task aims to delve into the intricate relationship between olfactory references found in textual descriptions and visual depictions across diferent historical periods and languages. This research work outlines the task's primary objectives: MUSTI Classification, requiring participants to predict shared olfactory sources between texts and images as a binary classification problem. Utilizing the CLIP model and tokenization methods, the model extracts features from images and text and computes the cosine similarity between them based on smell-related content.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep learning model</kwd>
        <kwd>Smells</kwd>
        <kwd>CLIP</kwd>
        <kwd>BERT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In recent years, the study of olfactory dimensions has gained traction, recognizing the
significance of smells in memory and emotions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].Projects like Odeuropa aim to enrich metadata and
develop new interfaces like the "scent wheel" in cultural heritage collections [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Initiatives,
including a diachronic multilingual benchmark and a comprehensive data model, address this
gap by capturing and structuring olfactory information across languages and time [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Eforts
in creating a multi-lingual taxonomy further enhance computational understanding of sensory
experiences [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].Furthermore, the development of a systematic theoretical framework to capture
olfactory information from texts [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] represents a pivotal step toward automated systems for
computational analysis.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>
        For the Multimodal Understanding of Smells in Texts and Images (MUSTI) task[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], an
integrated approach leveraging state-of-the-art language and vision models, specifically
employing the Contrastive Language-Image Pretraining (CLIP) model [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] has been devised. The
methodology involves the seamless fusion of textual and visual information to decipher and
correlate olfactory references[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] within textual passages and corresponding images from historical
periods ranging from the 16th to the 20th century. BERT model was also used for tokenization
of the text provided in the dataset. BERT [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is often used in natural language processing tasks
like tokenization to break huge chunks of texts into meaningful units. Its tokenization process
involves splitting text into sub-word units that capture intricate linguistic patterns for better
contextual understanding during model training and inference.The backbone used for vision
features extraction in CLIP models[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] like openai/clip-vit-base-patch16 is a Vision Transformer
(ViT) architecture. This architecture represents images as sequences of patches and processes
them through a transformer network, enabling the model to capture spatial relationships and
features from the image.A classification threshold of 470 is used to determine whether the
average similarity score between the image and text exceeds this threshold.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Data Preprocessing and Feature Extraction</title>
        <p>The MUSTI dataset, a collection of copyright-free texts and images from various repositories
and archives, forms the basis for this task. Annotated with over 80 categories of smell objects
and gestures, the dataset allows participants to develop models aimed at recognizing and linking
olfactory references across languages and modalities.</p>
        <p>Initially, the dataset undergoes preprocessing of the provided textual data, comprising
multilingual passages in English, German, Italian, and French, using a tokenization scheme implemented
through the ’bert-base-multilingual-cased’ tokenizer. To accommodate varying text lengths, the
data is divided into chunk of texts that fit within the maximum sequence length compatible with
the CLIP model’s input requirements. Simultaneously, images linked to the textual descriptions
are acquired by parsing the provided URLs. These images undergo resizing and conversion to
numerical arrays for further processing.</p>
        <p>To create feature representations conducive to model interpretation, a combination of
pixellevel information from images and token embeddings derived from textual segments are utilized
by our model. Image arrays are flattened, while textual segments are tokenized and flattened,
yielding joint feature representations that capture the essence of both modalities.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model Utilization: CLIP Integration for Multimodal Understanding</title>
        <p>The proposed work employs the CLIP model which is designed to predict the image and text
pairings. CLIP’s embeddings for images and text share the same space and the CLIP’s layers
leverage a large-scale pretrained model which enables us to encode semantic relationships
between text and images. This allows for cross-modal understanding, aiding in associating
textual descriptions of smells with relevant visual representations, facilitating multimodal
comprehension of olfactory concepts. The CLIP loss aims to maximize the cosine similarity
between the image and text embeddings for the N genuine pairs in the batch while minimizing
the cosine similarity for the N² - N incorrect pairings. This similarity measure serves as an
indicator of the correspondence or shared olfactory source between the textual descriptions
and the accompanying images.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Evaluation and Result Interpretation</title>
        <p>The cosine similarities between text and images were computed and also an average similarity
score across all segmented text chunks associated with each image is derived, providing a
comprehensive assessment of the overall olfactory connection between text and image pairs.
The average similarity scores are benchmarked against a threshold value to determine the
presence or absence of a shared olfactory source between text and image pairs. By assigning
binary labels based on the similarity threshold, the classification task (MUSTI Classification) for
recognizing the co-relation of smells across modalities is facilitated.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>A series of experiments was conducted, iterating over diferent stages of data preprocessing,
feature extraction, and model utilization to achieve comprehensive olfactory understanding
across text and image modalities. The proposed system was evaluated using precision, recall,
and F1-score metrics and with overall accuracy. The results are depicted in the below table
1.The accuracy was found out to be 61.38%.</p>
      <p>On further attempts, exclusion of title in processing the similarity led to a better accuracy of
63.47%. This could imply confusion caused due to the inclusion of title. The results of the was
depicted in the following table 2</p>
      <p>Precision</p>
      <p>Recall F1-Score</p>
      <p>Support</p>
      <sec id="sec-4-1">
        <title>4.1. Interpretation of Results</title>
        <p>The achieved accuracy of 63.47% in predicting the presence (’YES’) or absence (’NO’) of common
smell sources between text passages and images demonstrates a moderate level of success in
our model’s performance. The precision, recall, and F1-scores for each class (’YES’ and ’NO’)
indicate notable diferences in the model’s ability to predict positive and negative instances.
The model displays a relatively higher precision of 80.75% for identifying cases where there are
no common smell sources and a score of 61.54% which indicates that it misses a considerable
number of actual instances where there are no common smell sources. The F1-score of 60.84%
suggests a fair balance between precision and recall for the classes.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion And Outlook</title>
      <p>To refine the model, options include further fine-tuning CLIP and optimizing feature
engineering techniques for better discernment of nuanced textual and visual olfactory relationships.
Augmenting and enriching the dataset with diverse textual genres, images from various
historical periods, and more smell-related annotations can significantly enhance the model’s
comprehension of olfactory references across diferent contexts and time frames.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwabe</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Erp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Tullett</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Leemans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Marx</surname>
          </string-name>
          , S. C.
          <article-title>Ehrich, Capturing the semantics of smell: The odeuropa data model for olfactory heritage information</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>387</fpage>
          -
          <lpage>405</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Ehrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Verbeek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Marx</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bembibre</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Leemans</surname>
          </string-name>
          ,
          <article-title>Nose-first. towards an olfactory gaze for digital art history</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          , volume
          <volume>3064</volume>
          , CEUR Workshop Proceedings,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonelli</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Van Erp</surname>
            ,
            <given-names>I. Leemans</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Tullett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoğlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Dijkstra</surname>
          </string-name>
          , et al.,
          <article-title>A multilingual benchmark to capture olfactory situations over time</article-title>
          ,
          <source>in: Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Tekiroglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonelli</surname>
          </string-name>
          ,
          <article-title>Building a multilingual taxonomy of olfactory terms with timestamps</article-title>
          ,
          <source>in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language Resources Association</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4030</fpage>
          -
          <lpage>4039</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <article-title>Framenet-like annotation of olfactory information in texts</article-title>
          ,
          <source>in: Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage</source>
          ,
          <source>Social Sciences, Humanities and Literature</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Akdemir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Erp</surname>
          </string-name>
          ,
          <article-title>MUSTI - multimodal understanding of smells in texts and images at mediaeval 2022</article-title>
          , in: S. Hicks,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Langguth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Andreadis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hürriyetoglu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordmo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vuillemot</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          (Eds.),
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          , volume
          <volume>3583</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper50.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Akdemir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoğlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <article-title>Multimodal and multilingual understanding of smells using vilbert and muniter</article-title>
          ,
          <source>in: Proceedings of MediaEval 2022 CEUR Workshop</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          , I. Novalija,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Erp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          , The MUSTI challenge @
          <article-title>MediaEval 2023 - multimodal understanding of smells in texts and images with zero-shot evaluation</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2023 Workshop</source>
          , Amsterdam,
          <source>the Netherlands and Online, 1-2 February</source>
          <year>2024</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Huber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Spengler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Boivin</surname>
          </string-name>
          ,
          <article-title>How to use modern science to reconstruct ancient scents</article-title>
          ,
          <source>Nature Human Behaviour</source>
          <volume>6</volume>
          (
          <year>2022</year>
          )
          <fpage>611</fpage>
          -
          <lpage>614</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>