<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Text-Image Olfactory Matching Method Based on the Distribution of Real-World Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yi Shao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yulong Sun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenbo Wan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jing Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiande Sun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Shandong Normal University</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Correlation between olfactory information and human memory allows images and texts to rely on their content to separate from human olfactory cells and create imaginary olfactory experiences for humans. This means that images and text may contain equally rich olfactory information, and utilizing this olfactory information is inevitably limited by the distribution characteristics of image or text data, such as language gaps and long-tail distribution problems. To this end, this paper proposes a method based on target detection, which models similar olfactory information contained in images and texts into the same feature space, bridging the cross-language and cross-modal gaps, and adopts data augmentation and special sampling strategies respectively to alleviate the language imbalance of text data and the long-tail distribution of image objects.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In this paper, we delve into the MUSTI task of MediaEval2023[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The MUSTI task is a text-image
olfactory understanding challenge starting in 2022 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and existing works [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] have already
demonstrated the value and feasibility of this task. In MUSTI task of MediaEval2023, subtask 1
is to detect whether the image and text in each sample of the development set contain objects
that cause the same olfactory experience. Further, subtask 2 is to point out what these objects
are. Subtask 3 is to perform the above two subtasks on a zero-shot Slovenian dataset.
      </p>
      <p>The texts in the development set are composed of English, French, German, and Italian, but
the proportions of these four languages are uneven (en: 795, fr: 300, de: 480, it: 799). On the
other hand, the images in the development set are mostly European medieval paintings, and the
content contains a large number of objects of diferent classes, such as fruits, animals, portraits,
jewelry decorations, etc. There is a long-tail distribution phenomenon which causes the model’s
detection performance for tail classes to drop sharply. Existing image and text retrieval research
rarely involves olfactory information, so we build the model based on the characteristics of the
data set and combined with traditional target detection algorithms.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <sec id="sec-2-1">
        <title>2.1. Stage 1 for Coarse-Grained</title>
        <p>
          As shown in Figure 1, our proposed method is divided into Stage 1 for coarse-grained matching
and Stage 2 for fine-grained matching. The first stage is used for coarse-grained classification,
that is, to detect whether there are similar olfactory objects in images and texts, using
multilingual BERT (m-BERT)[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and ViT[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] respectively to extract text features and image features. For
text features, we input complete sentences and nouns after data augmentation into m-BERT to
obtain a fine-tuned model based on sentence features. The purpose of data augmentation is
to alleviate the problem of cross-language gaps in the dataset. Specifically, we translated the
nouns in each group into four languages and added superclasses which come from oficial data.
After these two text feature selection strategies, the text features will be spliced with the image
features, and the binary cross-entropy loss of coarse-grained stage ℒ will be calculated.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Stage 2 for Fine-Grained</title>
        <p>The second stage is used for fine-grained classification, that is, detecting which objects cause
similar olfactory experiences in images and texts. We counted the subtask 2 labels of all samples,
and categorized objects observed across all images into 77 classes for marking bounding boxes
for training YOLOv5. A hyperparameter  which means an image contains objects whose
total occurrence frequency in all images is less than  was introduced to alleviate long-tail
distribution problems in theses classes. The performance of YOLOv5 trained on training sets
divided with diferent  values is shown in Section3. The text feature extractor and image
feature extractor trained in the first stage will be used as the initial state of the second stage
to extract noun features and image features in each bounding box respectively. A pair of text
features and image features of the same category are regarded as Positive samples, otherwise
they are regarded as negative samples. These feature pairs will then be fed into the classifier to
calculate the binary cross-entropy loss ℒ  .</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Analysis</title>
      <sec id="sec-3-1">
        <title>3.1. Preliminary Experimental Works</title>
        <p>Images in the dataset had several obvious content categories: food, flowers, animals (including
prey and herds), water-related (such as the sea, ports, rivers, whales, fish, etc.), jewelry
decoration, personal portraits, crowd gatherings, etc. Most of these categories have significant
frequency of specific objects and significant olfactory characteristics, so we respectively
converted the images to gray scale images or retained the original color content, and respectively
used fine-tuned ResNet50 as classifier and K-means algorithm for clustering, but all the
performances were poor. In Stage 1, the text feature extractor of the model uses a variety of oficial
HuggingFace oficial versions and the performance comparison is shown in Table 1.</p>
        <p>In Stage 2, we select training samples through  each time and then randomly select samples
to ensure that the training set:validation set is 8:2, and there is at least one sample for each
category in the validation set. When =0, the training set is completely randomly selected. The
proportion of selected samples in the development set is 50.92% when =40, while it is 80.63%
when =90. The impact of diferent values of super parameter  on the performance of YOLOv5
in Stage 2 is shown in Figure 2. The larger the  value, the more categories can be ensured
to appear stably in the training set. However, a too large  value will also cause some tail
categories to appear too few times in the validation set, causing the randomness of the validation
results to increase. The overall F1 score of YOLO on the head categories gradually increases
as the  value increases, and roughly converges when =86. But as the  value increases to
90, the F1-score of the head class drops significantly. The overall detection F1 score on the
tail category first increases and then decreases as the  value increases, reaching the overall
optimum when k=87. For extreme tail categories, a smaller  value can result in a lower F1
score, while a slightly larger  value will cause YOLO to no longer be able to detect extreme tail
categories. Therefore, we finally select =87 to take into account the performance of all head
and tail categories. In addition to text data, we also tried data augmentation on image data in
Stage 2. We used Stable Difusion to try to generate some medieval-style meals and portraits,
but they all have visible diferences from the real medieval paintings in the development set.
Besides, it is dificult to establish efective tail data augmentation for paintings with various
styles and clutter based on afine transformation. For examples, an apple may look like a peach
after the color is changed, while a candle is dificult to detect after rotation because the candles
in other samples are vertical and the pipes are usually in the form of smoking slanted strips.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Comparison Experiments</title>
        <p>
          The comparisons of the F1 score of three subtasks are shown in Table 2. The overall structure
of Yi et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] on subtask1 is consistent with the proposed method, but the data augmentation
method is not used for cross-language bridging. On German and French samples that number
is small, the proposed method has obvious improvements compared with [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], which proves that
the text data enhancement we use has a strong ability to bridge the language gap.
        </p>
        <p>
          On subtask 2, the proposed method is far inferior to [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] because it does not treat diferent
words representing the same category as the same word, unlike [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], but directly performs binary
classification calculations on all diferent words and image regions. Our original intention was
to build a model that can directly map the textual word features and image region features to
the same feature space, thus eliminating the process of manually organizing diferent words
representing the same category. However, based on performance comparison, the proposed
method significantly degrades performance compared to [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] due to the proposed method requires
a small feature diferences between each image region and each text word of the same category.
Compared with the manually compiled list of approximate olfactory nouns in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], the proposed
model obviously saves labor costs significantly, but based on the huge drop in performance,
this method is not available.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>Thanks to the organizers of the MediaEval2023, especially to those organizers for MUSTI.
This work was supported in part by the Joint Project for Innovation and Development of
Shandong Natural Science Foundation (ZR2022LZH012) and Joint Project for Smart Computing
of Shandong Natural Science Foundation (ZR2020LZH015).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          , I. Novalija,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Erp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          , The MUSTI challenge @
          <article-title>MediaEval 2023 - multimodal understanding of smells in texts and images with zero-shot evaluation</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2023 Workshop</source>
          , Amsterdam,
          <source>the Netherlands and Online, 1-2 February</source>
          <year>2024</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Akdemir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Erp</surname>
          </string-name>
          ,
          <article-title>MUSTI - multimodal understanding of smells in texts and images at mediaeval 2022</article-title>
          , in: S. Hicks,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Langguth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Andreadis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hürriyetoglu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordmo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vuillemot</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          (Eds.),
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          , volume
          <volume>3583</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper50.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Akdemir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <article-title>Multimodal and multilingual understanding of smells using vilbert and muniter</article-title>
          , in: S. Hicks,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Langguth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Andreadis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hürriyetoglu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordmo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vuillemot</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          (Eds.),
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          , volume
          <volume>3583</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper36.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Wan,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Multilingual text-image olfactory object matching based on object detection</article-title>
          , in: S. Hicks,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Langguth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Andreadis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hürriyetoglu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordmo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vuillemot</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          (Eds.),
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          , volume
          <volume>3583</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: https: //ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper15.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>