A Text-Image Olfactory Matching Method Based on
                                the Distribution of Real-World Data
                                Yi Shao1 , Yulong Sun1 , Wenbo Wan1 , Jing Li1,* and Jiande Sun1,*
                                1
                                    Shandong Normal University, China


                                                                         Abstract
                                                                         Correlation between olfactory information and human memory allows images and texts to rely on their
                                                                         content to separate from human olfactory cells and create imaginary olfactory experiences for humans.
                                                                         This means that images and text may contain equally rich olfactory information, and utilizing this
                                                                         olfactory information is inevitably limited by the distribution characteristics of image or text data, such
                                                                         as language gaps and long-tail distribution problems. To this end, this paper proposes a method based on
                                                                         target detection, which models similar olfactory information contained in images and texts into the same
                                                                         feature space, bridging the cross-language and cross-modal gaps, and adopts data augmentation and
                                                                         special sampling strategies respectively to alleviate the language imbalance of text data and the long-tail
                                                                         distribution of image objects.


                                1. Introduction
                                In this paper, we delve into the MUSTI task of MediaEval2023[1]. The MUSTI task is a text-image
                                olfactory understanding challenge starting in 2022 [2], and existing works [3, 4] have already
                                demonstrated the value and feasibility of this task. In MUSTI task of MediaEval2023, subtask 1
                                is to detect whether the image and text in each sample of the development set contain objects
                                that cause the same olfactory experience. Further, subtask 2 is to point out what these objects
                                are. Subtask 3 is to perform the above two subtasks on a zero-shot Slovenian dataset.
                                   The texts in the development set are composed of English, French, German, and Italian, but
                                the proportions of these four languages are uneven (en: 795, fr: 300, de: 480, it: 799). On the
                                other hand, the images in the development set are mostly European medieval paintings, and the
                                content contains a large number of objects of different classes, such as fruits, animals, portraits,
                                jewelry decorations, etc. There is a long-tail distribution phenomenon which causes the model’s
                                detection performance for tail classes to drop sharply. Existing image and text retrieval research
                                rarely involves olfactory information, so we build the model based on the characteristics of the
                                data set and combined with traditional target detection algorithms.


                                2. Approach
                                2.1. Stage 1 for Coarse-Grained
                                As shown in Figure 1, our proposed method is divided into Stage 1 for coarse-grained matching
                                and Stage 2 for fine-grained matching. The first stage is used for coarse-grained classification,

                                MediaEval’23: Multimedia Evaluation Workshop, February 1–2, 2024, Amsterdam, The Netherlands and Online
                                *
                                  Corresponding author.
                                †
                                  These authors contributed equally.
                                $ 2021020981@stu.sdnu.edu.cn (Y. Shao); 2022020647@stu.sdnu.edu.cn (Y. Sun); ;wanwenbo@sdnu.edu.cn
                                (W. Wan); lijingjdsun@hotmail.com (J. Li); jiandesun@hotmail.com (J. Sun)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: Overview of the two-stage model.


that is, to detect whether there are similar olfactory objects in images and texts, using multilin-
gual BERT (m-BERT)[5] and ViT[6] respectively to extract text features and image features. For
text features, we input complete sentences and nouns after data augmentation into m-BERT to
obtain a fine-tuned model based on sentence features. The purpose of data augmentation is
to alleviate the problem of cross-language gaps in the dataset. Specifically, we translated the
nouns in each group into four languages and added superclasses which come from official data.
After these two text feature selection strategies, the text features will be spliced with the image
features, and the binary cross-entropy loss of coarse-grained stage ℒ𝐶𝐿𝑆 will be calculated.

2.2. Stage 2 for Fine-Grained
The second stage is used for fine-grained classification, that is, detecting which objects cause
similar olfactory experiences in images and texts. We counted the subtask 2 labels of all samples,
and categorized objects observed across all images into 77 classes for marking bounding boxes
for training YOLOv5. A hyperparameter 𝑘 which means an image contains objects whose
total occurrence frequency in all images is less than 𝑘 was introduced to alleviate long-tail
distribution problems in theses classes. The performance of YOLOv5 trained on training sets
divided with different 𝑘 values is shown in Section3. The text feature extractor and image
feature extractor trained in the first stage will be used as the initial state of the second stage
to extract noun features and image features in each bounding box respectively. A pair of text
features and image features of the same category are regarded as Positive samples, otherwise
they are regarded as negative samples. These feature pairs will then be fed into the classifier to
calculate the binary cross-entropy loss ℒ𝐷𝑇 𝐶 .


3. Results and Analysis
3.1. Preliminary Experimental Works
Images in the dataset had several obvious content categories: food, flowers, animals (including
prey and herds), water-related (such as the sea, ports, rivers, whales, fish, etc.), jewelry dec-
oration, personal portraits, crowd gatherings, etc. Most of these categories have significant
Table 1
Comparison of accuracy of different text feature extractors in Stage 1
                      similarity                                   classifier
  Model
               sentence        nouns        sentence        nouns         sentence         nouns
   BERT        bert-multilingual-cased      bert-multilingual-cased       bert-multilingual-uncased
   Acc.         0.7579         0.7558        0.8110        0.8232          0.8063          0.7979


Figure 2: Performance of YOLOv5 trained on training sets divided with different 𝑘 values.


frequency of specific objects and significant olfactory characteristics, so we respectively con-
verted the images to gray scale images or retained the original color content, and respectively
used fine-tuned ResNet50 as classifier and K-means algorithm for clustering, but all the perfor-
mances were poor. In Stage 1, the text feature extractor of the model uses a variety of official
HuggingFace official versions and the performance comparison is shown in Table 1.
   In Stage 2, we select training samples through 𝑘 each time and then randomly select samples
to ensure that the training set:validation set is 8:2, and there is at least one sample for each
category in the validation set. When 𝑘=0, the training set is completely randomly selected. The
proportion of selected samples in the development set is 50.92% when 𝑘=40, while it is 80.63%
when 𝑘=90. The impact of different values of super parameter 𝑘 on the performance of YOLOv5
in Stage 2 is shown in Figure 2. The larger the 𝑘 value, the more categories can be ensured
to appear stably in the training set. However, a too large 𝑘 value will also cause some tail
categories to appear too few times in the validation set, causing the randomness of the validation
results to increase. The overall F1 score of YOLO on the head categories gradually increases
as the 𝑘 value increases, and roughly converges when 𝑘=86. But as the 𝑘 value increases to
90, the F1-score of the head class drops significantly. The overall detection F1 score on the
tail category first increases and then decreases as the 𝑘 value increases, reaching the overall
optimum when k=87. For extreme tail categories, a smaller 𝑘 value can result in a lower F1
score, while a slightly larger 𝑘 value will cause YOLO to no longer be able to detect extreme tail
categories. Therefore, we finally select 𝑘=87 to take into account the performance of all head
and tail categories. In addition to text data, we also tried data augmentation on image data in
Table 2
Comparison of F1-score of images and texts in different languages on Stage 2
                                                                 F1-score
     Subtask              Models
                                               en       de         fr        it    micro avg.
                   dummy baseline [3]        0.4285   0.4289    0.3333    0.4273    0.4075
                  mUniter finetuned [3]      0.4473   0.4644    0.3605    0.5020    0.4473
                   mUniter-MUSTI [3]         0.6965   0.4579    0.5022    0.6535    0.6011
     Subtask1
                 mUniter-SNLI-MUSTI [3]      0.7482   0.5014    0.5053 0.6850       0.6176
                       Yi et al. [4]         0.7867   0.4568    0.3743 0.7501       0.6033
                          ours               0.7829   0.4845    0.5133 0.7074       0.6198
                       Yi et al. [4]         0.7427   0.7276    0.4599    0.7487    0.6708
     Subtask2
                          ours                  -        -          -         -     0.0572
                     ours (Subtask1)            -        -          -         -     0.3845
     Subtask3
                     ours (Subtask2)            -        -          -         -     0.0258


Stage 2. We used Stable Diffusion to try to generate some medieval-style meals and portraits,
but they all have visible differences from the real medieval paintings in the development set.
Besides, it is difficult to establish effective tail data augmentation for paintings with various
styles and clutter based on affine transformation. For examples, an apple may look like a peach
after the color is changed, while a candle is difficult to detect after rotation because the candles
in other samples are vertical and the pipes are usually in the form of smoking slanted strips.

3.2. Comparison Experiments
The comparisons of the F1 score of three subtasks are shown in Table 2. The overall structure
of Yi et al. [4] on subtask1 is consistent with the proposed method, but the data augmentation
method is not used for cross-language bridging. On German and French samples that number
is small, the proposed method has obvious improvements compared with [4], which proves that
the text data enhancement we use has a strong ability to bridge the language gap.
   On subtask 2, the proposed method is far inferior to [4] because it does not treat different
words representing the same category as the same word, unlike [4], but directly performs binary
classification calculations on all different words and image regions. Our original intention was
to build a model that can directly map the textual word features and image region features to
the same feature space, thus eliminating the process of manually organizing different words
representing the same category. However, based on performance comparison, the proposed
method significantly degrades performance compared to [4] due to the proposed method requires
a small feature differences between each image region and each text word of the same category.
Compared with the manually compiled list of approximate olfactory nouns in [4], the proposed
model obviously saves labor costs significantly, but based on the huge drop in performance,
this method is not available.


Acknowledgments
Thanks to the organizers of the MediaEval2023, especially to those organizers for MUSTI.
This work was supported in part by the Joint Project for Innovation and Development of
Shandong Natural Science Foundation (ZR2022LZH012) and Joint Project for Smart Computing
of Shandong Natural Science Foundation (ZR2020LZH015).
References
[1] A. Hürriyetoglu, I. Novalija, M. Zinnen, V. Christlein, P. Lisena, S. Menini, M. van Erp, R. Troncy, The
    MUSTI challenge @ MediaEval 2023 - multimodal understanding of smells in texts and images with
    zero-shot evaluation, in: Working Notes Proceedings of the MediaEval 2023 Workshop, Amsterdam,
    the Netherlands and Online, 1-2 February 2024, 2023.
[2] A. Hürriyetoglu, T. Paccosi, S. Menini, M. Zinnen, P. Lisena, K. Akdemir, R. Troncy, M. van Erp,
    MUSTI - multimodal understanding of smells in texts and images at mediaeval 2022, in: S. Hicks,
    A. G. S. de Herrera, J. Langguth, A. Lommatzsch, S. Andreadis, M. Dao, P. Martin, A. Hürriyetoglu,
    V. Thambawita, T. S. Nordmo, R. Vuillemot, M. A. Larson (Eds.), Working Notes Proceedings of the
    MediaEval 2022 Workshop, Bergen, Norway and Online, 12-13 January 2023, volume 3583 of CEUR
    Workshop Proceedings, CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3583/paper50.pdf.
[3] K. Akdemir, A. Hürriyetoglu, R. Troncy, T. Paccosi, S. Menini, M. Zinnen, V. Christlein, Multimodal
    and multilingual understanding of smells using vilbert and muniter, in: S. Hicks, A. G. S. de Herrera,
    J. Langguth, A. Lommatzsch, S. Andreadis, M. Dao, P. Martin, A. Hürriyetoglu, V. Thambawita,
    T. S. Nordmo, R. Vuillemot, M. A. Larson (Eds.), Working Notes Proceedings of the MediaEval
    2022 Workshop, Bergen, Norway and Online, 12-13 January 2023, volume 3583 of CEUR Workshop
    Proceedings, CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3583/paper36.pdf.
[4] Y. Shao, Y. Zhang, W. Wan, J. Li, J. Sun, Multilingual text-image olfactory object matching based
    on object detection, in: S. Hicks, A. G. S. de Herrera, J. Langguth, A. Lommatzsch, S. Andreadis,
    M. Dao, P. Martin, A. Hürriyetoglu, V. Thambawita, T. S. Nordmo, R. Vuillemot, M. A. Larson
    (Eds.), Working Notes Proceedings of the MediaEval 2022 Workshop, Bergen, Norway and Online,
    12-13 January 2023, volume 3583 of CEUR Workshop Proceedings, CEUR-WS.org, 2022. URL: https:
    //ceur-ws.org/Vol-3583/paper15.pdf.
[5] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers
    for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.org/abs/1810.04805.
    arXiv:1810.04805.
[6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
    M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image
    recognition at scale, arXiv preprint arXiv:2010.11929 (2020).