<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal and Multilingual Understanding of Smells using VilBERT and mUNITER</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kiymet Akdemir</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ali Hürriyetoğlu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raphaël Troncy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Teresa Paccosi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Menini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathias Zinnen</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincent Christlein</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>EURECOM</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>KNAW Humanities Cluster DHLab</institution>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Pattern Recognition Lab, Friedrich-Alexander-Universität</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We evaluate state-of-the-art multimodal models to detect common olfactory references in multilingual text and images in the scope of the Multimodal Understanding of Smells in Texts and Images (MUSTI) Task at Mediaeval 2022. The goal of the MUSTI Subtask 1 is to classify pairs of text and image as to whether they refer to the same smell source or not. We approach this task as a Visual Entailment problem and evaluate the performance of the English model ViLBERT and the multilingual model mUNITER on MUSTI Subtask 1. While base VilBERT and mUNITER models perform worse than a dummy baseline, fine-tuning these models using the training data improve performance significantly in almost all scenarios. We find that ifne-tuning mUNITER with SNLI-VE and MUSTI training data performs better than other configurations we implemented. Our experiments demonstrate that the task presents some challenges, but it is by no means impossible. Our code is available at https://github.com/Odeuropa/musti-eval-baselines to encourage reproducibility.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Olfactory information is considered dificult to identify in texts or images. This is mainly
due to the relatively rare linguistic evidence documented about its occurrence in texts and its
implicit representation in images. Consequently, automating olfactory information extraction
in text or images has been attracting considerably less attention [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Although novel
approaches for multimodal analysis of texts and images have recently been developed, to the
best of our knowledge, the olfactory information has not been the focus of any academic work
in a multimodal setting.
      </p>
      <p>
        The Multimodal Understanding of Smells in Texts and Images (MUSTI)1 task that is organized
in the scope of MediaEval 20222 fills this gap [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The text-image pairs provided by the MUSTI
organizers are multilingual – English, German, French, and Italian – and gathered from historical
data spanning a period between the 17th and 20th centuries.
      </p>
      <p>
        We evaluate the performance of two state-of-the-art models, VilBERT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and mUNITER [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
on the MUSTI challenge test data and present the performances of base and fine-tuned versions
of these models.
      </p>
      <p>We detail our method in Section 2. Next, we present the results of the models along various
configurations in Section 3. Finally, a summary of our evaluation and an outlook concludes this
paper in Section 4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>
        We propose to evaluate the performance of state of the art visio-linguistic models on the MUSTI
data. We use the VOLTA framework (Visiolinguistic Transformer Architectures) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] which
unifies several BERT-based Vision-Language (V&amp;L) Models built on top of ViLBERT-MT (Vision
&amp; Language BERT Multi-Task) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The VOLTA repository3 contains models pre-trained on their
original setup given in their papers and several models pre-trained in a controlled setup. We use
ViLBERT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], pre-trained in its original setup on the English data, and the multilingual model
mUNITER [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], pre-trained in the controlled setup on all languages (English, Italian, French, and
German) provided in MUSTI.
      </p>
      <p>
        These V&amp;L Models are pre-trained on Conceptual Captions [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to perform several V&amp;L
tasks such as Visual Question Answering, Visual Entailment, Grounding Referring Expressions,
Caption-Based Image Retrieval, etc. It is a standard approach to fine-tune these models on a
specific task [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        In the Visual Entailment task, the goal is to determine, given an image as a premise and text
as a hypothesis, whether the premise implies the hypothesis [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Models output one of the
three labels: entailment, neutral, or contradiction. For the MUSTI Subtask 1, to evaluate if an
image and a text pair refers to the same smell object, we evaluate them on the Visual Entailment
task. Then, we consider the output as YES if the model outputs entailment and NO if neutral or
contradiction are returned by the model since Subtask 1 is a binary classification problem.
      </p>
      <p>
        To extract features of the images, we use Faster R-CNN [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] with a ResNet-101 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] backbone
that outputs 36 boxes per image following VOLTA. First, we fine-tune ViLBERT and mUNITER
on the Visual Entailment dataset SNLI-VE [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for 20 epochs with a learning rate of 2e-5 and
batch size 128. Afterward, we train these fine-tuned models for 10 epochs using the MUSTI
training data, splitting 20% of it as a validation set with a learning rate of 2e-5 and batch size 64.
      </p>
      <p>We train ViLBERT only on English train data and the multilingual model mUNITER on the
complete MUSTI training data. The parameter sets that yield the best validation score during
training are used for inference. From the pre-trained models, we obtain the following models:
ifne-tuned on SNLI-VE, fine-tuned on MUSTI, and fine-tuned on MUSTI after the SNLI-VE.</p>
      <p>
        We observe that reproducing our experiments may lead to diferent results since the task
performance of BERT-based models after fine-tuning heavily depends on the weight initialization
seed, such that the minimum and maximum scores can difer by 1 or more points across 10
ifne-tuning experiments Bugliarello et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Furthermore, we fine-tune models using both
SNLI-VE and MUSTI, which may increase the variation in the scores.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>
        We present in Table 1 the F1-macro scores when using theViLBERT model on English data
only. In Table 2, we present the results of the dummy baseline compared to various mUNITER
models, as well as the method proposed in Shao et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The pre-trained model mUNITER
difers from the dummy baseline at most at 1 point with a very low YES output score. In
particular, it does not predict YES for any DE data and it yields 1, 2, and 5 YES for EN, FR, and
IT, respectively. Fine-tuning models on SNLI-VE does not improve scores significantly. The
result is not surprising since MUSTI Subtask 1 is not directly a visual entailment task, as the
text does not need to describe the image. It is suficient to have the same smell object in the
text and image pair to be classified as YES.
      </p>
      <p>
        On the other hand, fine-tuning the pre-trained mUNITER on MUSTI data increases the
number of YES outputs to 55 for EN, 22 for DE, 85 for FR, and 46 for IT, and the scores increase
remarkably. We achieve the best performance when the models are first fine-tuned on SNLI-VE,
and then on MUSTI training data. Thus, we got the highest scores on mUNITER fine-tuned on
both SNLI-VE and MUSTI, and ViLBERT fine-tuned on SNLI-VE and MUSTI. For the EN data,
ViLBERT-SNLI-MUSTI outperforms mUNITER models and the proposed method of Shao et al.
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Our best multilingual model mUNITER-SNLI-MUSTI outperforms Shao et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] except
for the EN and IT score, while their overall performances are close to each other.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this paper, we propose an approach to tackle the MUSTI Subtask 1 challenge, namely detecting
whether an image-text pair refers to the same smell object as a Visual Entailment task. In
particular, we have experimented with the multimodal models ViLBERT and mUNITER. We
ifne-tune models on SNLI-VE to improve the performance on the visual entailment task, and
we observe that training further on MUSTI training data boosts performance.
ViLBERT-SNLIMUSTI achieves the highest F1-macro scores on English data, while mUNITER-SNLI-MUSTI
achieves the best multilingual performance.</p>
      <p>
        As future work, we would like to adapt a CLIP [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] model towards the MUSTI task, replacing
the vision backbone with more performant architectures such as SWIN [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and trying diferent
ways to merge visual and textual features, or using more training data for the fine-tuning
step. Last but not least, adaptation of the base models utilized for historical text and historical
painting processing has the potential to enhance performance.
This work has been partially supported by European Union’s Horizon 2020 research and
innovation programme within the Odeuropa project (grant agreement No. 101004469).
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonelli</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Van Erp</surname>
            ,
            <given-names>I. Leemans</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Tullett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoğlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Dijkstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gordijn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jürgens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Koopman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ouwerkerk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Steen</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Novalija</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mladenic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zidar</surname>
          </string-name>
          ,
          <article-title>A multilingual benchmark to capture olfactory situations over time</article-title>
          , in: 3rd Workshop on Computational Approaches to Historical Language Change, Association for Computational Linguistics, Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .lchange-
          <volume>1</volume>
          .1.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kosti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <article-title>Odor: The icpr2022 odeuropa challenge on olfactory object recognition</article-title>
          ,
          <source>in: 26th International Conference on Pattern Recognition (ICPR)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4989</fpage>
          -
          <lpage>4994</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICPR56361.
          <year>2022</year>
          .
          <volume>9956542</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kosti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <article-title>Odeuropa dataset of smell-related objects</article-title>
          ,
          <year>2022</year>
          . URL: https://doi.org/10.5281/zenodo.6367776.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoğlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Akdemir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Erp</surname>
          </string-name>
          ,
          <source>MUSTI - Multimodal Understanding of Smells in Texts and Images at MediaEval</source>
          <year>2022</year>
          , in: MediaEval Benchmarking Initiative for Multimedia Evaluation,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks</article-title>
          ,
          <source>in: 33rd International Conference on Neural Information Processing Systems</source>
          , Curran Associates Inc.,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Bugliarello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ponti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Collier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Elliott</surname>
          </string-name>
          ,
          <article-title>Visually grounded reasoning across languages and cultures</article-title>
          , in: Workshop on ImageNet: Past, Present, and
          <string-name>
            <surname>Future</surname>
          </string-name>
          ,
          <year>2021</year>
          . URL: https: //openreview.net/forum?id=
          <article-title>-pKZ0OO-L7l.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bugliarello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cotterell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Okazaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Elliott</surname>
          </string-name>
          , Multimodal Pretraining Unmasked:
          <article-title>A MetaAnalysis and a Unified Framework of Vision-and-Language BERTs, Transactions of the Association for Computational Linguistics 9 (</article-title>
          <year>2021</year>
          )
          <fpage>978</fpage>
          -
          <lpage>994</lpage>
          . URL: https://doi.org/10.1162/tacl_a_
          <fpage>00408</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <fpage>12</fpage>
          -in
          <article-title>-1: Multi-task vision and language representation learning</article-title>
          ,
          <source>in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>10434</fpage>
          -
          <lpage>10443</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR42600.
          <year>2020</year>
          .
          <volume>01045</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Soricut</surname>
          </string-name>
          ,
          <article-title>Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning, in: 56th Annual Meeting of the Association for Computational Linguistics (ACL), Association for Computational Linguistics</article-title>
          , Melbourne, Australia,
          <year>2018</year>
          , pp.
          <fpage>2556</fpage>
          -
          <lpage>2565</lpage>
          . URL: https://aclanthology.org/P18-1238.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Doran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadav</surname>
          </string-name>
          ,
          <article-title>Visual entailment task for visually-grounded language learning</article-title>
          , arXiv preprint arXiv:
          <year>1811</year>
          .10582,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Faster</surname>
          </string-name>
          r-cnn:
          <article-title>Towards real-time object detection with region proposal networks</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>28</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Wan,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Multilingual Text-Image Olfactory Object Matching Based on Object Detection, in: MediaEval Benchmarking Initiative for Multimedia Evaluation</article-title>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International Conference on Machine Learning (ICML)</source>
          , PMLR,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <source>in: IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10012</fpage>
          -
          <lpage>10022</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>