<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The MUSTI challenge @ MediaEval 2023 - Multimodal Understanding of Smells in Texts and Images with Zero-shot Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ali Hürriyetoğlu</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Inna Novalija</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathias Zinnen</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincent Christlein</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pasquale Lisena</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Menini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marieke van Erp</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raphael Troncy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>EURECOM</institution>
          ,
          <addr-line>Sophia Antipolis</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Jožef Stefan Institute</institution>
          ,
          <country country="SI">Slovenia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>KNAW Humanities Cluster</institution>
          ,
          <addr-line>DHLab</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We ran the MUSTI challenge the second time after the MUSTI 2022 edition by extending the evaluation with a zero-shot evaluation scenario. This was needed as the first iteration showed us there is a lot of room for improvement and zero-shot performance of the state-of-the-art methods is useful in understanding what available models can predict without any training in a new language. We used the same data from MUSTI 2022 for training and evaluation for MUSTI 2023. Additionally, we prepared a second evaluation scenario, which we call zero-shot, in Slovenian, which was not known by the participants before the evaluation phase started. MUSTI 2023 has attracted many teams and state-of-the-art multimodal systems perform better than the systems proposed in MUSTI 2022.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The manner in which humans engage with smell is a prime example of intangible cultural
heritage: the way smells are created, in what situations they are used, but also how they are
appreciated are highly culturally dependent. By engaging with expressions of smells in texts
and images across multiple genres and multiple languages over a longer period of time, we can
gain more insights into how smells have afected human interactions through time.</p>
      <p>
        While smell is of vital importance in our day-to-day lives, little attention has been paid to it
within the natural language processing and computer vision communities. While there are some
lexicons focused on smell, the Odeuropa text benchmark dataset is the first multilingual,
crossdomain text dataset focused on smell references [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Similarly, for computer vision, no prior
datasets existed until the ODOR challenge dataset was created by members of this task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In
the Multimodal Understanding of Smells in Texts and Images (MUSTI) challenge, we bring these
modalities together, inviting the research community to explore parallels and complementarities
in the way smells are described and depicted in diferent modalities.
      </p>
      <p>The MUSTI challenge at MediaEval 2023 aims to collect information about smell from digital
multilingual text and image collections between the 16th to 20th centuries. More precisely,
MUSTI studies how diferent smells are referenced in modalities using a corpus of historical
multilingual texts and images. For example, what smell references can be identified in a text
and what smell sources and/or olfactory gestures can be recognized in an image?</p>
      <p>
        This paper is for the second edition of MUSTI. The first edition in 2022 observed that
achieving a good baseline for the task is feasible. One participant submission validated the
task by obtaining reasonable performance [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. However, there remains significant room for
improvement in terms of classification performance. Furthermore, the quest for insight has not
yet been addressed thoroughly. Additionally, MUSTI 2023 extends the 2022 protocol by adding
a zero-shot evaluation setting.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Motivation and Background</title>
      <p>To fully make sense of digital (heritage) collections, it is necessary to go beyond an
ocularcentric approach and to engage with their olfactory dimension as well, as these ofer a powerful
and direct entry to our emotions and memories. With the MUSTI task, we aim to accelerate
the understanding of olfactory references in English, Dutch, French, German, Italian, and
Slovene texts and images as well as the connections between these modalities. As recent and
ongoing exhibitions at Mauritshuis in The Hague, Netherlands, Museum Ulm in Ulm, Germany,
and the Prado Museum in Madrid, Spain demonstrate, museums and galleries are keen to
enrich museum visits with olfactory components – either for a more immersive experience
or to create a more inclusive experience for diferently abled museum visitors such as those
with a visual impairment. Reinterpreting historical scents is attracting attention from various
research disciplines (Huber et al., 2022) and leading to interesting collaborations with perfume
makers, for example, the Scent of the Golden Age candle was developed after a recipe by
Constantijn Huygens in a collaboration between historians and a perfume maker. To ensure that
such enrichments are grounded in historically correct contexts, language and computer vision
technologies can help to find olfactory relevant examples in digitized historical collections and
related sources.</p>
      <p>With this task, we aim to investigate: i) What does it mean for a text and an image to be
related in terms of smell? ii) Do diferent text and image genres reference smell diferently?
iii) Do diferent languages reference smell diferently? iv) How do references to smell in texts
and images change over time? v) How do relationships between smell references in texts and
images change over time?</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task description</title>
      <p>Smell is an underrepresented dimension of many multimedia analysis and representation tasks.
MUSTI aims to further the understanding of textual descriptions and visual depictions of smells
and smelling in historical texts and images. In this shared task, participants are provided with
multilingual texts (English, Dutch, German, French, Italian, and Slovene) and images, from the
16th to the 20th century, that pertain to smell in diferent ways. The images and the texts have
been selected because they contain depictions (images) and descriptions (text) of objects that are
known to reference smell. The goal of the task is to detect references to depictions (objects such
as flowers or animals in an image) and descriptions (texts) of objects that are known to evoke
smells in texts and images and to connect these smell references across these two modalities.
We formulate the challenge in the following subtasks that could be tackled independently from
each other:</p>
      <p>Subtask 1: Task participants are invited to develop language and image recognition
technologies to predict whether a text passage and an image contain references to the same smell
source or not. This task can therefore be cast as a binary classification problem.</p>
      <p>Subtask 2: [Optional] The participants are also asked to identify what is (are) the common
smell source(s) between the text passages and the images. The detection of the smell source
includes detecting the object or place that has a specific smell, or that produces an odour (e. g.
plant, animal, perfume, human). In other words, the smell source is the entity or phenomenon
that a perceiver experiences with his or her senses. This sub-task can therefore be cast as a
multi-label classification problem.</p>
      <p>Subtask 3: [Optional] For this subtask we include a new evaluation setting, with test data
that consists of image and text pairs in languages that are not provided in the training setting.
The training data is available in English, French, German, and Italian and the test data is in all
these four languages and two additional languages, which are Dutch and Slovene. We refer to
this subtask as a zero-shot evaluation setting.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Target groups and Recruiting participants</title>
      <p>Due to the growing interest in sensory mining (e. g. 1st International Workshop on Multisensory
Data and Knowledge (MDK) @ LDK 2021 and 2nd International Workshop on Multisensory
Data and Knowledge (MDK) @ theWebConf 2023) and multimodal information processing
(e. g. 1st International Workshop on Multimodal Understanding for the Web and Social Media
(MUWS), co-located with The WebConf (WWW) 2022 in diferent research disciplines. Although
participation was limited in MUSTI 2022, we consider MUSTI 2023 to be an opportunity to get
in early and establish a leading position on this problem. Community outreach has already
started in 2022 and with the execution of a communication plan to enhance the likelihood of
reaching a broad community that could propose solutions to the problem we proposed in 2023.
The Computer Vision ODOR challenge that we organised as a part of ICPR2022, demonstrates
the research community’s interest in taking on the previously unaddressed topic of smell.
As the task proposers are members of the language technology, computer vision, cultural
heritage, digital humanities and semantic web communities, they will publicize the task in
their communities via the appropriate mailing lists, social media channels such as Twitter/X
and Mastodon, and via upcoming presentations at the Language Resources and Evaluation
Conference, the Digital Humanities/Artificial Intelligence Seminar, the European Semantic
Web Conference, DHBenelux, The Web Conference, and the Digital Humanities Conference.
Furthermore, the Odeuropa Network (consisting of &gt;150 members), the project mailing list, and
other communication channels have a wide reach. Finally, we have collected a list of scholars
and research groups that work at the intersection of vision and language processing in the first
edition of MUSTI in 2022. We will expand this list and invite these people to participate in
MUSTI 2023. The MUSTI task also provides an excellent use case for students to hone their
multimodal and creative problem-solving skills. We will therefore also advertise the challenge
at relevant outlets such as the International Semantic Web Summer School and the EURECOM
Machine Learning and Intelligent System (MALIS) course.</p>
      <p>By splitting up the task into two stages (first binary classification, then multi-class
classification) we aim to reduce the barrier to participation. Furthermore, the team will make available
baseline smell reference recognition software for texts and images that the participants can
build on.</p>
      <p>Most researchers have already very busy agendas thus we aim to make the task attractive to
interested parties by providing tools to get going more easily. Furthermore, we will actively
target students and early-career researchers as well as industry to cast a wide net. The potential
application domains of the task help here.</p>
      <p>
        The Odeuropa project has created smell reference benchmark datasets for texts and images
that will be utilised [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Data</title>
      <p>The MUSTI 2023 dataset consists of copyright-free texts and partly copyrighted images that can
be downloaded and submitted by the participants using the URLs we provide. We ofer texts
in English, Dutch, French, German, Italian, and Slovene (zero-shot scenario) that participants
are to match to the images. The texts are selected from open repositories such as Project
Gutenberg, Europeana, Royal Society Corpus, Deutsches Text Arxiv, Gallica, Wikisource and
Liber Liber. The images are selected from diferent archives such as RKD, Bildindex der Kunst
und Architektur, Museum Boijmans, Ashmolean Museum Oxford, and Plateforme Ouverte du
Patrimoine. The images are annotated with 169 categories of smell objects and gestures such as
lfowers, food, animals, snifing and holding the nose. The object categories are organised in a
two-level taxonomy. The Odeuropa text and image benchmark datasets are available as training
data to the participants. The image dataset consists of 4,696 images with 36,663 associated
object annotations, 600 gesture annotations, and image=level meta-data. We also provide the
output of a text processing system we have developed to identify text snippets that contain
smell references. The systems of the participants are evaluated on a held-out dataset of roughly
1,200 images with associated texts in the four languages.</p>
      <p>Figure 1 provides an example of mapping images with Slovenian text (text translation: "The
stem is round and smooth, and the leaves are lanceolate and bright green. Lily’s flowers are
large, pure white, and smell very nice. Each flower has six petals, which are curved back at the
top. Lily means purity and innocence.") The Slovenian example presents a description of the
Lily flower from the journal "Teacher’s Mate" published in 1862.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Evaluation</title>
      <p>Task runs are evaluated against a gold standard consisting of image-text pairs. For the evaluation,
we use multiple statistics as each provides a slightly diferent perspective on the results. The
code and models of the baselines are available at . The subtasks are evaluated using the following
metrics:</p>
      <p>Subtask 1: Predicting whether an image and a text passage evoke the same smell source or
not. This subtask is evaluated using precision, recall and F1-score. As multiple text passages in
diferent languages can be linked to the same image, we employ multiple linking scorers such
as CEAF and BLANC to measure the performance across diferent smell reference chains.</p>
      <p>Subtask 2: Identifying the common smell source(s) between the text passages and the images.
For this subtask, precision, recall and F1-score are employed, as well as more fine-grained
evaluation methods such as RUFES, which can accommodate multi-level taxonomies.</p>
      <p>Subtask 3: Zero-shot evaluation setting. The evaluation for this subtask is the same as
subtasks 1 and 2. The only diference is that no training data was provided for this subtask.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Related Work</title>
      <p>
        To the best of our knowledge, the task of predicting whether an image and a text evoke the
same smell has not been tackled prior to the previous MUSTI challenge [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, some
closely related tasks about text-image alignment are established in literature: In visual question
answering (VQA), the aim is to develop systems capable of reasoning about visual information
in order to answer textual questions posed to the systems [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Based on existing datasets like
COCO [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or Visual Genome [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], various datasets and benchmarks have been proposed since
the mid-2010s to train and evaluate VQA algorithms [
        <xref ref-type="bibr" rid="ref10 ref11 ref8 ref9">8, 9, 10, 11</xref>
        ].
      </p>
      <p>
        Another closely related strand of research is vision-language pretraining (VLP) where
multimodal language and vision models are pre-trained on large amounts of image-caption pairs to
learn an embedding space shared between visual and textual embeddings. Models pre-trained
in this manner exhibit strong generalization capabilities when fine-tuned and applied to their
respective downstream task. The most influental VLP algorithm is CLIP [] with numerous
applications such as multimodal object detection [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ], image retrieval, artwork classification [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
or captioning [
        <xref ref-type="bibr" rid="ref15">15, 16</xref>
        ].
      </p>
      <p>Even closer to the MUSTI objective is the task of visual entailment (VE), introduced by Xie et
al. [17, 18] together with their SNLI-VE dataset which provides the default benchmark for the
task. Given an image-sentence pair, the aim of VE is to predict whether the image semantically
entails the text. VE algorithms are thus required to develop a semantic understanding of
both images and texts and relate them to each other. Recent algorithms like OFA [19] or
PromptTuning [20] achieve accuracies of over 90% at the SNLI-VE benchmark, suggesting that
a more dificult benchmark might be beneficial. Given that in MUSTI, logical entailment is
replaced with smell entailment, the MUSTI objective could be framed as olfactory entailment as
opposed to VE.
2022, pp. 12888–12900.
[16] J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image pre-training with frozen
image encoders and large language models, arXiv preprint arXiv:2301.12597 (2023).
[17] N. Xie, F. Lai, D. Doran, A. Kadav, Visual entailment task for visually-grounded language learning,
arXiv preprint arXiv:1811.10582 (2018).
[18] N. Xie, F. Lai, D. Doran, A. Kadav, Visual entailment: A novel task for fine-grained image
understanding, arXiv preprint arXiv:1901.06706 (2019).
[19] P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, H. Yang, Ofa: Unifying
architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,
in: International Conference on Machine Learning, PMLR, 2022, pp. 23318–23340.
[20] H. Yang, J. Lin, A. Yang, P. Wang, C. Zhou, H. Yang, Prompt tuning for generative multimodal
pretrained models, arXiv preprint arXiv:2208.02532 (2022).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonelli</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Van Erp</surname>
            ,
            <given-names>I. Leemans</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Tullett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoğlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Dijkstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gordijn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jürgens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Koopman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ouwerkerk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Steen</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Novalija</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mladenic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zidar</surname>
          </string-name>
          ,
          <article-title>A multilingual benchmark to capture olfactory situations over time</article-title>
          , in: N.
          <string-name>
            <surname>Tahmasebi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Montariol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kutuzov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hengchen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Dubossarsky</surname>
          </string-name>
          , L. Borin (Eds.),
          <source>Proceedings of the 3rd Workshop on Computational Approaches</source>
          to Historical Language Change, Association for Computational Linguistics, Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .lchange-
          <volume>1</volume>
          .1. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .lchange-
          <volume>1</volume>
          .1.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kosti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <article-title>Odor: The icpr2022 odeuropa challenge on olfactory object recognition</article-title>
          ,
          <source>in: 2022 26th International Conference on Pattern Recognition (ICPR)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>4989</fpage>
          -
          <lpage>4994</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Akdemir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Erp</surname>
          </string-name>
          ,
          <article-title>MUSTI - multimodal understanding of smells in texts and images at mediaeval 2022</article-title>
          , in: S. Hicks,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Langguth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Andreadis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hürriyetoglu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordmo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vuillemot</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          (Eds.),
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          , volume
          <volume>3583</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper50.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Akdemir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <article-title>Multimodal and multilingual understanding of smells using vilbert and muniter</article-title>
          , in: S. Hicks,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Langguth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Andreadis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hürriyetoglu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordmo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vuillemot</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          (Eds.),
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          , volume
          <volume>3583</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper36.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Teney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dick</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Van Den Hengel</surname>
          </string-name>
          ,
          <article-title>Visual question answering: A survey of methods and datasets</article-title>
          ,
          <source>Computer Vision and Image Understanding</source>
          <volume>163</volume>
          (
          <year>2017</year>
          )
          <fpage>21</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.-Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <article-title>Microsoft coco: Common objects in context</article-title>
          , in: Computer Vision-ECCV
          <year>2014</year>
          : 13th European Conference, Zurich, Switzerland, September 6-
          <issue>12</issue>
          ,
          <year>2014</year>
          , Proceedings, Part V 13, Springer,
          <year>2014</year>
          , pp.
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kravitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kalantidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Shamma</surname>
          </string-name>
          , et al.,
          <article-title>Visual genome: Connecting language and vision using crowdsourced dense image annotations</article-title>
          ,
          <source>International journal of computer vision 123</source>
          (
          <year>2017</year>
          )
          <fpage>32</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Antol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          , Vqa:
          <article-title>Visual question answering</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>2425</fpage>
          -
          <lpage>2433</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Khot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Summers-Stay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <article-title>Making the v in vqa matter: Elevating the role of image understanding in visual question answering</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>6904</fpage>
          -
          <lpage>6913</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>Visual7w: Grounded question answering in images</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>4995</fpage>
          -
          <lpage>5004</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van Der</given-names>
            <surname>Maaten</surname>
          </string-name>
          , L.
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lawrence Zitnick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>Clevr: A diagnostic dataset for compositional language and elementary visual reasoning</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2901</fpage>
          -
          <lpage>2910</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-N.</given-names>
            <surname>Hwang</surname>
          </string-name>
          , et al.,
          <article-title>Grounded language-image pre-training</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>10965</fpage>
          -
          <lpage>10975</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , et al.,
          <article-title>Grounding dino: Marrying dino with grounded pre-training for open-set object detection</article-title>
          ,
          <source>arXiv preprint arXiv:2303.05499</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Conde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Turgutlu</surname>
          </string-name>
          ,
          <article-title>Clip-art: Contrastive pre-training for fine-grained art classification</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>3956</fpage>
          -
          <lpage>3960</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Blip: Bootstrapping language-image pre-training for unified visionlanguage understanding and generation</article-title>
          ,
          <source>in: International Conference on Machine Learning</source>
          , PMLR,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>