The MUSTI challenge @ MediaEval 2023 - Multimodal
                                Understanding of Smells in Texts and Images with
                                Zero-shot Evaluation
                                Ali Hürriyetoğlu1,* , Inna Novalija2 , Mathias Zinnen3 , Vincent Christlein3 ,
                                Pasquale Lisena4 , Stefano Menini5 , Marieke van Erp1 and Raphael Troncy4
                                1
                                  KNAW Humanities Cluster, DHLab
                                2
                                  Jožef Stefan Institute, Slovenia
                                3
                                  Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
                                4
                                  EURECOM, Sophia Antipolis, France
                                5
                                  Fondazione Bruno Kessler, Trento, Italy


                                                                         Abstract
                                                                         We ran the MUSTI challenge the second time after the MUSTI 2022 edition by extending the evaluation
                                                                         with a zero-shot evaluation scenario. This was needed as the first iteration showed us there is a lot of room
                                                                         for improvement and zero-shot performance of the state-of-the-art methods is useful in understanding
                                                                         what available models can predict without any training in a new language. We used the same data from
                                                                         MUSTI 2022 for training and evaluation for MUSTI 2023. Additionally, we prepared a second evaluation
                                                                         scenario, which we call zero-shot, in Slovenian, which was not known by the participants before the
                                                                         evaluation phase started. MUSTI 2023 has attracted many teams and state-of-the-art multimodal systems
                                                                         perform better than the systems proposed in MUSTI 2022.


                                1. Introduction
                                The manner in which humans engage with smell is a prime example of intangible cultural
                                heritage: the way smells are created, in what situations they are used, but also how they are
                                appreciated are highly culturally dependent. By engaging with expressions of smells in texts
                                and images across multiple genres and multiple languages over a longer period of time, we can
                                gain more insights into how smells have affected human interactions through time.
                                   While smell is of vital importance in our day-to-day lives, little attention has been paid to it
                                within the natural language processing and computer vision communities. While there are some
                                lexicons focused on smell, the Odeuropa text benchmark dataset is the first multilingual, cross-
                                domain text dataset focused on smell references [1]. Similarly, for computer vision, no prior
                                datasets existed until the ODOR challenge dataset was created by members of this task [2]. In
                                the Multimodal Understanding of Smells in Texts and Images (MUSTI) challenge, we bring these
                                modalities together, inviting the research community to explore parallels and complementarities
                                in the way smells are described and depicted in different modalities.
                                   The MUSTI challenge at MediaEval 2023 aims to collect information about smell from digital
                                multilingual text and image collections between the 16th to 20th centuries. More precisely,

                                MediaEval’23: Multimedia Evaluation Workshop, February 1–2, 2024, Amsterdam, The Netherlands and Online
                                *
                                  Corresponding author.
                                †
                                  These authors contributed equally.
                                $ ali.hurriyetoglu@dh.huc.knaw.nl (A. Hürriyetoğlu); inna.koval@ijs.si (I. Novalija); mathias.zinnen@fau.de
                                (M. Zinnen); vincent.christlein@fau.de (V. Christlein); pasquale.lisena@eurecom.fr (P. Lisena); menini@fbk.eu
                                (S. Menini)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
MUSTI studies how different smells are referenced in modalities using a corpus of historical
multilingual texts and images. For example, what smell references can be identified in a text
and what smell sources and/or olfactory gestures can be recognized in an image?
   This paper is for the second edition of MUSTI. The first edition in 2022 observed that
achieving a good baseline for the task is feasible. One participant submission validated the
task by obtaining reasonable performance [3, 4]. However, there remains significant room for
improvement in terms of classification performance. Furthermore, the quest for insight has not
yet been addressed thoroughly. Additionally, MUSTI 2023 extends the 2022 protocol by adding
a zero-shot evaluation setting.


2. Motivation and Background
To fully make sense of digital (heritage) collections, it is necessary to go beyond an ocular-
centric approach and to engage with their olfactory dimension as well, as these offer a powerful
and direct entry to our emotions and memories. With the MUSTI task, we aim to accelerate
the understanding of olfactory references in English, Dutch, French, German, Italian, and
Slovene texts and images as well as the connections between these modalities. As recent and
ongoing exhibitions at Mauritshuis in The Hague, Netherlands, Museum Ulm in Ulm, Germany,
and the Prado Museum in Madrid, Spain demonstrate, museums and galleries are keen to
enrich museum visits with olfactory components – either for a more immersive experience
or to create a more inclusive experience for differently abled museum visitors such as those
with a visual impairment. Reinterpreting historical scents is attracting attention from various
research disciplines (Huber et al., 2022) and leading to interesting collaborations with perfume
makers, for example, the Scent of the Golden Age candle was developed after a recipe by
Constantijn Huygens in a collaboration between historians and a perfume maker. To ensure that
such enrichments are grounded in historically correct contexts, language and computer vision
technologies can help to find olfactory relevant examples in digitized historical collections and
related sources.
    With this task, we aim to investigate: i) What does it mean for a text and an image to be
related in terms of smell? ii) Do different text and image genres reference smell differently?
iii) Do different languages reference smell differently? iv) How do references to smell in texts
and images change over time? v) How do relationships between smell references in texts and
images change over time?


3. Task description
Smell is an underrepresented dimension of many multimedia analysis and representation tasks.
MUSTI aims to further the understanding of textual descriptions and visual depictions of smells
and smelling in historical texts and images. In this shared task, participants are provided with
multilingual texts (English, Dutch, German, French, Italian, and Slovene) and images, from the
16th to the 20th century, that pertain to smell in different ways. The images and the texts have
been selected because they contain depictions (images) and descriptions (text) of objects that are
known to reference smell. The goal of the task is to detect references to depictions (objects such
as flowers or animals in an image) and descriptions (texts) of objects that are known to evoke
smells in texts and images and to connect these smell references across these two modalities.
We formulate the challenge in the following subtasks that could be tackled independently from
each other:
   Subtask 1: Task participants are invited to develop language and image recognition tech-
nologies to predict whether a text passage and an image contain references to the same smell
source or not. This task can therefore be cast as a binary classification problem.
   Subtask 2: [Optional] The participants are also asked to identify what is (are) the common
smell source(s) between the text passages and the images. The detection of the smell source
includes detecting the object or place that has a specific smell, or that produces an odour (e. g.
plant, animal, perfume, human). In other words, the smell source is the entity or phenomenon
that a perceiver experiences with his or her senses. This sub-task can therefore be cast as a
multi-label classification problem.
   Subtask 3: [Optional] For this subtask we include a new evaluation setting, with test data
that consists of image and text pairs in languages that are not provided in the training setting.
The training data is available in English, French, German, and Italian and the test data is in all
these four languages and two additional languages, which are Dutch and Slovene. We refer to
this subtask as a zero-shot evaluation setting.


4. Target groups and Recruiting participants
Due to the growing interest in sensory mining (e. g. 1st International Workshop on Multisensory
Data and Knowledge (MDK) @ LDK 2021 and 2nd International Workshop on Multisensory
Data and Knowledge (MDK) @ theWebConf 2023) and multimodal information processing
(e. g. 1st International Workshop on Multimodal Understanding for the Web and Social Media
(MUWS), co-located with The WebConf (WWW) 2022 in different research disciplines. Although
participation was limited in MUSTI 2022, we consider MUSTI 2023 to be an opportunity to get
in early and establish a leading position on this problem. Community outreach has already
started in 2022 and with the execution of a communication plan to enhance the likelihood of
reaching a broad community that could propose solutions to the problem we proposed in 2023.
The Computer Vision ODOR challenge that we organised as a part of ICPR2022, demonstrates
the research community’s interest in taking on the previously unaddressed topic of smell.
As the task proposers are members of the language technology, computer vision, cultural
heritage, digital humanities and semantic web communities, they will publicize the task in
their communities via the appropriate mailing lists, social media channels such as Twitter/X
and Mastodon, and via upcoming presentations at the Language Resources and Evaluation
Conference, the Digital Humanities/Artificial Intelligence Seminar, the European Semantic
Web Conference, DHBenelux, The Web Conference, and the Digital Humanities Conference.
Furthermore, the Odeuropa Network (consisting of >150 members), the project mailing list, and
other communication channels have a wide reach. Finally, we have collected a list of scholars
and research groups that work at the intersection of vision and language processing in the first
edition of MUSTI in 2022. We will expand this list and invite these people to participate in
MUSTI 2023. The MUSTI task also provides an excellent use case for students to hone their
multimodal and creative problem-solving skills. We will therefore also advertise the challenge
at relevant outlets such as the International Semantic Web Summer School and the EURECOM
Machine Learning and Intelligent System (MALIS) course.
    By splitting up the task into two stages (first binary classification, then multi-class classifica-
tion) we aim to reduce the barrier to participation. Furthermore, the team will make available
baseline smell reference recognition software for texts and images that the participants can
build on.
    Most researchers have already very busy agendas thus we aim to make the task attractive to
interested parties by providing tools to get going more easily. Furthermore, we will actively
target students and early-career researchers as well as industry to cast a wide net. The potential
application domains of the task help here.
   The Odeuropa project has created smell reference benchmark datasets for texts and images
that will be utilised [1, 2].


5. Data
The MUSTI 2023 dataset consists of copyright-free texts and partly copyrighted images that can
be downloaded and submitted by the participants using the URLs we provide. We offer texts
in English, Dutch, French, German, Italian, and Slovene (zero-shot scenario) that participants
are to match to the images. The texts are selected from open repositories such as Project
Gutenberg, Europeana, Royal Society Corpus, Deutsches Text Arxiv, Gallica, Wikisource and
Liber Liber. The images are selected from different archives such as RKD, Bildindex der Kunst
und Architektur, Museum Boijmans, Ashmolean Museum Oxford, and Plateforme Ouverte du
Patrimoine. The images are annotated with 169 categories of smell objects and gestures such as
flowers, food, animals, sniffing and holding the nose. The object categories are organised in a
two-level taxonomy. The Odeuropa text and image benchmark datasets are available as training
data to the participants. The image dataset consists of 4,696 images with 36,663 associated
object annotations, 600 gesture annotations, and image=level meta-data. We also provide the
output of a text processing system we have developed to identify text snippets that contain
smell references. The systems of the participants are evaluated on a held-out dataset of roughly
1,200 images with associated texts in the four languages.
   Figure 1 provides an example of mapping images with Slovenian text (text translation: "The
stem is round and smooth, and the leaves are lanceolate and bright green. Lily’s flowers are
large, pure white, and smell very nice. Each flower has six petals, which are curved back at the
top. Lily means purity and innocence.") The Slovenian example presents a description of the
Lily flower from the journal "Teacher’s Mate" published in 1862.


6. Evaluation
Task runs are evaluated against a gold standard consisting of image-text pairs. For the evaluation,
we use multiple statistics as each provides a slightly different perspective on the results. The
code and models of the baselines are available at . The subtasks are evaluated using the following
metrics:
   Subtask 1: Predicting whether an image and a text passage evoke the same smell source or
not. This subtask is evaluated using precision, recall and F1 -score. As multiple text passages in
different languages can be linked to the same image, we employ multiple linking scorers such
as CEAF and BLANC to measure the performance across different smell reference chains.
   Subtask 2: Identifying the common smell source(s) between the text passages and the images.
For this subtask, precision, recall and F1 -score are employed, as well as more fine-grained
evaluation methods such as RUFES, which can accommodate multi-level taxonomies.
   Subtask 3: Zero-shot evaluation setting. The evaluation for this subtask is the same as
subtasks 1 and 2. The only difference is that no training data was provided for this subtask.
Figure 1: Example from Slovenian data: image and mapped text snapshot.


7. Related Work
To the best of our knowledge, the task of predicting whether an image and a text evoke the
same smell has not been tackled prior to the previous MUSTI challenge [3]. However, some
closely related tasks about text-image alignment are established in literature: In visual question
answering (VQA), the aim is to develop systems capable of reasoning about visual information
in order to answer textual questions posed to the systems [5]. Based on existing datasets like
COCO [6] or Visual Genome [7], various datasets and benchmarks have been proposed since
the mid-2010s to train and evaluate VQA algorithms [8, 9, 10, 11].
   Another closely related strand of research is vision-language pretraining (VLP) where multi-
modal language and vision models are pre-trained on large amounts of image-caption pairs to
learn an embedding space shared between visual and textual embeddings. Models pre-trained
in this manner exhibit strong generalization capabilities when fine-tuned and applied to their
respective downstream task. The most influental VLP algorithm is CLIP [] with numerous appli-
cations such as multimodal object detection [12, 13], image retrieval, artwork classification [14],
or captioning [15, 16].
   Even closer to the MUSTI objective is the task of visual entailment (VE), introduced by Xie et
al. [17, 18] together with their SNLI-VE dataset which provides the default benchmark for the
task. Given an image-sentence pair, the aim of VE is to predict whether the image semantically
entails the text. VE algorithms are thus required to develop a semantic understanding of
both images and texts and relate them to each other. Recent algorithms like OFA [19] or
PromptTuning [20] achieve accuracies of over 90% at the SNLI-VE benchmark, suggesting that
a more difficult benchmark might be beneficial. Given that in MUSTI, logical entailment is
replaced with smell entailment, the MUSTI objective could be framed as olfactory entailment as
opposed to VE.
References
 [1] S. Menini, T. Paccosi, S. Tonelli, M. Van Erp, I. Leemans, P. Lisena, R. Troncy, W. Tullett, A. Hür-
     riyetoğlu, G. Dijkstra, F. Gordijn, E. Jürgens, J. Koopman, A. Ouwerkerk, S. Steen, I. Noval-
     ija, J. Brank, D. Mladenic, A. Zidar, A multilingual benchmark to capture olfactory situa-
     tions over time, in: N. Tahmasebi, S. Montariol, A. Kutuzov, S. Hengchen, H. Dubossarsky,
     L. Borin (Eds.), Proceedings of the 3rd Workshop on Computational Approaches to Historical
     Language Change, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 1–10. URL:
     https://aclanthology.org/2022.lchange-1.1. doi:10.18653/v1/2022.lchange-1.1.
 [2] M. Zinnen, P. Madhu, R. Kosti, P. Bell, A. Maier, V. Christlein, Odor: The icpr2022 odeuropa
     challenge on olfactory object recognition, in: 2022 26th International Conference on Pattern
     Recognition (ICPR), IEEE, 2022, pp. 4989–4994.
 [3] A. Hürriyetoglu, T. Paccosi, S. Menini, M. Zinnen, P. Lisena, K. Akdemir, R. Troncy, M. van Erp,
     MUSTI - multimodal understanding of smells in texts and images at mediaeval 2022, in: S. Hicks,
     A. G. S. de Herrera, J. Langguth, A. Lommatzsch, S. Andreadis, M. Dao, P. Martin, A. Hürriyetoglu,
     V. Thambawita, T. S. Nordmo, R. Vuillemot, M. A. Larson (Eds.), Working Notes Proceedings of the
     MediaEval 2022 Workshop, Bergen, Norway and Online, 12-13 January 2023, volume 3583 of CEUR
     Workshop Proceedings, CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3583/paper50.pdf.
 [4] K. Akdemir, A. Hürriyetoglu, R. Troncy, T. Paccosi, S. Menini, M. Zinnen, V. Christlein, Multimodal
     and multilingual understanding of smells using vilbert and muniter, in: S. Hicks, A. G. S. de Herrera,
     J. Langguth, A. Lommatzsch, S. Andreadis, M. Dao, P. Martin, A. Hürriyetoglu, V. Thambawita,
     T. S. Nordmo, R. Vuillemot, M. A. Larson (Eds.), Working Notes Proceedings of the MediaEval
     2022 Workshop, Bergen, Norway and Online, 12-13 January 2023, volume 3583 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3583/paper36.pdf.
 [5] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, A. Van Den Hengel, Visual question answering: A
     survey of methods and datasets, Computer Vision and Image Understanding 163 (2017) 21–40.
 [6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft
     coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference,
     Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014, pp. 740–755.
 [7] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A.
     Shamma, et al., Visual genome: Connecting language and vision using crowdsourced dense image
     annotations, International journal of computer vision 123 (2017) 32–73.
 [8] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh, Vqa: Visual question
     answering, in: Proceedings of the IEEE international conference on computer vision, 2015, pp.
     2425–2433.
 [9] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the v in vqa matter: Elevating the
     role of image understanding in visual question answering, in: Proceedings of the IEEE conference
     on computer vision and pattern recognition, 2017, pp. 6904–6913.
[10] Y. Zhu, O. Groth, M. Bernstein, L. Fei-Fei, Visual7w: Grounded question answering in images,
     in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
     4995–5004.
[11] J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, R. Girshick, Clevr: A
     diagnostic dataset for compositional language and elementary visual reasoning, in: Proceedings of
     the IEEE conference on computer vision and pattern recognition, 2017, pp. 2901–2910.
[12] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang, et al.,
     Grounded language-image pre-training, in: Proceedings of the IEEE/CVF Conference on Computer
     Vision and Pattern Recognition, 2022, pp. 10965–10975.
[13] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al., Grounding
     dino: Marrying dino with grounded pre-training for open-set object detection, arXiv preprint
     arXiv:2303.05499 (2023).
[14] M. V. Conde, K. Turgutlu, Clip-art: Contrastive pre-training for fine-grained art classification, in:
     Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.
     3956–3960.
[15] J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrapping language-image pre-training for unified vision-
     language understanding and generation, in: International Conference on Machine Learning, PMLR,
     2022, pp. 12888–12900.
[16] J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image pre-training with frozen
     image encoders and large language models, arXiv preprint arXiv:2301.12597 (2023).
[17] N. Xie, F. Lai, D. Doran, A. Kadav, Visual entailment task for visually-grounded language learning,
     arXiv preprint arXiv:1811.10582 (2018).
[18] N. Xie, F. Lai, D. Doran, A. Kadav, Visual entailment: A novel task for fine-grained image under-
     standing, arXiv preprint arXiv:1901.06706 (2019).
[19] P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, H. Yang, Ofa: Unifying
     architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,
     in: International Conference on Machine Learning, PMLR, 2022, pp. 23318–23340.
[20] H. Yang, J. Lin, A. Yang, P. Wang, C. Zhou, H. Yang, Prompt tuning for generative multimodal
     pretrained models, arXiv preprint arXiv:2208.02532 (2022).