<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">The MUSTI challenge @ MediaEval 2023 -Multimodal Understanding of Smells in Texts and Images with Zero-shot Evaluation</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Ali</forename><surname>Hürriyetoğlu</surname></persName>
							<email>ali.hurriyetoglu@dh.huc.knaw.nl</email>
							<affiliation key="aff0">
								<orgName type="institution" key="instit1">KNAW Humanities Cluster</orgName>
								<orgName type="institution" key="instit2">DHLab</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Inna</forename><surname>Novalija</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Jožef Stefan Institute</orgName>
								<address>
									<country key="SI">Slovenia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mathias</forename><surname>Zinnen</surname></persName>
							<email>mathias.zinnen@fau.de</email>
							<affiliation key="aff2">
								<orgName type="department">Pattern Recognition Lab</orgName>
								<orgName type="institution">Friedrich-Alexander-Universität Erlangen-Nürnberg</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Vincent</forename><surname>Christlein</surname></persName>
							<email>vincent.christlein@fau.de</email>
							<affiliation key="aff2">
								<orgName type="department">Pattern Recognition Lab</orgName>
								<orgName type="institution">Friedrich-Alexander-Universität Erlangen-Nürnberg</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Pasquale</forename><surname>Lisena</surname></persName>
							<email>pasquale.lisena@eurecom.fr</email>
							<affiliation key="aff3">
								<orgName type="institution">EURECOM</orgName>
								<address>
									<settlement>Sophia Antipolis</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Stefano</forename><surname>Menini</surname></persName>
							<email>menini@fbk.eu</email>
							<affiliation key="aff4">
								<orgName type="institution">Fondazione Bruno Kessler</orgName>
								<address>
									<settlement>Trento</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marieke</forename><surname>Van Erp</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution" key="instit1">KNAW Humanities Cluster</orgName>
								<orgName type="institution" key="instit2">DHLab</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Raphael</forename><surname>Troncy</surname></persName>
							<affiliation key="aff3">
								<orgName type="institution">EURECOM</orgName>
								<address>
									<settlement>Sophia Antipolis</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">The MUSTI challenge @ MediaEval 2023 -Multimodal Understanding of Smells in Texts and Images with Zero-shot Evaluation</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">74DEF95624E422CE8F867B1FFA210C84</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We ran the MUSTI challenge the second time after the MUSTI 2022 edition by extending the evaluation with a zero-shot evaluation scenario. This was needed as the first iteration showed us there is a lot of room for improvement and zero-shot performance of the state-of-the-art methods is useful in understanding what available models can predict without any training in a new language. We used the same data from MUSTI 2022 for training and evaluation for MUSTI 2023. Additionally, we prepared a second evaluation scenario, which we call zero-shot, in Slovenian, which was not known by the participants before the evaluation phase started. MUSTI 2023 has attracted many teams and state-of-the-art multimodal systems perform better than the systems proposed in MUSTI 2022.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The manner in which humans engage with smell is a prime example of intangible cultural heritage: the way smells are created, in what situations they are used, but also how they are appreciated are highly culturally dependent. By engaging with expressions of smells in texts and images across multiple genres and multiple languages over a longer period of time, we can gain more insights into how smells have affected human interactions through time.</p><p>While smell is of vital importance in our day-to-day lives, little attention has been paid to it within the natural language processing and computer vision communities. While there are some lexicons focused on smell, the Odeuropa text benchmark dataset is the first multilingual, crossdomain text dataset focused on smell references <ref type="bibr" target="#b0">[1]</ref>. Similarly, for computer vision, no prior datasets existed until the ODOR challenge dataset was created by members of this task <ref type="bibr" target="#b1">[2]</ref>. In the Multimodal Understanding of Smells in Texts and Images (MUSTI) challenge, we bring these modalities together, inviting the research community to explore parallels and complementarities in the way smells are described and depicted in different modalities.</p><p>The MUSTI challenge at MediaEval 2023 aims to collect information about smell from digital multilingual text and image collections between the 16th to 20th centuries. More precisely, MUSTI studies how different smells are referenced in modalities using a corpus of historical multilingual texts and images. For example, what smell references can be identified in a text and what smell sources and/or olfactory gestures can be recognized in an image?</p><p>This paper is for the second edition of MUSTI. The first edition in 2022 observed that achieving a good baseline for the task is feasible. One participant submission validated the task by obtaining reasonable performance <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref>. However, there remains significant room for improvement in terms of classification performance. Furthermore, the quest for insight has not yet been addressed thoroughly. Additionally, MUSTI 2023 extends the 2022 protocol by adding a zero-shot evaluation setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Motivation and Background</head><p>To fully make sense of digital (heritage) collections, it is necessary to go beyond an ocularcentric approach and to engage with their olfactory dimension as well, as these offer a powerful and direct entry to our emotions and memories. With the MUSTI task, we aim to accelerate the understanding of olfactory references in English, Dutch, French, German, Italian, and Slovene texts and images as well as the connections between these modalities. As recent and ongoing exhibitions at Mauritshuis in The Hague, Netherlands, Museum Ulm in Ulm, Germany, and the Prado Museum in Madrid, Spain demonstrate, museums and galleries are keen to enrich museum visits with olfactory components -either for a more immersive experience or to create a more inclusive experience for differently abled museum visitors such as those with a visual impairment. Reinterpreting historical scents is attracting attention from various research disciplines (Huber et al., 2022) and leading to interesting collaborations with perfume makers, for example, the Scent of the Golden Age candle was developed after a recipe by Constantijn Huygens in a collaboration between historians and a perfume maker. To ensure that such enrichments are grounded in historically correct contexts, language and computer vision technologies can help to find olfactory relevant examples in digitized historical collections and related sources.</p><p>With this task, we aim to investigate: i) What does it mean for a text and an image to be related in terms of smell? ii) Do different text and image genres reference smell differently? iii) Do different languages reference smell differently? iv) How do references to smell in texts and images change over time? v) How do relationships between smell references in texts and images change over time?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Task description</head><p>Smell is an underrepresented dimension of many multimedia analysis and representation tasks. MUSTI aims to further the understanding of textual descriptions and visual depictions of smells and smelling in historical texts and images. In this shared task, participants are provided with multilingual texts (English, Dutch, German, French, Italian, and Slovene) and images, from the 16th to the 20th century, that pertain to smell in different ways. The images and the texts have been selected because they contain depictions (images) and descriptions (text) of objects that are known to reference smell. The goal of the task is to detect references to depictions (objects such as flowers or animals in an image) and descriptions (texts) of objects that are known to evoke smells in texts and images and to connect these smell references across these two modalities. We formulate the challenge in the following subtasks that could be tackled independently from each other: Subtask 1: Task participants are invited to develop language and image recognition technologies to predict whether a text passage and an image contain references to the same smell source or not. This task can therefore be cast as a binary classification problem.</p><p>Subtask 2: <ref type="bibr">[Optional]</ref> The participants are also asked to identify what is (are) the common smell source(s) between the text passages and the images. The detection of the smell source includes detecting the object or place that has a specific smell, or that produces an odour (e. g. plant, animal, perfume, human). In other words, the smell source is the entity or phenomenon that a perceiver experiences with his or her senses. This sub-task can therefore be cast as a multi-label classification problem.</p><p>Subtask 3: [Optional] For this subtask we include a new evaluation setting, with test data that consists of image and text pairs in languages that are not provided in the training setting.</p><p>The training data is available in English, French, German, and Italian and the test data is in all these four languages and two additional languages, which are Dutch and Slovene. We refer to this subtask as a zero-shot evaluation setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Target groups and Recruiting participants</head><p>Due to the growing interest in sensory mining (e. g. 1st International Workshop on Multisensory Data and Knowledge (MDK) @ LDK 2021 and 2nd International Workshop on Multisensory Data and Knowledge (MDK) @ theWebConf 2023) and multimodal information processing (e. g. 1st International Workshop on Multimodal Understanding for the Web and Social Media (MUWS), co-located with The WebConf (WWW) 2022 in different research disciplines. Although participation was limited in MUSTI 2022, we consider MUSTI 2023 to be an opportunity to get in early and establish a leading position on this problem. Community outreach has already started in 2022 and with the execution of a communication plan to enhance the likelihood of reaching a broad community that could propose solutions to the problem we proposed in 2023. The Computer Vision ODOR challenge that we organised as a part of ICPR2022, demonstrates the research community's interest in taking on the previously unaddressed topic of smell. As the task proposers are members of the language technology, computer vision, cultural heritage, digital humanities and semantic web communities, they will publicize the task in their communities via the appropriate mailing lists, social media channels such as Twitter/X and Mastodon, and via upcoming presentations at the Language Resources and Evaluation Conference, the Digital Humanities/Artificial Intelligence Seminar, the European Semantic Web Conference, DHBenelux, The Web Conference, and the Digital Humanities Conference. Furthermore, the Odeuropa Network (consisting of &gt;150 members), the project mailing list, and other communication channels have a wide reach. Finally, we have collected a list of scholars and research groups that work at the intersection of vision and language processing in the first edition of MUSTI in 2022. We will expand this list and invite these people to participate in MUSTI 2023. The MUSTI task also provides an excellent use case for students to hone their multimodal and creative problem-solving skills. We will therefore also advertise the challenge at relevant outlets such as the International Semantic Web Summer School and the EURECOM Machine Learning and Intelligent System (MALIS) course.</p><p>By splitting up the task into two stages (first binary classification, then multi-class classification) we aim to reduce the barrier to participation. Furthermore, the team will make available baseline smell reference recognition software for texts and images that the participants can build on.</p><p>Most researchers have already very busy agendas thus we aim to make the task attractive to interested parties by providing tools to get going more easily. Furthermore, we will actively target students and early-career researchers as well as industry to cast a wide net. The potential application domains of the task help here.</p><p>The Odeuropa project has created smell reference benchmark datasets for texts and images that will be utilised <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Data</head><p>The MUSTI 2023 dataset consists of copyright-free texts and partly copyrighted images that can be downloaded and submitted by the participants using the URLs we provide. We offer texts in English, Dutch, French, German, Italian, and Slovene (zero-shot scenario) that participants are to match to the images. The texts are selected from open repositories such as Project Gutenberg, Europeana, Royal Society Corpus, Deutsches Text Arxiv, Gallica, Wikisource and Liber Liber. The images are selected from different archives such as RKD, Bildindex der Kunst und Architektur, Museum Boijmans, Ashmolean Museum Oxford, and Plateforme Ouverte du Patrimoine. The images are annotated with 169 categories of smell objects and gestures such as flowers, food, animals, sniffing and holding the nose. The object categories are organised in a two-level taxonomy. The Odeuropa text and image benchmark datasets are available as training data to the participants. The image dataset consists of 4,696 images with 36,663 associated object annotations, 600 gesture annotations, and image=level meta-data. We also provide the output of a text processing system we have developed to identify text snippets that contain smell references. The systems of the participants are evaluated on a held-out dataset of roughly 1,200 images with associated texts in the four languages. Figure <ref type="figure" target="#fig_0">1</ref> provides an example of mapping images with Slovenian text (text translation: "The stem is round and smooth, and the leaves are lanceolate and bright green. Lily's flowers are large, pure white, and smell very nice. Each flower has six petals, which are curved back at the top. Lily means purity and innocence.") The Slovenian example presents a description of the Lily flower from the journal "Teacher's Mate" published in 1862.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Evaluation</head><p>Task runs are evaluated against a gold standard consisting of image-text pairs. For the evaluation, we use multiple statistics as each provides a slightly different perspective on the results. The code and models of the baselines are available at . The subtasks are evaluated using the following metrics:</p><p>Subtask 1: Predicting whether an image and a text passage evoke the same smell source or not. This subtask is evaluated using precision, recall and F 1 -score. As multiple text passages in different languages can be linked to the same image, we employ multiple linking scorers such as CEAF and BLANC to measure the performance across different smell reference chains.</p><p>Subtask 2: Identifying the common smell source(s) between the text passages and the images. For this subtask, precision, recall and F 1 -score are employed, as well as more fine-grained evaluation methods such as RUFES, which can accommodate multi-level taxonomies.</p><p>Subtask 3: Zero-shot evaluation setting. The evaluation for this subtask is the same as subtasks 1 and 2. The only difference is that no training data was provided for this subtask. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Related Work</head><p>To the best of our knowledge, the task of predicting whether an image and a text evoke the same smell has not been tackled prior to the previous MUSTI challenge <ref type="bibr" target="#b2">[3]</ref>. However, some closely related tasks about text-image alignment are established in literature: In visual question answering (VQA), the aim is to develop systems capable of reasoning about visual information in order to answer textual questions posed to the systems <ref type="bibr" target="#b4">[5]</ref>. Based on existing datasets like COCO <ref type="bibr" target="#b5">[6]</ref> or Visual Genome <ref type="bibr" target="#b6">[7]</ref>, various datasets and benchmarks have been proposed since the mid-2010s to train and evaluate VQA algorithms <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b10">11]</ref>.</p><p>Another closely related strand of research is vision-language pretraining (VLP) where multimodal language and vision models are pre-trained on large amounts of image-caption pairs to learn an embedding space shared between visual and textual embeddings. Models pre-trained in this manner exhibit strong generalization capabilities when fine-tuned and applied to their respective downstream task. The most influental VLP algorithm is CLIP [] with numerous applications such as multimodal object detection <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b12">13]</ref>, image retrieval, artwork classification <ref type="bibr" target="#b13">[14]</ref>, or captioning <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b15">16]</ref>.</p><p>Even closer to the MUSTI objective is the task of visual entailment (VE), introduced by Xie et al. <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref> together with their SNLI-VE dataset which provides the default benchmark for the task. Given an image-sentence pair, the aim of VE is to predict whether the image semantically entails the text. VE algorithms are thus required to develop a semantic understanding of both images and texts and relate them to each other. Recent algorithms like OFA <ref type="bibr" target="#b18">[19]</ref> or PromptTuning <ref type="bibr" target="#b19">[20]</ref> achieve accuracies of over 90% at the SNLI-VE benchmark, suggesting that a more difficult benchmark might be beneficial. Given that in MUSTI, logical entailment is replaced with smell entailment, the MUSTI objective could be framed as olfactory entailment as opposed to VE.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Example from Slovenian data: image and mapped text snapshot.</figDesc><graphic coords="5,89.29,101.03,416.70,244.52" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A multilingual benchmark to capture olfactory situations over time</title>
		<author>
			<persName><forename type="first">S</forename><surname>Menini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Paccosi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tonelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Erp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Leemans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Lisena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Troncy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Tullett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hürriyetoğlu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Dijkstra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Gordijn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Jürgens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Koopman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ouwerkerk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Steen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Novalija</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Brank</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mladenic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zidar</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2022.lchange-1.1</idno>
		<ptr target="https://aclanthology.org/2022.lchange-1.1.doi:10.18653/v1/2022.lchange-1.1" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">N</forename><surname>Tahmasebi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Montariol</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Kutuzov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Hengchen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Dubossarsky</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Borin</surname></persName>
		</editor>
		<meeting>the 3rd Workshop on Computational Approaches to Historical Language Change, Association for Computational Linguistics<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="1" to="10" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Odor: The icpr2022 odeuropa challenge on olfactory object recognition</title>
		<author>
			<persName><forename type="first">M</forename><surname>Zinnen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Madhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kosti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Maier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Christlein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2022 26th International Conference on Pattern Recognition (ICPR)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="4989" to="4994" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">MUSTI -multimodal understanding of smells in texts and images at mediaeval</title>
		<author>
			<persName><forename type="first">A</forename><surname>Hürriyetoglu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Paccosi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Menini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zinnen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Lisena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Akdemir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Troncy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Erp</surname></persName>
		</author>
		<ptr target="https://ceur-ws.org/Vol-3583/paper50.pdf" />
	</analytic>
	<monogr>
		<title level="m">Working Notes Proceedings of the MediaEval 2022 Workshop</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<editor>
			<persName><forename type="first">S</forename><surname>Hicks</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><forename type="middle">G S</forename><surname>De Herrera</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Langguth</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Lommatzsch</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Andreadis</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Dao</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Martin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Hürriyetoglu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Thambawita</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><forename type="middle">S</forename><surname>Nordmo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Vuillemot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Larson</surname></persName>
		</editor>
		<meeting><address><addrLine>Bergen, Norway and Online</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022-12-13">2022. 12-13 January 2023. 2022</date>
			<biblScope unit="volume">3583</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Multimodal and multilingual understanding of smells using vilbert and muniter</title>
		<author>
			<persName><forename type="first">K</forename><surname>Akdemir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hürriyetoglu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Troncy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Paccosi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Menini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zinnen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Christlein</surname></persName>
		</author>
		<ptr target="https://ceur-ws.org/Vol-3583/paper36.pdf" />
	</analytic>
	<monogr>
		<title level="m">Working Notes Proceedings of the MediaEval 2022 Workshop</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<editor>
			<persName><forename type="first">S</forename><surname>Hicks</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><forename type="middle">G S</forename><surname>De Herrera</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Langguth</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Lommatzsch</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Andreadis</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Dao</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Martin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Hürriyetoglu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Thambawita</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><forename type="middle">S</forename><surname>Nordmo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Vuillemot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Larson</surname></persName>
		</editor>
		<meeting><address><addrLine>Bergen, Norway and Online</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">12-13 January 2023. 2022</date>
			<biblScope unit="volume">3583</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Visual question answering: A survey of methods and datasets</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Teney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Van Den</surname></persName>
		</author>
		<author>
			<persName><surname>Hengel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Vision and Image Understanding</title>
		<imprint>
			<biblScope unit="volume">163</biblScope>
			<biblScope unit="page" from="21" to="40" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Microsoft coco: Common objects in context</title>
		<author>
			<persName><forename type="first">T.-Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Maire</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Belongie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hays</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Perona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ramanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dollár</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Zitnick</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computer Vision-ECCV 2014: 13th European Conference</title>
				<meeting><address><addrLine>Zurich, Switzerland</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014">September 6-12, 2014. 2014</date>
			<biblScope unit="page" from="740" to="755" />
		</imprint>
	</monogr>
	<note>Proceedings, Part V 13</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Visual genome: Connecting language and vision using crowdsourced dense image annotations</title>
		<author>
			<persName><forename type="first">R</forename><surname>Krishna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Groth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hata</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kravitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kalantidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L.-J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Shamma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International journal of computer vision</title>
		<imprint>
			<biblScope unit="volume">123</biblScope>
			<biblScope unit="page" from="32" to="73" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Vqa: Visual question answering</title>
		<author>
			<persName><forename type="first">S</forename><surname>Antol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mitchell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Zitnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE international conference on computer vision</title>
				<meeting>the IEEE international conference on computer vision</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="2425" to="2433" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Making the v in vqa matter: Elevating the role of image understanding in visual question answering</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Khot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Summers-Stay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="6904" to="6913" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Visual7w: Grounded question answering in images</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Groth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bernstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="4995" to="5004" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Clevr: A diagnostic dataset for compositional language and elementary visual reasoning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hariharan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Der Maaten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lawrence Zitnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Girshick</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="2901" to="2910" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Grounded language-image pre-training</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-N</forename><surname>Hwang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="10965" to="10975" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2303.05499</idno>
		<title level="m">Grounding dino: Marrying dino with grounded pre-training for open-set object detection</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Clip-art: Contrastive pre-training for fine-grained art classification</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">V</forename><surname>Conde</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Turgutlu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="3956" to="3960" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Blip: Bootstrapping language-image pre-training for unified visionlanguage understanding and generation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hoi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="12888" to="12900" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Savarese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hoi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2301.12597</idno>
		<title level="m">Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Visual entailment task for visually-grounded language learning</title>
		<author>
			<persName><forename type="first">N</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Doran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kadav</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1811.10582</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Doran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kadav</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1901.06706</idno>
		<title level="m">Visual entailment: A novel task for fine-grained image understanding</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework</title>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Men</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="23318" to="23340" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2208.02532</idno>
		<title level="m">Prompt tuning for generative multimodal pretrained models</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
