<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">AIGeN-Llama: An Adversarial Approach for Instruction Generation in VLN using Llama2 Model</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Niyati</forename><surname>Rawal</surname></persName>
							<email>niyti.rawal@unimore.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Modena and Reggio Emilia</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lorenzo</forename><surname>Baraldi</surname></persName>
							<email>lorenzo.baraldi@unimore.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Modena and Reggio Emilia</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Rita</forename><surname>Cucchiara</surname></persName>
							<email>rita.cucchiara@unimore.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Modena and Reggio Emilia</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">AIGeN-Llama: An Adversarial Approach for Instruction Generation in VLN using Llama2 Model</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">E70E94BEC80938457BE6D4E6B04E8FF7</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:27+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>vision, language, navigation R. Cucchiara) 0000-0002-4142-0488 (N. Rawal)</term>
					<term>0000-0001-5125-4957 (L. Baraldi)</term>
					<term>0000-0002-2239-283X (R. Cucchiara)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Vision-and-Language Navigation (VLN) aims to train a robot to perceive the surrounding environment and follow human instructions. In the context of Digital Libraries, such agents hold transformative potential for assisting users in navigating large, multi-modal repositories and in interpreting and connecting spatial, visual, and textual data. However, training agents to follow human-like instructions in unknown environments remains a significant challenge, largely due to the scarcity of labeled training data. To address this, we propose AIGeN-Llama, an adversarial framework that utilizes Llama2 models for instruction generation. The Llama2 generator synthesizes navigation instructions by processing image sequences, while a Llama2 discriminator determines the authenticity of these instructions compared to ground-truth data. This adversarial training enhances the realism of the generated instructions. We use metrics that are commonly used for image description, namely BLEU, METEOR, ROUGE, CIDEr, and SPICE to quantitatively evaluate the proposed model. In addition, we show some qualitative samples to prove the effectiveness of our method. The experiment highlights the flexibility and capability of Llama2 as both a generator and a discriminator, demonstrating its potential to advance embodied VLN tasks.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Vision-and-Language Navigation (VLN) represents a critical frontier in embodied AI, where agents are tasked with navigating unfamiliar environments based on natural language instructions. Beyond its traditional applications in assistive robotics and autonomous systems, VLN holds significant promise for enhancing digital libraries by enabling more intuitive, interactive, and accessible ways of exploring complex, multi-modal repositories. For instance, VLN agents could guide users through immersive virtual archives or assist in retrieving spatially or thematically relevant digital content using conversational queries. Currently, the development of robust VLN agents remains hindered by the scarcity of large-scale, high-quality datasets that pair trajectories with human instructions. This limitation not only affects generalization to unseen environments, a core requirement for real-world deployment, but also constrains the potential integration of VLN technologies into innovative digital library applications.</p><p>Recent studies have shown that augmenting training datasets with synthetic instructions can improve navigation performance <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3]</ref>. Despite these advances, generating realistic and contextually grounded instructions remains a challenge. Traditional approaches often rely on architectures, such as GPT-2 and BERT, which may lack the flexibility and expressive power of newer large language models (LLMs). To address this, we introduce AIGeN-Llama, an adversarial framework designed to leverage the advanced generative and discriminative capabilities of Llama2, a state-of-the-art LLM.</p><p>AIGeN-Llama builds on the principles of adversarial learning, employing Llama2 as both the instruction generator and discriminator (see Fig. <ref type="figure" target="#fig_0">1</ref> for an overview). The generator produces detailed navigation instructions based on image trajectories, while the discriminator evaluates the authenticity and alignment of these instructions with ground-truth data. This adversarial interplay pushes the</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Encoder Decoder</head><p>Real / Fake generator to create more realistic and nuanced instructions and also equips the discriminator to refine its ability to distinguish between synthetic and ground-truth instructions.</p><p>The motivation for adopting Llama2 lies in its demonstrated ability to excel in a variety of complex generative and understanding tasks, supported by its large-scale pretraining and fine-tuning on diverse datasets. By integrating Llama2 into an adversarial framework, AIGeN-Llama seeks to overcome the limitations of previous architectures, generating more relevant synthetic instructions. To quantitatively evaluate AIGeN-Llama, we use metrics that are commonly used for image description, namely, BLEU, METEOR, ROUGE, CIDEr and SPICE. In addition, we present some qualitative samples that show the ability of AIGeN-Llama to generate reasonable instructions. Our approach sets a new standard in VLN instruction generation and demonstrates the broader applicability of Llama2 in embodied AI systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>The field of Vision-and-Language Navigation (VLN) has seen significant advancements in recent years, driven by innovations in both data augmentation and model architectures. AIGeN-Llama builds upon these developments, addressing challenges in synthetic instruction generation and adversarial learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Vision and Language Navigation (VLN)</head><p>Vision-and-Language Navigation (VLN) is a challenging task requiring agents to navigate in 3D environments guided by natural language instructions. The Room-to-Room (R2R) dataset by Anderson et al. <ref type="bibr" target="#b3">[4]</ref> established a benchmark for VLN, pairing navigation trajectories with human-written instructions. While early works on VLN focused on sequence-to-sequence long short-term memory model for action inference, recent works rely on Transformers <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7]</ref>. Graph-based methods where graphs are used to model relations between scene, object and instructions <ref type="bibr" target="#b7">[8]</ref> or the use of topological maps <ref type="bibr" target="#b8">[9]</ref> have also been introduced recently.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Instruction Generation for VLN</head><p>Instruction generation has emerged as a critical task for enhancing VLN datasets. Anderson et al. introduced the Room-to-Room (R2R) dataset, which paired human-authored instructions with trajectories, but highlighted the challenge of scaling such datasets due to the cost of manual annotation <ref type="bibr" target="#b3">[4]</ref>.</p><p>Recent efforts have explored generating synthetic instructions to augment VLN datasets. For instance, Speaker-Follower models <ref type="bibr" target="#b9">[10]</ref> synthesized path descriptions but often produced overly simplistic or repetitive instructions. Other research studies generate instructions by sampling random trajectories, leveraging online rental marketplaces <ref type="bibr" target="#b1">[2]</ref> and large-scale datasets of indoor environments <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b2">3]</ref>. These methods emphasize the need for high-quality synthetic data to improve the generalization capabilities of navigation agents.</p><formula xml:id="formula_0">Llama2 Discriminator Llama2 Generator RN-152 !"# $ $%&amp; % BoS Go Real Instruction Generated Instruction Go ℒ ! ℒ " update update … … $%&amp; &amp; … !"# !"# !"# % … … to … … … to the … !"# $ $%&amp; % BoS Go $%&amp; &amp; … !"# !"# !"# % … to … … … …</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Large Language Models (LLMs) in VLN</head><p>The advent of large-scale pretrained language models, such as GPT and BERT, has had a significant impact on VLN tasks. Recent studies have incorporated GPT-based decoders to generate instructions and BERT-based encoders to contextualize trajectories <ref type="bibr" target="#b2">[3]</ref>. However, these models often lack the versatility and power of newer LLMs, such as Llama2, which excel at capturing long-range dependencies and generating more coherent text. AIGeN-Llama leverages Llama2 for both generative and discriminative roles. Its superior performance in language modeling enables the generation of nuanced and contextually relevant instructions, surpassing prior architectures in quality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Adversarial Learning</head><p>Adversarial learning, popularized by Generative Adversarial Networks (GANs) <ref type="bibr" target="#b11">[12]</ref>, has been widely adopted to improve synthetic data generation across various domains, including images, text, and audio. In instruction generation, adversarial learning ensures that generated outputs closely mimic human-like text. Works like <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14]</ref> demonstrated the potential of adversarial training for text generation To overcome the problem of gradient propagation for discrete outputs, techniques like the Gumbel-Softmax trick <ref type="bibr" target="#b13">[14]</ref> were introduced to approximate differentiable sampling. AIGeN-Llama adopts this approach, allowing Llama2 to generate high-quality instructions in an adversarial setting. The discriminator, also powered by Llama2, effectively distinguishes between real and synthetic instructions, pushing the generator toward greater realism and alignment with human-authored data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Method</head><p>AIGeN-Llama is an adversarial framework that leverages Llama2 as both a generator and a discriminator to produce realistic and high-quality navigation instructions for VLN. Unlike previous approaches that rely on GPT-2 and BERT, AIGeN-Llama utilizes Llama2's advanced language capabilities to generate more relevant instructions. See Fig. <ref type="figure" target="#fig_1">2</ref> for the schema of the overall model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Llama2 Generator</head><p>The generator is responsible for creating synthetic instructions based on sequences of images that represent navigation trajectories. It processes the input visual data and sequentially generates tokens, crafting instructions in natural language that guide the agent along the given trajectory.</p><p>The general approach is as follows. First, the images of the trajectory are fed into a pretrained ResNet-152 to extract the visual features. Next, all objects in the last image of the trajectory are detected using Mask2Former <ref type="bibr" target="#b14">[15]</ref> trained on ADE20K. This is essential to enrich the visual representation. The visual features along with the object names are fed into the Llama2 decoder as input. This is followed by the BOS token which is used by the model as an indication to start generating the instruction for the given trajectory. The Llama2 decoder is trained to predict the next token and predicts autoregressively until it reaches the EOS token. Formally,</p><formula xml:id="formula_1">𝑦 = Llama2 (︂[︂ 𝑣 0 , .., 𝑣 𝑡Images , 𝑜 𝑡𝑔𝑡 , 𝑜 0 .., 𝑜 𝑛 , Objects BOS, 𝑖 1 , .., 𝑖 𝑚 , Instruction EOS ]︂)︂<label>(1)</label></formula><p>where (𝑣 0 , ..., 𝑣 𝑡 ) denotes the set of visual features for images of the trajectory, 𝑜 tgt indicates the target object label, (𝑜 0 , ..., 𝑜 𝑛 ) denote the names of the objects in the last image, BOS and EOS are begin of string and end of string tokens respectively. Consequently, (𝑖 1 , ..., 𝑖 𝑚 ) denotes the tokens that correspond to the instruction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Llama2 Discriminator</head><p>Another Llama2 model that acts as a discriminator evaluates whether the generated instruction matches the visual trajectory and aligns with real human instructions. This component ensures that the generated instructions are realistic and contextually accurate. The purpose of the discriminator is to perform a classification task between real and fake instructions. Here, the ground truth instructions are referred to as real instructions, whereas the instructions generated by the Llama2 decoder are fake. Binary cross-entropy loss is used to minimize the error between the actual output and the generated output (real or fake).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Adversarial Training using Gumbel Softmax</head><p>The generator and discriminator are trained simultaneously in a competitive setup. The generator aims to produce instructions that are indistinguishable from ground-truth human instructions, fooling the discriminator. It minimizes a loss function on the basis of how "realistic" its outputs are judged to be. The discriminator is trained to differentiate between real human-written instructions and synthetic instructions generated by the model. It minimizes a binary cross-entropy loss that measures its ability to correctly classify instructions as real or fake. Gumbel-Softmax is used to make the discrete token generation process differentiable, enabling backpropagation through the generator during adversarial training.</p><p>The generator loss is defined as:</p><formula xml:id="formula_2">ℒ 𝐺 = − log(𝐷(𝐼 𝐺 , 𝑥)),<label>(2)</label></formula><p>where 𝐼 𝐺 ∈ 𝐺(𝑥) is the generated instruction and 𝑥 is the sequence of images belonging to the trajectory. The discriminator loss is:</p><formula xml:id="formula_3">ℒ 𝐷 = − log(1 − 𝐷(𝐼 𝐺 , 𝑥)) − log(𝐷(𝐼 𝑅 , 𝑥)),<label>(3)</label></formula><p>where 𝐼 𝑅 ∈ 𝑅(𝑥) is the ground-truth instruction. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>We evaluated AIGeN-Llama on a widely used VLN dataset, REVERIE. In REVERIE, navigation sequences are composed of 360°images that are collected at the nodes of navigation graphs in Matterport3D environments <ref type="bibr" target="#b15">[16]</ref>. Each navigation sequence requires agents to identify and interact with specific objects at the target location, adding complexity to the task. Only the frontal view of the 360°images, with a field of view of 60°is considered. For evaluation, we follow the standard split of training, validation seen, and validation unseen environments provided by the datasets. The training of AIGeN-Llama uses a learning rate of 0.2𝑒 − 3 for the generator and 0.2𝑒 − 2 for the discriminator, a batch size of 1, and Adam <ref type="bibr" target="#b16">[17]</ref> as the optimizer. We use a pretrained Llama2 7B chat model for the generator and a pretrained Open Llama 3B model for the discriminator. The visual features used by the model are extracted using ResNet-152. Both the generator and the discriminator are individually trained before training them in an adversarial manner. This is done to ensure that the generator is already able to generate somewhat relevant instructions when trained together with the discriminator in an adversarial manner. Although the batch size is 1, we accumulate the gradients and update the optimizer every 48 steps. During the evaluation, the discriminator of the model is dropped, and the instructions are generated using the trained generator only.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Quantitative Results</head><p>To evaluate the improvements introduced by AIGeN-Llama over its predecessor, AIGeN <ref type="bibr" target="#b2">[3]</ref>, we conduct a detailed comparison of the quality of generated instructions in terms of both descriptive richness and alignment with the input trajectories. The comparison focuses on two key aspects: instruction realism and contextual relevance to visual data. The comparison uses the standard image description metrics <ref type="bibr" target="#b17">[18]</ref>, namely BLEU <ref type="bibr" target="#b18">[19]</ref>, METEOR <ref type="bibr" target="#b19">[20]</ref>, ROUGE <ref type="bibr" target="#b20">[21]</ref>, CIDEr <ref type="bibr" target="#b21">[22]</ref>, and SPICE <ref type="bibr" target="#b22">[23]</ref>. All these metrics are obtained by comparing the predicted instruction with the ground-truth instruction in terms of their n-grams (where an n-gram is a sequence of n consecutive words). While all these metrics are commonly used for evaluating cross-modal description, only CIDEr and SPICE have been specifically designed for this task. The others (BLEU, METEOR, and ROUGE) have indeed been proposed for evaluating translation and summarization. According to recent literature, CIDEr showcases the best alignment with human judgment <ref type="bibr" target="#b21">[22]</ref>. As can be seen in Table <ref type="table" target="#tab_0">1</ref>, the metrics related to ROUGE, CIDEr, and SPICE are considerably higher for AIGeN-Llama than for AIGeN. Although AIGeN-Llama has lower BLEU and ROUGE scores compared to AIGeN, it's important to note that these metrics were originally designed for machine translation, where nearly exact word-for-word matches are expected. Low BLEU and METEOR scores alongside high CIDEr, ROUGE, and SPICE scores suggest that while the generated captions may not match the reference texts in wording or exact phrasing, they are capturing the core semantic content effectively.   correctly. In the third example, 'kitchen' is recognized as a 'dining room' and 'stool' is recognized as a 'chair'. Looking at the last image of the trajectory (c), it is understandable that there is no clear boundary segregating the kitchen and the dining table. Moreover, 'chair' and 'stool' are quite close to each other in terminology, and hence, it is easy to confuse the two.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Qualitative Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions and Future Works</head><p>In this work, we introduced AIGeN-Llama, a novel adversarial framework for generating high-quality, and realistic instructions in VLN. Using the advanced generative and discriminative capabilities of the Llama2 language model, AIGeN-Llama addresses key limitations of previous works, including excessive reliance on human-annotated data. The adversarial setup, where Llama2 serves as both a generator and a discriminator, enables the generation of synthetic instructions that closely align with human-authored text while maintaining descriptive precision. Our experiments demonstrate that AIGeN-Llama outperforms previous models like AIGeN on multiple evaluation metrics, namely ROUGE, CIDEr, and SPICE. This shows that AIGeN-Llama is capable of capturing the core semantic content effectively. In the future, we would like to test if the AIGeN-Llama helps to improve the navigation performance.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: We present the overview of AIGeN-Llama, an adversarial framework that utilizes Llama2 models for instruction generation. AIGeN-Llama consists of a Llama2 encoder and a Llama2 decoder. Llama2 decoder act as a generator and Llama2 encoder act as a discriminator. Both the generator and the decoder are trained simultaneously to generate instructions corresponding to the given sequence of images.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Schema of the proposed generative-adversarial framework for synthetic instruction generation.The Llama2 decoder acts as a generator while the Llama2 encoder acts as a discriminator. The generator generates fake instructions token-by-token until it reaches the EOS token. The discriminator must detect whether the instructions corresponding to a given sequence of images are real (ground truth) or fake (generated).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3</head><label>3</label><figDesc>Fig.3shows three qualitative samples in which the instructions generated by AIGeN-Llama are compared with the ground-truth instructions. All three samples have been taken from the "unseen" validation split of REVERIE, so that AIGeN-Llama has never seen these environments during training. The first two examples (a) and (b) are positive, while the latter is negative. In the first and second examples, both the goal rooms (dining room and living room) and the target objects (plant in both cases) are recognized</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>(a) GT: Go to the dining room on level 1 with round table and center the plant on the table.AIGeN-Llama: Go to the dining room and water the plant.(b) GT: Enter the living room and pick up the potted plant.AIGeN-Llama: Go to the living room and water the plant.(c) GT: Pull out the second stool from the left side in the kitchen.AIGeN-Llama: Go to the dining room and pull out the chair on your left.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Sample image sequences from REVERIE Val Unseen split with corresponding ground-truth instruction and synthetic instructions generated using AIGeN-Llama. The images in each sequence have been reduced to 6 to facilitate the graphical presentation and we only show the frontal image of the panoramic observation at each timestep.</figDesc><graphic coords="6,82.97,234.98,70.63,52.98" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Image description experiments comparison of AIGeN-Llama with AIGeN<ref type="bibr" target="#b2">[3]</ref> </figDesc><table><row><cell></cell><cell></cell><cell></cell><cell>Val Seen</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>Val Unseen</cell><cell></cell><cell></cell></row><row><cell>Model</cell><cell cols="10">BLEU-1 METEOR ROUGE CIDEr SPICE BLEU-1 METEOR ROUGE CIDEr SPICE</cell></row><row><cell>AIGeN</cell><cell>48.4</cell><cell>22.8</cell><cell>46.5</cell><cell>89.0</cell><cell>32.9</cell><cell>42.1</cell><cell>17.9</cell><cell>39.3</cell><cell>48.6</cell><cell>22.8</cell></row><row><cell>AIGeN-Llama</cell><cell>35.6</cell><cell>21.8</cell><cell>53.8</cell><cell>117.6</cell><cell>41.3</cell><cell>26.3</cell><cell>17.1</cell><cell>44.4</cell><cell>81.9</cell><cell>33.2</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The authors were supported by Marie Sklodowska-Curie Action Horizon 2020 (Grant agreement No. 955778) for the project "Personalized Robotics as Service Oriented Applications" ("PERSEO") and "Fit for Medical Robotics" ("Fit4MedRob") project, funded by the Italian Ministry of University and Research.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Learning from Unlabeled 3D Environments for Vision-and-Language Navigation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-L</forename><surname>Guhur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tapaswi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schmid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Laptev</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the European Conference on Computer Vision</title>
				<meeting>the European Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Airbert: In-Domain Pretraining for Visionand-Language Navigation</title>
		<author>
			<persName><forename type="first">P.-L</forename><surname>Guhur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tapaswi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Laptev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schmid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Aigen: An adversarial approach for instruction generation in vln</title>
		<author>
			<persName><forename type="first">N</forename><surname>Rawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bigazzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Baraldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cucchiara</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="2070" to="2080" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments</title>
		<author>
			<persName><forename type="first">P</forename><surname>Anderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Teney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bruce</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sünderhauf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Reid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gould</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Van Den</surname></persName>
		</author>
		<author>
			<persName><surname>Hengel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Scene-Intuitive Agent for Remote Embodied Visual Grounding</title>
		<author>
			<persName><forename type="first">X</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Multimodal Attention Networks for Low-Level Vision-and-Language Navigation</title>
		<author>
			<persName><forename type="first">F</forename><surname>Landi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Baraldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cornia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Corsini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cucchiara</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Vision and Image Understanding</title>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">History Aware Multimodal Transformer for Vision-and-Language Navigation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-L</forename><surname>Guhur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schmid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Laptev</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Language and Visual Entity Relationship Graph for Agent Navigation</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Hong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gould</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-L</forename><surname>Guhur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tapaswi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schmid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Laptev</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Speaker-Follower Models for Vision-and-Language Navigation</title>
		<author>
			<persName><forename type="first">D</forename><surname>Fried</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Cirik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rohrbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Andreas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L.-P</forename><surname>Morency</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Berg-Kirkpatrick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Saenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Klein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning</title>
		<author>
			<persName><forename type="first">A</forename><surname>Kamath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Anderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">Y</forename><surname>Koh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ku</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Waters</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Baldridge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Parekh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Generative Adversarial Nets</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">J</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pouget-Abadie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mirza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Warde-Farley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ozair</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Courville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Towards Diverse and Natural Image Descriptions via a Conditional GAN</title>
		<author>
			<persName><forename type="first">B</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Fidler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Urtasun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training</title>
		<author>
			<persName><forename type="first">R</forename><surname>Shetty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rohrbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hendricks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fritz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Schiele</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Masked-attention Mask Transformer for Universal Image Segmentation</title>
		<author>
			<persName><forename type="first">B</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Misra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Schwing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kirillov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Girdhar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Matterport3D: Learning from RGB-D Data in Indoor Environments</title>
		<author>
			<persName><forename type="first">A</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Funkhouser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Halber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Niessner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Savva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on 3D Vision</title>
				<meeting>the International Conference on 3D Vision</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Adam: A Method for Stochastic Optimization</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Kingma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ba</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Learning Representations</title>
				<meeting>the International Conference on Learning Representations</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">From Show to Tell: A Survey on Deep Learning-based Image Captioning</title>
		<author>
			<persName><forename type="first">M</forename><surname>Stefanini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cornia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Baraldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Cascianelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Fiameni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cucchiara</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">BLEU: a Method for Automatic Evaluation of Machine Translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-J</forename><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments</title>
		<author>
			<persName><forename type="first">S</forename><surname>Banerjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Annual Meeting of the Association for Computational Linguistics Workshops</title>
				<meeting>the Annual Meeting of the Association for Computational Linguistics Workshops</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">ROUGE: A Package for Automatic Evaluation of Summaries</title>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Annual Meeting of the Association for Computational Linguistics Workshops</title>
				<meeting>the Annual Meeting of the Association for Computational Linguistics Workshops</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">CIDEr: Consensus-based Image Description Evaluation</title>
		<author>
			<persName><forename type="first">R</forename><surname>Vedantam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lawrence Zitnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">SPICE: Semantic Propositional Image Caption Evaluation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Anderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Fernando</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gould</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the European Conference on Computer Vision</title>
				<meeting>the European Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
