<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>IRCDL</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>AIGeN-Llama: An Adversarial Approach for Instruction Generation in VLN using Llama2 Model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Niyati Rawal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Baraldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rita Cucchiara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Modena and Reggio Emilia</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>21</volume>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>Vision-and-Language Navigation (VLN) aims to train a robot to perceive the surrounding environment and follow human instructions. In the context of Digital Libraries, such agents hold transformative potential for assisting users in navigating large, multi-modal repositories and in interpreting and connecting spatial, visual, and textual data. However, training agents to follow human-like instructions in unknown environments remains a significant challenge, largely due to the scarcity of labeled training data. To address this, we propose AIGeN-Llama, an adversarial framework that utilizes Llama2 models for instruction generation. The Llama2 generator synthesizes navigation instructions by processing image sequences, while a Llama2 discriminator determines the authenticity of these instructions compared to ground-truth data. This adversarial training enhances the realism of the generated instructions. We use metrics that are commonly used for image description, namely BLEU, METEOR, ROUGE, CIDEr, and SPICE to quantitatively evaluate the proposed model. In addition, we show some qualitative samples to prove the efectiveness of our method. The experiment highlights the flexibility and capability of Llama2 as both a generator and a discriminator, demonstrating its potential to advance embodied VLN tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;vision</kwd>
        <kwd>language</kwd>
        <kwd>navigation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Vision-and-Language Navigation (VLN) represents a critical frontier in embodied AI, where agents are
tasked with navigating unfamiliar environments based on natural language instructions. Beyond its
traditional applications in assistive robotics and autonomous systems, VLN holds significant promise
for enhancing digital libraries by enabling more intuitive, interactive, and accessible ways of exploring
complex, multi-modal repositories. For instance, VLN agents could guide users through immersive
virtual archives or assist in retrieving spatially or thematically relevant digital content using
conversational queries. Currently, the development of robust VLN agents remains hindered by the scarcity of
large-scale, high-quality datasets that pair trajectories with human instructions. This limitation not only
afects generalization to unseen environments, a core requirement for real-world deployment, but also
constrains the potential integration of VLN technologies into innovative digital library applications.</p>
      <p>
        Recent studies have shown that augmenting training datasets with synthetic instructions can improve
navigation performance [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Despite these advances, generating realistic and contextually grounded
instructions remains a challenge. Traditional approaches often rely on architectures, such as GPT-2 and
BERT, which may lack the flexibility and expressive power of newer large language models (LLMs). To
address this, we introduce AIGeN-Llama, an adversarial framework designed to leverage the advanced
generative and discriminative capabilities of Llama2, a state-of-the-art LLM.
      </p>
      <p>AIGeN-Llama builds on the principles of adversarial learning, employing Llama2 as both the
instruction generator and discriminator (see Fig. 1 for an overview). The generator produces detailed
navigation instructions based on image trajectories, while the discriminator evaluates the authenticity
and alignment of these instructions with ground-truth data. This adversarial interplay pushes the</p>
      <p>Go to the bedroom and …</p>
      <sec id="sec-1-1">
        <title>Decoder</title>
        <p>…</p>
      </sec>
      <sec id="sec-1-2">
        <title>Instruction</title>
      </sec>
      <sec id="sec-1-3">
        <title>Encoder</title>
      </sec>
      <sec id="sec-1-4">
        <title>Real / Fake</title>
        <p>generator to create more realistic and nuanced instructions and also equips the discriminator to refine
its ability to distinguish between synthetic and ground-truth instructions.</p>
        <p>The motivation for adopting Llama2 lies in its demonstrated ability to excel in a variety of complex
generative and understanding tasks, supported by its large-scale pretraining and fine-tuning on diverse
datasets. By integrating Llama2 into an adversarial framework, AIGeN-Llama seeks to overcome the
limitations of previous architectures, generating more relevant synthetic instructions. To quantitatively
evaluate AIGeN-Llama, we use metrics that are commonly used for image description, namely, BLEU,
METEOR, ROUGE, CIDEr and SPICE. In addition, we present some qualitative samples that show the
ability of AIGeN-Llama to generate reasonable instructions. Our approach sets a new standard in VLN
instruction generation and demonstrates the broader applicability of Llama2 in embodied AI systems.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The field of Vision-and-Language Navigation (VLN) has seen significant advancements in recent years,
driven by innovations in both data augmentation and model architectures. AIGeN-Llama builds upon
these developments, addressing challenges in synthetic instruction generation and adversarial learning.</p>
      <sec id="sec-2-1">
        <title>2.1. Vision and Language Navigation (VLN)</title>
        <p>
          Vision-and-Language Navigation (VLN) is a challenging task requiring agents to navigate in 3D
environments guided by natural language instructions. The Room-to-Room (R2R) dataset by Anderson et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
established a benchmark for VLN, pairing navigation trajectories with human-written instructions.
While early works on VLN focused on sequence-to-sequence long short-term memory model for action
inference, recent works rely on Transformers [
          <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
          ]. Graph-based methods where graphs are used to
model relations between scene, object and instructions [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] or the use of topological maps [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] have also
been introduced recently.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Instruction Generation for VLN</title>
        <p>
          Instruction generation has emerged as a critical task for enhancing VLN datasets. Anderson et al.
introduced the Room-to-Room (R2R) dataset, which paired human-authored instructions with trajectories,
but highlighted the challenge of scaling such datasets due to the cost of manual annotation [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          Recent eforts have explored generating synthetic instructions to augment VLN datasets. For instance,
Speaker-Follower models [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] synthesized path descriptions but often produced overly simplistic or
repetitive instructions. Other research studies generate instructions by sampling random trajectories,
leveraging online rental marketplaces [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and large-scale datasets of indoor environments [
          <xref ref-type="bibr" rid="ref1 ref11 ref3">1, 11, 3</xref>
          ].
        </p>
        <p>Llama2 Generator</p>
        <p>Real Instruction
Generated Instruction</p>
        <p>…
Go to the …
update</p>
        <p>ℒ"</p>
        <p>These methods emphasize the need for high-quality synthetic data to improve the generalization
capabilities of navigation agents.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Large Language Models (LLMs) in VLN</title>
        <p>
          The advent of large-scale pretrained language models, such as GPT and BERT, has had a significant
impact on VLN tasks. Recent studies have incorporated GPT-based decoders to generate instructions and
BERT-based encoders to contextualize trajectories [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. However, these models often lack the versatility
and power of newer LLMs, such as Llama2, which excel at capturing long-range dependencies and
generating more coherent text.
        </p>
        <p>AIGeN-Llama leverages Llama2 for both generative and discriminative roles. Its superior
performance in language modeling enables the generation of nuanced and contextually relevant instructions,
surpassing prior architectures in quality.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Adversarial Learning</title>
        <p>
          Adversarial learning, popularized by Generative Adversarial Networks (GANs) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], has been widely
adopted to improve synthetic data generation across various domains, including images, text, and audio.
In instruction generation, adversarial learning ensures that generated outputs closely mimic human-like
text. Works like [
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ] demonstrated the potential of adversarial training for text generation To
overcome the problem of gradient propagation for discrete outputs, techniques like the Gumbel-Softmax
trick [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] were introduced to approximate diferentiable sampling. AIGeN-Llama adopts this approach,
allowing Llama2 to generate high-quality instructions in an adversarial setting. The discriminator,
also powered by Llama2, efectively distinguishes between real and synthetic instructions, pushing the
generator toward greater realism and alignment with human-authored data.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>AIGeN-Llama is an adversarial framework that leverages Llama2 as both a generator and a discriminator
to produce realistic and high-quality navigation instructions for VLN. Unlike previous approaches that
rely on GPT-2 and BERT, AIGeN-Llama utilizes Llama2’s advanced language capabilities to generate
more relevant instructions. See Fig. 2 for the schema of the overall model.</p>
      <sec id="sec-3-1">
        <title>3.1. Llama2 Generator</title>
        <p>The generator is responsible for creating synthetic instructions based on sequences of images that
represent navigation trajectories. It processes the input visual data and sequentially generates tokens,
crafting instructions in natural language that guide the agent along the given trajectory.</p>
        <p>
          The general approach is as follows. First, the images of the trajectory are fed into a pretrained
ResNet-152 to extract the visual features. Next, all objects in the last image of the trajectory are detected
using Mask2Former [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] trained on ADE20K. This is essential to enrich the visual representation. The
visual features along with the object names are fed into the Llama2 decoder as input. This is followed
by the BOS token which is used by the model as an indication to start generating the instruction for the
given trajectory. The Llama2 decoder is trained to predict the next token and predicts autoregressively
until it reaches the EOS token. Formally,
        </p>
        <p>︂([
 = Llama2
0, .., Images, , 0.., ,ObjectsBOS, 1, .., ,InstructionEOS
︂])
(1)
where (0, ..., ) denotes the set of visual features for images of the trajectory, tgt indicates the target
object label, (0, ..., ) denote the names of the objects in the last image, BOS and EOS are begin
of string and end of string tokens respectively. Consequently, (1, ..., ) denotes the tokens that
correspond to the instruction.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Llama2 Discriminator</title>
        <p>Another Llama2 model that acts as a discriminator evaluates whether the generated instruction matches
the visual trajectory and aligns with real human instructions. This component ensures that the generated
instructions are realistic and contextually accurate. The purpose of the discriminator is to perform a
classification task between real and fake instructions. Here, the ground truth instructions are referred
to as real instructions, whereas the instructions generated by the Llama2 decoder are fake. Binary
cross-entropy loss is used to minimize the error between the actual output and the generated output
(real or fake).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Adversarial Training using Gumbel Softmax</title>
        <p>The generator and discriminator are trained simultaneously in a competitive setup. The generator aims
to produce instructions that are indistinguishable from ground-truth human instructions, fooling the
discriminator. It minimizes a loss function on the basis of how “realistic” its outputs are judged to be.
The discriminator is trained to diferentiate between real human-written instructions and synthetic
instructions generated by the model. It minimizes a binary cross-entropy loss that measures its ability
to correctly classify instructions as real or fake. Gumbel-Softmax is used to make the discrete token
generation process diferentiable, enabling backpropagation through the generator during adversarial
training.</p>
        <p>The generator loss is defined as:</p>
        <p>ℒ = − log((, )),
trajectory.</p>
        <p>The discriminator loss is:
where  ∈ () is the generated instruction and  is the sequence of images belonging to the
ℒ = − log(1 − (, )) − log((, )),
where  ∈ () is the ground-truth instruction.
(2)
(3)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>
        We evaluated AIGeN-Llama on a widely used VLN dataset, REVERIE. In REVERIE, navigation sequences
are composed of 360° images that are collected at the nodes of navigation graphs in Matterport3D
environments [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Each navigation sequence requires agents to identify and interact with specific
objects at the target location, adding complexity to the task. Only the frontal view of the 360° images,
with a field of view of 60 ° is considered. For evaluation, we follow the standard split of training, validation
seen, and validation unseen environments provided by the datasets. The training of AIGeN-Llama
uses a learning rate of 0.2 − 3 for the generator and 0.2 − 2 for the discriminator, a batch size of
1, and Adam [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] as the optimizer. We use a pretrained Llama2 7B chat model for the generator and
a pretrained Open Llama 3B model for the discriminator. The visual features used by the model are
extracted using ResNet-152. Both the generator and the discriminator are individually trained before
training them in an adversarial manner. This is done to ensure that the generator is already able to
generate somewhat relevant instructions when trained together with the discriminator in an adversarial
manner. Although the batch size is 1, we accumulate the gradients and update the optimizer every
48 steps. During the evaluation, the discriminator of the model is dropped, and the instructions are
generated using the trained generator only.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Quantitative Results</title>
        <p>
          To evaluate the improvements introduced by AIGeN-Llama over its predecessor, AIGeN [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we conduct
a detailed comparison of the quality of generated instructions in terms of both descriptive richness
and alignment with the input trajectories. The comparison focuses on two key aspects: instruction
realism and contextual relevance to visual data. The comparison uses the standard image description
metrics [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], namely BLEU [19], METEOR [20], ROUGE [21], CIDEr [22], and SPICE [23]. All these
metrics are obtained by comparing the predicted instruction with the ground-truth instruction in terms
of their n-grams (where an n-gram is a sequence of n consecutive words). While all these metrics are
commonly used for evaluating cross-modal description, only CIDEr and SPICE have been specifically
designed for this task. The others (BLEU, METEOR, and ROUGE) have indeed been proposed for
evaluating translation and summarization. According to recent literature, CIDEr showcases the best
alignment with human judgment [22]. As can be seen in Table 1, the metrics related to ROUGE, CIDEr,
and SPICE are considerably higher for AIGeN-Llama than for AIGeN. Although AIGeN-Llama has lower
BLEU and ROUGE scores compared to AIGeN, it’s important to note that these metrics were originally
designed for machine translation, where nearly exact word-for-word matches are expected. Low BLEU
and METEOR scores alongside high CIDEr, ROUGE, and SPICE scores suggest that while the generated
captions may not match the reference texts in wording or exact phrasing, they are capturing the core
semantic content efectively.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Qualitative Results</title>
        <p>Fig. 3 shows three qualitative samples in which the instructions generated by AIGeN-Llama are compared
with the ground-truth instructions. All three samples have been taken from the “unseen” validation
split of REVERIE, so that AIGeN-Llama has never seen these environments during training. The first
two examples (a) and (b) are positive, while the latter is negative. In the first and second examples, both
the goal rooms (dining room and living room) and the target objects (plant in both cases) are recognized
(a) GT: Go to the dining room on level 1 with round table and center the plant on the table.</p>
        <p>AIGeN-Llama: Go to the dining room and water the plant.
(b) GT: Enter the living room and pick up the potted plant.</p>
        <p>AIGeN-Llama: Go to the living room and water the plant.
(c) GT: Pull out the second stool from the left side in the kitchen.</p>
        <p>AIGeN-Llama: Go to the dining room and pull out the chair on your left.</p>
        <p>Figure 3: Sample image sequences from REVERIE Val Unseen split with corresponding ground-truth
instruction and synthetic instructions generated using AIGeN-Llama. The images in each sequence
have been reduced to 6 to facilitate the graphical presentation and we only show the frontal image of
the panoramic observation at each timestep.
correctly. In the third example, ‘kitchen’ is recognized as a ‘dining room’ and ‘stool’ is recognized
as a ‘chair’. Looking at the last image of the trajectory (c), it is understandable that there is no clear
boundary segregating the kitchen and the dining table. Moreover, ‘chair’ and ‘stool’ are quite close to
each other in terminology, and hence, it is easy to confuse the two.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Works</title>
      <p>In this work, we introduced AIGeN-Llama, a novel adversarial framework for generating high-quality,
and realistic instructions in VLN. Using the advanced generative and discriminative capabilities of
the Llama2 language model, AIGeN-Llama addresses key limitations of previous works, including
excessive reliance on human-annotated data. The adversarial setup, where Llama2 serves as both
a generator and a discriminator, enables the generation of synthetic instructions that closely align
with human-authored text while maintaining descriptive precision. Our experiments demonstrate that
AIGeN-Llama outperforms previous models like AIGeN on multiple evaluation metrics, namely ROUGE,
CIDEr, and SPICE. This shows that AIGeN-Llama is capable of capturing the core semantic content
efectively. In the future, we would like to test if the AIGeN-Llama helps to improve the navigation
performance.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The authors were supported by Marie Sklodowska-Curie Action Horizon 2020 (Grant agreement No.
955778) for the project “Personalized Robotics as Service Oriented Applications” (“PERSEO”) and “Fit for
Medical Robotics” (“Fit4MedRob”) project, funded by the Italian Ministry of University and Research.
[19] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a Method for Automatic Evaluation of
Machine Translation, in: Proceedings of the Annual Meeting of the Association for Computational
Linguistics, 2002.
[20] S. Banerjee, A. Lavie, METEOR: An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments, in: Proceedings of the Annual Meeting of the Association for
Computational Linguistics Workshops, 2005.
[21] C.-Y. Lin, ROUGE: A Package for Automatic Evaluation of Summaries, in: Proceedings of the</p>
      <p>Annual Meeting of the Association for Computational Linguistics Workshops, 2004.
[22] R. Vedantam, C. Lawrence Zitnick, D. Parikh, CIDEr: Consensus-based Image Description
Evaluation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2015.
[23] P. Anderson, B. Fernando, M. Johnson, S. Gould, SPICE: Semantic Propositional Image Caption
Evaluation, in: Proceedings of the European Conference on Computer Vision, 2016.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-L.</given-names>
            <surname>Guhur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tapaswi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Laptev</surname>
          </string-name>
          ,
          <article-title>Learning from Unlabeled 3D Environments for Vision-and-Language Navigation</article-title>
          ,
          <source>in: Proceedings of the European Conference on Computer Vision</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.-L.</given-names>
            <surname>Guhur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tapaswi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          , I. Laptev,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <article-title>Airbert: In-Domain Pretraining for Visionand-Language Navigation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Rawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bigazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baraldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cucchiara</surname>
          </string-name>
          ,
          <string-name>
            <surname>Aigen:</surname>
          </string-name>
          <article-title>An adversarial approach for instruction generation in vln</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>2070</fpage>
          -
          <lpage>2080</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Teney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bruce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sünderhauf</surname>
          </string-name>
          , I. Reid,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gould</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Van Den Hengel</given-names>
            ,
            <surname>Vision-</surname>
          </string-name>
          and
          <string-name>
            <surname>-Language Navigation</surname>
          </string-name>
          :
          <article-title>Interpreting Visually-Grounded Navigation Instructions in Real Environments</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Scene-Intuitive Agent for Remote Embodied Visual Grounding</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Landi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baraldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cornia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Corsini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cucchiara</surname>
          </string-name>
          ,
          <article-title>Multimodal Attention Networks for Low-Level Vision-and-Language Navigation, Computer Vision</article-title>
          and Image
          <string-name>
            <surname>Understanding</surname>
          </string-name>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-L.</given-names>
            <surname>Guhur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Laptev</surname>
          </string-name>
          ,
          <article-title>History Aware Multimodal Transformer for Vision-andLanguage Navigation</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gould</surname>
          </string-name>
          ,
          <article-title>Language and Visual Entity Relationship Graph for Agent Navigation</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-L.</given-names>
            <surname>Guhur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tapaswi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Laptev</surname>
          </string-name>
          , Think Global, Act Local:
          <article-title>Dual-scale Graph Transformer for Vision-and-Language Navigation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Fried</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cirik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Andreas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-P.</given-names>
            <surname>Morency</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Berg-Kirkpatrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Saenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          , T. Darrell,
          <article-title>Speaker-Follower Models for Vision-and-</article-title>
          <string-name>
            <surname>Language Navigation</surname>
          </string-name>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Waters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Baldridge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Parekh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A New</given-names>
            <surname>Path</surname>
          </string-name>
          <article-title>: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pouget-Abadie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warde-Farley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ozair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Generative Adversarial Nets,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fidler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Urtasun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Towards Diverse and Natural Image Descriptions via a Conditional GAN</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Shetty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Anne</given-names>
            <surname>Hendricks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fritz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiele</surname>
          </string-name>
          ,
          <article-title>Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , I. Misra,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Schwing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirillov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girdhar</surname>
          </string-name>
          ,
          <article-title>Masked-attention Mask Transformer for Universal Image Segmentation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Funkhouser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Halber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Niessner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Savva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Matterport3D: Learning from RGB-D Data in Indoor Environments</article-title>
          ,
          <source>in: Proceedings of the International Conference on 3D Vision</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          ,
          <source>Proceedings of the International Conference on Learning Representations</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stefanini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cornia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baraldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cascianelli</surname>
          </string-name>
          , G. Fiameni,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cucchiara</surname>
          </string-name>
          , From Show to Tell:
          <article-title>A Survey on Deep Learning-based Image Captioning</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>