<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Guiding Reinforcement Learning with Selective Vision-Language Model Supervision</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matteo Merler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Bonetta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernardo Magnini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>We propose a framework that augments a model-free Reinforcement Learning (RL) agent with selective guidance from a pre-trained Vision-Language Model (VLM). Our system is designed to assist the RL agent, which starts from scratch and has no prior notion of the environment, by leveraging the VLM's common-sense knowledge to support its decision making. Rather than relying on the VLM at every timestep, the agent monitors its own uncertainty during training and defers to the VLM only when it is unsure about which action to take. Uncertainty is measured using the entropy of the policy distribution, and guidance is triggered when this entropy exceeds a predefined threshold. To reduce computational overhead, we introduce a stochastic gating mechanism that limits the frequency of VLM queries, along with a cache that stores past VLM responses for reuse. Experiments show that our method leads to more stable learning dynamics compared to standard PPO, with reduced variance across runs. In the FrozenLake environment, we observe that VLM guidance is primarily utilized during the early stages of training, gradually diminishing as the agent becomes more confident. This suggests that our selective guidance mechanism can support early exploration without hindering long-term autonomous behavior.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Reinforcement Learning</kwd>
        <kwd>Vision-Language Models</kwd>
        <kwd>Policy Guidance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Reinforcement Learning (RL) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has proven efective for training agents across a wide range of domains,
including games [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ] and robotics [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ], among others. At its core, an agent interacts with an
environment and learns how to behave through experience, by maximizing a reward signal observed as
a result of its actions. Typically, model-free RL methods [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ] start learning with no prior information
about the environment; instead, they explore in order to learn the dynamics of the world, before being
able to exploit what they learned by maximizing the expected reward. This process can be particularly
challenging in long-horizon tasks where the reward signal is sparse, leading to sample ineficiency [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Meanwhile, Large Foundation Models (LFMs) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] such as Large Language Models (LLMs) [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
        ]
and Vision-Language Models (VLMs) [
        <xref ref-type="bibr" rid="ref16 ref17 ref18">16, 17, 18</xref>
        ] demonstrate strong reasoning capabilities and encode
extensive common-sense priors from internet-scale pretraining. These models can be harnessed to guide
RL agents, providing high-level priors that bootstrap behavior and mitigate early-stage exploration
challenges. As such, recent work has explored using LFMs as agents [19, 20, 21], showing that they can
directly generate actions or high-level plans by conditioning on the current state and task specification.
      </p>
      <p>However, these methods often rely on the computationally expensive LFM at every timestep, impeding
the learning of a lightweight, task-specialized policy. To fix this, we take inspiration from dual-process
theories of cognition [22], which distinguish between a fast, habitual decision-making system (System
1) and a slow, deliberative reasoning system (System 2). We propose a hybrid RL framework in which a
model-free policy operates as the decision-maker (acting as System 1), while a VLM (acting as System
2) is selectively invoked, when the agent exhibits low confidence, to provide high-level guidance in
ambiguous or unfamiliar states (Figure 1).</p>
      <p>
        We evaluate the approach on the FrozenLake environment from Gymnasium [23] and show that
our VLM guidance approach leads to increased stability compared to a vanilla PPO [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] baseline, showing
that targeted, uncertainty-driven VLM guidance is a viable strategy for improving learning eficiency in
model-free RL.
Easy
choice!
      </p>
      <p>I need
help!</p>
      <p>VLM</p>
      <p>I am in position (0,0), there are no
holes below me so moving down
looks pretty safe...</p>
      <p>Action ⟶ move down</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Foundation models have been integrated with RL in various ways to improve generalization, sample
eficiency, and enable instruction-following abilities [24, 25].</p>
      <p>One instance of this is reward shaping: LFMs can generate reward signals directly from textual
instructions or from learned preferences [26, 27], or indirectly by embedding alignment between states
and goals [28, 29]. Other works have proposed generating executable code as reward functions [30, 31]
or training dedicated reward models through supervised or pretraining strategies [32, 33].</p>
      <p>Another active line of research uses foundation models as policy priors or action generators.
Pretrained LFMs can directly output actions based on linguistic or perceptual inputs [34, 19], or produce code
that defines the agent’s behavior [ 35]. Vision-Language-Action (VLA) models have also been trained
end-to-end to map observations and goals to low-level actions [36, 37, 38, 21]. RL with Foundation
Priors (RLFP) unifies these approaches, showing how foundation models can be used jointly for reward
modeling and policy learning [39].</p>
      <p>Recent approaches like GLAM [40] and TWOSOME [41] fine-tune LLMs directly in interactive
environments using online RL, aligning the model’s prior knowledge with embodied or symbolic
domains. Bonetta et al. [42] demonstrate that integrating a VLM as a policy into PPO training can
significantly improve sample eficiency and performance compared to training from scratch. Similarly,
LM4TEACH [43] distill LLM knowledge into a student policy based on the LLM’s logits for each
action. Zhai et al. [20] show that large VLMs, when fine-tuned with chain-of-thought prompting,
outperform closed-source models on complex multi-step decision tasks. GTR [44] further improves
this by supervising intermediate thoughts to prevent “thought collapse,” ensuring the VLM’s reasoning
process remains consistent throughout learning.</p>
      <p>However, most of these approaches rely on foundation models to drive decision-making at every step.
In contrast, few methods investigate hybrid strategies where the foundation model plays a supportive
role, guiding the policy only when needed. ULTRA [45] identifies critical states from collected experience
ofline and proposes improvements only for those. DSADF [ 46], in contrast, uses a hierarchical pipeline:
a VLM plans subgoals, model-free controllers execute them, and the VLM intervenes if the controller fails.
Han et al. [47] propose a similar dual-level system, using a VLA model to support decision-making. Most
similar to our work, RCMP [48] also introduce a framework where guidance from a pre-trained expert
is provided based on uncertainty, based on previous work in the multi-agent setting [49]. Compared
to these, our method provides an online assistance mechanism: a lightweight CNN policy governs
behavior, but when it expresses uncertainty, a VLM proposes an action which is treated as if the small
model had chosen it. This setup enables faster training by leveraging VLM guidance selectively, while
avoiding the high computational cost and scalability issues of always-on foundation model policies or
hierarchical frameworks, as well as avoiding the need for a domain-specific expert to provide guidance,
which might not always be available.</p>
      <p>High Entropy</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <sec id="sec-3-1">
        <title>3.1. Problem Setup</title>
        <p>
          Following standard RL conventions, we formulate the problem as a Markov Decision Process (MDP)
defined by the tuple (, , ℛ,  ,  ), where  is the state space,  is the action space, ℛ :  ×  → R
is the reward function,  :  ×  →  is the transition function and  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is the discount factor [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
We focus on deterministic, fully observable environments with a discrete action space  in order to
isolate the efect of our method from other sources of variability. In principle, our work can extend to
stochastic, partially observable and continuous settings with minor changes.
        </p>
        <p>In this setting, the agent aims to learn a policy  :  → Δ()1 which maximizes the discounted
cumulative expected reward E∼  ︀[ ∑︀∞=0  ℛ(, ) | 0 ∼ 0]︀ where 0 is the initial state distribution.
We use   (|) to denote the probability of an action  being sampled from a learned policy   (·| )
parametrized by  and conditioned on the current state .</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Vision-Language Model Guidance</title>
        <p>
          We introduce a RL framework that augments a model-free policy with selective guidance from a
pretrained VLM. The key idea is to allow the RL agent to act autonomously in states where   exhibits high
confidence, while deferring to the VLM in low-confidence states where uncertainty about the optimal
action is high. Specifically, we choose the entropy ℋ [  (·| )] = − ∑︀∈   (|) log   (|) as
the measure of uncertainty of the policy (where a higher entropy implies a flatter distribution with
more indecision between actions). With this, we define an entropy threshold hyperparameter  , above
which the VLM will be asked to provide guidance. Furthermore, we normalize ℋ [  (·| )] in a range
between [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] by dividing it by its theoretical maximum log ||, in order to set  more easily.
        </p>
        <p>When the normalized entropy exceeds  , the VLM provides guidance for the agent by generating the
action to take, acting as a guidance policy  VLM(·| )2. The model is prompted with an image of the
current state , a description of the available actions and goal, and instructed to reason with zero-shot
Chain of Thought (CoT) prompting [50, 51] to generate an action  ∼  VLM(·| ). This replaces the
sampling process in the training loop:  is sampled from  VLM(·| ) instead of   (·| ).</p>
        <p>
          Crucially, VLMs are known to hallucinate [52] and may ofer inaccurate or misleading guidance,
particularly in domains that diverge from their pre-training distribution (mostly composed of realistic
1Δ() represents any probability distribution over , formally Δ() = {︀  :  → [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] ⃒⃒ ∑︀∈ () = 1}︀
2Slightly abusing the notation, as the VLM only generates the best action, rather than a distribution over .
Algorithm 1 Selective Policy Guidance with VLM Supervision
images), such as synthetic environments or pixel art. However, in our framework the RL agent does not
blindly imitate the VLM. Instead, it treats VLM-suggested actions as exploratory signals: if following
such an action leads to poor rewards, the policy is updated through standard RL mechanisms (e.g., policy
gradient or value function updates) to reduce the likelihood of selecting that action in similar states.
This enables the agent to recover from suboptimal guidance and progressively improve its behavior.
        </p>
        <p>Integrating a VLM into the learning process significantly increases the computational budget 3 required
to train the agent, reducing its scalability to more complex tasks which require more training steps.
Even if the VLM is prompted only under high-entropy conditions, the method still incurs in substantial
overhead, as it will always be used at the early stages of training (since   initially has high entropy
across most states). To mitigate this, we introduce a caching mechanism  :  →  that stores the
VLM’s responses, allowing the agent to reuse previous guidance when revisiting the same state.</p>
        <p>
          However, this in turn will force the agent to always follow the same strategy in cached states,
restricting the potential for exploration and potentially resulting in sub-optimal behavior. To address
this issue, we formulate a new hyperparameter  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] which toggles when VLM guidance is invoked,
similar to  in an  -greedy policy (i.e., VLM guidance will be used if ℋ [  (·| )] &gt;  ∧  ∼  (0, 1) &lt;  ,
where  is the uniform distribution). The final action selection process for our method is formalized in
Algorithm 1 and summarized in Figure 2.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <p>
          We now describe the experimental setup used to evaluate our method, including the underlying RL
algorithm, environment, and evaluation metrics. Detailed hyperparameters are reported in Appendix B.
RL Algorithm. As our method only afects the action sampling process, it can be integrated into any
model-free RL algorithm with the aim of improving its sample eficiency and/or stability. For this work,
we choose the popular Proximal Policy Optimization (PPO) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] method as a backbone. We follow the
implementation of PPO provided by CleanRL [53], a high-quality codebase ofering reproducible and
well-tested RL baselines. The agent uses a Convolutional Neural Network (CNN) for both the actor and
critic networks, enabling it to process visual state inputs directly. The action selection logic is modified
by inserting our selective guidance policy in place of the default sampling routine, while the rest of
3For example, Zhai et al. [20], which use a VLM at every step, report a budget of 30 hours for 15k training steps.
0
w/ guidance (ours)
w/o guidance
        </p>
        <p>VLM policy
50 Glo1b0a0l Step (1×5100³) 200
250
0.0 2.5 5.0 Glo7.b5al Step (×10³)
10.0 12.5 15.0 17.5 20.0</p>
        <p>Average Cache Size
0.0 2.5 5.0Glob7a.l5Ste1p0(.0×101³2).5 15.0 17.5
(a) FrozenLake learning curves.</p>
        <p>(b) VLM guidance calls over time.</p>
        <p>
          (c) Running average of cache size.
the algorithm remains unchanged. Our VLM of choice is the Gemma3 model [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] in its 4B parameters
variant which provides a convenient tradeof between speed and performance. The prompt used for
VLM guidance is reported in Appendix A. All experiments use a training budget of 500k environment
steps. We performed each experiment on a single NVIDIA A100 64GB GPUs with both the RL agent
and the VLM loaded on the same device. We employ the vLLM library [54] to accelerate VLM inference.
Environment. We evaluate the method on a deterministic version of the FrozenLake grid-world
environment from Gymnasium [23]. This task requires the agent to navigate a grid (using the cardinal
directions as actions) and avoid holes in order to reach the goal, represented as a gift box (as seen in
Figure 2). Notably, the agent only receives a positive reward of 1 when reaching the gift box, and 0 in
any other case (including falling into a hole), making the signal very sparse. To further challenge the
agent, we choose the bigger "8x8" grid map, which is kept fixed across episodes. A state  is given to
both the PPO agent and the VLM as an image.
        </p>
        <p>Evaluation Metrics. We assess performance using multiple metrics. First, we measure the agent’s
average episodic return over 10 evaluation episodes, reported across 5 random seeds. In addition, we
visualize evaluation curves by plotting this average episodic return every 5000 training steps. To assess
the VLM usefulness and cache usage, we track the number of guidance calls (i.e., how many times the
VLM was queried) and the cache dimension over training time-steps.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results</title>
        <p>• Our method w/ guidance, which uses selective VLM guidance during training.
• A PPO baseline w/o guidance, trained with the same CNN backbone but without any VLM
involvement.
• A VLM policy baseline, which queries the VLM directly at inference time to choose actions,
without any RL (without caching).</p>
        <p>Both learning-based agents show early convergence during training. However, the vanilla PPO
baseline exhibits high variance across seeds: while some runs converge quickly due to lucky initial
exploration, others remain stuck with near-zero returns throughout the entire training process, making
the baseline less reliable and predictable.</p>
        <p>The VLM policy baseline consistently fails to solve the environment. This demonstrates that querying
Gemma3 directly, without any task-specific learning or adaptation, is insuficient for meaningful
decision-making in this setting. The VLM does not solve the long-horizon task, but can instead be
useful to guide individual short-term actions, with RL learning not to repeat incorrect guidance.</p>
        <p>In contrast, our method consistently converges to optimal performance (average return ∼ 1.0) across
all seeds, with significantly lower variance. Although the average sample eficiency is only slightly
better due to the task’s simplicity, our method is markedly more stable. The VLM guidance helps the
agent recover from unproductive early rollouts and reliably explore successful strategies.</p>
        <p>To better understand the role of VLM guidance during training, Figure 3b reports the number of VLM
queries issued over time. As expected, most calls occur during the early high-entropy phase of learning.
Around the 18k step mark, the number of guidance calls drops to zero, indicating that the agent has
learned to act confidently without external help. Figure 3c tracks the evolution of the cache used to
store guidance responses, which grows rapidly during early training, peaking around 17.5k steps with
around 120 unique states4. After that, most information needs are met through cache hits rather than
new VLM queries, reducing the computational overhead of prompting.</p>
        <p>Together, these results provide a proof-of-concept for the adaptive behavior enabled by our guidance
method. VLM queries are used when uncertainty is high, then progressively replaced by internalized
behavior and cached knowledge. The resulting overhead is minimal: training with guidance takes about
90 minutes, compared to ∼ 85 minutes without it. This small cost is outweighed by the substantial
improvements in stability and robustness, especially in long-horizon environments with sparse rewards.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>We introduced a RL framework that leverages selective guidance from a pre-trained VLM to support
early-stage exploration. By monitoring the agent’s own uncertainty and triggering VLM guidance
only when necessary, our method enables more stable and consistent learning compared to a vanilla
PPO baseline, while keeping the reliance on external supervision limited in both time and scope. Our
experiments on the FrozenLake environment show that the proposed strategy is particularly efective
during the early phases of training, when the policy entropy is high and exploration is most challenging,
with the cache reducing the computational burden by promoting reuse of past VLM completions. Overall,
our results suggest that selectively integrating VLM supervision, rather than relying on it continuously,
can be a practical and eficient way to improve the learning dynamics in a model-free RL agent.</p>
      <sec id="sec-5-1">
        <title>5.1. Limitations and Future Work</title>
        <p>While these initial results are promising, further work is needed to assess the method’s generality and
efectiveness. In FrozenLake, successful PPO runs already achieved near-optimal performance quickly
(as shown by the shaded area in Figure 3a), limiting observable gains. More complex environments,
such as ones involving grounded language understanding [55], may better showcase the benefits of
selective VLM guidance in enhancing exploration and learning speed.</p>
        <p>
          Future work should also test robustness across RL algorithms and in more challenging settings,
including continuous, stochastic, and partially observable environments. This will require improving
the measure of uncertainty (e.g., by using the standard deviation from an ensemble of value heads [48])
and improving the way guidance is incorporated into the algorithm, taking inspiration from imitation
learning literature [56]. The process through which guidance is discarded (with the parameter  )
can also be improved, for instance by implementing some version of the UCB formula [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] or similar
count-based methods [49].
        </p>
        <p>Key directions include an ablation of guidance hyperparameters  and  to clarify their influence
on learning, and improvements to the caching strategy (e.g., storing diverse completions) to avoid
premature convergence. Our results also highlight the limits of small-scale VLMs like Gemma3, which
was unable to reach the goal on its own (Figure 3a), even in a common environment which the model
4Note that, while the 8x8 grid-world is only comprised of 64 unique states, the direction the agent is facing is also represented
visually (even if irrelevant in practice), bringing the theoretical maximum to 64 · 4 = 256 unique states.
has likely seen during pre-training. Since the VLM’s output quality is central to our approach, scaling to
better models will be important. Promising avenues are distilling knowledge from a larger VLM through
supervised fine-tuning before training with RL, or incorporating the experience collected online by the
RL algorithm, similar to RL4VLM [20], to improve the quality of guidance during training.
[19] M. Ahn, B. Ichter, A. Brohan, Y. Chebotar, C. Finn, K. Hausman, et al., Do as i can, not as i say:</p>
        <p>Grounding language in robotic afordances, in: 6th Annual Conference on Robot Learning, 2022.
[20] Y. Zhai, H. Bai, Z. Lin, J. Pan, S. Tong, Y. Zhou, A. Suhr, S. Xie, Y. LeCun, Y. Ma, S. Levine,
Finetuning large vision-language models as decision-making agents via reinforcement learning, in:
The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
[21] G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakrishna,
R. Baruch, M. Bauza, M. Blokzijl, et al., Gemini robotics: Bringing ai into the physical world, arXiv
preprint arXiv:2503.20020 (2025).
[22] D. Kahneman, Thinking, Fast and Slow, Allen Lane, 2011.
[23] M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goulão, A. Kallinteris,
M. Krimmel, A. KG, et al., Gymnasium: A standard interface for reinforcement learning
environments, arXiv preprint arXiv:2407.17032 (2024).
[24] Y. Cao, H. Zhao, Y. Cheng, T. Shu, Y. Chen, G. Liu, G. Liang, J. Zhao, J. Yan, Y. Li, Survey on
large language model-enhanced reinforcement learning: Concept, taxonomy, and methods, IEEE
Transactions on Neural Networks and Learning Systems (2024).
[25] S. Schoepp, M. Jafaripour, Y. Cao, T. Yang, F. Abdollahi, S. Golestan, Z. Sufiyan, O. R. Zaiane, M. E.</p>
        <p>Taylor, The evolving landscape of llm-and vlm-integrated reinforcement learning, arXiv preprint
arXiv:2502.15214 (2025).
[26] M. Kwon, S. M. Xie, K. Bullard, D. Sadigh, Reward design with language models, in: The Eleventh</p>
        <p>International Conference on Learning Representations, 2023.
[27] Y. Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel, A. Gupta, J. Andreas, Guiding
pretraining in reinforcement learning with large language models, in: International Conference
on Machine Learning, PMLR, 2023, pp. 8657–8677.
[28] M. Klissarov, P. D’Oro, S. Sodhani, R. Raileanu, P.-L. Bacon, P. Vincent, A. Zhang, M. Henaf, Motif:
Intrinsic motivation from artificial intelligence feedback, in: NeurIPS 2023 Foundation Models for
Decision Making Workshop, 2023.
[29] J. Rocamonde, V. Montesinos, E. Nava, E. Perez, D. Lindner, Vision-language models are zero-shot
reward models for reinforcement learning, in: The Twelfth International Conference on Learning
Representations, 2024.
[30] Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y. Zhu, L. Fan, A.
Anandkumar, Eureka: Human-level reward design via coding large language models, arXiv preprint
arXiv:2310.12931 (2023).
[31] D. Venuto, S. N. Islam, M. Klissarov, D. Precup, S. Yang, A. Anand, Code as reward: empowering
reinforcement learning with vlms, in: Proceedings of the 41st International Conference on Machine
Learning, ICML’24, JMLR.org, 2024.
[32] Y. J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V. Kumar, A. Zhang, VIP: Towards universal visual
reward and representation via value-implicit pre-training, in: NeurIPS 2022 Foundation Models
for Decision Making Workshop, 2022.
[33] Y. J. Ma, V. Kumar, A. Zhang, O. Bastani, D. Jayaraman, LIV: Language-image representations and
rewards for robotic control, in: Workshop on Reincarnating Reinforcement Learning at ICLR 2023,
2023.
[34] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch,
Y. Chebotar, P. Sermanet, T. Jackson, N. Brown, L. Luu, S. Levine, K. Hausman, brian ichter,
Inner monologue: Embodied reasoning through planning with language models, in: 6th Annual
Conference on Robot Learning, 2022.
[35] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, A. Zeng, Code as policies:
Language model programs for embodied control, in: 2023 IEEE International Conference on
Robotics and Automation (ICRA), IEEE, 2023, pp. 9493–9500.
[36] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman,
A. Herzog, J. Hsu, et al., Rt-1: Robotics transformer for real-world control at scale, arXiv preprint
arXiv:2212.06817 (2022).
[37] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess,
A. Dubey, C. Finn, et al., Rt-2: Vision-language-action models transfer web knowledge to robotic
control, arXiv preprint arXiv:2307.15818 (2023).
[38] M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster,
G. Lam, P. Sanketi, et al., Openvla: An open-source vision-language-action model, arXiv preprint
arXiv:2406.09246 (2024).
[39] W. Ye, Y. Zhang, H. Weng, X. Gu, S. Wang, T. Zhang, M. Wang, P. Abbeel, Y. Gao, Reinforcement
learning with foundation priors: Let embodied agent eficiently learn on its own, in: 8th Annual
Conference on Robot Learning, 2024.
[40] T. Carta, C. Romac, T. Wolf, S. Lamprier, O. Sigaud, P.-Y. Oudeyer, Grounding large language
models in interactive environments with online reinforcement learning, in: Proceedings of the
40th International Conference on Machine Learning, ICML’23, JMLR.org, 2023.
[41] W. Tan, W. Zhang, S. Liu, L. Zheng, X. Wang, B. An, True knowledge comes from practice: Aligning
large language models with embodied environments via reinforcement learning, in: The Twelfth
International Conference on Learning Representations, 2024.
[42] G. Bonetta, D. Zago, R. Cancelliere, M. Polato, B. Magnini, Vision language models as policy
learners in reinforcement learning environments, in: ESANN, 2024.
[43] Z. Zhou, B. Hu, C. Zhao, P. Zhang, B. Liu, Large language model as a policy teacher for training
reinforcement learning agents, IJCAI ’24, 2024.
[44] T. Wei, Y. Yang, J. Xing, Y. Shi, Z. Lu, D. Ye, Gtr: Guided thought reinforcement prevents thought
collapse in rl-based vlm agent training, arXiv preprint arXiv:2503.08525 (2025).
[45] H. Tan, H. Yan, Y. Yang, Llm-guided reinforcement learning: Addressing training bottlenecks
through policy modulation, arXiv preprint arXiv:2505.20671 (2025).
[46] A. Z. Dou, D. Cui, J. Yan, W. Wang, B. Chen, H. Wang, Z. Xie, S. Zhang, Dsadf: Thinking fast and
slow for decision making, arXiv preprint arXiv:2505.08189 (2025).
[47] B. Han, J. Kim, J. Jang, A dual process vla: Eficient robotic manipulation leveraging vlm, arXiv
preprint arXiv:2410.15549 (2024).
[48] F. L. Da Silva, P. Hernandez-Leal, B. Kartal, M. E. Taylor, Uncertainty-aware action advising for
deep reinforcement learning agents, Proceedings of the AAAI Conference on Artificial Intelligence
34 (2020) 5792–5799.
[49] F. L. Da Silva, R. Glatt, A. H. R. Costa, Simultaneously learning and advising in multiagent
reinforcement learning, in: Proceedings of the 16th conference on autonomous agents and
multiagent systems, 2017, pp. 1100–1108.
[50] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought
prompting elicits reasoning in large language models, Advances in neural information processing
systems 35 (2022) 24824–24837.
[51] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot reasoners,</p>
        <p>Advances in neural information processing systems 35 (2022) 22199–22213.
[52] H. Liu, W. Xue, Y. Chen, D. Chen, X. Zhao, K. Wang, L. Hou, R. Li, W. Peng, A survey on
hallucination in large vision-language models, arXiv preprint arXiv:2402.00253 (2024).
[53] S. Huang, R. F. J. Dossa, C. Ye, J. Braga, D. Chakraborty, K. Mehta, J. G. Araújo, Cleanrl:
Highquality single-file implementations of deep reinforcement learning algorithms, Journal of Machine
Learning Research 23 (2022) 1–18.
[54] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, I. Stoica, Eficient
memory management for large language model serving with pagedattention, in: Proceedings of
the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
[55] D. Paglieri, B. Cupiał, S. Coward, U. Piterbarg, M. Wolczyk, A. Khan, E. Pignatelli, Ł. Kuciński,
L. Pinto, R. Fergus, J. N. Foerster, J. Parker-Holder, T. Rocktäschel, BALROG: Benchmarking agentic
LLM and VLM reasoning on games, in: The Thirteenth International Conference on Learning
Representations, 2025.
[56] S. Ross, G. Gordon, D. Bagnell, A reduction of imitation learning and structured prediction to
no-regret online learning, in: Proceedings of the fourteenth international conference on artificial
intelligence and statistics, JMLR Workshop and Conference Proceedings, 2011, pp. 627–635.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>A. Prompts</title>
      <p>VLM Guidance Prompt
&lt;system&gt;
You are a smart agent acting in the "FrozenLake" environment from the Gymnasium library. You will
receive observations from the environment and must decide which action to take next.
You should think about the answer step by step inside the &lt;think&gt; tag, and then provide the action inside
the &lt;action&gt; tag. All text outside the &lt;think&gt; and &lt;action&gt; tags will be ignored.
&lt;/system&gt;
&lt;user&gt;
The goal is to navigate across a frozen grid to reach the goal without falling into holes.
## Observation space
You are presented with the image of the environment in the style of pixel art.</p>
      <p>Each cell may be safe (white with snow), a hole (bright blue ice hole), the start (a stool) or the goal (a
present box). The agent is represented by an elf character.
## Action space
The action space includes 4 discrete actions:
− 0: Move left
− 1: Move down
− 2: Move right
− 3: Move up
Based on the current visual information, which action should the agent take next?
## Observation:
{image}
&lt;/user&gt;</p>
    </sec>
    <sec id="sec-7">
      <title>B. Experiment Hyperparameters</title>
      <p>This Appendix reports the hyperparameters used for PPO training in our experiments on the
FrozenLake-v1 environment. We also report the configurations for the CNN policy and for the
VLM.</p>
      <sec id="sec-7-1">
        <title>Parameter</title>
        <p>Environment ID
Map size
Slippery
Observation type
Image stack size
Number of parallel environments
Episode length (steps)
Total timesteps</p>
      </sec>
      <sec id="sec-7-2">
        <title>Value</title>
        <p>FrozenLake-v1
8× 8
false
RGB image
1
4
128
500,000
Learning rate 2.5e-4
Batch size 512
Minibatch size 128
Update epochs 4
Number of minibatches 4
Discount factor ( ) 0.99
GAE lambda 0.95
Entropy coeficient 0.01
Value function coeficient 0.5
Clip coeficient 0.2
Clip value loss true
Normalize advantages true
Anneal learning rate true
Max gradient norm 0.5
Evaluation frequency Every 1,000 steps
Evaluation episodes 10</p>
      </sec>
      <sec id="sec-7-3">
        <title>Parameter</title>
        <p>Agent type
Convolutional channels
Kernel size
Stride
Padding
Conv head activation
Residual activation
Conv head output size
Actor standard deviation
Critic standard deviation</p>
      </sec>
      <sec id="sec-7-4">
        <title>Value</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>C. Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT in order to: Grammar and spelling
check, Paraphrase and reword, Generate literature review (particularly to help discover less well-known
related work). After using this tool, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This work was conducted during the period in which Giovanni Bonetta and Bernardo Magnini were
supported by the PNRR MUR project PE0000013-FAIR (Spoke 2).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          , et al.,
          <article-title>Reinforcement learning: An introduction</article-title>
          , MIT press Cambridge,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Fidjeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ostrovski</surname>
          </string-name>
          , et al.,
          <article-title>Human-level control through deep reinforcement learning</article-title>
          ,
          <source>Nature</source>
          <volume>518</volume>
          (
          <year>2015</year>
          )
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Maddison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sifre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Van Den Driessche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schrittwieser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Panneershelvam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanctot</surname>
          </string-name>
          , et al.,
          <article-title>Mastering the game of go with deep neural networks and tree search</article-title>
          ,
          <source>Nature</source>
          <volume>529</volume>
          (
          <year>2016</year>
          )
          <fpage>484</fpage>
          -
          <lpage>489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , I. Babuschkin,
          <string-name>
            <given-names>W. M.</given-names>
            <surname>Czarnecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mathieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dudzik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Powell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ewalds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Georgiev</surname>
          </string-name>
          , et al.,
          <article-title>Grandmaster level in starcraft ii using multi-agent reinforcement learning</article-title>
          ,
          <source>Nature</source>
          <volume>575</volume>
          (
          <year>2019</year>
          )
          <fpage>350</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hafner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pasukonis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          , T. Lillicrap,
          <article-title>Mastering diverse control tasks through world models</article-title>
          ,
          <source>Nature</source>
          <volume>640</volume>
          (
          <year>2025</year>
          )
          <fpage>647</fpage>
          -
          <lpage>653</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Finn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <article-title>End-to-end training of deep visuomotor policies</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>17</volume>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gu</surname>
          </string-name>
          , E. Holly,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning for robotic manipulation with asynchronous of-policy updates</article-title>
          ,
          <source>in: 2017 IEEE international conference on robotics and automation (ICRA)</source>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>3389</fpage>
          -
          <lpage>3396</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Andrychowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chociej</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Litwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McGrew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Petron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plappert</surname>
          </string-name>
          , G. Powell,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ribas</surname>
          </string-name>
          , et al.,
          <article-title>Solving rubik's cube with a robot hand</article-title>
          , arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>07113</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wolski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Klimov</surname>
          </string-name>
          ,
          <article-title>Proximal policy optimization algorithms</article-title>
          ,
          <source>arXiv preprint arXiv:1707.06347</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Haarnoja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <surname>Soft</surname>
          </string-name>
          actor-critic:
          <article-title>Of-policy maximum entropy deep reinforcement learning with a stochastic actor</article-title>
          ,
          <source>in: International conference on machine learning, Pmlr</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1861</fpage>
          -
          <lpage>1870</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ecofet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huizinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. O.</given-names>
            <surname>Stanley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clune</surname>
          </string-name>
          , First return, then explore,
          <source>Nature</source>
          <volume>590</volume>
          (
          <year>2021</year>
          )
          <fpage>580</fpage>
          -
          <lpage>586</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hudson</surname>
          </string-name>
          , E. Adeli,
          <string-name>
            <given-names>R.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arora</surname>
          </string-name>
          , S. von Arx,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bohg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosselut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brunskill</surname>
          </string-name>
          , et al.,
          <article-title>On the opportunities and risks of foundation models</article-title>
          ,
          <source>arXiv preprint arXiv:2108.07258</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report, arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          , et al.,
          <source>The llama 3 herd of models, arXiv preprint arXiv:2407.21783</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lv</surname>
          </string-name>
          , et al.,
          <source>Qwen3 technical report, arXiv preprint arXiv:2505.09388</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , K. Zhang, P. Zhang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , et al.,
          <article-title>Llava-onevision: Easy visual task transfer</article-title>
          ,
          <source>arXiv preprint arXiv:2408.03326</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , et al.,
          <source>Qwen2. 5-vl technical report, arXiv preprint arXiv:2502.13923</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vieillard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Merhej</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Perrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Matejovicova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rivière</surname>
          </string-name>
          , et al.,
          <source>Gemma 3 technical report, arXiv preprint arXiv:2503.19786</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>