<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>all: natural language to bind com munication, perception and action</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simone Colomban</string-name>
          <email>simone.colombani@studenti.unimi.i</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitri Ognibene</string-name>
          <email>dimitri.ognibene@unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Boccignone</string-name>
          <email>giuseppe.boccignone@unimi.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Oversonic Robotics</institution>
          ,
          <addr-line>Carate Brianza</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Milan</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Milano-Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, research in the area of human-robot interaction has focused on developing robots capable of understanding complex human instructions and performing tasks in dynamic and diverse environments. These systems have a wide range of applications, from personal assistance to industrial robotics, emphasizing the importance of robots interacting flexibly, naturally and safely with humans. This paper presents an advanced architecture for robotic action planning that integrates communication, perception, and planning with Large Language Models (LLMs). Our system is designed to translate commands expressed in natural language into executable robot actions, incorporating environmental information and dynamically updating plans based on real-time feedback. The Planner Module is the core of the system where LLMs embedded in a modified ReAct framework are employed to interpret and carry out user commands like'Go to the kitchen and pick up the blue bottle on the table'. By leveraging their extensive pre-trained knowledge, LLMs can efectively process user requests without the need to introduce new knowledge on the changing environment. The modified ReAct framework further enhances the execution space by providing real-time environmental perception and the outcomes of physical actions. By combining robust and dynamic semantic map representations as graphs with control components and failure explanations, this architecture enhances a robot's adaptability, task execution eficiency, and seamless collaboration with human users in shared and dynamic environments. Through the integration of continuous feedback loops with the environment the system can dynamically adjusts the plan to accommodate unexpected changes, optimizing the robot's ability to perform tasks. Using a dataset of previous experience is possible to provide detailed feedback about the failure. Updating the LLMs context of the next iteration with suggestion on how to showcasing its adaptability and potential for integration across diverse environments. By leveraging LLMs and semantic mapping, the architecture enables RoBee to navigate and respond to real-time changes. Human-Robot interaction, Robot task planning, Large Language Models, Automated planning AI4CC-IPS-RCRA-SPIRIT 2024: International Workshop on Artificial Intelligence for Climate Change, Italian Workshop on Planning and Scheduling, RCRA Workshop on Experimental evaluation of algorithms for solving problems with combinatorial explosion, and SPIRIT Workshop on Strategies, Prediction, Interaction, and Reasoning in Italy. November 25-28th, 2024, Bolzano, Italy [1].</p>
      </abstract>
      <kwd-group>
        <kwd>This system has been implemented on RoBee</kwd>
        <kwd>the cognitive humanoid robot developed by Oversonic Robotics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The integration of LLMs in robotic systems has opened new avenues for autonomous task planning
and execution [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. These models demonstrate exceptional natural language understanding and
commonsense reasoning capabilities, enhancing a robot’s ability to comprehend contexts and execute
commands [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. However LLMs are not be able to plan autonomously, they need to be integrated in
architectures that enable them to understand the environment, the robot capabilities and sta6t]e. [This
research aims to empower robots to comprehend user requests and autonomously generate actionable
plans in diverse environments.
      </p>
      <p>
        However, the eficacy of these plans relies on the robot’s understanding of its operating environment
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. To bridge this gap, our work employs scene graphs8[] as a semantic mapping tool, ofering a
structured representation of spatial and semantic information within a scene.
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>
        In our approach, we leverage LLMs through in-context9[], which enables the models to learn and adapt
based on the information provided in the context. Our work implements a modified version of the
ReAct [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] framework that expand the context of LLMs with environmental information and execution
feedback, allowing the model to plan and execute skil1l1s][translating them into physical actions.
Motivation The primary focus of our work is to enable robot to interact flexibly and robustly in
dynamic and diverse environments with limited human intervention. Traditional robotic systems usually
rely on static, pre-programmed instructions or closed world predefined knowledge and settings, limiting
their adapt-ability to dynamic environments. Interacting with humans in daily tasks within complex
environments disrupts these assumptions. LLMs and VLM can provide open-domain knowledge to
represent novel conditions without human intervention. However, these models are not informed
of the trained specific robot, task and settings at hand, that define what information can be relevant
and necessary to find and reason about [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Exceeding in the level of detail may lead to impractical
computational requirements and response time. Discarding crucial information, spatial or semantic,
may lead to repeated failures due to the introduced non-managed partial observabil1i3t]y. [To find
the relevant information may be too slow14[]. LLMs can still produce outputs that are logically
inconsistent or impractical1[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], expecially if they are not integrated into systems that allow them to
adapt to changes in the environment and the physical capabilities of the robots. During task execution,
robots may encounter unexpected situations, such as unanticipated obstacles, sensor errors, or changes
in the environment that were not accounted for in the initial plan. Such scenarios necessitate robust
error handling mechanisms and adaptive planning strategies that enable the system to reassess and
modify its actions in real-time1[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. By introducing execution controlling and failure management into
the planning process at diferent levels as well as retrieval of previous successful plans, we propose a
solution to enhance the robustness and flexibility of LLM-based robotic systems. This approach ensures
that the robot can efectively perceive changes in the environment and the failures that may arise from
them, allowing it to adapt strategies in response to new challenges.
      </p>
      <p>
        Proposed approach Our system addresses the challenges of dynamic environments through a
realtime perception module and a Planner module that integrates execution control, and failure management.
It comprise a Controller that monitors the execution of tasks and detects errors, while the Explainer
analyzes failures and suggests adjustments based on past experiences. This feedback loop enables
adaptive re-planning, allowing the system to modify its actions as needed. Specifically, we propose
the use of the ReAct [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] framework, expanding its operational space with skills, physical actions
of the robot. By leveraging LLMs for natural language understanding and a perception system, the
architecture supports autonomous task execution in dyanamic scenarios.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        A substantial body of literature explores the utilization of LLMs for robotic task planni4n,g5][.
LLM for robot planning Recent works highlight the potential of Large Language Models (LLMs)
in robotic planning1[
        <xref ref-type="bibr" rid="ref7">7, 18, 19</xref>
        ]. DEPS [20] introduces an iterative planning approach for agents in
open-world environments, such as Minecraft. It utilizes LLMs to analyze errors during execution and
refine plans, improving both reasoning and goal selection processes. However, this approach has been
primarily developed and tested in virtual environments, with notable diferences in comparison to
real-world settings due to the dynamic and unpredictable nature of physical environments. Additionally,
DEPS does not leverage previous issues and solutions but relies solely on feedback from humans and
vision-language models (VLMs).
      </p>
      <p>Scene graph as environemental representation The use of scene graphs [21] as a means to
represent the robot’s environment has gained traction.22[] employs 3D scene graphs to represent
environments and uses LLMs to generate Planning Domain Definition Language (PDDL) files. This
method decomposes long-term goals into natural language instructions and enhances computational
eficiency by addressing sub-goals. However, it lacks a mechanism for replanning based on feedback
during execution, which could limit its adaptability in dynamic scenarios. SayPl2a3]ni[ntegrates
semantic search with scene graphs and path planning to aid robots in navigating complex environments
through natural language. By combining these techniques, SayPlan simulates various scenarios to
refine task sequences, which helps improve overall task performance in complex environments.
Replanning Replanning enables long-term autonomous task execution in roboti2c4s][. DROC
[25] empowers robots to process natural language corrections and generalize that information to new
tasks. It introduces a mechanism to distinguish between high-level and low-level errors, allowing
more flexible plan corrections. However, DROC does not address the types of failures that may occur
during plan execution, focusing instead on high-level corrections provided by user2s6.][supports
autonomous long-term task execution by integrating LLMs for planning and VLMs for feedback. This
approach adapts to changes in the environment through a structured component system that verifies
and corrects plans as needed. Yet, the feedback is limited to what is visible to the robot’s camera,
potentially overlooking other significant environmental changes.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Architecture</title>
      <sec id="sec-3-1">
        <title>Our system is based on two components:</title>
        <p>• Perception Module: it is responsible for sensing and interpreting the environment. It builds
and mantains a semantic map in the form of a directed graph that integrates both geometric and
semantic information.
• Planner Module: it takes the information provided by the Perception Module to formulate plans
and actions that allow the robot to perform specific tasks.</p>
        <p>Figure 1 show how these components interact to allow the robot to understand its environment
and act accordingly to satisfy user requests. The Perception module uses data provided by the robot’s
sensors to supply the semantic map to the Planner module, which in turn processes it to generate
specific action plans. In what follows we precisely address the Planner Module while details on the
Perception Module will be provided in a separate article.
3.1. Planner module
The architecture of the Planner module is designed to translate user requests, expressed in natural
language, into specific actions executable by a robot. This module is responsible for understanding
instructions, planning appropriate actions, and managing the execution of those actions in a dynamic
environment. The Planning module is composed by five sub-modules:
• Task Planner: Translates user requests, expressed in natural language, into a sequence of
high-level skills.
• Skill Planner: Translates high-level skills into specific, low-level executable commands.
• Executor: Executes the low-level actions generated by the Skill Planner.
• Controller: Monitors the execution of actions and manages any errors or unexpected events
during the process.
• Explainer: Interprets the causes of execution failures by analyzing data received from the</p>
        <p>Controller and provides suggestions to the Task Planner on how to adjust the plan.
The architecture of the planner module is shown in Figur2e. The main component of the system is
the Task Planner, which receives the user’s request and translates it into a list of high-level ”skills”
that represent the robot’s capabilities. These skills include actions such as ”PICK” (grasp an object),
”PLACE” (place an object), and ”GOTO” (move to a position).</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Task Planner</title>
          <p>The decision-making process of the Task Planner is driven bypaolicy, which is implemented as a
LLM. A policy is a strategy or rule that defines how actions are selected based on the current state or
context,[27].</p>
          <p>Task Planner is implemented using the ReAct framework10[], which alternates between reasoning
and action phases during the process. In the reasoning phase, the Task Planner can access various
”perception” actions to gather information from the environment, such as the semantic map and the
current state of the robot, and can execute one or more”skill” actions to perform physical actions.
The classical idea of ReAct is to augment the agent’s action space t=ô  ∪ 
, where  is the
space of language-based reasoning actions. An acti ô n∈  , referred to as a”thought” or reasoning
trace, does not directly afect the external environment but instead updates the current conte xt+1 =
(  ,  ̂ ) by adding useful information to support future decision-making10[]. In the classical idea
there could be various types of useful thoughts, such as decomposing task goals and creating action
plans, injecting commonsense knowledge relevant to task solving, extracting important parts from
observations, tracking progress and transitioning action plans, handling exceptions and adjusting action
plans, and so on, but always without modifying the physical environment, only embedding it within
the context. Interestingly, this approach mixes reasoning and action in a flexible manner. In the future,
we will analyse the potential of this approach also connecting to the planning-to-2p8l,a2n9][ and
meta-reasoning 3[0, 31, 32] concepts.</p>
          <p>
            In our work, we augment the agent’s 3[
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] action space with two types of actions:
• A skill action  ∈  skill, which involves physically interacting with the environment, such as
manipulating objects or navigating. The result of a skill action provides new feedback that
updates the current context.
• A perception action  ∈  perception, which involves accessing information from the environment,
such as querying the semantic map or sensors, and integrating that information into the context.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>The augmented action space is defined as:</title>
        <p>=̂  skill∪  perception ∪ 
 ∶  →
, ̂ (  ) =  ̂
reasoning.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Where:</title>
        <p>Thus, the LLM serves as the policy  that selects diferent types of actions from the augmented action
space and dynamically adapting the current cont extused to plan based on real-time information and</p>
        <sec id="sec-3-3-1">
          <title>Formal Description:</title>
          <p>The Task Planner’s policy, represented by the LLM, can be formalized as a
function that maps the current context  to an action ̂ from the augmented action spac e ̂:
skill∪  perception ∪  .
•  is the set of all possible contexts.
•  ̂ is the augmented action spac e=̂</p>
          <p>and any past actions or thoughts.
•   represents the current context at time , which includes the state of the robot, the environment,
•  ̂ ∈  ̂ is the action chosen by the policy, which can be a skill actio n∈  skill, a perception action
  ∈  perception, or a reasoning trac e ̂ ∈  .</p>
          <p>The context   is updated based on the chosen action:
• If  ̂ ∈  (a reasoning action), the context updates to:
 +1 = (  ,  ̂ )

This represents the thought process, where reasoning contributes new information without
afecting the external environment.
• If  ̂ ∈  perception (a perception action), the result of querying the environment updates the
context:</p>
          <p>+1 = (  ,  perception( ̂ ))
Here,  perception represents the function that gathers information and modifies the context based
on the perception action’s outcome.
• If  ̂ ∈  skill(a skill action), the robot interacts with the environment, and the context updates
based on feedback from the physical action:</p>
          <p>+1 = (  ,  skill( ̂ ))
Where  skillis the function that captures the result of executing a physical skill, such as
manipulating an object or moving to a location.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.1.2. Skill Planner</title>
          <p>Once a high-level request for the execution of a skill is made, the Skill Planner is responsible for
translating the high-level skills, provided by the Task Planner, into sequences of low-level commands
executable by the robot. While the Task Planner focuses on understanding natural language and creating
a general plan, the Skill Planner deals with the specific details of how each skill should be executed,
considering the robot’s state and the environment.</p>
          <p>Let a skill be represented in the following general form, defined by the Task Planner with specific
syntax:
  
_  ( 
1,  
2, … ,  
 )
Where:
• SKILL_NAME is the name of the skill to be executed (e.g.P, ICK, PLACE, GOTO).
• param_1, param_2, …, param_N are parameters for the skill, such as the object to manipulate
or the destination to navigate to.</p>
          <p>Using a strict syntax ensures that the Skill Planner can correctly interpret the high-level commands
without ambiguity. For instance, a natural language command l”Mikoeve near the table and grab the
bottle” would lack precision. The Skill Planner needs concrete parameters for the robot to act efectively.
Skill Planner workflow: The Skill Planner operates by performing three functions:
1. Precondition Verification: Before translating a skill into low-level commands, the Skill Planner
verifies that the necessary preconditions for execution are met. Let   represent the current state of the
robot and the environment at tim e, and  ( skill,  ) denote a function for every skill that evaluates the
preconditions for a given skill. The precondition check can be expressed as:
 ( skill,  ) = {</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>1, if all preconditions are met 0, otherwise</title>
        <p>For example, before executing thePICK skill, the following checks may be performed:
• The object is visible by the robot.
• The object is reachable for the robotic arm.</p>
        <p>• The robotic arm is free.</p>
        <p>If any of these conditions are not met (( skill,  ) = 0), the Skill Planner reports a failure to the Task
Planner.</p>
        <p>2. Target nodes extraction: Based on the parameters of the skill, the Skill Planner extracts the
target nodes from the semantic mapℳ, which contains geometric and semantic information about
the environment. Every node provides geometric information such as object’s position and relevant
context, which is then used to generate low-level commands.</p>
        <p>3. Generation of Low-Level Commands: When  ( skill,  ) = 1, the Skill Planner translate the skill
into a sequence of low-level commands to control the robot behavior. In this system, we represent skill
decomposition in commands as Hierarchical Task Networks (HTNs) that contains low-level commands
executable by the robot. Let ( skill,node,   ) denote the function that translates the given skill into
low-level commands based on the target nodes extracted from the semantic map and current state.
The output is a sequence of pre-modeled commands parameterized with the information of the robot
state and the target nodes,{ 1,  2, … ,   }, where each command  directs specific components of
the robot. Our implementation use HTNs solely on the breakdown of skills into commands without
using them with advanced features like re-planning or error recovery of the commands. In this case, if
any command fails, the entire skill fails, with no attempt at re-planning at the skill planner level. The
process can be represented as:
{ 1,  2, … ,   } =  (</p>
        <p>skill,node,   )</p>
        <p>The Skill Planner is designed to be flexible and extendable. The skill functi o n,sand 
adapted or extended to accommodate new skills, hardware, or environments.
can be</p>
        <sec id="sec-3-4-1">
          <title>3.1.3. Executor</title>
          <p>The Executor is responsible for directly interacting with the robot’s hardware to execute the commands
provided by the Skill Planner. It translates the low-level commands into physical actions by controlling
various hardware elements such as motors, robotic arm grippers, and other actuators required for task
execution.</p>
          <p>Let the set of low-level commands generated by the Skill Planner be represented as above, i.e.,
 1,  2, … ,   } =  ( skill,node,   ), where  ( skill,node,   ) defines the sequence of commands
based on the skill, the target node, and the current state of the robot and the environment.</p>
          <p>The Executor is tasked with executing these commands on the physical robot. Let the state of the
robot at time  be denoted by ℎ , and the function that maps a low-level comman dto an efect on the
robot’s state be denoted as (  , ℎ ). The execution of a command at tim e can be described as:
ℎ+1 =  (
where ℎ+1 is the updated state after executing the command   . This process is repeated for each
command in the sequence{ 1,  2, … ,   } until the entire skill is executed.</p>
          <p>Executor workflow:
• Command reception: The Executor receives a set of low-level command{s 1,  2, … ,   }
from the Skill Planner. Each command specifies a concrete action to be performed by the robot’s
hardware components.
• Hardware interaction: For each command  , the Executor interacts with the robot’s hardware,
adjusting the motors, grippers, and other actuators. This interaction can be represented by the
function  (  , ℎ ) that determines the efect of a command on the robot’s state ℎ .
• Command execution: The Executor executes each command  in the sequence, ensuring that
the robot’s state transitions from stateℎ to ℎ+1 . Formally:
ℎ_ + 1 =  (</p>
          <p>_, ℎ _), ∀ = 1, 2, … , 
After executing all commands, the robot’s reaches the final state ℎ+ , corresponding to the
completion of the skill.
• Real-Time feedback: During execution, the robot’s provides feedback on its current state. Let
  denote the feedback at time , and  +1 be the updated feedback after executing command  
:
where  is the feedback function. If unexpected feedback +1 is received, the Executor can trigger
adjustments to the plan or inform the Skill Planner of a potential issue.</p>
          <p>Diferent robots may use diferent communication protocols, and hardware configurations. Therefore,
the Executor must be adapted for each specific robot system, ensuring that it correctly interacts with
the robot’s hardware.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.1.4. Controller</title>
          <p>The Controller is responsible for monitoring the robot’s status and the environment during command
execution, ensuring that they are carried out as planned. After each command is executed, the Executor
sends feedback indicating either success or failure. If a failure occurs, it results in the failure of the
entire skill. Upon the completion of all commands, a success feedback will indicate the successful
execution of the skill.</p>
          <p>Denote   the feedback from the Executor at time . The Controller processes  to determine the outcome
of the executed skills. The feedback can be classified into two categories: success and failure.
Feedback processing:
the next command:
can be represented as:
• Success: If the feedback   indicates successful execution of a command and it is the last
command to execute, the skill is considered successfully completed, the Controller sends a positive
acknowledgment to the Task Planner to continue the planning process. However, if the feedback
indicates success but the command is not the last one, the Controller waits for the execution of
if   = Success ⟹</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>Task Planner continues</title>
        <p>• Failure: If a failure occurs during the execution of any command, the planned skill fails and the
Controller generates a failure messa ge that includes the reason for the failure. This message
is sent to the Explainer. Le t  represent the specific error detected at time  . The failure message
  = Failure(  )
where   can include various error reasons such as obstacles detected, non-executable trajectories,
or environmental changes.</p>
        <p>The Controller’s operation is highly dependent on the specific robot system in use, as it relies on the
characteristics of the robot and the employed software system. In a ROS environment, for example, the
Controller interacts with ROS nodes that control the robot’s hardware. In our work, RoBee, described
in section5, has a system that allows to obtain feedback on the execution of commands.</p>
        <sec id="sec-3-5-1">
          <title>3.1.5. Explainer</title>
          <p>The Explainer component plays a critical role in enhancing the planning process by providing insights
to the Task Planner when failures occur during the execution phase. After receiving the failure reason,
the Explainer searches a datase t for previous instances of similar failures. This dataset comprises
records of failures associated with specific skills and user requests. Let  denote the subset of the

dataset containing records of failures and solutions related to the same skill and error message. The
dataset has been manually built based on previous experiences, desired behaviors, and expected failures.</p>
          <p>The search can be expressed as:
where:</p>
          <p>= {(  ,   ,   ) ∈  |   = skill_nam,e  =   ,   ∼ user_request}
•   is the skill being executed (e.g., PICK).
•   represents the specific user request associated with the failure.
•   is the failure reason provided by the Controller
•   ∼ user_request indicates that the user request in the dataset is similar to the current user
request.</p>
          <p>Rather than searching for an exact match to the user’s request, the Explainer assesses the similarity of
the user’s request (  ) to the instances in the dataset linked to the suggestion, using cosine similarity in
our approach [34]. This method enables the system to identify the most relevant past instances, even
when the user’s requests are not identical.</p>
          <p>Once relevant instances are identified, the Explainer analyzes these cases to generate a suggestion
for the Task Planner on how to proceed. The suggestion is structured as follows:</p>
          <p>= Suggest(   )</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>For instance, if the Controller reports the failure reason:</title>
        <p>= ”Cannot execute the approach movement for the PICK skill, object too far”
The Explainer analyzes this failure and may find a previous instance where the robot successfully
resolved a similar issue. It could recommend a command to the Task Planner:</p>
        <p>= ”Use the GOTO skill to move near the object to pick”
This suggestion enables the Task Planner to adjust its strategy efectively, moving the robot closer to
the object before attempting the PICK action again.</p>
        <p>The suggestions provided by the Explainer can be tailored to accommodate specific behaviors of the
robot. This adaptability can be achieved by modifying the parameters of the data utilized to generate
the suggestions. Thus, the Explainer enhances the resilience of the system, facilitating its ability to
adapt to changing conditions and recover from errors eficiently.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Behavior example of the system</title>
      <p>To illustrate the proposed system’s behavior in a dynamic environment, consider a scenario where the
robot is tasked with picking up a bottle from a table in a room and placing it in a designated area in
other room. In this example, the system must react to unexpected changes, such as the bottle being
moved by an external agent during the task.</p>
      <p>Example User Request: The user provides the instruction: ”Go to the table in the kitchen, pick
up the bottle, and place it on the table in the bedroomP.”erception Module: The robot senses the
environment and generateds a semantic map, which includes the pose and types of objecTtsa.sk
Planner: The request is interpreted, and the Task Planner begins the planning procedure. It starts by
calling perception actions to gather information about the environment and the robot’s state. Using
the results of these perception actions, the Task Planner determines the next steps in its execution
by calling skill actions and incorporating the outcomes of each skill into the planning process. LLM
interprets the request and reasons as follows.</p>
      <p>User Request: ”Go to the table in the kitchen, pick up the bottle, and place it on the table in the
bedroom.”
Robot State: The robot is currently in the bedroom and has both the right and left arms empty.
Thought: The task is to fetch the bottle from the kitchen and place it on the table in the bedroom.
I need to know all the rooms where the robot can go to.</p>
      <p>Perception action: GetMapRooms()
Observation: The robot identifies the rooms: [kitchen, bedroom].</p>
      <p>Thought: Now, I will navigate to the kitchen to find the bottle.</p>
      <p>Skill action: GOTO(kitchen)
Observation: The robot moves to the kitchen.</p>
      <p>Thought: I should identify the specific location of the bottle within the kitchen.</p>
      <p>Perception action: GetObjectInRoom(kitchen)
Observation: The robot finds the following objects in the kitchen: [sink on the table_1,
stove on the table_1, bottle, table_1, table_2].</p>
      <p>Thought: I will now move to the location of the bottle.</p>
      <p>Skill action: GOTO(bottle)
Observation: The robot moves to the bottle’s location in the kitchen.</p>
      <p>Thought: I will pick up the bottle now.</p>
      <p>Skill action: PICK(bottle, right)
Observation: The robot can’t see the bottle to pick, find a bottle in the semantic map and try
again to pick it.</p>
      <p>Thought: The bottle is not visible where expected, I need to check the location of the bottle
again.</p>
      <p>Perception action: GetObjectInRoom(kitchen)
Observation: The robot finds the following objects in the kitchen: [sink on the table_1,
stove on the table_1, bottle on the table_2, table_1, table_2 ].</p>
      <p>Thought: I will go to the table to pick up the bottle instead.</p>
      <p>Skill action: GOTO(table_2)
Observation: The robot moves to the table.</p>
      <p>Thought: I will pick up the bottle from the table now.</p>
      <p>Skill action: PICK(bottle, right)
Observation: The robot successfully picks up the bottle.</p>
      <p>Thought: Next, I need to navigate to the bedroom to place the bottle.</p>
      <p>Skill action: GOTO(bedroom)
Observation: The robot moves to the bedroom.</p>
      <p>Thought: I need to locate the table in the bedroom where I will place the bottle.</p>
      <p>Perception action: GetObjectInRoom(bedroom)
Observation: The robot sees: [table, bed, lamp on the table] in the bedroom.</p>
      <p>Thought: I will approach the table to place the bottle.</p>
      <p>Skill action: GOTO(table)
Observation: The robot approaches the table.</p>
      <p>Thought: Finally, I will place the bottle on the table.</p>
      <p>Skill action: PLACE(bottle, right)
Observation: The robot places the bottle on the table successfully.</p>
      <p>Skill Planner: For each high-level skill, once the preconditions are met, the Skill Planner translates
the choosen skill in a sequence of low-level commands, such as motor movements for navigation,
arm articulation for picking, and placing actions. For example oncePtIhCeK(bottle, right) skill is
planned, it can be translated and divided into the following phases with relative commands:
• Approach: The robot arm moves towards the object’s position, making any necessary adjustments
to align correctly, and opens the gripper.
• Grasp: The robot activates the gripping mechanisms to seize the object. This phase includes
closing the gripper and verifying the grasp.</p>
      <p>• Lifting: The robot lifts the object from the surface it is on.</p>
      <p>Execution: The Executor begins executing the planned skill, which is composed of a sequence of
commands by the Skill Planner. The Executor follows the ordered steps to achieve the goal. For example
with the skillPICK(bottle, right), the Executor receive the list of command and execute:
• Execute approach: The robot arm moves towards the object’s position and open the gripper.
• Execute grasp: This phase includes closing the gripper and verifying the grasp.</p>
      <p>• Execute lifting: The robot lifts the object from the surface it is on.</p>
      <p>Thus, when an unexpected event occurs, such as the bottle being moved or is not reachable the executor
may raise a failure message.</p>
      <p>Controller and Explainer interaction:
• The Controller detects that the object is no longer in the expected location and sends a failure
message to the Explainer.
• The Explainer analyzes the failure, referencing previous instances where objects were moved
unexpectedly. It suggests the Task Planner to re analyse the semantic map and update the object’s
location.</p>
      <p>Re-planning: Based on the suggestion, the Task Planner issues a new plan:
• Execute GOTO(table) to go near the identified bottle.
• After locating the bottle on the table, the robot updates its actions and proceeds to execute the
remaining tasks.</p>
      <p>This example demonstrates how the system adapts in real-time, allowing for continuous task execution
even in dynamic and unpredictable environments.</p>
      <p>
        Planning algorithm We now formalize this process in the form of an adaptive planning algorithm.
In this algorithm, the used LLM is a generalist model such Lalsama 3 70B Instruct [35], whose behavior
we influence through in-context learning [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
1: Input: User request  , Robot state  
2: Output: Execution of user request
      </p>
      <sec id="sec-4-1">
        <title>3: procedure Planning(,  )</title>
        <p>0 ← InitializeLLMContex(t, , 
while not goal achieveddo
 ←
if  =</p>
        <p>TaskPlanne(r,  0)
”Skill” then
 )
▷
▷</p>
        <p>▷ Get first skill</p>
      </sec>
      <sec id="sec-4-2">
        <title>Execute commands</title>
        <p>▷ Detect failure
Generate suggestion
 ←
 ←
if  =
else
end if
else
end if
Executor()</p>
      </sec>
      <sec id="sec-4-3">
        <title>Falsethen</title>
        <p>←
  ← Explainer( )</p>
        <p>Controlle(r  )
  ← Skill succesfully executed
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:</p>
        <p>SkillPlanne(r, 
)</p>
        <p>▷ Translate skill into low-level commands
  ← CallPerceptionActio(n)
▷ Reading semantic map from Perception Module</p>
        <p>▷ Update context
▷ Get next skill based on updated context
 ←
end while
 +1 ← UpdateContext(  )</p>
        <p>TaskPlanne(r,  +1 )
22: end procedure</p>
        <p>This algorithm shows the adaptive behavior of the system by incorporating feedback loops that
facilitate real-time re-planning. By alternating between action and reasoning phases, the robot can
continuously adapt to changes, ensuring task success even in unpredictable environments.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Robot Hardware</title>
      <p>The system was implemented using RoBee, a cognitive humanoid robot developed by Oversonic Robotics.
RoBee measures 160 cm in height and weighs 60 kg. It has 32 degrees of freedom, enabling highly
lfexible movement. The robot is equipped with multiple sensors, including cameras, microphones, and
force sensors.</p>
      <p>The cameras provide real-time visual data, supporting navigation and object recognition tasks. The
microphones facilitate audio input, enabling speech recognition and interaction through natural
language processing. The force sensors are used for handling objects, allowing RoBee to adjust grip force
based on the characteristics of the item being manipulated, enhancing precision and safety during
interactions.</p>
      <p>RoBee’s mechanical structure includes two arms capable of bimanual manipulation, each capable of
handling objects weighing up to 5 kg. The system includes a torso and leg system designed for
balance and mobility. RoBee is equipped with LIDAR sensors for real-time environment mapping and
obstacle detection. These LIDAR sensors enable the robot to navigate autonomously through complex
environments, ensuring safe operation in shared spaces. The combination of autonomous navigation
technologies and LIDAR-based detection enhances the ability of RoBee to move eficiently and avoid
collisions in dynamic industrial environments.</p>
      <p>In addition to its physical capabilities, RoBee integrates with cloud-based systems, allowing for remote
monitoring, task scheduling, and data analytics.</p>
      <p>The Planner-module takes into account RoBee’s embodiment, ensuring that the system is aligned with
the robot’s capabilities such as its degrees of freedom, sensor suite, and ability to perform manipulation
and navigation.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Preliminary results</title>
      <p>Preliminary experiments were conducted in a simulated environment replicating two main rooms: a
kitchen and abedroom, as illustrated in Figur6e.</p>
      <p>During the experiments, three types of requests were tested, each varying in complexity:
• Simple requests: direct commands that involve only one skill. For examp”lPei,ck up the bottle
in front of you”, where the task planner needs only to identify the parameters and activate the
appropriate skill.
• Moderately complex requests: tasks that require the robot to perform multiple skills in
sequence, as explicitly described in the command. An example”iGso to the kitchen, pick up the
bottle, and bring it to the table in the bedroom”, which involves multiple skills. These tasks require
a higher level of complexity, with planning across several steps and handling potential failures.
• Complex requests: such as ”I’m thirsty, can you help me?”, which were more open-ended and
required the robot to interpret the task and break it down into multiple steps.</p>
      <p>The results in table1 showed that the system performed well with simple requests, followed by
moderately complex ones. However, the success rate for complex requests was significantly lower,
with only 25% of the tasks completed correctly. This lower performance was attributed to the system’s
dificulty in understanding and managing ambiguous or under-specified instructions.</p>
      <p>It is important to note that these are preliminary results, and further analysis is ongoing. A thorough
evaluation of the data is currently underway, including a comparison with the state of the art in robot
task execution and natural language understanding. This will allow for a deeper understanding of the
system’s strengths and areas for improvement.</p>
      <p>Request type</p>
      <p>Simple requests
Moderately complex requests</p>
      <p>Complex requests</p>
      <p>Number of attempts
30
20
10</p>
      <p>Success rate
90%
75%
25%</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>The proposed planning system exhibits notable strengths, particularly its adaptability and seamless with
the robot’s diverse set of skill for executing complex tasks. The system’s core advantage lies in its ability
to interpret user commands through natural language processing, converting them into high-level
actions that are further refined into low-level, executable tasks. By integrating real-time environmental
feedback from the Perception Module through an extended version of ReAct framework, the system can
dynamically adjust to unexpected situations, such as obstacles or execution failures. This adaptability
is supported by an architecture, where the Task Planner, Skill Planner, Controller, and Explainer
components work in harmony to ensure smooth task execution even in changing environments.
One of the system’s key strengths is its ability to manage error recovery through feedback loops,
allowing the robot to adapt quickly to failures during task execution. The Explainer module provides on
the fly suggestions to modify the plan based on past errors, enhancing the system’s validity. The use of
semantic maps and scene graphs provides the robot with a structured understanding of its environment,
ensuring that actions are contextually accurate and responsive to real-world conditions.
The integration of LLMs, perceptual feedback, and flexible task planning mechanisms makes the system
highly versatile for complex, dynamic environments. Its implementation on RoBee, the humanoid robot
developed by Oversonic Robotics, has demonstrated its practical potential, positioning it as a valuable
tool for applications requiring advanced human-robot interaction and adaptability in unpredictable
settings.</p>
      <p>
        In the future, other than extending the low level skill set available, we will investigate the possibility to
autonomously expand the Explainer dataset as well as providing similar information directly to the Task
Planner, increasing flexibility and reliability and reducing the number of re-planning events. We will
also study capability of the system to proactively acquire information about the environme1n4t] [and
human partners both through sensors 3[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and communication strategies, leveraging the potential for
proactive information gathering behaviours of LLMs37[
        <xref ref-type="bibr" rid="ref17 ref18">, 38, 39</xref>
        ]. Moreover, it will be crucial to assess
the reliability of the system both at the planning level as well as the communication level, considering
the introduction of embodiment and environment while the limitation in pragmatic understanding of
LLM are still to be understood3[
        <xref ref-type="bibr" rid="ref19 ref20 ref9">9, 40, 41</xref>
        ].
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>Special thanks to Oversonic Robotics for enabling the implementation of this project using their
humanoid robot, RoBee.
manipulation tasks, in: 2022 IEEE 27th International Conference on Emerging Technologies and
Factory Automation (ETFA), IEEE, 2022, pp. 1–4.
[17] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, A. Zeng, Code as policies:
Language model programs for embodied control, in: 2023 IEEE International Conference on
Robotics and Automation (ICRA), IEEE, 2023, pp. 9493–9500.
[18] B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, P. Stone, Llm+ p: Empowering large language
models with optimal planning proficiency, arXiv e-prints (2023) arXiv–2304.
[19] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, Y. Su, Llm-planner: Few-shot grounded
planning for embodied agents with large language models, in: Proceedings of the IEEE/CVF
International Conference on Computer Vision, 2023, pp. 2998–3009.
[20] Z. Wang, S. Cai, G. Chen, A. Liu, X. Ma, Y. Liang, Describe, explain, plan and select: Interactive
planning with large language models enables open-world multi-task agents, arXiv e-prints (2023)
arXiv–2302.
[21] I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, S. Savarese, 3d scene graph: A
structure for unified semantics, 3d space, and camera, in: Proceedings of the IEEE/CVF international
conference on computer vision, 2019, pp. 5664–5673.
[22] Y. Liu, L. Palmieri, S. Koch, I. Georgievski, M. Aiello, Delta: Decomposed eficient long-term robot
task planning using large language models, arXiv e-prints (2024) arXiv–2404.
[23] K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, N. Suenderhauf, Sayplan: Grounding large
language models using 3d scene graphs for scalable robot task planning, in: 7th Annual Conference
on Robot Learning, 2023.
[24] M. Cashmore, A. Coles, B. Cserna, E. Karpas, D. Magazzeni, W. Ruml, Replanning for situated
robots, in: Proceedings of the International Conference on Automated Planning and Scheduling,
volume 29, 2019, pp. 665–673.
[25] L. Zha, Y. Cui, L.-H. Lin, M. Kwon, M. G. Arenas, A. Zeng, F. Xia, D. Sadigh, Distilling and
retrieving generalizable knowledge for robot manipulation via language corrections, in: 2024
IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2024, pp. 15172–15179.
[26] M. Skreta, Z. Zhou, J. L. Yuan, K. Darvish, A. Aspuru-Guzik, A. Garg, Replan: Robotic replanning
with perception and language models, arXiv e-prints (2024) arXiv–2401.
[27] H. Gefner, Non-classical planning with a classical planner: The power of transformations, in:</p>
      <p>European Workshop on Logics in Artificial Intelligence, Springer, 2014, pp. 33–47.
[28] D. Ognibene, G. Pezzulo, H. Dindo, Resources allocation in a bayesian, schema-based model
of distributed action control, in: NIPS-Workshop on Probabilistic Approaches for Robotics and
Control, 2009.
[29] M. Ho, D. Abel, J. Cohen, M. Littman, T. Grifiths, People do not just plan, they plan to plan, in:</p>
      <p>Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020, pp. 1300–1307.
[30] S. Russell, E. Wefald, Principles of metareasoning, Artificial intelligence 49 (1991) 361–395.
[31] S. Zilberstein, S. J. Russell, Anytime sensing, planning and action: A practical model for robot
control, in: IJCAI, volume 93, 1993, pp. 1402–1407.
[32] R. Ackerman, V. A. Thompson, Meta-reasoning: Monitoring and control of thinking and reasoning,</p>
      <p>Trends in cognitive sciences 21 (2017) 607–617.
[33] S. J. Russell, P. Norvig, Artificial intelligence: a modern approach, Pearson, 2016.
[34] F. Rahutomo, T. Kitasuka, M. Aritsugi, et al., Semantic cosine similarity, in: The 7th international
student conference on advanced science and technology ICAST, volume 4, University of Seoul
South Korea, 2012, p. 1.
[35] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,</p>
      <p>A. Fan, et al., The llama 3 herd of models, arXiv. org (????).
[36] D. Ognibene, Y. Demiris, Towards active event recognition., in: IJCAI, 2013, pp. 2495–2501.
[37] S. Patania, E. Masiero, L. Brini, G. Donabauer, U. Kruschwitz, V. Piskovskyi, D. Ognibene, Large
language models as an active bayesian filter: information acquisition and integration, in:
Proceedings of the 28th Workshop on the Semantics and Pragmatics of Dialogue - Full Papers, SEMDIAL,
Trento, Italy, 2024. URL:http://semdial.org/anthology/Z24-Patania_semdial_0006..pdf</p>
    </sec>
    <sec id="sec-9">
      <title>8. Online Resources</title>
      <p>More information about RoBee and Oversonic Robotics are available:
• RoBee,
• Oversonic Robotics</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Aineto</surname>
          </string-name>
          , R. De Benedictis,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maratea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mittelmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Monaco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Scala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Serafini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Serina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Spegni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tosello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Umbrico</surname>
          </string-name>
          , M. Vallati (Eds.),
          <source>Proceedings of the International Workshop on Artificial Intelligence for Climate Change, the Italian workshop on Planning and Scheduling</source>
          , the RCRA Workshop on
          <article-title>Experimental evaluation of algorithms for solving problems with combinatorial explosion, and</article-title>
          the Workshop on Strategies, Prediction, Interaction, and
          <article-title>Reasoning in Italy (AI4CC-IPS-RCRA-SPIRIT 2024), co-located with 23rd International Conference of the Italian Association for Artificial Intelligence</article-title>
          (AIxIA
          <year>2024</year>
          ), CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Driess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Sajjadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lynch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wahid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tompson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Vuong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          , et al.,
          <article-title>Palm-e: an embodied multimodal language model</article-title>
          ,
          <source>in: Proceedings of the 40th International Conference on Machine Learning</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>8469</fpage>
          -
          <lpage>8488</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Brohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chebotar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Cortes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>David</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Finn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gopalakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hausman</surname>
          </string-name>
          , et al.,
          <article-title>Do as i can, not as i say: Grounding language in robotic afordances</article-title>
          , arXiv e-prints (
          <year>2022</year>
          ) arXiv-
          <fpage>2204</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          , C. Ma, Y. Liu,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>Large language models for robotics: Opportunities, challenges, and perspectives</article-title>
          ,
          <source>arXiv preprint arXiv:2401.04334</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Large language models for robotics: A survey, arXiv e-prints (</article-title>
          <year>2023</year>
          ) arXiv-
          <fpage>2311</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kambhampati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Valmeekam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stechly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhambri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saldyt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Murthy</surname>
          </string-name>
          ,
          <article-title>Llms can't plan, but can help planning in llm-modulo frameworks</article-title>
          ,
          <source>arXiv preprint arXiv:2402</source>
          .
          <year>01817</year>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tellex</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gopalan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kress-Gazit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Matuszek</surname>
          </string-name>
          ,
          <article-title>Robots that use language</article-title>
          ,
          <source>Annual Review of Control, Robotics, and Autonomous Systems</source>
          <volume>3</volume>
          (
          <year>2020</year>
          )
          <fpage>25</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A. A.</given-names>
            <surname>Shah</surname>
          </string-name>
          , et al.,
          <article-title>Scene graph generation: A comprehensive survey</article-title>
          , arXiv e-prints (
          <year>2022</year>
          ) arXiv-
          <fpage>2201</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sui</surname>
          </string-name>
          ,
          <article-title>A survey on in-context learning</article-title>
          ,
          <source>arXiv preprint arXiv:2301.00234</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          , I. Shafran,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          , React:
          <article-title>Synergizing reasoning and acting in language models</article-title>
          ,
          <source>in: International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Heuss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gebauer</surname>
          </string-name>
          , G. Reinhart,
          <article-title>Concept for the automated adaption of abstract planning domains for specific application cases in skills-based industrial robotics</article-title>
          ,
          <source>Journal of Intelligent Manufacturing</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shanahan</surname>
          </string-name>
          , Frame problem, the,
          <source>Encyclopedia of Cognitive Science</source>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Kaelbling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Littman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Cassandra</surname>
          </string-name>
          ,
          <article-title>Planning and acting in partially observable stochastic domains</article-title>
          ,
          <source>Artificial intelligence 101</source>
          (
          <year>1998</year>
          )
          <fpage>99</fpage>
          -
          <lpage>134</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ognibene</surname>
          </string-name>
          , G. Baldassare,
          <article-title>Ecological active vision: four bioinspired principles to integrate bottom-up and adaptive top-down attention tested with a simple camera-arm robot</article-title>
          ,
          <source>IEEE transactions on autonomous mental development 7</source>
          (
          <year>2014</year>
          )
          <fpage>3</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rosell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diab</surname>
          </string-name>
          ,
          <article-title>Reasoning and state monitoring for the robust execution of robotic</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dixit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bodrova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Brown</surname>
          </string-name>
          , P. Xu,
          <string-name>
            <given-names>L.</given-names>
            <surname>Takayama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Varley</surname>
          </string-name>
          , et al.,
          <article-title>Robots that ask for help: Uncertainty alignment for large language model planners</article-title>
          ,
          <source>Proceedings of Machine Learning Research</source>
          <volume>229</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <article-title>Toward collaborative llms: Investigating proactivity in task-oriented dialogues</article-title>
          ,
          <source>in: Proceedings of the 28th Workshop on the Semantics and Pragmatics of Dialogue - Invited Talks</source>
          ,
          <string-name>
            <surname>SEMDIAL</surname>
          </string-name>
          , Trento, Italy,
          <year>2024</year>
          . URL:http://semdial.org/anthology/Z24-Magninini_semdial_0003a. pdf.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>A.</given-names>
            <surname>Martinenghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Koyuturk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Amenta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ruskov</surname>
          </string-name>
          , G. Donabauer,
          <string-name>
            <given-names>U.</given-names>
            <surname>Kruschwitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ognibene</surname>
          </string-name>
          ,
          <article-title>Von neumidas: Enhanced annotation schema for human-llm interactions combining midas with von neumann inspiredsemantics</article-title>
          ,
          <source>in: Proceedings of the 28th Workshop on the Semantics and Pragmatics of Dialogue - Poster Abstracts</source>
          ,
          <string-name>
            <surname>SEMDIAL</surname>
          </string-name>
          , Trento, Italy,
          <year>2024</year>
          . URhLt:tp://semdial.org/ anthology/Z24-Martinenghi_semdial_0045.p.df
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>A.</given-names>
            <surname>Martinenghi</surname>
          </string-name>
          , G. Donabauer,
          <string-name>
            <given-names>S.</given-names>
            <surname>Amenta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bursic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giudici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Kruschwitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Garzotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ognibene</surname>
          </string-name>
          ,
          <article-title>Llms of catan: Exploring pragmatic capabilities of generative chatbots through prediction and classification of dialogue acts in boardgames' multi-party dialogues</article-title>
          ,
          <source>in: Proceedings of the 10th Workshop on Games and Natural Language Processing@ LREC-COLING</source>
          <year>2024</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>