=Paper=
{{Paper
|id=Vol-3794/paper03
|storemode=property
|title=End-User Personalisation of Humanoid Robot Behaviour Through
              Vocal Interaction
            
|pdfUrl=https://ceur-ws.org/Vol-3794/paper3.pdf
|volume=Vol-3794
|authors=Simone Gallo,Giacomo Vaiani,Fabio Paternò
|dblpUrl=https://dblp.org/rec/conf/rfh/GalloVP24
}}
==End-User Personalisation of Humanoid Robot Behaviour Through
              Vocal Interaction
            ==
<pdf width="1500px">https://ceur-ws.org/Vol-3794/paper3.pdf</pdf>
<pre>
                         End-User Personalisation of Humanoid Robot Behaviour
                         Through Vocal Interaction
                         Simone Gallo1,2,∗,† , Giacomo Vaiani1,2,† and Fabio Paternò1
                         1
                             CNR – ISTI, Via G. Moruzzi 1, 56127 Pisa, Italy
                         2
                             University of Pisa – Computer Science Dept., Largo B. Pontecorvo 3, 56127 Pisa, Italy


                                            Abstract
                                            This study explores the integration of Large Language Models with social robots to facilitate End-User Development through natural
                                            language interactions. The paper presents a prototype system embodied in a Pepper robot that allows non-expert users to customise
                                            robot behaviours by defining personalisation rules via vocal commands. This system employs trigger-action programming, enabling
                                            users to create automations based on specific triggers and actions without requiring in-depth technical knowledge. Through an example
                                            scenario, we show how users can program the robot by employing voice commands to execute actions when an event occurs. The
                                            created automations can also involve available IoT objects. The study investigates the potential of natural language interaction to
                                            improve the usability and flexibility of robot programming, offering new possibilities for personalised interactions in various settings.

                                            Keywords
                                            End-User Development, Human-Robot Interaction, Smart Spaces, CEUR-WS


                         1. Introduction                                                                                               action format. Through vocal conversations, users can natu-
                                                                                                                                       rally express their preferences for how they want the robot
                         Over recent years, technological advancements have re-                                                        to act when a specific event occurs. These events can be
                         sulted in the development of increasingly sophisticated                                                       triggered by interacting with the robot itself (e.g., when the
                         robots that are more closely integrated into human daily ac-                                                  robot recognises a person), or by the surrounding smart envi-
                         tivities. This evolution is especially noticeable in the realm                                                ronment (e.g., the change of temperature in a room, opening
                         of social robots. These robots are designed to interact with                                                  or closing windows/doors). In the following sections, we
                         humans in various social contexts, assisting with a range                                                     first introduce related work in the field of trigger-action
                         of tasks, including children’s language education [1], older                                                  programming for robot end-user development, and then we
                         adults’ cognitive training [2], and smart device manage-                                                      delve into the architecture of this prototype and an appli-
                         ment [3]. Moreover, robots could offer advantages over                                                        cation scenario. Finally, we discuss the next step for this
                         traditional voice assistants, particularly in routine activity                                                work.
                         detection and support for individuals with functional limi-
                         tations [4][5][6]. Several end-user development tools have
                         been introduced to facilitate user engagement and customi-                                                    2. Related Work
                         sation of these robots’ behaviour. These tools utilise various
                         paradigms, such as block-based [7] and natural language                                                       Various research studies have explored the possibilities of
                         programming [8], enabling users to compose personalisa-                                                       end-user programming of robot behaviour using the trigger-
                         tion rules (for instance, having the robot say something                                                      action paradigm, employing diverse interaction modalities
                         when a person is in front of it or perform specific actions                                                   and approaches. Leonardi et al. [10] exploited a graph-
                         based on vocal commands).                                                                                     ical web-based wizard interface to enable users without
                            Recent advancements in artificial intelligence, particu-                                                   programming skills to define personalisation rules by spec-
                         larly with Large Language Models (LLMs), have the po-                                                         ifying events and/or conditions (triggers) that, once met,
                         tential to enhance robots’ communicative and operational                                                      initiate the execution of defined actions. The tool allows
                         capabilities. This enables interactions that can resemble                                                     the user to select triggers and actions related to smart de-
                         human-like conversations and dynamically adapt to end-                                                        vices (e.g., the motion detected by a sensor, turning on smart
                         user requests [9]. Within this context, trigger-action pro-                                                   bulbs) and a Pepper robot (e.g., a touch on the robot’s head).
                         gramming emerges as an effective approach to End-User                                                         Thus, it was possible to create automations, such as having
                         Development (EUD) in robotic systems. It allows users to                                                      Pepper say “Hey, how are you?” when someone entering the
                         define robot behaviours in response to specific events or                                                     room is detected, by combining Internet of Things devices
                         conditions, offering a user-friendly way to customise robot                                                   with Pepper. In the present study, we propose the devel-
                         functionalities without the need for deep technical knowl-                                                    opment of automations through direct interaction with the
                         edge.                                                                                                         robotic system, as opposed to the utilisation of a separate
                            This paper presents a prototype of a conversational agent                                                  web-based wizard.
                         embodied in a Pepper robot that utilises an LLM to assist                                                        Another contribution by Porfirio et al. [11] presents Tab-
                         non-expert users in creating personalisation rules in trigger-                                                ula, a multimodal end-user development system for pro-
                                                                                                                                       gramming service robots for personal use in domestic and
                         Workshop Robots for Humans 2024, Advanced Visual Interfaces, Arenzano,                                        workplace environments. In this case, the system enables
                         June 3rd, 2024                                                                                                users without programming skills to script tasks, defining
                         ∗
                              Corresponding author.                                                                                    humanoid robots’ behaviour (a Pepper one) using trigger-
                         †
                             These authors contributed equally.                                                                        action programming and combining natural language with
                         Envelope-Open simone.gallo@isti.cnr.it (S. Gallo); giacomo.vaiani@phd.unipi.it                                sketches on a visual interface to define the automation. In
                         (G. Vaiani); fabio.paterno@isti.cnr.it (F. Paternò)
                                                                                                                                       particular, users can utilise natural language commands (via
                         Orcid 0000-0003-3412-1639 (S. Gallo); 0009-0002-0910-2284 (G. Vaiani);
                         0000-0001-8355-6909 (F. Paternò)                                                                              voice) to define triggers and actions. The resulting automa-
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                        Attribution 4.0 International (CC BY 4.0).
                                                                                                                                       tion is visualised on a two-dimensional map, displaying the

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
current environment (e.g., the user’s house) and the defined
path or actions of the robot. This setup allows users to mod-
ify or refine the automation or to implement more complex
logic that is difficult to express verbally. The implemented
prototype encompasses a set of five actions (e.g., moving to
a position, saying something) and two events (e.g., a person
approaching or speaking to the robot), enabling the cre-
ation of automations like “when the user arrives home, the
robot goes to the entrance”. Although users have consid-
ered the approach promising, the system faces challenges
in processing natural language input due to difficulties in
understanding complex or ambiguous commands, which
leads to errors in automation and user frustration.
   Finally, a recent work proposed by Karli et al. [12], de-
veloped a system integrating ChatGPT to enable end-users
to define robot programs (e.g., for defining the movement
of robotic harm) using natural language instructions. The
system interface presents a chat from which the user sends
the inputs, a console showing the generated code and a
view of the robot simulator. Beginning with the description
of the desired robot behaviour, the user engages in a col-
laborative process that iteratively defines and debugs the
specifications with the system to address the request. While
the natural language interaction is effective, the study em-
phasises several critical points regarding the use of LLMs
in this context. It highlights the necessity of enhancing the
reliability of LLM-generated code through accurate code
verification processes, crafting more effective prompts and
adjusting prompts dynamically to better fit the context.
   In general, LLM approaches open new possibilities for
                                                                  Figure 1: System Architecture
applications across a broad range of settings in end-user
development, highlighting the potential to enhance the us-
ability and flexibility of robot programming.
                                                                  through an Automation Manager (we use Home Assistant1
                                                                  as Automation Manager because it is open-source, robust,
3. Our Approach                                                   and widely used with an active community). In particular,
                                                                  we consider the following triggers and actions involving
In this proposal, we introduce the combined use of Pepper         Pepper:
with an LLM agent, aiming to define the robot’s behaviour
through specific trigger-action personalisation rules (also               • Triggers: chest button press, head touch, hand
called automations) expressed by voice. By integrating an                   touch (both right and left hand), face recognition,
LLM as a natural language processing module, users can                      emotion recognition, and speech recognition.
communicate with Pepper in a more intuitive and conver-                   • Actions: speak, hand movement, run animation,
sational way. This enhancement enables Pepper to process                    position change, video camera activation, LEDs state
complex commands and questions, significantly improv-                       change, show something on display.
ing its usability and interactive capabilities. Furthermore,
this system design lets users create automations verbally,           When the user speaks, the robot utilises the Google
eliminating the need for any programming skills. Figure 3         Speech API to identify the spoken sentence. The system
illustrates the architecture of the designed system.              then makes an API call to the Dialogue Manager to process
   Specifically, this prototype aims to create automations        the user message. The API response, which is delivered
that include triggers and actions related to both a smart         using Pepper’s voice, contains the response message for
environment (e.g. a smart home) and the robot. Through            the user. Once the automation is complete it is saved in a
these automations, it is possible to define events, conditions,   database and executed through the automations manager.
and actions related to both sensors and smart objects (e.g.,         More generally, our approach opens the possibility of
motion sensors, smart light bulbs, smart thermostats) and         creating automations that involve robots and surrounding
Pepper (e.g., recognise a person, display something on its        connected sensors and objects. Indeed, the created automa-
tablet, say something). In this way, the robot becomes part       tions can involve both triggers and actions associated with
of a smart ecosystem in which it can perform actions in           the robot (e.g., when Pepper recognises a specific person,
response to triggers related to the environment or be itself      says “Hello [name]” and does a greeting animation), but it is
the trigger of events for the execution of actions by smart       also possible to have triggers activated by external sensors
objects. On a technical level, Pepper can be considered a         or objects and actions executed by the robot. Vice versa,
system entity at the same level as sensors and smart objects      triggers can be generated by the robot, with actions exe-
and can thus be integrated into control systems of smart          cuted by surrounding objects (e.g., When Pepper detects a
environments. The automations created are then executed
                                                                  1
                                                                      https://www.home-assistant.io/
negative emotion on the user’s face, soft lights turn on, and      the Dialogue Manager via an HTTP request. Upon receiv-
relaxing music plays.).                                            ing the text, the Dialogue Manager (a Flask Python server)
                                                                   forwards the message to the GPT-4 model through the Ope-
3.1. Interacting with Pepper                                       nAI API. Guided by the instructions in the defined prompt,
                                                                   the model determines the next step in the conversation. It
The application integrates voice and text messaging func-          can either execute a function to perform a specific task or
tionalities, as well as message display editing. The vocal         directly generate a textual response to the user’s query.
interaction approach utilises Google Cloud’s Text-to-Speech           The constructed prompt begins with a description of
(TTS) and Speech-to-Text (STT) APIs. This approach offers          the role the model is expected to adopt (e.g., ”You are
a comprehensive solution for adding vocal input and out-           Pepper, a humanoid robot...” ), which is succeeded by an
put capabilities, thereby enhancing the user experience by         explanation of the task (e.g., ”Your task is to help users create
making interactions more natural and engaging.                     automations...” ) and some general guidelines to be adhered
   Users can activate voice recognition in the robot by press-     to when interacting with the user (e.g., ”call the user by
ing a designated button on the interface or touching a part        the name” or “keep the response short and simple…” ). Then,
of the robot’s body, such as the hand or head. This service        the prompt introduces the functions the model can use,
converts vocal input into text. Initially, the service starts      along with instructions on when and how to use these
recording audio via the robot’s microphone, visually notify-       (e.g., “always use the verify_automation function before
ing the user of the recording status with a change in button       saving the automation” ). The function calling functionality,
colour to indicate when the recording is active. Next, the         provided by the OpenAI API, enables the LLM to interface
service sends the audio stream to the Google STT service.          with external resources and tools. This is possible by
The speech service promptly notifies the robot application         supplying the model with a set of function descriptions and
when receiving the input converted into text format, allow-        the required parameters for their execution. In particular,
ing for almost instantaneous processing of the transcribed         we include a function for retrieving the list of possible
message. The obtained text can be used for various pur-            triggers and actions for defining an automation, a function
poses within the robot application, such as displaying it for      to verify the correctness of the defined automation, and a
the user, sending it to other backend services for further         function for saving the created automation in a database.
responses or actions, or triggering specific commands based        When a function is invoked, its output is fed back into the
on keywords.                                                       model. This becomes the basis for GPT-4 to generate an
   The robot application’s audio output is generated by a          appropriate response. For example, the ”save automation”
TTS functionality that handles authentication to the Google        functions provide an output message containing the unique
Cloud service, manages synthesis requests, and ultimately          automation ID as well as a confirmation message if the
plays the resulting audio using a media player. The sys-           operation was successful or an error message otherwise.
tem saves the synthesised audio to a local file, readying          The model utilises this output to generate the user’s
it for playback. This conversion occurs after the message          response. This response is then dispatched to Pepper,
has been processed and sent to the list of displayed mes-          closing the loop of the initial HTTP request. Pepper then
sages, ensuring that the user can both see and hear the bot’s      verbalises the response, providing the user with an audible
response. A useful feature introduced is the toggle func-          answer.
tionality that allows users to choose between viewing all
messages exchanged in the chat and viewing only the last           Example Scenario. Let’s consider a usage scenario
two messages (the application’s default view). The toggle          in which the user talks with Pepper and defines one
is triggered by a method that flips a Boolean value, based         automation by saying something like: ”When I come back
on which the adapter decides which view to adopt. This             home if I’m in a bad mood, say something comforting”.
mechanism helps keep the user interface tidy, showing only         Consequently, Pepper will send this sentence to the
a part of the messages during vocal interaction but allow-         Dialogue Manager that retrieves the list of possible triggers
ing users to view the entire conversation, if desired. This        and actions and proposes the users an initial automation:
decision was made under the assumption that during a real-         “We can detect when you are home using the location of
time vocal conversation, users do not always need a textual        your smartphone, and I can detect your mood by your facial
representation of the entire chat.                                 expressions. If you are sad or angry, I can put on relaxing
   Finally, the interface also features a reset function, allow-   music and say something comforting like...”.
ing users to activate a series of actions through a dedicated         At this point, the user can continue talking with Pepper
button to stop any ongoing activity, such as vocal recording       to refine the proposed automation. Once the user is satisfied
or the playback of animations and vocal synthesis. More-           with the defined triggers and actions, the Dialogue Manager
over, this functionality serves to clear the user interface        uses the “verify automation” function to check that the
by deleting the message history and returning any input            automation contains only available triggers and actions. If
fields or selections to their original state. This reset feature   an automation is correctly verified, Pepper asks the user
ensures a smooth and intuitive user experience by allowing         if the created automation can be saved. After saving, a
users to easily reset the application without the need to          confirmation message resumes the created automation along
navigate through complex menus or restart the app entirely.        with the assigned ID in the database. Once automation
                                                                   is saved, the Automation Manager module executes the
3.2. Dialogue Manager                                              defined actions when the chosen event is triggered, and the
                                                                   defined conditions are eventually met.
The Dialogue Manager serves as the core component for
processing natural language inputs and managing conver-
sational flow. When a user engages with the robot, Pepper
transcribes the user’s speech into text and then sends it to
4. Conclusions and Future Work                                         ternational Conference on Human-Robot Interaction
                                                                       (HRI), 2022, pp. 1075–1079. URL: https://ieeexplore.
This research explores the integration of Large Language               ieee.org/document/9889467. doi:10.1109/HRI53351.
Models into the domain of End-User Development for so-                 2022.9889467 .
cial robots, focusing on enabling users to customise robot         [6] G. Wilson, C. Pereyda, N. Raghunath, G. de la
behaviours through intuitive, natural language vocal inter-            Cruz, S. Goel, S. Nesaei, B. Minor, M. Schmitter-
actions. By implementing a prototype conversational agent              Edgecombe, M. E. Taylor, D. J. Cook,              Robot-
embodied in a Pepper robot, we facilitate non-expert users in          enabled support of daily activities in smart home
creating automation for personalising the robot behaviour              environments,        Cognitive Systems Research 54
based on events in a smart environment (e.g., presence de-             (2019) 258–272. URL: https://www.sciencedirect.com/
tection in a room), or on events on the robot itself (e.g.,            science/article/pii/S1389041718302651. doi:10.1016/
human face recognition). This approach leverages trigger-              j.cogsys.2018.10.032 .
action programming, presenting a user-friendly method for          [7] Towards a Modular and Distributed End-User Devel-
customising robot functionalities without requiring tech-              opment Framework for Human-Robot Interaction |
nical knowledge. Our proposal contributes to the field by              IEEE Journals & Magazine | IEEE Xplore, ???? URL:
illustrating the practical application of LLMs in enhancing            https://ieeexplore.ieee.org/document/9323043.
robot usability and flexibility, suggesting a promising av-        [8] S. Beschi, D. Fogli, F. Tampalini, CAPIRCI: A
enue for future research and development in social robotics            Multi-modal System for Collaborative Robot Pro-
and user-centric automation. For future work, we plan to               gramming, volume 11553, Springer International
initially conduct user tests in a controlled environment (e.g.,        Publishing, Cham, 2019, pp. 51–66. URL: http://link.
laboratory setting) to evaluate the strengths and weaknesses           springer.com/10.1007/978-3-030-24781-2_4. doi:10.
of our solution in comparison with existing tools based on             1007/978- 3- 030- 24781- 2_4 , book Title: End-User
visual interfaces. User tests in real-world scenarios will fol-        Development Series Title: Lecture Notes in Computer
low, addressing the need for realistic, extended evaluations           Science.
in robotics. Given the focus on the End-User Development           [9] S. Vemprala, R. Bonatti, A. Bucker, A. Kapoor, Chat-
approach, it is important to test with users without pro-              GPT for Robotics: Design Principles and Model Abili-
gramming skills and home automation experience.                        ties, 2023. URL: http://arxiv.org/abs/2306.17582. doi:10.
                                                                       48550/arXiv.2306.17582 , arXiv:2306.17582 [cs].
                                                                  [10] N. Leonardi, M. Manca, F. Paternò, C. Santoro, Trigger-
References                                                             Action Programming for Personalising Humanoid
 [1] J. de Wit, T. Schodde, B. Willemsen, K. Bergmann,                 Robot Behaviour, in: Proceedings of the 2019 CHI
     M. de Haas, S. Kopp, E. Krahmer, P. Vogt, The Ef-                 Conference on Human Factors in Computing Systems,
     fect of a Robot’s Gestures and Adaptive Tutoring on               CHI ’19, Association for Computing Machinery, New
     Children’s Acquisition of Second Language Vocabu-                 York, NY, USA, 2019, pp. 1–13. URL: https://dl.acm.org/
     laries, in: Proceedings of the 2018 ACM/IEEE Interna-             doi/10.1145/3290605.3300675. doi:10.1145/3290605.
     tional Conference on Human-Robot Interaction, HRI                 3300675 .
     ’18, Association for Computing Machinery, New York,          [11] D. Porfirio, L. Stegner, M. Cakmak, A. Sauppé, A. Al-
     NY, USA, 2018, pp. 50–58. URL: https://dl.acm.org/                barghouthi, B. Mutlu, Sketching Robot Programs On
     doi/10.1145/3171221.3171277. doi:10.1145/3171221.                 the Fly, in: Proceedings of the 2023 ACM/IEEE Interna-
     3171277 .                                                         tional Conference on Human-Robot Interaction, HRI
 [2] M. Manca, F. Paternò, C. Santoro, E. Zedda,                       ’23, Association for Computing Machinery, New York,
     C. Braschi, R. Franco, A. Sale, The impact of se-                 NY, USA, 2023, pp. 584–593. URL: https://dl.acm.org/
     rious games with humanoid robots on mild cogni-                   doi/10.1145/3568162.3576991. doi:10.1145/3568162.
     tive impairment older adults, International Jour-                 3576991 .
     nal of Human-Computer Studies 145 (2021) 102509.             [12] U. B. Karli, J.-T. Chen, V. N. Antony, C.-M. Huang, Al-
     URL: https://www.sciencedirect.com/science/article/               chemist: LLM-Aided End-User Development of Robot
     pii/S1071581920301117. doi:10.1016/j.ijhcs.2020.                  Applications, in: Proceedings of the 2024 ACM/IEEE
     102509 .                                                          International Conference on Human-Robot Interac-
 [3] H.-D. Bui, N. Y. Chong, An Integrated Approach                    tion, HRI ’24, Association for Computing Machinery,
     to Human-Robot-Smart Environment Interaction In-                  New York, NY, USA, 2024, pp. 361–370. URL: https://dl.
     terface for Ambient Assisted Living, in: 2018 IEEE                acm.org/doi/10.1145/3610977.3634969. doi:10.1145/
     Workshop on Advanced Robotics and its Social Im-                  3610977.3634969 .
     pacts (ARSO), 2018, pp. 32–37. URL: https://ieeexplore.
     ieee.org/document/8625821. doi:10.1109/ARSO.2018.
     8625821 , iSSN: 2162-7576.
 [4] N. Ramoly, A. Bouzeghoub, B. Finance, A Frame-
     work for Service Robots in Smart Home: An Effi-
     cient Solution for Domestic Healthcare, IRBM 39
     (2018) 413–420. URL: https://www.sciencedirect.com/
     science/article/pii/S1959031818302793. doi:10.1016/
     j.irbm.2018.10.010 .
 [5] E. Toscano, M. Spitale, F. Garzotto, Socially Assistive
     Robots in Smart Homes: Design Factors that Influ-
     ence the User Perception, in: 2022 17th ACM/IEEE In-

</pre>