=Paper=
{{Paper
|id=Vol-3794/paper03
|storemode=property
|title=End-User Personalisation of Humanoid Robot Behaviour Through
Vocal Interaction
|pdfUrl=https://ceur-ws.org/Vol-3794/paper3.pdf
|volume=Vol-3794
|authors=Simone Gallo,Giacomo Vaiani,Fabio Paternò
|dblpUrl=https://dblp.org/rec/conf/rfh/GalloVP24
}}
==End-User Personalisation of Humanoid Robot Behaviour Through
Vocal Interaction
==
End-User Personalisation of Humanoid Robot Behaviour
Through Vocal Interaction
Simone Gallo1,2,∗,† , Giacomo Vaiani1,2,† and Fabio Paternò1
1
CNR – ISTI, Via G. Moruzzi 1, 56127 Pisa, Italy
2
University of Pisa – Computer Science Dept., Largo B. Pontecorvo 3, 56127 Pisa, Italy
Abstract
This study explores the integration of Large Language Models with social robots to facilitate End-User Development through natural
language interactions. The paper presents a prototype system embodied in a Pepper robot that allows non-expert users to customise
robot behaviours by defining personalisation rules via vocal commands. This system employs trigger-action programming, enabling
users to create automations based on specific triggers and actions without requiring in-depth technical knowledge. Through an example
scenario, we show how users can program the robot by employing voice commands to execute actions when an event occurs. The
created automations can also involve available IoT objects. The study investigates the potential of natural language interaction to
improve the usability and flexibility of robot programming, offering new possibilities for personalised interactions in various settings.
Keywords
End-User Development, Human-Robot Interaction, Smart Spaces, CEUR-WS
1. Introduction action format. Through vocal conversations, users can natu-
rally express their preferences for how they want the robot
Over recent years, technological advancements have re- to act when a specific event occurs. These events can be
sulted in the development of increasingly sophisticated triggered by interacting with the robot itself (e.g., when the
robots that are more closely integrated into human daily ac- robot recognises a person), or by the surrounding smart envi-
tivities. This evolution is especially noticeable in the realm ronment (e.g., the change of temperature in a room, opening
of social robots. These robots are designed to interact with or closing windows/doors). In the following sections, we
humans in various social contexts, assisting with a range first introduce related work in the field of trigger-action
of tasks, including children’s language education [1], older programming for robot end-user development, and then we
adults’ cognitive training [2], and smart device manage- delve into the architecture of this prototype and an appli-
ment [3]. Moreover, robots could offer advantages over cation scenario. Finally, we discuss the next step for this
traditional voice assistants, particularly in routine activity work.
detection and support for individuals with functional limi-
tations [4][5][6]. Several end-user development tools have
been introduced to facilitate user engagement and customi- 2. Related Work
sation of these robots’ behaviour. These tools utilise various
paradigms, such as block-based [7] and natural language Various research studies have explored the possibilities of
programming [8], enabling users to compose personalisa- end-user programming of robot behaviour using the trigger-
tion rules (for instance, having the robot say something action paradigm, employing diverse interaction modalities
when a person is in front of it or perform specific actions and approaches. Leonardi et al. [10] exploited a graph-
based on vocal commands). ical web-based wizard interface to enable users without
Recent advancements in artificial intelligence, particu- programming skills to define personalisation rules by spec-
larly with Large Language Models (LLMs), have the po- ifying events and/or conditions (triggers) that, once met,
tential to enhance robots’ communicative and operational initiate the execution of defined actions. The tool allows
capabilities. This enables interactions that can resemble the user to select triggers and actions related to smart de-
human-like conversations and dynamically adapt to end- vices (e.g., the motion detected by a sensor, turning on smart
user requests [9]. Within this context, trigger-action pro- bulbs) and a Pepper robot (e.g., a touch on the robot’s head).
gramming emerges as an effective approach to End-User Thus, it was possible to create automations, such as having
Development (EUD) in robotic systems. It allows users to Pepper say “Hey, how are you?” when someone entering the
define robot behaviours in response to specific events or room is detected, by combining Internet of Things devices
conditions, offering a user-friendly way to customise robot with Pepper. In the present study, we propose the devel-
functionalities without the need for deep technical knowl- opment of automations through direct interaction with the
edge. robotic system, as opposed to the utilisation of a separate
This paper presents a prototype of a conversational agent web-based wizard.
embodied in a Pepper robot that utilises an LLM to assist Another contribution by Porfirio et al. [11] presents Tab-
non-expert users in creating personalisation rules in trigger- ula, a multimodal end-user development system for pro-
gramming service robots for personal use in domestic and
Workshop Robots for Humans 2024, Advanced Visual Interfaces, Arenzano, workplace environments. In this case, the system enables
June 3rd, 2024 users without programming skills to script tasks, defining
∗
Corresponding author. humanoid robots’ behaviour (a Pepper one) using trigger-
†
These authors contributed equally. action programming and combining natural language with
Envelope-Open simone.gallo@isti.cnr.it (S. Gallo); giacomo.vaiani@phd.unipi.it sketches on a visual interface to define the automation. In
(G. Vaiani); fabio.paterno@isti.cnr.it (F. Paternò)
particular, users can utilise natural language commands (via
Orcid 0000-0003-3412-1639 (S. Gallo); 0009-0002-0910-2284 (G. Vaiani);
0000-0001-8355-6909 (F. Paternò) voice) to define triggers and actions. The resulting automa-
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
tion is visualised on a two-dimensional map, displaying the
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
current environment (e.g., the user’s house) and the defined
path or actions of the robot. This setup allows users to mod-
ify or refine the automation or to implement more complex
logic that is difficult to express verbally. The implemented
prototype encompasses a set of five actions (e.g., moving to
a position, saying something) and two events (e.g., a person
approaching or speaking to the robot), enabling the cre-
ation of automations like “when the user arrives home, the
robot goes to the entrance”. Although users have consid-
ered the approach promising, the system faces challenges
in processing natural language input due to difficulties in
understanding complex or ambiguous commands, which
leads to errors in automation and user frustration.
Finally, a recent work proposed by Karli et al. [12], de-
veloped a system integrating ChatGPT to enable end-users
to define robot programs (e.g., for defining the movement
of robotic harm) using natural language instructions. The
system interface presents a chat from which the user sends
the inputs, a console showing the generated code and a
view of the robot simulator. Beginning with the description
of the desired robot behaviour, the user engages in a col-
laborative process that iteratively defines and debugs the
specifications with the system to address the request. While
the natural language interaction is effective, the study em-
phasises several critical points regarding the use of LLMs
in this context. It highlights the necessity of enhancing the
reliability of LLM-generated code through accurate code
verification processes, crafting more effective prompts and
adjusting prompts dynamically to better fit the context.
In general, LLM approaches open new possibilities for
Figure 1: System Architecture
applications across a broad range of settings in end-user
development, highlighting the potential to enhance the us-
ability and flexibility of robot programming.
through an Automation Manager (we use Home Assistant1
as Automation Manager because it is open-source, robust,
3. Our Approach and widely used with an active community). In particular,
we consider the following triggers and actions involving
In this proposal, we introduce the combined use of Pepper Pepper:
with an LLM agent, aiming to define the robot’s behaviour
through specific trigger-action personalisation rules (also • Triggers: chest button press, head touch, hand
called automations) expressed by voice. By integrating an touch (both right and left hand), face recognition,
LLM as a natural language processing module, users can emotion recognition, and speech recognition.
communicate with Pepper in a more intuitive and conver- • Actions: speak, hand movement, run animation,
sational way. This enhancement enables Pepper to process position change, video camera activation, LEDs state
complex commands and questions, significantly improv- change, show something on display.
ing its usability and interactive capabilities. Furthermore,
this system design lets users create automations verbally, When the user speaks, the robot utilises the Google
eliminating the need for any programming skills. Figure 3 Speech API to identify the spoken sentence. The system
illustrates the architecture of the designed system. then makes an API call to the Dialogue Manager to process
Specifically, this prototype aims to create automations the user message. The API response, which is delivered
that include triggers and actions related to both a smart using Pepper’s voice, contains the response message for
environment (e.g. a smart home) and the robot. Through the user. Once the automation is complete it is saved in a
these automations, it is possible to define events, conditions, database and executed through the automations manager.
and actions related to both sensors and smart objects (e.g., More generally, our approach opens the possibility of
motion sensors, smart light bulbs, smart thermostats) and creating automations that involve robots and surrounding
Pepper (e.g., recognise a person, display something on its connected sensors and objects. Indeed, the created automa-
tablet, say something). In this way, the robot becomes part tions can involve both triggers and actions associated with
of a smart ecosystem in which it can perform actions in the robot (e.g., when Pepper recognises a specific person,
response to triggers related to the environment or be itself says “Hello [name]” and does a greeting animation), but it is
the trigger of events for the execution of actions by smart also possible to have triggers activated by external sensors
objects. On a technical level, Pepper can be considered a or objects and actions executed by the robot. Vice versa,
system entity at the same level as sensors and smart objects triggers can be generated by the robot, with actions exe-
and can thus be integrated into control systems of smart cuted by surrounding objects (e.g., When Pepper detects a
environments. The automations created are then executed
1
https://www.home-assistant.io/
negative emotion on the user’s face, soft lights turn on, and the Dialogue Manager via an HTTP request. Upon receiv-
relaxing music plays.). ing the text, the Dialogue Manager (a Flask Python server)
forwards the message to the GPT-4 model through the Ope-
3.1. Interacting with Pepper nAI API. Guided by the instructions in the defined prompt,
the model determines the next step in the conversation. It
The application integrates voice and text messaging func- can either execute a function to perform a specific task or
tionalities, as well as message display editing. The vocal directly generate a textual response to the user’s query.
interaction approach utilises Google Cloud’s Text-to-Speech The constructed prompt begins with a description of
(TTS) and Speech-to-Text (STT) APIs. This approach offers the role the model is expected to adopt (e.g., ”You are
a comprehensive solution for adding vocal input and out- Pepper, a humanoid robot...” ), which is succeeded by an
put capabilities, thereby enhancing the user experience by explanation of the task (e.g., ”Your task is to help users create
making interactions more natural and engaging. automations...” ) and some general guidelines to be adhered
Users can activate voice recognition in the robot by press- to when interacting with the user (e.g., ”call the user by
ing a designated button on the interface or touching a part the name” or “keep the response short and simple…” ). Then,
of the robot’s body, such as the hand or head. This service the prompt introduces the functions the model can use,
converts vocal input into text. Initially, the service starts along with instructions on when and how to use these
recording audio via the robot’s microphone, visually notify- (e.g., “always use the verify_automation function before
ing the user of the recording status with a change in button saving the automation” ). The function calling functionality,
colour to indicate when the recording is active. Next, the provided by the OpenAI API, enables the LLM to interface
service sends the audio stream to the Google STT service. with external resources and tools. This is possible by
The speech service promptly notifies the robot application supplying the model with a set of function descriptions and
when receiving the input converted into text format, allow- the required parameters for their execution. In particular,
ing for almost instantaneous processing of the transcribed we include a function for retrieving the list of possible
message. The obtained text can be used for various pur- triggers and actions for defining an automation, a function
poses within the robot application, such as displaying it for to verify the correctness of the defined automation, and a
the user, sending it to other backend services for further function for saving the created automation in a database.
responses or actions, or triggering specific commands based When a function is invoked, its output is fed back into the
on keywords. model. This becomes the basis for GPT-4 to generate an
The robot application’s audio output is generated by a appropriate response. For example, the ”save automation”
TTS functionality that handles authentication to the Google functions provide an output message containing the unique
Cloud service, manages synthesis requests, and ultimately automation ID as well as a confirmation message if the
plays the resulting audio using a media player. The sys- operation was successful or an error message otherwise.
tem saves the synthesised audio to a local file, readying The model utilises this output to generate the user’s
it for playback. This conversion occurs after the message response. This response is then dispatched to Pepper,
has been processed and sent to the list of displayed mes- closing the loop of the initial HTTP request. Pepper then
sages, ensuring that the user can both see and hear the bot’s verbalises the response, providing the user with an audible
response. A useful feature introduced is the toggle func- answer.
tionality that allows users to choose between viewing all
messages exchanged in the chat and viewing only the last Example Scenario. Let’s consider a usage scenario
two messages (the application’s default view). The toggle in which the user talks with Pepper and defines one
is triggered by a method that flips a Boolean value, based automation by saying something like: ”When I come back
on which the adapter decides which view to adopt. This home if I’m in a bad mood, say something comforting”.
mechanism helps keep the user interface tidy, showing only Consequently, Pepper will send this sentence to the
a part of the messages during vocal interaction but allow- Dialogue Manager that retrieves the list of possible triggers
ing users to view the entire conversation, if desired. This and actions and proposes the users an initial automation:
decision was made under the assumption that during a real- “We can detect when you are home using the location of
time vocal conversation, users do not always need a textual your smartphone, and I can detect your mood by your facial
representation of the entire chat. expressions. If you are sad or angry, I can put on relaxing
Finally, the interface also features a reset function, allow- music and say something comforting like...”.
ing users to activate a series of actions through a dedicated At this point, the user can continue talking with Pepper
button to stop any ongoing activity, such as vocal recording to refine the proposed automation. Once the user is satisfied
or the playback of animations and vocal synthesis. More- with the defined triggers and actions, the Dialogue Manager
over, this functionality serves to clear the user interface uses the “verify automation” function to check that the
by deleting the message history and returning any input automation contains only available triggers and actions. If
fields or selections to their original state. This reset feature an automation is correctly verified, Pepper asks the user
ensures a smooth and intuitive user experience by allowing if the created automation can be saved. After saving, a
users to easily reset the application without the need to confirmation message resumes the created automation along
navigate through complex menus or restart the app entirely. with the assigned ID in the database. Once automation
is saved, the Automation Manager module executes the
3.2. Dialogue Manager defined actions when the chosen event is triggered, and the
defined conditions are eventually met.
The Dialogue Manager serves as the core component for
processing natural language inputs and managing conver-
sational flow. When a user engages with the robot, Pepper
transcribes the user’s speech into text and then sends it to
4. Conclusions and Future Work ternational Conference on Human-Robot Interaction
(HRI), 2022, pp. 1075–1079. URL: https://ieeexplore.
This research explores the integration of Large Language ieee.org/document/9889467. doi:10.1109/HRI53351.
Models into the domain of End-User Development for so- 2022.9889467 .
cial robots, focusing on enabling users to customise robot [6] G. Wilson, C. Pereyda, N. Raghunath, G. de la
behaviours through intuitive, natural language vocal inter- Cruz, S. Goel, S. Nesaei, B. Minor, M. Schmitter-
actions. By implementing a prototype conversational agent Edgecombe, M. E. Taylor, D. J. Cook, Robot-
embodied in a Pepper robot, we facilitate non-expert users in enabled support of daily activities in smart home
creating automation for personalising the robot behaviour environments, Cognitive Systems Research 54
based on events in a smart environment (e.g., presence de- (2019) 258–272. URL: https://www.sciencedirect.com/
tection in a room), or on events on the robot itself (e.g., science/article/pii/S1389041718302651. doi:10.1016/
human face recognition). This approach leverages trigger- j.cogsys.2018.10.032 .
action programming, presenting a user-friendly method for [7] Towards a Modular and Distributed End-User Devel-
customising robot functionalities without requiring tech- opment Framework for Human-Robot Interaction |
nical knowledge. Our proposal contributes to the field by IEEE Journals & Magazine | IEEE Xplore, ???? URL:
illustrating the practical application of LLMs in enhancing https://ieeexplore.ieee.org/document/9323043.
robot usability and flexibility, suggesting a promising av- [8] S. Beschi, D. Fogli, F. Tampalini, CAPIRCI: A
enue for future research and development in social robotics Multi-modal System for Collaborative Robot Pro-
and user-centric automation. For future work, we plan to gramming, volume 11553, Springer International
initially conduct user tests in a controlled environment (e.g., Publishing, Cham, 2019, pp. 51–66. URL: http://link.
laboratory setting) to evaluate the strengths and weaknesses springer.com/10.1007/978-3-030-24781-2_4. doi:10.
of our solution in comparison with existing tools based on 1007/978- 3- 030- 24781- 2_4 , book Title: End-User
visual interfaces. User tests in real-world scenarios will fol- Development Series Title: Lecture Notes in Computer
low, addressing the need for realistic, extended evaluations Science.
in robotics. Given the focus on the End-User Development [9] S. Vemprala, R. Bonatti, A. Bucker, A. Kapoor, Chat-
approach, it is important to test with users without pro- GPT for Robotics: Design Principles and Model Abili-
gramming skills and home automation experience. ties, 2023. URL: http://arxiv.org/abs/2306.17582. doi:10.
48550/arXiv.2306.17582 , arXiv:2306.17582 [cs].
[10] N. Leonardi, M. Manca, F. Paternò, C. Santoro, Trigger-
References Action Programming for Personalising Humanoid
[1] J. de Wit, T. Schodde, B. Willemsen, K. Bergmann, Robot Behaviour, in: Proceedings of the 2019 CHI
M. de Haas, S. Kopp, E. Krahmer, P. Vogt, The Ef- Conference on Human Factors in Computing Systems,
fect of a Robot’s Gestures and Adaptive Tutoring on CHI ’19, Association for Computing Machinery, New
Children’s Acquisition of Second Language Vocabu- York, NY, USA, 2019, pp. 1–13. URL: https://dl.acm.org/
laries, in: Proceedings of the 2018 ACM/IEEE Interna- doi/10.1145/3290605.3300675. doi:10.1145/3290605.
tional Conference on Human-Robot Interaction, HRI 3300675 .
’18, Association for Computing Machinery, New York, [11] D. Porfirio, L. Stegner, M. Cakmak, A. Sauppé, A. Al-
NY, USA, 2018, pp. 50–58. URL: https://dl.acm.org/ barghouthi, B. Mutlu, Sketching Robot Programs On
doi/10.1145/3171221.3171277. doi:10.1145/3171221. the Fly, in: Proceedings of the 2023 ACM/IEEE Interna-
3171277 . tional Conference on Human-Robot Interaction, HRI
[2] M. Manca, F. Paternò, C. Santoro, E. Zedda, ’23, Association for Computing Machinery, New York,
C. Braschi, R. Franco, A. Sale, The impact of se- NY, USA, 2023, pp. 584–593. URL: https://dl.acm.org/
rious games with humanoid robots on mild cogni- doi/10.1145/3568162.3576991. doi:10.1145/3568162.
tive impairment older adults, International Jour- 3576991 .
nal of Human-Computer Studies 145 (2021) 102509. [12] U. B. Karli, J.-T. Chen, V. N. Antony, C.-M. Huang, Al-
URL: https://www.sciencedirect.com/science/article/ chemist: LLM-Aided End-User Development of Robot
pii/S1071581920301117. doi:10.1016/j.ijhcs.2020. Applications, in: Proceedings of the 2024 ACM/IEEE
102509 . International Conference on Human-Robot Interac-
[3] H.-D. Bui, N. Y. Chong, An Integrated Approach tion, HRI ’24, Association for Computing Machinery,
to Human-Robot-Smart Environment Interaction In- New York, NY, USA, 2024, pp. 361–370. URL: https://dl.
terface for Ambient Assisted Living, in: 2018 IEEE acm.org/doi/10.1145/3610977.3634969. doi:10.1145/
Workshop on Advanced Robotics and its Social Im- 3610977.3634969 .
pacts (ARSO), 2018, pp. 32–37. URL: https://ieeexplore.
ieee.org/document/8625821. doi:10.1109/ARSO.2018.
8625821 , iSSN: 2162-7576.
[4] N. Ramoly, A. Bouzeghoub, B. Finance, A Frame-
work for Service Robots in Smart Home: An Effi-
cient Solution for Domestic Healthcare, IRBM 39
(2018) 413–420. URL: https://www.sciencedirect.com/
science/article/pii/S1959031818302793. doi:10.1016/
j.irbm.2018.10.010 .
[5] E. Toscano, M. Spitale, F. Garzotto, Socially Assistive
Robots in Smart Homes: Design Factors that Influ-
ence the User Perception, in: 2022 17th ACM/IEEE In-