End-User Personalisation of Humanoid Robot Behaviour Through Vocal Interaction Simone Gallo1,2,∗,† , Giacomo Vaiani1,2,† and Fabio Paternò1 1 CNR – ISTI, Via G. Moruzzi 1, 56127 Pisa, Italy 2 University of Pisa – Computer Science Dept., Largo B. Pontecorvo 3, 56127 Pisa, Italy Abstract This study explores the integration of Large Language Models with social robots to facilitate End-User Development through natural language interactions. The paper presents a prototype system embodied in a Pepper robot that allows non-expert users to customise robot behaviours by defining personalisation rules via vocal commands. This system employs trigger-action programming, enabling users to create automations based on specific triggers and actions without requiring in-depth technical knowledge. Through an example scenario, we show how users can program the robot by employing voice commands to execute actions when an event occurs. The created automations can also involve available IoT objects. The study investigates the potential of natural language interaction to improve the usability and flexibility of robot programming, offering new possibilities for personalised interactions in various settings. Keywords End-User Development, Human-Robot Interaction, Smart Spaces, CEUR-WS 1. Introduction action format. Through vocal conversations, users can natu- rally express their preferences for how they want the robot Over recent years, technological advancements have re- to act when a specific event occurs. These events can be sulted in the development of increasingly sophisticated triggered by interacting with the robot itself (e.g., when the robots that are more closely integrated into human daily ac- robot recognises a person), or by the surrounding smart envi- tivities. This evolution is especially noticeable in the realm ronment (e.g., the change of temperature in a room, opening of social robots. These robots are designed to interact with or closing windows/doors). In the following sections, we humans in various social contexts, assisting with a range first introduce related work in the field of trigger-action of tasks, including children’s language education [1], older programming for robot end-user development, and then we adults’ cognitive training [2], and smart device manage- delve into the architecture of this prototype and an appli- ment [3]. Moreover, robots could offer advantages over cation scenario. Finally, we discuss the next step for this traditional voice assistants, particularly in routine activity work. detection and support for individuals with functional limi- tations [4][5][6]. Several end-user development tools have been introduced to facilitate user engagement and customi- 2. Related Work sation of these robots’ behaviour. These tools utilise various paradigms, such as block-based [7] and natural language Various research studies have explored the possibilities of programming [8], enabling users to compose personalisa- end-user programming of robot behaviour using the trigger- tion rules (for instance, having the robot say something action paradigm, employing diverse interaction modalities when a person is in front of it or perform specific actions and approaches. Leonardi et al. [10] exploited a graph- based on vocal commands). ical web-based wizard interface to enable users without Recent advancements in artificial intelligence, particu- programming skills to define personalisation rules by spec- larly with Large Language Models (LLMs), have the po- ifying events and/or conditions (triggers) that, once met, tential to enhance robots’ communicative and operational initiate the execution of defined actions. The tool allows capabilities. This enables interactions that can resemble the user to select triggers and actions related to smart de- human-like conversations and dynamically adapt to end- vices (e.g., the motion detected by a sensor, turning on smart user requests [9]. Within this context, trigger-action pro- bulbs) and a Pepper robot (e.g., a touch on the robot’s head). gramming emerges as an effective approach to End-User Thus, it was possible to create automations, such as having Development (EUD) in robotic systems. It allows users to Pepper say “Hey, how are you?” when someone entering the define robot behaviours in response to specific events or room is detected, by combining Internet of Things devices conditions, offering a user-friendly way to customise robot with Pepper. In the present study, we propose the devel- functionalities without the need for deep technical knowl- opment of automations through direct interaction with the edge. robotic system, as opposed to the utilisation of a separate This paper presents a prototype of a conversational agent web-based wizard. embodied in a Pepper robot that utilises an LLM to assist Another contribution by Porfirio et al. [11] presents Tab- non-expert users in creating personalisation rules in trigger- ula, a multimodal end-user development system for pro- gramming service robots for personal use in domestic and Workshop Robots for Humans 2024, Advanced Visual Interfaces, Arenzano, workplace environments. In this case, the system enables June 3rd, 2024 users without programming skills to script tasks, defining ∗ Corresponding author. humanoid robots’ behaviour (a Pepper one) using trigger- † These authors contributed equally. action programming and combining natural language with Envelope-Open simone.gallo@isti.cnr.it (S. Gallo); giacomo.vaiani@phd.unipi.it sketches on a visual interface to define the automation. In (G. Vaiani); fabio.paterno@isti.cnr.it (F. Paternò) particular, users can utilise natural language commands (via Orcid 0000-0003-3412-1639 (S. Gallo); 0009-0002-0910-2284 (G. Vaiani); 0000-0001-8355-6909 (F. Paternò) voice) to define triggers and actions. The resulting automa- © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). tion is visualised on a two-dimensional map, displaying the CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings current environment (e.g., the user’s house) and the defined path or actions of the robot. This setup allows users to mod- ify or refine the automation or to implement more complex logic that is difficult to express verbally. The implemented prototype encompasses a set of five actions (e.g., moving to a position, saying something) and two events (e.g., a person approaching or speaking to the robot), enabling the cre- ation of automations like “when the user arrives home, the robot goes to the entrance”. Although users have consid- ered the approach promising, the system faces challenges in processing natural language input due to difficulties in understanding complex or ambiguous commands, which leads to errors in automation and user frustration. Finally, a recent work proposed by Karli et al. [12], de- veloped a system integrating ChatGPT to enable end-users to define robot programs (e.g., for defining the movement of robotic harm) using natural language instructions. The system interface presents a chat from which the user sends the inputs, a console showing the generated code and a view of the robot simulator. Beginning with the description of the desired robot behaviour, the user engages in a col- laborative process that iteratively defines and debugs the specifications with the system to address the request. While the natural language interaction is effective, the study em- phasises several critical points regarding the use of LLMs in this context. It highlights the necessity of enhancing the reliability of LLM-generated code through accurate code verification processes, crafting more effective prompts and adjusting prompts dynamically to better fit the context. In general, LLM approaches open new possibilities for Figure 1: System Architecture applications across a broad range of settings in end-user development, highlighting the potential to enhance the us- ability and flexibility of robot programming. through an Automation Manager (we use Home Assistant1 as Automation Manager because it is open-source, robust, 3. Our Approach and widely used with an active community). In particular, we consider the following triggers and actions involving In this proposal, we introduce the combined use of Pepper Pepper: with an LLM agent, aiming to define the robot’s behaviour through specific trigger-action personalisation rules (also • Triggers: chest button press, head touch, hand called automations) expressed by voice. By integrating an touch (both right and left hand), face recognition, LLM as a natural language processing module, users can emotion recognition, and speech recognition. communicate with Pepper in a more intuitive and conver- • Actions: speak, hand movement, run animation, sational way. This enhancement enables Pepper to process position change, video camera activation, LEDs state complex commands and questions, significantly improv- change, show something on display. ing its usability and interactive capabilities. Furthermore, this system design lets users create automations verbally, When the user speaks, the robot utilises the Google eliminating the need for any programming skills. Figure 3 Speech API to identify the spoken sentence. The system illustrates the architecture of the designed system. then makes an API call to the Dialogue Manager to process Specifically, this prototype aims to create automations the user message. The API response, which is delivered that include triggers and actions related to both a smart using Pepper’s voice, contains the response message for environment (e.g. a smart home) and the robot. Through the user. Once the automation is complete it is saved in a these automations, it is possible to define events, conditions, database and executed through the automations manager. and actions related to both sensors and smart objects (e.g., More generally, our approach opens the possibility of motion sensors, smart light bulbs, smart thermostats) and creating automations that involve robots and surrounding Pepper (e.g., recognise a person, display something on its connected sensors and objects. Indeed, the created automa- tablet, say something). In this way, the robot becomes part tions can involve both triggers and actions associated with of a smart ecosystem in which it can perform actions in the robot (e.g., when Pepper recognises a specific person, response to triggers related to the environment or be itself says “Hello [name]” and does a greeting animation), but it is the trigger of events for the execution of actions by smart also possible to have triggers activated by external sensors objects. On a technical level, Pepper can be considered a or objects and actions executed by the robot. Vice versa, system entity at the same level as sensors and smart objects triggers can be generated by the robot, with actions exe- and can thus be integrated into control systems of smart cuted by surrounding objects (e.g., When Pepper detects a environments. The automations created are then executed 1 https://www.home-assistant.io/ negative emotion on the user’s face, soft lights turn on, and the Dialogue Manager via an HTTP request. Upon receiv- relaxing music plays.). ing the text, the Dialogue Manager (a Flask Python server) forwards the message to the GPT-4 model through the Ope- 3.1. Interacting with Pepper nAI API. Guided by the instructions in the defined prompt, the model determines the next step in the conversation. It The application integrates voice and text messaging func- can either execute a function to perform a specific task or tionalities, as well as message display editing. The vocal directly generate a textual response to the user’s query. interaction approach utilises Google Cloud’s Text-to-Speech The constructed prompt begins with a description of (TTS) and Speech-to-Text (STT) APIs. This approach offers the role the model is expected to adopt (e.g., ”You are a comprehensive solution for adding vocal input and out- Pepper, a humanoid robot...” ), which is succeeded by an put capabilities, thereby enhancing the user experience by explanation of the task (e.g., ”Your task is to help users create making interactions more natural and engaging. automations...” ) and some general guidelines to be adhered Users can activate voice recognition in the robot by press- to when interacting with the user (e.g., ”call the user by ing a designated button on the interface or touching a part the name” or “keep the response short and simple…” ). Then, of the robot’s body, such as the hand or head. This service the prompt introduces the functions the model can use, converts vocal input into text. Initially, the service starts along with instructions on when and how to use these recording audio via the robot’s microphone, visually notify- (e.g., “always use the verify_automation function before ing the user of the recording status with a change in button saving the automation” ). The function calling functionality, colour to indicate when the recording is active. Next, the provided by the OpenAI API, enables the LLM to interface service sends the audio stream to the Google STT service. with external resources and tools. This is possible by The speech service promptly notifies the robot application supplying the model with a set of function descriptions and when receiving the input converted into text format, allow- the required parameters for their execution. In particular, ing for almost instantaneous processing of the transcribed we include a function for retrieving the list of possible message. The obtained text can be used for various pur- triggers and actions for defining an automation, a function poses within the robot application, such as displaying it for to verify the correctness of the defined automation, and a the user, sending it to other backend services for further function for saving the created automation in a database. responses or actions, or triggering specific commands based When a function is invoked, its output is fed back into the on keywords. model. This becomes the basis for GPT-4 to generate an The robot application’s audio output is generated by a appropriate response. For example, the ”save automation” TTS functionality that handles authentication to the Google functions provide an output message containing the unique Cloud service, manages synthesis requests, and ultimately automation ID as well as a confirmation message if the plays the resulting audio using a media player. The sys- operation was successful or an error message otherwise. tem saves the synthesised audio to a local file, readying The model utilises this output to generate the user’s it for playback. This conversion occurs after the message response. This response is then dispatched to Pepper, has been processed and sent to the list of displayed mes- closing the loop of the initial HTTP request. Pepper then sages, ensuring that the user can both see and hear the bot’s verbalises the response, providing the user with an audible response. A useful feature introduced is the toggle func- answer. tionality that allows users to choose between viewing all messages exchanged in the chat and viewing only the last Example Scenario. Let’s consider a usage scenario two messages (the application’s default view). The toggle in which the user talks with Pepper and defines one is triggered by a method that flips a Boolean value, based automation by saying something like: ”When I come back on which the adapter decides which view to adopt. This home if I’m in a bad mood, say something comforting”. mechanism helps keep the user interface tidy, showing only Consequently, Pepper will send this sentence to the a part of the messages during vocal interaction but allow- Dialogue Manager that retrieves the list of possible triggers ing users to view the entire conversation, if desired. This and actions and proposes the users an initial automation: decision was made under the assumption that during a real- “We can detect when you are home using the location of time vocal conversation, users do not always need a textual your smartphone, and I can detect your mood by your facial representation of the entire chat. expressions. If you are sad or angry, I can put on relaxing Finally, the interface also features a reset function, allow- music and say something comforting like...”. ing users to activate a series of actions through a dedicated At this point, the user can continue talking with Pepper button to stop any ongoing activity, such as vocal recording to refine the proposed automation. Once the user is satisfied or the playback of animations and vocal synthesis. More- with the defined triggers and actions, the Dialogue Manager over, this functionality serves to clear the user interface uses the “verify automation” function to check that the by deleting the message history and returning any input automation contains only available triggers and actions. If fields or selections to their original state. This reset feature an automation is correctly verified, Pepper asks the user ensures a smooth and intuitive user experience by allowing if the created automation can be saved. After saving, a users to easily reset the application without the need to confirmation message resumes the created automation along navigate through complex menus or restart the app entirely. with the assigned ID in the database. Once automation is saved, the Automation Manager module executes the 3.2. Dialogue Manager defined actions when the chosen event is triggered, and the defined conditions are eventually met. The Dialogue Manager serves as the core component for processing natural language inputs and managing conver- sational flow. When a user engages with the robot, Pepper transcribes the user’s speech into text and then sends it to 4. Conclusions and Future Work ternational Conference on Human-Robot Interaction (HRI), 2022, pp. 1075–1079. URL: https://ieeexplore. This research explores the integration of Large Language ieee.org/document/9889467. doi:10.1109/HRI53351. Models into the domain of End-User Development for so- 2022.9889467 . cial robots, focusing on enabling users to customise robot [6] G. Wilson, C. Pereyda, N. Raghunath, G. de la behaviours through intuitive, natural language vocal inter- Cruz, S. Goel, S. Nesaei, B. Minor, M. Schmitter- actions. By implementing a prototype conversational agent Edgecombe, M. E. Taylor, D. J. Cook, Robot- embodied in a Pepper robot, we facilitate non-expert users in enabled support of daily activities in smart home creating automation for personalising the robot behaviour environments, Cognitive Systems Research 54 based on events in a smart environment (e.g., presence de- (2019) 258–272. URL: https://www.sciencedirect.com/ tection in a room), or on events on the robot itself (e.g., science/article/pii/S1389041718302651. doi:10.1016/ human face recognition). This approach leverages trigger- j.cogsys.2018.10.032 . action programming, presenting a user-friendly method for [7] Towards a Modular and Distributed End-User Devel- customising robot functionalities without requiring tech- opment Framework for Human-Robot Interaction | nical knowledge. Our proposal contributes to the field by IEEE Journals & Magazine | IEEE Xplore, ???? URL: illustrating the practical application of LLMs in enhancing https://ieeexplore.ieee.org/document/9323043. robot usability and flexibility, suggesting a promising av- [8] S. Beschi, D. Fogli, F. Tampalini, CAPIRCI: A enue for future research and development in social robotics Multi-modal System for Collaborative Robot Pro- and user-centric automation. For future work, we plan to gramming, volume 11553, Springer International initially conduct user tests in a controlled environment (e.g., Publishing, Cham, 2019, pp. 51–66. URL: http://link. laboratory setting) to evaluate the strengths and weaknesses springer.com/10.1007/978-3-030-24781-2_4. doi:10. of our solution in comparison with existing tools based on 1007/978- 3- 030- 24781- 2_4 , book Title: End-User visual interfaces. User tests in real-world scenarios will fol- Development Series Title: Lecture Notes in Computer low, addressing the need for realistic, extended evaluations Science. in robotics. Given the focus on the End-User Development [9] S. Vemprala, R. Bonatti, A. Bucker, A. Kapoor, Chat- approach, it is important to test with users without pro- GPT for Robotics: Design Principles and Model Abili- gramming skills and home automation experience. ties, 2023. URL: http://arxiv.org/abs/2306.17582. doi:10. 48550/arXiv.2306.17582 , arXiv:2306.17582 [cs]. [10] N. Leonardi, M. Manca, F. Paternò, C. Santoro, Trigger- References Action Programming for Personalising Humanoid [1] J. de Wit, T. Schodde, B. Willemsen, K. Bergmann, Robot Behaviour, in: Proceedings of the 2019 CHI M. de Haas, S. Kopp, E. Krahmer, P. Vogt, The Ef- Conference on Human Factors in Computing Systems, fect of a Robot’s Gestures and Adaptive Tutoring on CHI ’19, Association for Computing Machinery, New Children’s Acquisition of Second Language Vocabu- York, NY, USA, 2019, pp. 1–13. URL: https://dl.acm.org/ laries, in: Proceedings of the 2018 ACM/IEEE Interna- doi/10.1145/3290605.3300675. doi:10.1145/3290605. tional Conference on Human-Robot Interaction, HRI 3300675 . ’18, Association for Computing Machinery, New York, [11] D. Porfirio, L. Stegner, M. Cakmak, A. Sauppé, A. Al- NY, USA, 2018, pp. 50–58. URL: https://dl.acm.org/ barghouthi, B. Mutlu, Sketching Robot Programs On doi/10.1145/3171221.3171277. doi:10.1145/3171221. the Fly, in: Proceedings of the 2023 ACM/IEEE Interna- 3171277 . tional Conference on Human-Robot Interaction, HRI [2] M. Manca, F. Paternò, C. Santoro, E. Zedda, ’23, Association for Computing Machinery, New York, C. Braschi, R. Franco, A. Sale, The impact of se- NY, USA, 2023, pp. 584–593. URL: https://dl.acm.org/ rious games with humanoid robots on mild cogni- doi/10.1145/3568162.3576991. doi:10.1145/3568162. tive impairment older adults, International Jour- 3576991 . nal of Human-Computer Studies 145 (2021) 102509. [12] U. B. Karli, J.-T. Chen, V. N. Antony, C.-M. Huang, Al- URL: https://www.sciencedirect.com/science/article/ chemist: LLM-Aided End-User Development of Robot pii/S1071581920301117. doi:10.1016/j.ijhcs.2020. Applications, in: Proceedings of the 2024 ACM/IEEE 102509 . International Conference on Human-Robot Interac- [3] H.-D. Bui, N. Y. Chong, An Integrated Approach tion, HRI ’24, Association for Computing Machinery, to Human-Robot-Smart Environment Interaction In- New York, NY, USA, 2024, pp. 361–370. URL: https://dl. terface for Ambient Assisted Living, in: 2018 IEEE acm.org/doi/10.1145/3610977.3634969. doi:10.1145/ Workshop on Advanced Robotics and its Social Im- 3610977.3634969 . pacts (ARSO), 2018, pp. 32–37. URL: https://ieeexplore. ieee.org/document/8625821. doi:10.1109/ARSO.2018. 8625821 , iSSN: 2162-7576. [4] N. Ramoly, A. Bouzeghoub, B. Finance, A Frame- work for Service Robots in Smart Home: An Effi- cient Solution for Domestic Healthcare, IRBM 39 (2018) 413–420. URL: https://www.sciencedirect.com/ science/article/pii/S1959031818302793. doi:10.1016/ j.irbm.2018.10.010 . [5] E. Toscano, M. Spitale, F. Garzotto, Socially Assistive Robots in Smart Homes: Design Factors that Influ- ence the User Perception, in: 2022 17th ACM/IEEE In-