Dialog Management for a Social Assistive Robot in the Domain of Elderly Care Berardina De Carolis, Giampaolo Flace, La Forgia Angela, Nicola Macchiarulo and Giovanni Melone Department of Computer Science, University of Bari Italy Exprivia S.p.A, Molfetta, Italy Abstract The world population is aging and there are main concerns of the aged care industry to provide appropriate care for elderly people as their health and independent functioning decline. Socially Assistive Robot technology could assume an important role in health and social care to meet this higher demand. In this paper, we present an architecture to support believable conversation capabilities of a social robot in the context of daily task assistance to elderly people. The conversational system is based on a BDI architecture that allows to mix deliberative and reactive reasoning in order to determine, at each step, which goals are valid and, consequently, which action to perform to reach a goal according to the current dialog context, including the emotional state of the user. The architecture has been preliminary tested on a low cost robot, Ohmni in this case, usually dedicated to telepresence and the results of this evaluation show that, from the dialog management point of view, it is robust enough to handle dialogs typical of the elderly care domain. Keywords Social Robot, Conversational System, Belief Desire Intention 1. Introduction In today’s society, the geriatric population is steadily increasing and requires more and more assistance and monitoring [1]. These needs are sometimes not fully met due to the shortage of elderly care personnel. In this context, one of the most ambitious challenges is to introduce Social Assistive Robots (SAR) to provide assistance to elderly relying on social interaction [2]. Then, the ability to communicate using natural language and high level dialog management is a fundamental requirement for a social robot since spoken dialogue is generally considered as the most natural way for human-robot interaction [3]. Research in the field of Human-Robot Interaction (HRI) is increasingly focussing on the development of robots equipped with intelligent conversational abilities by developing dialogue systems able to deal with real-world scenarios supporting specific tasks especially in social contexts. Simulating human-to-human communication to enhance and ease human-to-machine com- munication is still a challenging task, especially when the goal is to enable a natural, adaptive, context and affect aware interaction. Italian Workshop on Artificial Intelligence for an Ageing Society (AIxAS 2021), November 29th, 2021 © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) To this aim, we developed an architecture of a Conversational System (CS) that is able to support believable conversations in the context of daily task assistance to elderly people. In this work, a general architecture to implement a CS is presented, with the aim of supporting the conversational capabilities of social robots, even a low cost robot like the one used for developing the prototype presented in this paper: Ohmni, a low cost domestic robot with Android operative system, developed by OhmniLabs and usually dedicated to telepresence. It does not have a humanoid appearance, but is composed of a simple metal rod that supports a microphone, a speaker, a webcam and a touchscreen. It also comes with a mobile base that allows its movement. Such very rudimentary features, and the lack of conversational capabilities or SDK functions that permit its implementation, make Ohmni a perfect candidate for using the developed conversational system. The proposed architecture is based on the Beliefs, Desires, Intentions (BDI) one. As explained in [4], BDI architectures are primarily focused on practical reasoning, i.e. the process of mixing deliberative and reactive reasoning in order to determining, at each step, which goals are valid and, consequently, which action to perform to reach a goal according to the current state of the world. Mixing reactive/proactive approaches enables the management of coherent conversations while still being responsive to unexpected user inputs. The dialog module, then takes into account the agent’s knowledge (the beliefs) and selects at each step the best goal to be fulfilled through the activation of an intention that the robot commits to. Intentions are then satisfied through the execution of plans that may include dialog acts and service activation. The model integrates a deliberative and a reactive components since goals are triggered, revised dynamically and satisfied by triggering appropriate plans, so that an appropriate response can be generated and returned to the user. During the interaction between users and the robot, perceptions, as in general for conversa- tional interfaces, are captured by analysing the speech of the user during the dialogs. The voice captured through the robot’s microphone is used both for converting the Speech To Text (STT) and for emotion analysis. The converted text is used by Natural Language Processing (NLP) techniques, in order to extract user intention (intent), significant entities, and sentiment polarity. The voice signal is then analysed with a specific software to recognize emotions. These information, also called extracted perceptions, along with background knowledge and the active goal, are interpreted as beliefs and used by a dialogue management component to trigger goals that will be satisfied by pre-defined plans, which are generally not very long as they have to adapt to the unpredictable interaction of the user, who after all, can say anything. This architecture has been fully implemented and tested on the Ohmni robotic platform in the elderly care interaction scenario realized for the SI-Robotics project. The remaining of the paper details this conversational architecture. The next section intro- duces the Italian project SI-Robotics. Section 3 provides an overview of the architecture as well as the dialog system that we have developed for our robot, going into details about its components. Section 4 provides a practical application of the conversational system in the context of elderly assistance. Sections 5, finally, summarise our main contributions and restate the key challenges that human–robot interaction brings to Artificial Intelligence. 2. SI-Robotics SI-Robotics is an Italian project that aims to design and develop innovative collaborative robotics solutions to support human beings in health and care services, in home, residential, and hospital environments. This project aims to produce advanced models of interaction, designed to motivating an active aging. These solutions are easily adaptable and allow to help elderly people in daily activities, anticipating their needs, and offering teleassistance, monitoring, and coaching services. To meet these needs, SI-Robotics proposes the realization of a cognitive agent able to interface with humans, IoT devices, social robots, and services present in the cloud [5]. The development of the SI-Robotics project has been entrusted to Exprivia, which collaborates with sixteen other partners to design the system architecture, software services, and AI-based functionalities, also taking care of their integration. 3. An Overview of the Conversational System The Conversational System (CS) allows the social robot, Ohmni in this case, to operate as a virtual assistant, providing it with both responsive and deliberative capabilities based on the architecture illustrated in Figure 1. It is composed by two main components: • The Conversational User Interface (CUI), developed as a web application, running inside the robot’s tablet, to enable interaction between the senior and Ohmni; • The Conversational Agent (CA), that based on the input received from the CUI, interprets and processes an answer, along with a possible action to be performed. Figure 1: Conversational system architecture in the form of components and message flow. Figure 1 shows at a high level the interaction among these components. In particular, the touchscreen on the Ohmni robot allows to manage the interaction through the CUI that has been developed as a web application due to the constraints of the Ohmni management system [6]. It is displayed in a special Android app that implements a WebView. The CUI allows not only to display multimedia content but also to manage the speech-based interaction with the user. The workflow of the interaction process will be described in details in the following section. The robot communicates with a cloud platform to recieve some events, such as the start of daily check-up or reminder task. The events are read by the CUI in order to notify the CA and a specific behavior is activated. Once the user’s voice is recorded or an event is triggered, the CUI sends a request to the web API related to the CA in order to obtain a response in JSON format containing the following information: textual response to be played back to the user, one or more suggestions about what the robot expects the user to respond to, an HTML content to be rendered, and a possible action to be performed. To provide a more general view of the components and subcomponents that make up the CS, Figure 2 shows the component diagram. Figure 2: Conversational system architecture in the form of components and message flow. 3.1. The Conversational User Interface For reasons strictly related to integration with the Ohmni robot, the CUI, shown in Figure 3, is implemented as a web application, using Javascript as programming language. Its features include: • Recording the user’s voice while talking; • Communication with the bot by submitting recorded audio or captured photo; • Displaying the content of the CA response message (composed from: one textual answer, one or more suggestions, and possibly a multimedia content); Figure 3: Conversational user interface implemented in the SI-ROBOTIC project. • Speech synthesis to reproduce the textual response received from the bot; • Visualization of the webcam video stream in a dedicated frame; • Capture a photo using the robot’s webcam; • Communication with the cloud platform for receiving events; • Execution of specific actions. In particular, the HarkJS library is used for voice detection, while Ohmni Standalone library provided by Ohmnilabs is used for speech synthesis. 3.2. The Conversational Agent The CA implements the following functionalities: • Conversion of received audio to text (Speech to Text) using the Google Speech service; • Facial recognition of a person and of his/her gender from the face; • Sending communication messages to the cloud platform; • Storing and retrieving user data from the database; • Extracting perceptions from text and audio; • Managing the conversational flow through a dialog management function. The last two aspects are explored in more detail in the subsequent sections. 3.2.1. Extraction of perceptions The following perceptions are extracted during the dialog: • User intention: goal or activity the user wishes to accomplish; • Sentiment polarity: positive, negative, or neutral; • Entities: useful information that can be found in a sentence; Figure 4: Extraction of perceptions with a practical example. • Emotion: happiness, fear, anger, disgust, sadness, surprise. As shown in Figure 4, the first three perceptions are extracted using the converted text, while the emotion is detected directly from the analysis of the recorded audio using Vokaturi [7]. To limit as much as possible the use of paid cognitive services on cloud, ad-hoc predictive models based on deep learning are trained for intent, entity, and sentiment recognition. In particular, intent and sentiment are predicted using Google BERT [8], while entities are extracted using a mixed approach based on Spacy [9], regular expressions, and text search considering also lemmas. Finally, the Vokaturi library is used for emotion recognition. It should be noted that the use of BERT for intent recognition and sentiment analysis is justified by the fact that this model based on deep learning adopts a bidirectional architecture of transformers. Due to the use of the multi-head self attention mechanism, during term processing, BERT reads the entire word sequence at once, and this enables learning of contextual relationships between words (i.e. take into account the context of the sentence). In addition, it is one of the first models to adopt the technique of transfer learning within NLP. The pre-training sentence involves unsupervised learning by training the network on generic tasks, while in the fine-tuning phase, the network is reused and adapted to perform a more specific task by adopting a supervised approach. The downstream task used for both intent recognition and sentiment analysis during the fine-tuning process was sequence classification. The advantages of using transfer learning are: reduced training time, improved prediction, and the ability to use a relatively small dataset and still get good results. 3.2.2. Dialogue Management Once perceptions are obtained, a dialogue-based mechanism, inspired by the BDI model, is used to obtain the response. As in our approach, the conversational capabilities and the dialog management of an agent can be implmented through the BDI agent model [10] that has been used successfully in a range of applications. This model allows for a mix of reactive behaviour and goal-directed reasoning and this supports different means for achieving a goal depending on the context and other factors [11]. The mixed reactive/proactive model enables the management of coherent conversational activity while still being responsive to unexpected user input and aware of changes in the conversation context. BDI plans provide knowledge of how to perform different types of conversational activities, while appropriate Knowledge Bases (KBs) contain information about entities associated with those activities. BDI agent-based approaches to dialogue management have been previously proposed (e.g.,[12, 13]); however, these have typically been for task-oriented conversations (e.g.,accessing email or managing an appointment). In our approach we use the BDI framework to provide variability in the way a goal is progressively achieved, as well as in the conversational content [14]. The goals of the CA are activated on the bases of its belief, that are eventually revised according to changes in its perceptions. Goals are fulfilled through the selection of different plans whose execution is, in turn, adapted to different contexts (i.e. recognized emotions, time of the day, etc.) thus behaviors that are appropriate to the user’s situation. A goal is implemented through a parameterized function with: the converted text, the extracted perceptions, and a reference to the CA instance. The plans associated with this function are created by placing logical conditions on the values of perceptions, especially on intent. Goals are managed using a stack, so the function associated with the goal at the top of that stack is called to obtain the response. This whole mechanism needs at least a goal function to work. For this reason, General goal has been implemented and placed by default as first element of the stack. In the case where there are no feasible plans (i.e. no condition in the goal function is satisfied), the CA initially responds in a generic way by notifying to the user that it has not understood and invites the user to repeat. In case this happens a second time, it will formulate the previous question again. While the dialog evolves, goals are put in the stack and the user’s request is managed by the goal function. Finally, to prevent the robot from responding when it is not needed, communication with the CA is initially disabled and then re-enabled by the user by uttering one of the following wake-words: robot or ohmni. 4. An Example in the context of Elderly Assistance The conversational capability of the Ohmni robot have been finalized in order to support the following tasks: • Registration of the senior profile; • Communication of relevant symptoms and general conditions; • Telepresence; • Information provision (i.e. weather); • Medicine Remind; • Daily Check-up. To this aim, we trained our system to recognize 10 main intents and 38 smalltalks [15]. Examples of intents that can be recognized during the conversation are the following: • CheckUp (i.e. ”let’s make the daily check-up”); • RemindMed (i.e. ”please remind me the pills I have to take today”); • HealthState (i.e. ”I feel like I have fever”); • BotAbility (i.e. ”what are you able to do?”). As far as entities are concerned, until now we are able to recognize the following 8 categories of entities: Entity Name Approach Description SYMPTOM Spacy Symptom or illness (e.g. fever, stomach ache) LOC Spacy Country, city, or place (e.g. Mola di Bari, Molfetta) EMOTIVE Search Emotional state of the user (e.g. happy, sad) MEASUREMENT Search Vital parameters (e.g. pulse, blood pressure) DAY Search Day of the week (e.g. Monday, Saturday) TEMPORAL Search Time indicator (e.g. today, tomorrow) ROLE Search Role held (e.g. doctor, nephew) NUM Regex Numbers (e.g. 82, 180) Note that while the entity LOC is recognized by the basic Italian model provided by Spacy, the entity SYMPTOM is trained manually. In addition, it was possible to use regular expressions to capture the NUM entity since the Google STT tool used automatically converts the numbers spoken by the user into numeric format. Figure 5: Example of a conversation between an elderly person and the robot. The robot receives the medicine reminder event from the cloud platform, and starts the conversation. With the objective to clarify how the system is expected to dialog with seniors in the context of assisted daily life, the following example in Figure 5 is given. The figure describes a particular Figure 6: Examples of CUI responses during dialogue. conversation that could be performed by the system. Instead, in Figure 6, screenshots of the CUI are shown during the dialogue between elder and robot. As shown in Figure 7, testing of the conversational solution was conducted both from PCs via browser and on the Ohmni robot in the Exprivia enterprise. COVID-19 restrictions prevented evaluation in ALHs (Assisted Living Houses) so that testing could be conducted with elderly people. Figure 7: Testing of the conversational system from the browser and on the Ohmni robot. 5. Conclusions and Future Work The work described in this paper is a preliminary step towards a general architecture of a Social Assistive Robot for elderly people living in Assisted Living Houses. In this phase of our research we developed a conversational system based on a BDI architecture and we integrated it into the Ohmni robot, that is a low-cost product designed only for telepresence tasks, in order to endow it with conversational capabilities. The result of the prototype version obtained so far is satisfactory, but needs improvement especially in voice detection. Future developments will focus on the following points: • Add new assistive features by introducing more intents and plans; • Adopt smart microphones capable of filtering noise to improve speech recognition; • Evaluate the conversational system by carrying out usability and technology acceptance tests with real users (i.e. elderly people in ALHs); • Develop services in such a way that the conversational system can also be used on other platforms. Acknowledgments Research supported by ”SocIal ROBOTics for active and healthy ageing” (SI-ROBOTICS) project founded by the Italian ”Ministero dell’Istruzione, dell’Università e della Ricerca” under the framework ”PON 676 - Ricerca e Innovazione 2014-2020”, Grant Agreement ARS01 01120. References [1] Elderly population across EU regions, 2020. URL: https://ec.europa.eu/eurostat/web/ products-eurostat-news/-/DDN-20200402-1. [2] D. Feil-Seifer, M. Matarić, Defining socially assistive robotics, volume 2005, 2005, pp. 465 – 468. doi:1 0 . 1 1 0 9 / I C O R R . 2 0 0 5 . 1 5 0 1 1 4 3 . [3] T. Fong, I. Nourbakhsh, K. Dautenhahn, A survey of socially interactive robots, Robotics and Autonomous Systems 42 (2003) 143–166. doi:1 0 . 1 0 1 6 / S 0 9 2 1 - 8 8 9 0 ( 0 2 ) 0 0 3 7 2 - X . [4] A. M. Andrew, Reasoning about rational agents, by michael wooldridge, mit press, cam- bridge, mass., 200, xi 227pp., isbn 0-262-23213 (hardback £23.50), Robotica 19 (2001) 459–462. doi:1 0 . 1 0 1 7 / S 0 2 6 3 5 7 4 7 0 1 2 1 3 4 9 6 . [5] Si-robotics, 2021. URL: https://www.exprivia.it/it/show-info-event-full.php?id_event= 6009. [6] Webapi - ohmni developer manual, 2021. URL: https://docs.ohmnilabs.com/webapi. [7] J. Garcia-Garcia, V. Penichet, M. Lozano, Emotion detection: a technology review, 2017, pp. 1–8. doi:1 0 . 1 1 4 5 / 3 1 2 3 8 1 8 . 3 1 2 3 8 5 2 . [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv (2018). URL: https://arxiv.org/abs/1810. 04805v2. a r X i v : 1 8 1 0 . 0 4 8 0 5 . [9] M. Honnibal, I. Montani, S. Van Landeghem, A. Boyd, spaCy: Industrial-strength Natural Language Processing in Python, 2020. URL: https://doi.org/10.5281/zenodo.1212303. doi:1 0 . 5281/zenodo.1212303. [10] A. S. Rao, M. Georgeff, BDI Agents: From Theory to Practice, un- defined (1995). URL: https://www.semanticscholar.org/paper/BDI-Agents% 3A-From-Theory-to-Practice-Rao-Georgeff/8bb51f40236fd06406f22b31fcacb381539c3bf9. [11] L. Padgham, M. Winikoff, Developing Intelligent Agent Systems: A Practical Guide, Wiley, Hoboken, NJ, USA, 2004. [12] A. Nguyen, W. Wobcke, An agent-based approach to dialogue management in personal assistants, International Conference on Intelligent User Interfaces, Proceedings IUI (2005) 137–144. doi:1 0 . 1 1 4 5 / 1 0 4 0 8 3 0 . 1 0 4 0 8 6 5 . [13] J. van Oijen, W. A. van Doesburg, F. Dignum, Goal-Based Communication Using BDI Agents as Virtual Humans in Training: An Ontology Driven Dialogue System, ResearchGate (2010) 38–52. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 6 4 2 - 1 8 1 8 1 - 8 _ 3 . [14] W. Wong, J. Thangarajah, L. Padgham, Flexible conversation management using a bdi agent approach, volume 7502, 2012, pp. 464–470. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 6 4 2 - 3 3 1 9 7 - 8 _ 4 8 . [15] Microsoft, Bot Frameowrk Tools, 2021. URL: https://github.com/microsoft/ botframework-cli/blob/main/packages/qnamaker/docs/chit-chat-dataset.md, [On- line; accessed 11. Nov. 2021].