CORK: A COnversational agent framewoRK exploiting both rational and emotional intelligence Fabio Catania Micol Spitale fabio.catania@polimi.it micol.spitale@polimi.it Politecnico di Milano Politecnico di Milano Milano, Italy Milano, Italy Davide Fisicaro Franca Garzotto davide.fisicaro@polimi.it franca.garzotto@polimi.it Politecnico di Milano Politecnico di Milano Milano, Italy Milano, Italy ABSTRACT into people’s home and daily routines: the most famous are Apple’s This paper proposes CORK, a modular framework to facilitate and Siri, Amazon’s Alexa (over 8 million users in January 2017), Google accelerate the realization and the maintenance of intelligent Con- Assistant and IBM conversational services. versational Agents with both rational and emotional capabilities. A In usual human-human interaction, non-verbal information is re- smart CA can be integrated in any digital device and aims at inter- sponsible for about 93% of the message perception [7] and its rel- preting the natural language inputs by the user and at responding evance has been recently investigated a lot in human computer to them consistently with the semantics, the context, the user’s interfaces, too. Affective computing is the interdisciplinary research perceived emotional state and her/his profile. field regarding systems and devices that can recognize, interpret, CORK’s strength is its ability to split the content of the speech from process and simulate human affects. Currently, it is one of the most its pure conversational scheme and rules: it permits to the develop- active research topics, furthermore, having increasingly intensive ers to detect during design some general communicative patterns attention. According to BusinessWire [5], global Affective Com- to be filled at run-time with the right content depending on the puting market has been valued at USD 16.17 billion in 2017 and is context. At first, this can be tiresome and time-consuming, but it expected to reach a value of USD 88.69 billion by 2023. permits then to achieve high scalability and low-cost modification Emotional Conversational Agents are at the very beginning and and maintainability. still no generally used architectural framework has been proposed This framework is potentially valid in general and for now it has in order to facilitate their implementation and fair comparison. been effectively applied to develop conversational tools for people Conversational Technology’s designers know how tiring and time- with neurodevelopmental disorders. consuming is the process to create a dialog system and to maintain it: it is necessary to detect and anticipate the inordinate ways that CCS CONCEPTS a user may send input to the system and to couple them with the best outputs. What is more is that modifying the content of the • Human-centered computing → HCI theory, concepts and output implies to operate directly on the conversation. The ideal models; Empirical studies in HCI . Conversational Agent should be able to adapt to different situations and scenarios and to be easily correctable in terms of content. KEYWORDS In this respect, this work provides a clear contribution to advance Conversational technology; Affective computing; Conversational the state-of-the art: we propose CORK, a modular framework to Agent Framework; facilitate and accelerate the realization and the maintenance of intel- ACM Reference Format: ligent, system-initiative Conversational Agents with both rational Fabio Catania, Micol Spitale, Davide Fisicaro, and Franca Garzotto. 2019. and emotional capabilities. A big and innovative challenge faced CORK: A COnversational agent framewoRK exploiting both rational and by CORK is the split of the content of the speech from its pure con- emotional intelligence. In Joint Proceedings of the ACM IUI 2019 Workshops, versational scheme and rules. This permits to detect during design Los Angeles, USA, March 20, 2019 , 8 pages. some general communicative patterns to be filled at run-time with different contents depending on the context. 1 INTRODUCTION This framework is potentially valid in general and for now it has A Conversational Agent (CA), or dialogue system, is a software pro- been effectively applied to develop conversational tools for people gram able to interact through natural language. Voice-based CAs with neurodevelopmental disorders NDD. In particular, we inves- and chatbots are progressively getting more and more embedded tigate the use of Emoty, a spoken Conversational Agent realized with our framework, to mitigate the impairments of these persons IUI Workshops’19, March 20, 2019, Los Angeles, USA related to the difficulty of recognizing and expressing emotions - a © 2019 Copyright Âľ 2019 for the individual papers by the papers’ authors. Copying problem clinically referred to as Alexithymia. permitted for private and academic purposes. This volume is published and copyrighted by its editors. IUI Workshops’19, March 20, 2019, Los Angeles, USA Catania, et al. 2 STATE OF THE ART • the Receiver, who is the person who gets the message and ELIZA [39], one of the earliest developed CAs, was capable of tries to understand what the sender wants to convey in order creating the illusion that the system was actually listening to the to respond accordingly. user without even using any built-in framework for contextualiz- ing events by using a simple pattern matching and substitution methodology. As technology has evolved, retrieval-based conversational agents have become the most used ones. They get as input an assertion and a context (the conversation up to now); then, they use some heuristic functions to pick up an appropriate response from a pre- Figure 1: Berlo’s communication model defined repository. The heuristic could be either a complex group of Machine Learning (ML) classifiers or a simpler rule-based expres- sion match. This way, the set of possible responses is fixed, but at In this paper, we promote the application of these concepts to least is grammatically correct. The chatbots realized with the help conversational technologies and we assert that the basis of Berlo’s of Dialogflow [12], wit.ai [11], Alexa [2], Watson [19] and Azure theory can be adapted to the human-computer natural language in- bot service [24] are all examples of rationally-intelligent retrieval- teraction, where the roles of the sender and the receiver are played based conversational agents. The main disadvantage of this model alternately by a human being and an intelligent machine. is that potentially large numbers of rules and patterns would be Let’s think about a system initiative Conversational Agent, that is required. Rule elicitation is undoubtedly a time-consuming, high a dialog system that needs to completely control the flow of the maintenance task. Secondly, modifying one rule or introducing a conversation asking to the user a series of questions and ignoring new one into the script invariably has an impact on the remaining (or misinterpreting) anything the user says that is not a direct an- rules [26]. swer to the system’s question. In this setting, every time the system On the other hand, generative models do not rely on pre-defined transmits a message to the user, it attaches also the context it refers responses but generate new answers from scratch using some ML to. When the user answers, the message and the same context are techniques. Unfortunately, nowadays they do not work well, yet, sent back to the system. At this point, the dialog system decodes because of the fact that they tend to make grammatical mistakes and elaborates the user’s request (message plus context) and replies and to produce irrelevant, generic or inconsistent responses. In consistently back. addition, they need a huge amount of training data and are very hard to optimize. Generative models represent an active area of research [22], [33]. Much research is being focused on pursuing socially and emotionally aware CAs [4], [27]. Human-computer interaction can be more effective taking into ac- count the emotional state of the user. Cassell et al. [6] cites that embodied CAs should now be capable of developing deeper rela- tionships that make their collaboration with humans productive and satisfying over long periods of time. Unfortunately, the basic versions of both retrieval-based and generative models do not take into account any emotional aspect within the conversation. We studied the architecture of some affective conversational agents, e.g. [16], [34], and we found out that all of them use an upgrade of the retrieval-based model exploiting the result of emotional analysis as an element of choice for the final answers. We were surprised to note that in literature there is no generally used, shared framework for the realization of emotionally aware dialog systems. 3 THEORETICAL FUNDAMENTALS According to the American theorist and university professor David Berlo [10] there exist four main factors in the human communica- Figure 2: Adaptation of Berlo’s model to Conversational tion process: Technology • the Source, who is the sender of the message; • the Message, that includes the content of the communication, The exploited channel depends from the nature of the conversa- its structure, the context and the form the message is sent; tional agent. Chatbots use written text as a conduit, while spoken • the Channel, that is the medium used to send the message dialogue systems use the verbal channel and have two essential (looking, listening, tasting, touching and smelling); components: a speech recognizer and a text-to-speech module. In CORK IUI Workshops’19, March 20, 2019, Los Angeles, USA addition, Conversational Agents can take advantage of the visual Other modules executing additional elaborations can be inte- channel thanks to the use of a camera and can be embodied by grated in a later stage. avatars and robots and exploit the physical channel to understand and communicate to the user. 4 THE FRAMEWORK In order to hold a conversation, a dialog system must be able to finalize three main phases: • the decoding and understanding of the natural language input thanks to the input recognizer; • the execution of a consistent task and/or the generation of a logical output; • the rendering of the output. Designing this framework, we reason a Spoken Conversational Agent with emotional sensitivity. We propose a client-server soft- ware architecture exploiting the Model-View-Controller MVC ar- chitectural pattern. As is well known, it divides the software appli- cation into three interconnected logical tiers, so that it separates the internal representations of information from how information is elaborated and from the ways that it is presented to the user. This choice is to lend flexibility, robustness and scalability to the system. In addition, this way, it will be possible to easily integrate the realized CAs in digital devices such as tablets or smart phones or embed them in everyday physical objects (e.g. toys, home equip- ment). Each tier is described as follows: • client Tier: it is the topmost level of the application and is thought as thin as possible. Its only goal is to manage the inputs and the outputs by the user during the whole Figure 3: Functional view of the modules composing CORK conversational session. This means that it records the user while speaking, it sends it to the server, it waits for an answer and produces an artificial human speech as an output; Below you can find a detailed description of each component. • logic Tier: it processes the input message coming from the client tier with the help of external services and it interacts 4.1 The recorder with the persistence tier to select the best possible answer This module lies on the client and, as its name suggests, is responsi- for the user; ble for recording the user when she/he speaks and tries to interact • persistence Tier: it is in charge of storing and retrieving with the system. Beyond that, it is in charge of sending to the server, information from the database (i.e., data about the users, the in particular to the engine module, the recorded audio file and the preset output templates, the conversation contents, ...). conversational context referred to. The main challenges faced by As a matter of higher reusability and maintainability, we propose the recorder module are the identification of the instants when a modular software architecture with independent blocks, such that to start and to complete the recording (that ideally correspond to each contains everything necessary to execute only one aspect of the moments when the user starts and finish to speak). Optionally, the main elaboration. The modules constituting the early version cutting-edge versions of this software component can run an early of our architecture are the following: audio data processing to clean it from background noises and the • the recorder; speaker diarisation. Diarisation is the process of partitioning an • the engine; input audio stream into homogeneous segments according to the • the speech-to-text module; speaker identity. • the NLP Unit and intent detector; • the sentiment analysis module; 4.2 The engine • the emotional analysis module; The logic and the control of the whole system lie in the engine mod- • the topic analysis module; ule. At every conversational step, this receives the message and the • the voice analysis module; relative context from the client tier; depending on the exploited • the profiling module; channel, the message can be an audio recording, a written text, a • the output creation module; picture or a rap. Successively, the engine interprets and elaborates • the text-to-speech module. the message by calling one by one the other components in order IUI Workshops’19, March 20, 2019, Los Angeles, USA Catania, et al. to generate the best answer for the user; the processing may vary large amount of data they have at their disposal to train the models, with respect to the context and the nature of the input message. makes their NLP units the state-of-the-art technologies to detect For example, some modules may be excluded from the process and the intention of the user in every conversation step. some elaborations may be parallelized. A basic elaboration flow is represented by Figure 4. Finally, the generated output message and 4.5 The sentiment analysis module context are sent back to the client tier. Sentiment analysis regards finding out whether a given text ex- presses positivity, neutrality or negativity. This module gets as input the speech by the user and returns its polarity back. The most advanced APIs on the marketplace are Google cloud - natu- ral language [15], Azure - text analytics [23], Repustate [31], Text processing [29] and TextRazor [35]. 4.6 The emotional analysis module Emotion analysis permits to extract the feelings expressed by a text using Natural Language Processing. Feelings are in this case the Big Six emotions (Joy, Sadness, Fear, Anger, Surprise, Disgust) and other cognitive statements as confidence, uncertainty and attention. This module gets as input the transcription of the speech by the user. After its examination, the service returns a dictionary that maps the emotions and cognitive statements to the probability that the author is expressing them. The most advanced APIs of this kind are IBM Tone Analyzer [17], Indico.io [21], Qo Emotion [30] and TheySay [36] and they work all with English language. Figure 4: The message processing flow 4.7 The topic analysis module The topic analysis module uses machine learning and natural lan- guage processing to discover the abstract topics that occur in a written work. Topic modeling is a frequently used text-mining tool 4.3 The speech to text module for discovery of hidden semantic structures in a text body. Intu- The speech to text module gets as input a prerecorded audio con- itively, given that a written text is about a particular topic, one taining some spoken words. It applies advanced deep-learning and would expect particular words to appear in the text more or less machine learning neural network algorithms to recognize and tran- frequently. At the moment of writing, Google Cloud Natural Lan- script the speech into text and return it as a string. Some speech guage [15], TextRazor [35] and Meaning Cloud Topic Extraction [8] recognition systems require "training" where an individual speaker are three relevant and advanced API providing this kind of service. reads text into the system. In this way, the system analyzes then the person’s specific voice and uses it to fine-tune the recognition 4.8 The profiling module of that person’s speech, resulting in increased accuracy. Especially The profiling module customizes the contents and the style of the because of the large amount of data they have at their disposal conversation depending on every single user. More in detail it ex- to train their models, the cutting-edge speech to text services are ploits relevant info about the user (generalities, preferences, needs, Google Cloud Speech-to-Text [13], IBM Watson Speech to Text [18] ...) to influence the best output to be provided to the user and the and Microsoft Azure Speech to Text [25]. They work for a wide best way to do it. To do so, it interacts with the persistence tier: it range of languages and are easily embeddable via API. is in charge of storing the info and of picking them up if necessary. Relevant info can be obtained during previous conversations, pro- 4.4 The NLP unit and intent detector viding the dialog systems with episodic memory. Alternatively, they The Natural Language Processing Unit and Intent Detector is one of can be received through other channels, such as social networks, the most relevant modules of the whole system. It receives as input pre-configuration by the user, third party configuration. a string of text representing the transcription of the user’s speech and the context it refers to. By exploiting domain knowledge and 4.9 The voice analysis module natural language comprehension capabilities, it analyzes, under- This module is in charge of understanding the emotion in a speaker’s stands and returns the user’s intent. Intents are links between what voice. It gets as input an audio recording containing a human speech the user says and its interpretation by the bot. Contexts are useful and returns back the emotion perceived analyzing the pitch and the for differentiating requests which might have different meaning intonation of the voice. Emotional speech analysis starting from depending on previous requests. The platforms Dialogflow [12], the harmonic features of the audio is a relatively new research field wit.ai [11], Alexa [2], Watson [19] and Azure bot service [24] use and there is much room for improvement. The most relevant APIs an intent-based approach: that means that they elaborate a reac- providing this kind of service are Good Vibrations [37], Affectiva tion to the user’s intention detected by the NLP unit. Again, the [1] and Vokaturi [38]. CORK IUI Workshops’19, March 20, 2019, Los Angeles, USA 4.10 The output creation module interact with the social robot verbally, making facial expressions This innovative module permits the customization of the output to and by touch. This system has been used to detect some relevant be delivered to the user in order to obtain a user-centered content conversational patterns for people with NDD: for up to one month driven conversational agent. In a first stage, it picks up the best we observed with the help of an expert the weekly sessions of a fitting answer from a table of anticipated templates according to group of 11 NDD adults aged 25 to 43 interacting with Ele. From this experience and three weekly meeting sessions with two psy- • the state of the conversation, chological specialists it turns out that some ways of speaking (with • the detected intention of the user, a lot of repetitions, reassurances and continuous reinforcements) • the user in case, are the most effective for people with NDD. • the sentiment detected from the semantic, • the emotion recognized both from the text and from the pitch of the voice of the user’s message. In a second stage, it fills the selected template with the right content chosen from the contents table according to the result of the topic analysis and, again, to the user’s intention. The final generated output is returned to the engine module. This kind of table-driven architecture permits the easy update and modification of the contents without even touching the conversational structure. 4.11 The text to speech module The text to speech module is responsible for the human voice syn- thesis, that is the artificial production of human speech. Compared Figure 5: Ele to recorded human speech, the advantage of synthetized voice is that its content can change and be customized at runtime. By play- ing around with the voice features, the text to speech module may 5.2 Emoty express emotions, moods and cognitive states. The quality of a Once we detected a considerable number of conversational patterns speech synthesizer is judged by its similarity to the human voice for people with neurodevelopmental disorders, we designed a tool and by its ability to be understood clearly. As for the speech to text for people affected by a specific disturb called Alexithymia, that module, some speech synthesizers permit "training" where an indi- is the inability to identify and express emotions. In particular, we vidual speaker reads text into the system. In this way, the system developed Emoty, a Dialog System playing the role of emotional analyzes then the person’s specific voice and uses it to fine-tune facilitator and trainer by exploiting emotion detection capabilities the duplicate of that person’s typical sounds. There exists many from both text and audio data. Emoty proposes to the user some text to speech APIs available on the market, such as Google Cloud emotion expression tasks in the form of a game, organized in in- Text to Speech [14], Amazon Polly [3], IBM Watson Text to Speech creasingly difficult levels. For example, the user could be asked to [20] and ResponsiveVoice [32]. They all provide female and male read an assigned sentence trying to express a given emotion with voices speaking different languages and with different accents. her/his tone of voice. 5 DEVELOPED CONVERSATIONAL AGENTS The CORK framework is potentially valid in general. For now, it has been applied to develop conversational tools for people with neurodevelopmental disorders NDD. NDD is a group of conditions that are characterized by severe deficits in the cognitive, emotional and motor areas and produce impairments of social functioning. Any kind of NDD is a chronical state, but early and focused inter- ventions are thought to mitigate its effect. As an experiment, we investigated the use of conversational technologies as a tool for per- sons with NDD to lead them to enhance their emotion expression and communication capabilities and to get more socially integrated. 5.1 Ele As a first part of our work, we developed Ele (see Figure 5), an em- Figure 6: Emoty bodied Wizard-of-Oz dialog system with the aspect of an elephant. This robot wants to be a conversational companion that speaks through the live voice of a remote human operator and can engage The project has been designed in close collaboration with care- the user in dialogues and recounting stories, enriching the commu- givers and psychologists who actively participated in the following nication using toys body movement effects. In return, the user can phases: IUI Workshops’19, March 20, 2019, Los Angeles, USA Catania, et al. (1) eliciting the key requirements, with the right contents depending on the context and on the inter- (2) evaluating iterative prototypes, acting user. Information about the users are inserted before the first (3) performing an exploratory evaluation. session by the caregiver supporting her/him. They are the name, the age and some parameters indicating the ability of the user to In addition to the verbal channel, the application exploits the visual express and recognize emotions. one as a support to the player (see Figure 6). Specifically, the use Finally, the human voice synthesis is done by calling the Google of emojis facilitates the conceptualization of the emotions, and the Cloud Text to Speech API. combination emotion-color works as reinforcement to the spoken feedbacks by the system. This kind of matching is very common in literature and we decided to follow Plutchik’s theory [28]. In his model, joy is represented as yellow, fear is dark green, surprise is cyan, sadness is blue, and anger is red. We associated neutrality to light grey, used as background for the system as well. 5.2.1 Architectural overview. The system follows the guidance of CORK, the Conversational agent framework. It has been realized as a web application because web apps are accessible to everybody via browser on everyday devices and do not need any kind of prelimi- nary configuration. They permit both vocal and visual interaction with the user thanks to the screen, the microphone and the speakers Figure 7: A user playing with the application and a drawing of the executor device. by a user as a side activity The engine module of the entire system lies on the Cloud and is accessible through serverless functions to be triggered via HTTPS requests. This guarantees fair pricing, a safe execution environment 5.2.2 Exploratory study. When ready, Emoty has been field tested and high-level scalability and availability. by people with NDD at mild or moderate severity ranging from For every conversation step, the engine remotely calls Google Cloud 16 to 60 years old. Our study has been plan to take place in a six- Speech-to-Text to get a transcription of the user’s speech and then month evaluation organized in weekly scheduled sessions of 10-15 Dialogflow to detect the user’s intention according to the context. minutes integrated in daily activities at a specialized care center. The semantic analysis of the input contents is delegated by our Experimental sessions are focusing on both system to indico.io. This external service returns a dictionary that maps anger, fear, joy, sadness and surprise to the probability with • the improvements in performance (i.e., the completion of which the author is expressing each emotion. At the time of de- the activities) of users across sessions in time; velopment, there is no emotion analysis API working in Italian • the usability of the application. language and this constrains us to translate our Italian inputs to For now, only three weeks of experimentation have been con- English and then to let them be processed by the emotion recog- ducted. Already after this little time, according to the attending nizer. The automatic translation may imply a loss of information to therapist some users tend to be more eager to interact with our be considered, however after a preliminary evaluation the results device rather than to speak to other people. We know that the are satisfactory. evaluation based on three sessions is hardly statistically significant, The emotion recognition from the harmonic features of the audio therefore, for now, the caregivers’ opinions are the most valuable is performed by an original machine learning model all by us. The feedback we can collect: they observed a growing awareness about process is organized in two subsequent steps: feelings by the participants. Anyway, the graphical representation of the performances by three users across the three sessions is (1) feature extraction reported in Figure 8. (2) classification into emotions. This encouraging early results are a clue that the detected conver- The former extracts temporal and spectral characteristics of the sational patterns fit the target user’s needs and the conversational audio signal; whereas, the classification has been implemented us- agent framework was well designed to develop complex emotion- ing a supervised learning approach based on deep neural networks. ally sensitive dialog systems (at least for this application domain). In particular, a wide and deep (convolutional) neural network has been built to exploit the principle of temporal locality across audio 6 CONCLUSION signal partitions and increase the discriminatory strength of the This work proposes a solution to a common need we observed in model. In order to properly train the model an open source and literature: the necessity to have a structured, tested and simplified free Italian dataset called Emovo [9] has been used. It is a corpus way to develop a conversational agent. of recordings by 6 actors playing 14 different sentences simulating In this paper we described CORK, a modular framework to facilitate the five emotion categories. and accelerate the realization and the maintenance of intelligent The messages by the system are generated by the output creation Conversational Agents with both rational and emotional capabili- module that fills the conversational templates detected with Ele ties. CORK IUI Workshops’19, March 20, 2019, Los Angeles, USA addition, we will try to improve the effectiveness of the framework by testing it with new dialog systems in new application domains. In parallel, we will work to improve the accuracy of the emotion recognition Machine Learning ML model. To do so, we will create and tag a larger, proprietary emotional dataset collecting speech- based conversations and dialogues with the collaboration of local theatre companies and acting schools. REFERENCES [1] Affectiva. 2018. Affectiva. https://www.affectiva.com [2] Amazon. 2018. Alexa. https://developer.amazon.com/it/alexa [3] Amazon. 2018. Amazon Polly. https://aws.amazon.com/polly/?nc1=f_ls [4] Christian Becker, Stefan Kopp, and Ipke Wachsmuth. 2007. Why emotions should be integrated into conversational agents. Conversational informatics: an engineering approach (2007), 49–68. [5] BusinessWire. 2018. Global Affective Computing Market 2018-2023. https: //bit.ly/2Qv5kKX Figure 8: Users’ progress across the sessions playing with [6] Justine Cassell, Alastair J Gill, and Paul A Tepper. 2007. Coordination in con- versation and rapport. In Proceedings of the workshop on Embodied Language Emoty Processing. Association for Computational Linguistics, 41–50. [7] Claude C Chibelushi and Fabrice Bourel. 2003. Facial expression recognition: A brief tutorial overview. CVonline: On-Line Compendium of Computer Vision 9 (2003). [8] Meaning Cloud. 2018. Meaning Cloud Topic Extraction. https://www. Our framework has a client-server architecture exploiting the Model- meaningcloud.com/developer/topics-extraction View-Controller MVC pattern. This way, it will be possible to easily [9] Giovanni Costantini, Iacopo Iaderola, Andrea Paoloni, and Massimiliano Todisco. integrate the realized CAs in digital devices such as tablets or smart 2014. Emovo corpus: an italian emotional speech database. In International Con- ference on Language Resources and Evaluation (LREC 2014). European Language phones or embed them in everyday physical objects (e.g. toys, home Resources Association (ELRA), 3501–3504. equipment). In addition, the system is organized in independent, [10] Richard S Croft. 2004. Communication theory. Eastern Oregon University, La reusable software modules controlled by a centralized engine, such Grande, OR (2004). [11] Facebook. 2018. wit.ai. https://wit.ai that each of them deals with a single functionality in the whole [12] Google. 2018. Dialogflow. https://dialogflow.com conversational process. The proposed components are the recorder, [13] Google. 2018. Google Cloud Speech-to-Text. https://cloud.google.com/ speech-to-text the engine, the speech-to-text module, the NLP Unit and intent [14] Google. 2018. Google Cloud Text-To-Speech. https://cloud.google.com/ detector, the sentiment analysis module, the emotional analysis text-to-speech module, the topic analysis module, the profiling module, the output [15] Google. 2018. Googlecloud - natural language. https://cloud.google.com/ natural-language creation module and the text-to-speech module. Other modules [16] Shang Guo, Jonathan Lenchner, Jonathan Connell, Mishal Dholakia, and Hide- executing additional elaborations can be integrated at will. masa Muta. 2017. Conversational bootstrapping and other tricks of a concierge A big and innovative challenge faced by CORK is the split of the robot. In Proceedings of the 2017 ACM/IEEE International Conference on Human- Robot Interaction. ACM, 73–81. content of the speech from its pure conversational scheme and rules. [17] IBM. 2018. IBM ToneAnalyzer. https://www.ibm.com/watson/services/ This permits to detect during design some general communicative tone-analyzer [18] IBM. 2018. IBM Watson Speech to Text. https://www.ibm.com/watson/services/ patterns to be filled at run-time with different contents depending speech-to-text on the context. [19] IBM. 2018. Watson. https://www.ibm.com/watson To test the effectiveness of the proposed framework, we designed [20] IBM. 2018. Watson Text to Speech. https://www.ibm.com/watson/services/ text-to-speech and developed in cooperation with psychologists and therapists [21] Indico.io. 2018. Indico.io. https://indico.io an emotional facilitator and trainer, called Emoty, for individuals [22] Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, with neurodevelopmental disorders as a conversational supporting and Bill Dolan. 2016. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155 (2016). tool for regular interventions. Although Emoty is still a prototype, [23] Microsoft. 2018. Azure - text analytics. https://azure.microsoft.com/en-us/ the initial results of our empirical study indicate that Emoty can services/cognitive-services/text-analytics [24] Microsoft. 2018. Microsoft Azure. https://azure.microsoft.com/en-us/services/ be easily used by therapists and persons with NDD and has some bot-service potential to mitigate Alexithymia and its effects. After only three [25] Microsoft. 2018. Microsoft Azure Speech to Text. https://azure.microsoft.com/ weeks of practice with the emotional trainer, the caregivers ob- en-us/services/cognitive-services/speech-to-text [26] Karen O’Shea, Zuhair Bandar, and Keeley Crockett. 2010. A conversational agent served a growing awareness by the users of their own feelings. This framework using semantic analysis. International Journal of Intelligent Computing encouraging early results are a clue that the chosen conversational Research (IJICR) 1, 1/2 (2010). patterns fit the target user’s needs and the conversational agent [27] Rosalind Wright Picard et al. 1995. Affective computing. (1995). [28] Robert Plutchik. 2001. The nature of emotions: Human emotions have deep framework was well designed to develop complex emotionally sen- evolutionary roots, a fact that may explain their complexity and provide tools sitive dialog systems for this application domain and in general at for clinical practice. American scientist 89, 4 (2001), 344–350. [29] Text processing. 2018. Text processing. http://text-processing.com large. [30] Qemotion. 2018. Qemotion. https://www.qemotion.com The natural follow up of this project will be the development of a [31] Repustate. 2018. Repustate. https://www.repustate.com user-friendly platform inspired to IBM Watson Assistant and Di- [32] ResponsiveVoice. 2018. ResponsiveVoice. https://responsivevoice.org [33] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and alogflow that permits the development of content-driven emotion- Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative ally sensitive Conversational Agents without even programming. In Hierarchical Neural Network Models.. In AAAI, Vol. 16. 3776–3784. IUI Workshops’19, March 20, 2019, Los Angeles, USA Catania, et al. [34] Heung-Yeung Shum, Xiao-dong He, and Di Li. 2018. From Eliza to XiaoIce: chal- [37] Good Vibrations. 2018. Good Vibrations. https://www.good-vibrations.it lenges and opportunities with social chatbots. Frontiers of Information Technology [38] Vokaturi. 2018. Vokaturi. https://vokaturi.com & Electronic Engineering 19, 1 (2018), 10–26. [39] Joseph Weizenbaum. 1966. ELIZA - a computer program for the study of natural [35] TextRazor. 2018. TextRazor. https://www.textrazor.com language communication between man and machine. Commun. ACM 9, 1 (1966), [36] TheySay. 2018. TheySay. http://www.theysay.io 36–45.