CORK: A COnversational agent framewoRK exploiting both
                rational and emotional intelligence
                                  Fabio Catania                                                                Micol Spitale
                             fabio.catania@polimi.it                                                       micol.spitale@polimi.it
                              Politecnico di Milano                                                         Politecnico di Milano
                                   Milano, Italy                                                                 Milano, Italy

                                Davide Fisicaro                                                              Franca Garzotto
                           davide.fisicaro@polimi.it                                                     franca.garzotto@polimi.it
                             Politecnico di Milano                                                          Politecnico di Milano
                                  Milano, Italy                                                                  Milano, Italy

ABSTRACT                                                                                into people’s home and daily routines: the most famous are Apple’s
This paper proposes CORK, a modular framework to facilitate and                         Siri, Amazon’s Alexa (over 8 million users in January 2017), Google
accelerate the realization and the maintenance of intelligent Con-                      Assistant and IBM conversational services.
versational Agents with both rational and emotional capabilities. A                     In usual human-human interaction, non-verbal information is re-
smart CA can be integrated in any digital device and aims at inter-                     sponsible for about 93% of the message perception [7] and its rel-
preting the natural language inputs by the user and at responding                       evance has been recently investigated a lot in human computer
to them consistently with the semantics, the context, the user’s                        interfaces, too. Affective computing is the interdisciplinary research
perceived emotional state and her/his profile.                                          field regarding systems and devices that can recognize, interpret,
CORK’s strength is its ability to split the content of the speech from                  process and simulate human affects. Currently, it is one of the most
its pure conversational scheme and rules: it permits to the develop-                    active research topics, furthermore, having increasingly intensive
ers to detect during design some general communicative patterns                         attention. According to BusinessWire [5], global Affective Com-
to be filled at run-time with the right content depending on the                        puting market has been valued at USD 16.17 billion in 2017 and is
context. At first, this can be tiresome and time-consuming, but it                      expected to reach a value of USD 88.69 billion by 2023.
permits then to achieve high scalability and low-cost modification                      Emotional Conversational Agents are at the very beginning and
and maintainability.                                                                    still no generally used architectural framework has been proposed
This framework is potentially valid in general and for now it has                       in order to facilitate their implementation and fair comparison.
been effectively applied to develop conversational tools for people                     Conversational Technology’s designers know how tiring and time-
with neurodevelopmental disorders.                                                      consuming is the process to create a dialog system and to maintain
                                                                                        it: it is necessary to detect and anticipate the inordinate ways that
CCS CONCEPTS                                                                            a user may send input to the system and to couple them with the
                                                                                        best outputs. What is more is that modifying the content of the
• Human-centered computing → HCI theory, concepts and
                                                                                        output implies to operate directly on the conversation. The ideal
models; Empirical studies in HCI .
                                                                                        Conversational Agent should be able to adapt to different situations
                                                                                        and scenarios and to be easily correctable in terms of content.
KEYWORDS
                                                                                        In this respect, this work provides a clear contribution to advance
Conversational technology; Affective computing; Conversational                          the state-of-the art: we propose CORK, a modular framework to
Agent Framework;                                                                        facilitate and accelerate the realization and the maintenance of intel-
ACM Reference Format:                                                                   ligent, system-initiative Conversational Agents with both rational
Fabio Catania, Micol Spitale, Davide Fisicaro, and Franca Garzotto. 2019.               and emotional capabilities. A big and innovative challenge faced
CORK: A COnversational agent framewoRK exploiting both rational and                     by CORK is the split of the content of the speech from its pure con-
emotional intelligence. In Joint Proceedings of the ACM IUI 2019 Workshops,             versational scheme and rules. This permits to detect during design
Los Angeles, USA, March 20, 2019 , 8 pages.
                                                                                        some general communicative patterns to be filled at run-time with
                                                                                        different contents depending on the context.
1     INTRODUCTION                                                                      This framework is potentially valid in general and for now it has
A Conversational Agent (CA), or dialogue system, is a software pro-                     been effectively applied to develop conversational tools for people
gram able to interact through natural language. Voice-based CAs                         with neurodevelopmental disorders NDD. In particular, we inves-
and chatbots are progressively getting more and more embedded                           tigate the use of Emoty, a spoken Conversational Agent realized
                                                                                        with our framework, to mitigate the impairments of these persons
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                      related to the difficulty of recognizing and expressing emotions - a
© 2019 Copyright Âľ 2019 for the individual papers by the papers’ authors. Copying      problem clinically referred to as Alexithymia.
permitted for private and academic purposes. This volume is published and copyrighted
by its editors.
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                              Catania, et al.


2   STATE OF THE ART                                                         • the Receiver, who is the person who gets the message and
ELIZA [39], one of the earliest developed CAs, was capable of                  tries to understand what the sender wants to convey in order
creating the illusion that the system was actually listening to the            to respond accordingly.
user without even using any built-in framework for contextualiz-
ing events by using a simple pattern matching and substitution
methodology.
As technology has evolved, retrieval-based conversational agents
have become the most used ones. They get as input an assertion
and a context (the conversation up to now); then, they use some
heuristic functions to pick up an appropriate response from a pre-
                                                                                    Figure 1: Berlo’s communication model
defined repository. The heuristic could be either a complex group
of Machine Learning (ML) classifiers or a simpler rule-based expres-
sion match. This way, the set of possible responses is fixed, but at
                                                                            In this paper, we promote the application of these concepts to
least is grammatically correct. The chatbots realized with the help
                                                                         conversational technologies and we assert that the basis of Berlo’s
of Dialogflow [12], wit.ai [11], Alexa [2], Watson [19] and Azure
                                                                         theory can be adapted to the human-computer natural language in-
bot service [24] are all examples of rationally-intelligent retrieval-
                                                                         teraction, where the roles of the sender and the receiver are played
based conversational agents. The main disadvantage of this model
                                                                         alternately by a human being and an intelligent machine.
is that potentially large numbers of rules and patterns would be
                                                                         Let’s think about a system initiative Conversational Agent, that is
required. Rule elicitation is undoubtedly a time-consuming, high
                                                                         a dialog system that needs to completely control the flow of the
maintenance task. Secondly, modifying one rule or introducing a
                                                                         conversation asking to the user a series of questions and ignoring
new one into the script invariably has an impact on the remaining
                                                                         (or misinterpreting) anything the user says that is not a direct an-
rules [26].
                                                                         swer to the system’s question. In this setting, every time the system
On the other hand, generative models do not rely on pre-defined
                                                                         transmits a message to the user, it attaches also the context it refers
responses but generate new answers from scratch using some ML
                                                                         to. When the user answers, the message and the same context are
techniques. Unfortunately, nowadays they do not work well, yet,
                                                                         sent back to the system. At this point, the dialog system decodes
because of the fact that they tend to make grammatical mistakes
                                                                         and elaborates the user’s request (message plus context) and replies
and to produce irrelevant, generic or inconsistent responses. In
                                                                         consistently back.
addition, they need a huge amount of training data and are very
hard to optimize. Generative models represent an active area of
research [22], [33]. Much research is being focused on pursuing
socially and emotionally aware CAs [4], [27].
Human-computer interaction can be more effective taking into ac-
count the emotional state of the user. Cassell et al. [6] cites that
embodied CAs should now be capable of developing deeper rela-
tionships that make their collaboration with humans productive
and satisfying over long periods of time. Unfortunately, the basic
versions of both retrieval-based and generative models do not take
into account any emotional aspect within the conversation.
We studied the architecture of some affective conversational agents,
e.g. [16], [34], and we found out that all of them use an upgrade of
the retrieval-based model exploiting the result of emotional analysis
as an element of choice for the final answers. We were surprised to
note that in literature there is no generally used, shared framework
for the realization of emotionally aware dialog systems.


3   THEORETICAL FUNDAMENTALS
According to the American theorist and university professor David
Berlo [10] there exist four main factors in the human communica-         Figure 2: Adaptation of Berlo’s model to Conversational
tion process:                                                            Technology

    • the Source, who is the sender of the message;
    • the Message, that includes the content of the communication,          The exploited channel depends from the nature of the conversa-
      its structure, the context and the form the message is sent;       tional agent. Chatbots use written text as a conduit, while spoken
    • the Channel, that is the medium used to send the message           dialogue systems use the verbal channel and have two essential
      (looking, listening, tasting, touching and smelling);              components: a speech recognizer and a text-to-speech module. In
CORK                                                                                   IUI Workshops’19, March 20, 2019, Los Angeles, USA


addition, Conversational Agents can take advantage of the visual             Other modules executing additional elaborations can be inte-
channel thanks to the use of a camera and can be embodied by              grated in a later stage.
avatars and robots and exploit the physical channel to understand
and communicate to the user.

4   THE FRAMEWORK
In order to hold a conversation, a dialog system must be able to
finalize three main phases:
    • the decoding and understanding of the natural language
      input thanks to the input recognizer;
    • the execution of a consistent task and/or the generation of a
      logical output;
    • the rendering of the output.
   Designing this framework, we reason a Spoken Conversational
Agent with emotional sensitivity. We propose a client-server soft-
ware architecture exploiting the Model-View-Controller MVC ar-
chitectural pattern. As is well known, it divides the software appli-
cation into three interconnected logical tiers, so that it separates
the internal representations of information from how information
is elaborated and from the ways that it is presented to the user.
This choice is to lend flexibility, robustness and scalability to the
system. In addition, this way, it will be possible to easily integrate
the realized CAs in digital devices such as tablets or smart phones
or embed them in everyday physical objects (e.g. toys, home equip-
ment).
Each tier is described as follows:
    • client Tier: it is the topmost level of the application and
      is thought as thin as possible. Its only goal is to manage
      the inputs and the outputs by the user during the whole             Figure 3: Functional view of the modules composing CORK
      conversational session. This means that it records the user
      while speaking, it sends it to the server, it waits for an answer
      and produces an artificial human speech as an output;                  Below you can find a detailed description of each component.
    • logic Tier: it processes the input message coming from the
      client tier with the help of external services and it interacts     4.1    The recorder
      with the persistence tier to select the best possible answer
                                                                          This module lies on the client and, as its name suggests, is responsi-
      for the user;
                                                                          ble for recording the user when she/he speaks and tries to interact
    • persistence Tier: it is in charge of storing and retrieving
                                                                          with the system. Beyond that, it is in charge of sending to the server,
      information from the database (i.e., data about the users, the
                                                                          in particular to the engine module, the recorded audio file and the
      preset output templates, the conversation contents, ...).
                                                                          conversational context referred to. The main challenges faced by
   As a matter of higher reusability and maintainability, we propose      the recorder module are the identification of the instants when
a modular software architecture with independent blocks, such that        to start and to complete the recording (that ideally correspond to
each contains everything necessary to execute only one aspect of          the moments when the user starts and finish to speak). Optionally,
the main elaboration. The modules constituting the early version          cutting-edge versions of this software component can run an early
of our architecture are the following:                                    audio data processing to clean it from background noises and the
    • the recorder;                                                       speaker diarisation. Diarisation is the process of partitioning an
    • the engine;                                                         input audio stream into homogeneous segments according to the
    • the speech-to-text module;                                          speaker identity.
    • the NLP Unit and intent detector;
    • the sentiment analysis module;                                      4.2    The engine
    • the emotional analysis module;                                      The logic and the control of the whole system lie in the engine mod-
    • the topic analysis module;                                          ule. At every conversational step, this receives the message and the
    • the voice analysis module;                                          relative context from the client tier; depending on the exploited
    • the profiling module;                                               channel, the message can be an audio recording, a written text, a
    • the output creation module;                                         picture or a rap. Successively, the engine interprets and elaborates
    • the text-to-speech module.                                          the message by calling one by one the other components in order
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                            Catania, et al.


to generate the best answer for the user; the processing may vary      large amount of data they have at their disposal to train the models,
with respect to the context and the nature of the input message.       makes their NLP units the state-of-the-art technologies to detect
For example, some modules may be excluded from the process and         the intention of the user in every conversation step.
some elaborations may be parallelized. A basic elaboration flow is
represented by Figure 4. Finally, the generated output message and     4.5    The sentiment analysis module
context are sent back to the client tier.                              Sentiment analysis regards finding out whether a given text ex-
                                                                       presses positivity, neutrality or negativity. This module gets as
                                                                       input the speech by the user and returns its polarity back. The
                                                                       most advanced APIs on the marketplace are Google cloud - natu-
                                                                       ral language [15], Azure - text analytics [23], Repustate [31], Text
                                                                       processing [29] and TextRazor [35].

                                                                       4.6    The emotional analysis module
                                                                       Emotion analysis permits to extract the feelings expressed by a text
                                                                       using Natural Language Processing. Feelings are in this case the
                                                                       Big Six emotions (Joy, Sadness, Fear, Anger, Surprise, Disgust) and
                                                                       other cognitive statements as confidence, uncertainty and attention.
                                                                       This module gets as input the transcription of the speech by the
                                                                       user. After its examination, the service returns a dictionary that
                                                                       maps the emotions and cognitive statements to the probability that
                                                                       the author is expressing them. The most advanced APIs of this kind
                                                                       are IBM Tone Analyzer [17], Indico.io [21], Qo Emotion [30] and
                                                                       TheySay [36] and they work all with English language.

            Figure 4: The message processing flow                      4.7    The topic analysis module
                                                                       The topic analysis module uses machine learning and natural lan-
                                                                       guage processing to discover the abstract topics that occur in a
                                                                       written work. Topic modeling is a frequently used text-mining tool
4.3    The speech to text module                                       for discovery of hidden semantic structures in a text body. Intu-
The speech to text module gets as input a prerecorded audio con-       itively, given that a written text is about a particular topic, one
taining some spoken words. It applies advanced deep-learning and       would expect particular words to appear in the text more or less
machine learning neural network algorithms to recognize and tran-      frequently. At the moment of writing, Google Cloud Natural Lan-
script the speech into text and return it as a string. Some speech     guage [15], TextRazor [35] and Meaning Cloud Topic Extraction [8]
recognition systems require "training" where an individual speaker     are three relevant and advanced API providing this kind of service.
reads text into the system. In this way, the system analyzes then
the person’s specific voice and uses it to fine-tune the recognition   4.8    The profiling module
of that person’s speech, resulting in increased accuracy. Especially   The profiling module customizes the contents and the style of the
because of the large amount of data they have at their disposal        conversation depending on every single user. More in detail it ex-
to train their models, the cutting-edge speech to text services are    ploits relevant info about the user (generalities, preferences, needs,
Google Cloud Speech-to-Text [13], IBM Watson Speech to Text [18]       ...) to influence the best output to be provided to the user and the
and Microsoft Azure Speech to Text [25]. They work for a wide          best way to do it. To do so, it interacts with the persistence tier: it
range of languages and are easily embeddable via API.                  is in charge of storing the info and of picking them up if necessary.
                                                                       Relevant info can be obtained during previous conversations, pro-
4.4    The NLP unit and intent detector                                viding the dialog systems with episodic memory. Alternatively, they
The Natural Language Processing Unit and Intent Detector is one of     can be received through other channels, such as social networks,
the most relevant modules of the whole system. It receives as input    pre-configuration by the user, third party configuration.
a string of text representing the transcription of the user’s speech
and the context it refers to. By exploiting domain knowledge and       4.9    The voice analysis module
natural language comprehension capabilities, it analyzes, under-       This module is in charge of understanding the emotion in a speaker’s
stands and returns the user’s intent. Intents are links between what   voice. It gets as input an audio recording containing a human speech
the user says and its interpretation by the bot. Contexts are useful   and returns back the emotion perceived analyzing the pitch and the
for differentiating requests which might have different meaning        intonation of the voice. Emotional speech analysis starting from
depending on previous requests. The platforms Dialogflow [12],         the harmonic features of the audio is a relatively new research field
wit.ai [11], Alexa [2], Watson [19] and Azure bot service [24] use     and there is much room for improvement. The most relevant APIs
an intent-based approach: that means that they elaborate a reac-       providing this kind of service are Good Vibrations [37], Affectiva
tion to the user’s intention detected by the NLP unit. Again, the      [1] and Vokaturi [38].
CORK                                                                                  IUI Workshops’19, March 20, 2019, Los Angeles, USA


4.10     The output creation module                                      interact with the social robot verbally, making facial expressions
This innovative module permits the customization of the output to        and by touch. This system has been used to detect some relevant
be delivered to the user in order to obtain a user-centered content      conversational patterns for people with NDD: for up to one month
driven conversational agent. In a first stage, it picks up the best      we observed with the help of an expert the weekly sessions of a
fitting answer from a table of anticipated templates according to        group of 11 NDD adults aged 25 to 43 interacting with Ele. From
                                                                         this experience and three weekly meeting sessions with two psy-
      • the state of the conversation,                                   chological specialists it turns out that some ways of speaking (with
      • the detected intention of the user,                              a lot of repetitions, reassurances and continuous reinforcements)
      • the user in case,                                                are the most effective for people with NDD.
      • the sentiment detected from the semantic,
      • the emotion recognized both from the text and from the
        pitch of the voice of the user’s message.
   In a second stage, it fills the selected template with the right
content chosen from the contents table according to the result
of the topic analysis and, again, to the user’s intention. The final
generated output is returned to the engine module. This kind of
table-driven architecture permits the easy update and modification
of the contents without even touching the conversational structure.

4.11     The text to speech module
The text to speech module is responsible for the human voice syn-
thesis, that is the artificial production of human speech. Compared                                Figure 5: Ele
to recorded human speech, the advantage of synthetized voice is
that its content can change and be customized at runtime. By play-
ing around with the voice features, the text to speech module may        5.2    Emoty
express emotions, moods and cognitive states. The quality of a
                                                                         Once we detected a considerable number of conversational patterns
speech synthesizer is judged by its similarity to the human voice
                                                                         for people with neurodevelopmental disorders, we designed a tool
and by its ability to be understood clearly. As for the speech to text
                                                                         for people affected by a specific disturb called Alexithymia, that
module, some speech synthesizers permit "training" where an indi-
                                                                         is the inability to identify and express emotions. In particular, we
vidual speaker reads text into the system. In this way, the system
                                                                         developed Emoty, a Dialog System playing the role of emotional
analyzes then the person’s specific voice and uses it to fine-tune
                                                                         facilitator and trainer by exploiting emotion detection capabilities
the duplicate of that person’s typical sounds. There exists many
                                                                         from both text and audio data. Emoty proposes to the user some
text to speech APIs available on the market, such as Google Cloud
                                                                         emotion expression tasks in the form of a game, organized in in-
Text to Speech [14], Amazon Polly [3], IBM Watson Text to Speech
                                                                         creasingly difficult levels. For example, the user could be asked to
[20] and ResponsiveVoice [32]. They all provide female and male
                                                                         read an assigned sentence trying to express a given emotion with
voices speaking different languages and with different accents.
                                                                         her/his tone of voice.
5     DEVELOPED CONVERSATIONAL AGENTS
The CORK framework is potentially valid in general. For now, it
has been applied to develop conversational tools for people with
neurodevelopmental disorders NDD. NDD is a group of conditions
that are characterized by severe deficits in the cognitive, emotional
and motor areas and produce impairments of social functioning.
Any kind of NDD is a chronical state, but early and focused inter-
ventions are thought to mitigate its effect. As an experiment, we
investigated the use of conversational technologies as a tool for per-
sons with NDD to lead them to enhance their emotion expression
and communication capabilities and to get more socially integrated.

5.1     Ele
As a first part of our work, we developed Ele (see Figure 5), an em-                             Figure 6: Emoty
bodied Wizard-of-Oz dialog system with the aspect of an elephant.
This robot wants to be a conversational companion that speaks
through the live voice of a remote human operator and can engage            The project has been designed in close collaboration with care-
the user in dialogues and recounting stories, enriching the commu-       givers and psychologists who actively participated in the following
nication using toys body movement effects. In return, the user can       phases:
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                              Catania, et al.


   (1) eliciting the key requirements,                                   with the right contents depending on the context and on the inter-
   (2) evaluating iterative prototypes,                                  acting user. Information about the users are inserted before the first
   (3) performing an exploratory evaluation.                             session by the caregiver supporting her/him. They are the name,
                                                                         the age and some parameters indicating the ability of the user to
In addition to the verbal channel, the application exploits the visual   express and recognize emotions.
one as a support to the player (see Figure 6). Specifically, the use     Finally, the human voice synthesis is done by calling the Google
of emojis facilitates the conceptualization of the emotions, and the     Cloud Text to Speech API.
combination emotion-color works as reinforcement to the spoken
feedbacks by the system. This kind of matching is very common in
literature and we decided to follow Plutchik’s theory [28]. In his
model, joy is represented as yellow, fear is dark green, surprise is
cyan, sadness is blue, and anger is red. We associated neutrality to
light grey, used as background for the system as well.


5.2.1 Architectural overview. The system follows the guidance of
CORK, the Conversational agent framework. It has been realized as
a web application because web apps are accessible to everybody via
browser on everyday devices and do not need any kind of prelimi-
nary configuration. They permit both vocal and visual interaction
with the user thanks to the screen, the microphone and the speakers      Figure 7: A user playing with the application and a drawing
of the executor device.                                                  by a user as a side activity
The engine module of the entire system lies on the Cloud and is
accessible through serverless functions to be triggered via HTTPS
requests. This guarantees fair pricing, a safe execution environment     5.2.2 Exploratory study. When ready, Emoty has been field tested
and high-level scalability and availability.                             by people with NDD at mild or moderate severity ranging from
For every conversation step, the engine remotely calls Google Cloud      16 to 60 years old. Our study has been plan to take place in a six-
Speech-to-Text to get a transcription of the user’s speech and then      month evaluation organized in weekly scheduled sessions of 10-15
Dialogflow to detect the user’s intention according to the context.      minutes integrated in daily activities at a specialized care center.
The semantic analysis of the input contents is delegated by our          Experimental sessions are focusing on both
system to indico.io. This external service returns a dictionary that
maps anger, fear, joy, sadness and surprise to the probability with          • the improvements in performance (i.e., the completion of
which the author is expressing each emotion. At the time of de-                the activities) of users across sessions in time;
velopment, there is no emotion analysis API working in Italian               • the usability of the application.
language and this constrains us to translate our Italian inputs to          For now, only three weeks of experimentation have been con-
English and then to let them be processed by the emotion recog-          ducted. Already after this little time, according to the attending
nizer. The automatic translation may imply a loss of information to      therapist some users tend to be more eager to interact with our
be considered, however after a preliminary evaluation the results        device rather than to speak to other people. We know that the
are satisfactory.                                                        evaluation based on three sessions is hardly statistically significant,
The emotion recognition from the harmonic features of the audio          therefore, for now, the caregivers’ opinions are the most valuable
is performed by an original machine learning model all by us. The        feedback we can collect: they observed a growing awareness about
process is organized in two subsequent steps:                            feelings by the participants. Anyway, the graphical representation
                                                                         of the performances by three users across the three sessions is
   (1) feature extraction                                                reported in Figure 8.
   (2) classification into emotions.                                        This encouraging early results are a clue that the detected conver-
The former extracts temporal and spectral characteristics of the         sational patterns fit the target user’s needs and the conversational
audio signal; whereas, the classification has been implemented us-       agent framework was well designed to develop complex emotion-
ing a supervised learning approach based on deep neural networks.        ally sensitive dialog systems (at least for this application domain).
In particular, a wide and deep (convolutional) neural network has
been built to exploit the principle of temporal locality across audio    6   CONCLUSION
signal partitions and increase the discriminatory strength of the        This work proposes a solution to a common need we observed in
model. In order to properly train the model an open source and           literature: the necessity to have a structured, tested and simplified
free Italian dataset called Emovo [9] has been used. It is a corpus      way to develop a conversational agent.
of recordings by 6 actors playing 14 different sentences simulating      In this paper we described CORK, a modular framework to facilitate
the five emotion categories.                                             and accelerate the realization and the maintenance of intelligent
The messages by the system are generated by the output creation          Conversational Agents with both rational and emotional capabili-
module that fills the conversational templates detected with Ele         ties.
CORK                                                                                     IUI Workshops’19, March 20, 2019, Los Angeles, USA


                                                                         addition, we will try to improve the effectiveness of the framework
                                                                         by testing it with new dialog systems in new application domains.
                                                                         In parallel, we will work to improve the accuracy of the emotion
                                                                         recognition Machine Learning ML model. To do so, we will create
                                                                         and tag a larger, proprietary emotional dataset collecting speech-
                                                                         based conversations and dialogues with the collaboration of local
                                                                         theatre companies and acting schools.


                                                                         REFERENCES
                                                                          [1] Affectiva. 2018. Affectiva. https://www.affectiva.com
                                                                          [2] Amazon. 2018. Alexa. https://developer.amazon.com/it/alexa
                                                                          [3] Amazon. 2018. Amazon Polly. https://aws.amazon.com/polly/?nc1=f_ls
                                                                          [4] Christian Becker, Stefan Kopp, and Ipke Wachsmuth. 2007. Why emotions
                                                                              should be integrated into conversational agents. Conversational informatics: an
                                                                              engineering approach (2007), 49–68.
                                                                          [5] BusinessWire. 2018. Global Affective Computing Market 2018-2023. https:
                                                                              //bit.ly/2Qv5kKX
Figure 8: Users’ progress across the sessions playing with                [6] Justine Cassell, Alastair J Gill, and Paul A Tepper. 2007. Coordination in con-
                                                                              versation and rapport. In Proceedings of the workshop on Embodied Language
Emoty                                                                         Processing. Association for Computational Linguistics, 41–50.
                                                                          [7] Claude C Chibelushi and Fabrice Bourel. 2003. Facial expression recognition:
                                                                              A brief tutorial overview. CVonline: On-Line Compendium of Computer Vision 9
                                                                              (2003).
                                                                          [8] Meaning Cloud. 2018. Meaning Cloud Topic Extraction.                  https://www.
Our framework has a client-server architecture exploiting the Model-          meaningcloud.com/developer/topics-extraction
View-Controller MVC pattern. This way, it will be possible to easily      [9] Giovanni Costantini, Iacopo Iaderola, Andrea Paoloni, and Massimiliano Todisco.
integrate the realized CAs in digital devices such as tablets or smart        2014. Emovo corpus: an italian emotional speech database. In International Con-
                                                                              ference on Language Resources and Evaluation (LREC 2014). European Language
phones or embed them in everyday physical objects (e.g. toys, home            Resources Association (ELRA), 3501–3504.
equipment). In addition, the system is organized in independent,         [10] Richard S Croft. 2004. Communication theory. Eastern Oregon University, La
reusable software modules controlled by a centralized engine, such            Grande, OR (2004).
                                                                         [11] Facebook. 2018. wit.ai. https://wit.ai
that each of them deals with a single functionality in the whole         [12] Google. 2018. Dialogflow. https://dialogflow.com
conversational process. The proposed components are the recorder,        [13] Google. 2018. Google Cloud Speech-to-Text.               https://cloud.google.com/
                                                                              speech-to-text
the engine, the speech-to-text module, the NLP Unit and intent           [14] Google. 2018. Google Cloud Text-To-Speech.               https://cloud.google.com/
detector, the sentiment analysis module, the emotional analysis               text-to-speech
module, the topic analysis module, the profiling module, the output      [15] Google. 2018. Googlecloud - natural language.            https://cloud.google.com/
                                                                              natural-language
creation module and the text-to-speech module. Other modules             [16] Shang Guo, Jonathan Lenchner, Jonathan Connell, Mishal Dholakia, and Hide-
executing additional elaborations can be integrated at will.                  masa Muta. 2017. Conversational bootstrapping and other tricks of a concierge
A big and innovative challenge faced by CORK is the split of the              robot. In Proceedings of the 2017 ACM/IEEE International Conference on Human-
                                                                              Robot Interaction. ACM, 73–81.
content of the speech from its pure conversational scheme and rules.     [17] IBM. 2018. IBM ToneAnalyzer.              https://www.ibm.com/watson/services/
This permits to detect during design some general communicative               tone-analyzer
                                                                         [18] IBM. 2018. IBM Watson Speech to Text. https://www.ibm.com/watson/services/
patterns to be filled at run-time with different contents depending           speech-to-text
on the context.                                                          [19] IBM. 2018. Watson. https://www.ibm.com/watson
To test the effectiveness of the proposed framework, we designed         [20] IBM. 2018. Watson Text to Speech. https://www.ibm.com/watson/services/
                                                                              text-to-speech
and developed in cooperation with psychologists and therapists           [21] Indico.io. 2018. Indico.io. https://indico.io
an emotional facilitator and trainer, called Emoty, for individuals      [22] Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao,
with neurodevelopmental disorders as a conversational supporting              and Bill Dolan. 2016. A persona-based neural conversation model. arXiv preprint
                                                                              arXiv:1603.06155 (2016).
tool for regular interventions. Although Emoty is still a prototype,     [23] Microsoft. 2018. Azure - text analytics. https://azure.microsoft.com/en-us/
the initial results of our empirical study indicate that Emoty can            services/cognitive-services/text-analytics
                                                                         [24] Microsoft. 2018. Microsoft Azure. https://azure.microsoft.com/en-us/services/
be easily used by therapists and persons with NDD and has some                bot-service
potential to mitigate Alexithymia and its effects. After only three      [25] Microsoft. 2018. Microsoft Azure Speech to Text. https://azure.microsoft.com/
weeks of practice with the emotional trainer, the caregivers ob-              en-us/services/cognitive-services/speech-to-text
                                                                         [26] Karen O’Shea, Zuhair Bandar, and Keeley Crockett. 2010. A conversational agent
served a growing awareness by the users of their own feelings. This           framework using semantic analysis. International Journal of Intelligent Computing
encouraging early results are a clue that the chosen conversational           Research (IJICR) 1, 1/2 (2010).
patterns fit the target user’s needs and the conversational agent        [27] Rosalind Wright Picard et al. 1995. Affective computing. (1995).
                                                                         [28] Robert Plutchik. 2001. The nature of emotions: Human emotions have deep
framework was well designed to develop complex emotionally sen-               evolutionary roots, a fact that may explain their complexity and provide tools
sitive dialog systems for this application domain and in general at           for clinical practice. American scientist 89, 4 (2001), 344–350.
                                                                         [29] Text processing. 2018. Text processing. http://text-processing.com
large.                                                                   [30] Qemotion. 2018. Qemotion. https://www.qemotion.com
The natural follow up of this project will be the development of a       [31] Repustate. 2018. Repustate. https://www.repustate.com
user-friendly platform inspired to IBM Watson Assistant and Di-          [32] ResponsiveVoice. 2018. ResponsiveVoice. https://responsivevoice.org
                                                                         [33] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and
alogflow that permits the development of content-driven emotion-              Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative
ally sensitive Conversational Agents without even programming. In             Hierarchical Neural Network Models.. In AAAI, Vol. 16. 3776–3784.
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                                                       Catania, et al.


[34] Heung-Yeung Shum, Xiao-dong He, and Di Li. 2018. From Eliza to XiaoIce: chal-        [37] Good Vibrations. 2018. Good Vibrations. https://www.good-vibrations.it
     lenges and opportunities with social chatbots. Frontiers of Information Technology   [38] Vokaturi. 2018. Vokaturi. https://vokaturi.com
     & Electronic Engineering 19, 1 (2018), 10–26.                                        [39] Joseph Weizenbaum. 1966. ELIZA - a computer program for the study of natural
[35] TextRazor. 2018. TextRazor. https://www.textrazor.com                                     language communication between man and machine. Commun. ACM 9, 1 (1966),
[36] TheySay. 2018. TheySay. http://www.theysay.io                                             36–45.