<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshops,
Los Angeles, USA, March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>CORK: A COnversational agent framewoRK exploiting both rational and emotional intelligence</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabio Catania</string-name>
          <email>fabio.catania@polimi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Fisicaro</string-name>
          <email>davide.fisicaro@polimi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Conversational technology; Afective computing; Conversational</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Micol Spitale</string-name>
          <email>micol.spitale@polimi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franca Garzotto</string-name>
          <email>franca.garzotto@polimi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Agent Framework;</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Politecnico di Milano</institution>
          ,
          <addr-line>Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>20</volume>
      <issue>2019</issue>
      <abstract>
        <p>This paper proposes CORK, a modular framework to facilitate and accelerate the realization and the maintenance of intelligent Conversational Agents with both rational and emotional capabilities. A smart CA can be integrated in any digital device and aims at interpreting the natural language inputs by the user and at responding to them consistently with the semantics, the context, the user's perceived emotional state and her/his profile. CORK's strength is its ability to split the content of the speech from its pure conversational scheme and rules: it permits to the developers to detect during design some general communicative patterns to be filled at run-time with the right content depending on the context. At first, this can be tiresome and time-consuming, but it permits then to achieve high scalability and low-cost modification and maintainability. This framework is potentially valid in general and for now it has been efectively applied to develop conversational tools for people with neurodevelopmental disorders.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Human-centered computing → HCI theory, concepts and
models; Empirical studies in HCI .</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>A Conversational Agent (CA), or dialogue system, is a software
program able to interact through natural language. Voice-based CAs
and chatbots are progressively getting more and more embedded
IUI Workshops’19, March 20, 2019, Los Angeles, USA
© 2019 Copyright Âľ 2019 for the individual papers by the papers’ authors. Copying
permitted for private and academic purposes. This volume is published and copyrighted
by its editors.
into people’s home and daily routines: the most famous are Apple’s
Siri, Amazon’s Alexa (over 8 million users in January 2017), Google
Assistant and IBM conversational services.</p>
      <p>
        In usual human-human interaction, non-verbal information is
responsible for about 93% of the message perception [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and its
relevance has been recently investigated a lot in human computer
interfaces, too. Afective computing is the interdisciplinary research
ifeld regarding systems and devices that can recognize, interpret,
process and simulate human afects. Currently, it is one of the most
active research topics, furthermore, having increasingly intensive
attention. According to BusinessWire [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], global Afective
Computing market has been valued at USD 16.17 billion in 2017 and is
expected to reach a value of USD 88.69 billion by 2023.
Emotional Conversational Agents are at the very beginning and
still no generally used architectural framework has been proposed
in order to facilitate their implementation and fair comparison.
Conversational Technology’s designers know how tiring and
timeconsuming is the process to create a dialog system and to maintain
it: it is necessary to detect and anticipate the inordinate ways that
a user may send input to the system and to couple them with the
best outputs. What is more is that modifying the content of the
output implies to operate directly on the conversation. The ideal
Conversational Agent should be able to adapt to diferent situations
and scenarios and to be easily correctable in terms of content.
In this respect, this work provides a clear contribution to advance
the state-of-the art: we propose CORK, a modular framework to
facilitate and accelerate the realization and the maintenance of
intelligent, system-initiative Conversational Agents with both rational
and emotional capabilities. A big and innovative challenge faced
by CORK is the split of the content of the speech from its pure
conversational scheme and rules. This permits to detect during design
some general communicative patterns to be filled at run-time with
diferent contents depending on the context.
      </p>
      <p>This framework is potentially valid in general and for now it has
been efectively applied to develop conversational tools for people
with neurodevelopmental disorders NDD. In particular, we
investigate the use of Emoty, a spoken Conversational Agent realized
with our framework, to mitigate the impairments of these persons
related to the dificulty of recognizing and expressing emotions - a
problem clinically referred to as Alexithymia.
2</p>
    </sec>
    <sec id="sec-3">
      <title>STATE OF THE ART</title>
      <p>
        ELIZA [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ], one of the earliest developed CAs, was capable of
creating the illusion that the system was actually listening to the
user without even using any built-in framework for
contextualizing events by using a simple pattern matching and substitution
methodology.
      </p>
      <p>
        As technology has evolved, retrieval-based conversational agents
have become the most used ones. They get as input an assertion
and a context (the conversation up to now); then, they use some
heuristic functions to pick up an appropriate response from a
predefined repository. The heuristic could be either a complex group
of Machine Learning (ML) classifiers or a simpler rule-based
expression match. This way, the set of possible responses is fixed, but at
least is grammatically correct. The chatbots realized with the help
of Dialogflow [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], wit.ai [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Alexa [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Watson [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and Azure
bot service [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] are all examples of rationally-intelligent
retrievalbased conversational agents. The main disadvantage of this model
is that potentially large numbers of rules and patterns would be
required. Rule elicitation is undoubtedly a time-consuming, high
maintenance task. Secondly, modifying one rule or introducing a
new one into the script invariably has an impact on the remaining
rules [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ].
      </p>
      <p>
        On the other hand, generative models do not rely on pre-defined
responses but generate new answers from scratch using some ML
techniques. Unfortunately, nowadays they do not work well, yet,
because of the fact that they tend to make grammatical mistakes
and to produce irrelevant, generic or inconsistent responses. In
addition, they need a huge amount of training data and are very
hard to optimize. Generative models represent an active area of
research [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. Much research is being focused on pursuing
socially and emotionally aware CAs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
      </p>
      <p>
        Human-computer interaction can be more efective taking into
account the emotional state of the user. Cassell et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] cites that
embodied CAs should now be capable of developing deeper
relationships that make their collaboration with humans productive
and satisfying over long periods of time. Unfortunately, the basic
versions of both retrieval-based and generative models do not take
into account any emotional aspect within the conversation.
We studied the architecture of some afective conversational agents,
e.g. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], and we found out that all of them use an upgrade of
the retrieval-based model exploiting the result of emotional analysis
as an element of choice for the final answers. We were surprised to
note that in literature there is no generally used, shared framework
for the realization of emotionally aware dialog systems.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>THEORETICAL FUNDAMENTALS</title>
      <p>
        According to the American theorist and university professor David
Berlo [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] there exist four main factors in the human
communication process:
• the Source, who is the sender of the message;
• the Message, that includes the content of the communication,
its structure, the context and the form the message is sent;
• the Channel, that is the medium used to send the message
(looking, listening, tasting, touching and smelling);
• the Receiver, who is the person who gets the message and
tries to understand what the sender wants to convey in order
to respond accordingly.
      </p>
      <p>In this paper, we promote the application of these concepts to
conversational technologies and we assert that the basis of Berlo’s
theory can be adapted to the human-computer natural language
interaction, where the roles of the sender and the receiver are played
alternately by a human being and an intelligent machine.
Let’s think about a system initiative Conversational Agent, that is
a dialog system that needs to completely control the flow of the
conversation asking to the user a series of questions and ignoring
(or misinterpreting) anything the user says that is not a direct
answer to the system’s question. In this setting, every time the system
transmits a message to the user, it attaches also the context it refers
to. When the user answers, the message and the same context are
sent back to the system. At this point, the dialog system decodes
and elaborates the user’s request (message plus context) and replies
consistently back.</p>
      <p>The exploited channel depends from the nature of the
conversational agent. Chatbots use written text as a conduit, while spoken
dialogue systems use the verbal channel and have two essential
components: a speech recognizer and a text-to-speech module. In
addition, Conversational Agents can take advantage of the visual
channel thanks to the use of a camera and can be embodied by
avatars and robots and exploit the physical channel to understand
and communicate to the user.
4</p>
    </sec>
    <sec id="sec-5">
      <title>THE FRAMEWORK</title>
      <p>In order to hold a conversation, a dialog system must be able to
ifnalize three main phases:
• the decoding and understanding of the natural language
input thanks to the input recognizer;
• the execution of a consistent task and/or the generation of a
logical output;
• the rendering of the output.</p>
      <p>Designing this framework, we reason a Spoken Conversational
Agent with emotional sensitivity. We propose a client-server
software architecture exploiting the Model-View-Controller MVC
architectural pattern. As is well known, it divides the software
application into three interconnected logical tiers, so that it separates
the internal representations of information from how information
is elaborated and from the ways that it is presented to the user.
This choice is to lend flexibility, robustness and scalability to the
system. In addition, this way, it will be possible to easily integrate
the realized CAs in digital devices such as tablets or smart phones
or embed them in everyday physical objects (e.g. toys, home
equipment).</p>
      <p>Each tier is described as follows:
• client Tier: it is the topmost level of the application and
is thought as thin as possible. Its only goal is to manage
the inputs and the outputs by the user during the whole
conversational session. This means that it records the user
while speaking, it sends it to the server, it waits for an answer
and produces an artificial human speech as an output;
• logic Tier: it processes the input message coming from the
client tier with the help of external services and it interacts
with the persistence tier to select the best possible answer
for the user;
• persistence Tier: it is in charge of storing and retrieving
information from the database (i.e., data about the users, the
preset output templates, the conversation contents, ...).</p>
      <p>As a matter of higher reusability and maintainability, we propose
a modular software architecture with independent blocks, such that
each contains everything necessary to execute only one aspect of
the main elaboration. The modules constituting the early version
of our architecture are the following:
• the recorder;
• the engine;
• the speech-to-text module;
• the NLP Unit and intent detector;
• the sentiment analysis module;
• the emotional analysis module;
• the topic analysis module;
• the voice analysis module;
• the profiling module;
• the output creation module;
• the text-to-speech module.
Other modules executing additional elaborations can be
integrated in a later stage.</p>
      <p>
        Below you can find a detailed description of each component.
This module lies on the client and, as its name suggests, is
responsible for recording the user when she/he speaks and tries to interact
with the system. Beyond that, it is in charge of sending to the server,
in particular to the engine module, the recorded audio file and the
conversational context referred to. The main challenges faced by
the recorder module are the identification of the instants when
to start and to complete the recording (that ideally correspond to
the moments when the user starts and finish to speak). Optionally,
cutting-edge versions of this software component can run an early
audio data processing to clean it from background noises and the
speaker diarisation. Diarisation is the process of partitioning an
input audio stream into homogeneous segments according to the
speaker identity.
The logic and the control of the whole system lie in the engine
module. At every conversational step, this receives the message and the
relative context from the client tier; depending on the exploited
channel, the message can be an audio recording, a written text, a
picture or a rap. Successively, the engine interprets and elaborates
the message by calling one by one the other components in order
to generate the best answer for the user; the processing may vary
with respect to the context and the nature of the input message.
For example, some modules may be excluded from the process and
some elaborations may be parallelized. A basic elaboration flow is
represented by Figure 4. Finally, the generated output message and
context are sent back to the client tier.
The speech to text module gets as input a prerecorded audio
containing some spoken words. It applies advanced deep-learning and
machine learning neural network algorithms to recognize and
transcript the speech into text and return it as a string. Some speech
recognition systems require "training" where an individual speaker
reads text into the system. In this way, the system analyzes then
the person’s specific voice and uses it to fine-tune the recognition
of that person’s speech, resulting in increased accuracy. Especially
because of the large amount of data they have at their disposal
to train their models, the cutting-edge speech to text services are
Google Cloud Speech-to-Text [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], IBM Watson Speech to Text [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
and Microsoft Azure Speech to Text [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. They work for a wide
range of languages and are easily embeddable via API.
4.4
      </p>
    </sec>
    <sec id="sec-6">
      <title>The NLP unit and intent detector</title>
      <p>
        The Natural Language Processing Unit and Intent Detector is one of
the most relevant modules of the whole system. It receives as input
a string of text representing the transcription of the user’s speech
and the context it refers to. By exploiting domain knowledge and
natural language comprehension capabilities, it analyzes,
understands and returns the user’s intent. Intents are links between what
the user says and its interpretation by the bot. Contexts are useful
for diferentiating requests which might have diferent meaning
depending on previous requests. The platforms Dialogflow [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
wit.ai [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Alexa [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Watson [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and Azure bot service [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] use
an intent-based approach: that means that they elaborate a
reaction to the user’s intention detected by the NLP unit. Again, the
large amount of data they have at their disposal to train the models,
makes their NLP units the state-of-the-art technologies to detect
the intention of the user in every conversation step.
4.5
      </p>
    </sec>
    <sec id="sec-7">
      <title>The sentiment analysis module</title>
      <p>
        Sentiment analysis regards finding out whether a given text
expresses positivity, neutrality or negativity. This module gets as
input the speech by the user and returns its polarity back. The
most advanced APIs on the marketplace are Google cloud -
natural language [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], Azure - text analytics [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], Repustate [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], Text
processing [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] and TextRazor [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ].
4.6
      </p>
    </sec>
    <sec id="sec-8">
      <title>The emotional analysis module</title>
      <p>
        Emotion analysis permits to extract the feelings expressed by a text
using Natural Language Processing. Feelings are in this case the
Big Six emotions (Joy, Sadness, Fear, Anger, Surprise, Disgust) and
other cognitive statements as confidence, uncertainty and attention.
This module gets as input the transcription of the speech by the
user. After its examination, the service returns a dictionary that
maps the emotions and cognitive statements to the probability that
the author is expressing them. The most advanced APIs of this kind
are IBM Tone Analyzer [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], Indico.io [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], QoEmotion [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] and
TheySay [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ] and they work all with English language.
4.7
      </p>
    </sec>
    <sec id="sec-9">
      <title>The topic analysis module</title>
      <p>
        The topic analysis module uses machine learning and natural
language processing to discover the abstract topics that occur in a
written work. Topic modeling is a frequently used text-mining tool
for discovery of hidden semantic structures in a text body.
Intuitively, given that a written text is about a particular topic, one
would expect particular words to appear in the text more or less
frequently. At the moment of writing, Google Cloud Natural
Language [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], TextRazor [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ] and Meaning Cloud Topic Extraction [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
are three relevant and advanced API providing this kind of service.
4.8
      </p>
    </sec>
    <sec id="sec-10">
      <title>The profiling module</title>
      <p>The profiling module customizes the contents and the style of the
conversation depending on every single user. More in detail it
exploits relevant info about the user (generalities, preferences, needs,
...) to influence the best output to be provided to the user and the
best way to do it. To do so, it interacts with the persistence tier: it
is in charge of storing the info and of picking them up if necessary.
Relevant info can be obtained during previous conversations,
providing the dialog systems with episodic memory. Alternatively, they
can be received through other channels, such as social networks,
pre-configuration by the user, third party configuration.
4.9</p>
    </sec>
    <sec id="sec-11">
      <title>The voice analysis module</title>
      <p>
        This module is in charge of understanding the emotion in a speaker’s
voice. It gets as input an audio recording containing a human speech
and returns back the emotion perceived analyzing the pitch and the
intonation of the voice. Emotional speech analysis starting from
the harmonic features of the audio is a relatively new research field
and there is much room for improvement. The most relevant APIs
providing this kind of service are Good Vibrations [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ], Afectiva
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Vokaturi [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ].
4.10
      </p>
    </sec>
    <sec id="sec-12">
      <title>The output creation module</title>
      <p>This innovative module permits the customization of the output to
be delivered to the user in order to obtain a user-centered content
driven conversational agent. In a first stage, it picks up the best
iftting answer from a table of anticipated templates according to
• the state of the conversation,
• the detected intention of the user,
• the user in case,
• the sentiment detected from the semantic,
• the emotion recognized both from the text and from the
pitch of the voice of the user’s message.</p>
      <p>In a second stage, it fills the selected template with the right
content chosen from the contents table according to the result
of the topic analysis and, again, to the user’s intention. The final
generated output is returned to the engine module. This kind of
table-driven architecture permits the easy update and modification
of the contents without even touching the conversational structure.
4.11</p>
    </sec>
    <sec id="sec-13">
      <title>The text to speech module</title>
      <p>
        The text to speech module is responsible for the human voice
synthesis, that is the artificial production of human speech. Compared
to recorded human speech, the advantage of synthetized voice is
that its content can change and be customized at runtime. By
playing around with the voice features, the text to speech module may
express emotions, moods and cognitive states. The quality of a
speech synthesizer is judged by its similarity to the human voice
and by its ability to be understood clearly. As for the speech to text
module, some speech synthesizers permit "training" where an
individual speaker reads text into the system. In this way, the system
analyzes then the person’s specific voice and uses it to fine-tune
the duplicate of that person’s typical sounds. There exists many
text to speech APIs available on the market, such as Google Cloud
Text to Speech [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Amazon Polly [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], IBM Watson Text to Speech
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and ResponsiveVoice [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. They all provide female and male
voices speaking diferent languages and with diferent accents.
5
      </p>
    </sec>
    <sec id="sec-14">
      <title>DEVELOPED CONVERSATIONAL AGENTS</title>
      <p>The CORK framework is potentially valid in general. For now, it
has been applied to develop conversational tools for people with
neurodevelopmental disorders NDD. NDD is a group of conditions
that are characterized by severe deficits in the cognitive, emotional
and motor areas and produce impairments of social functioning.
Any kind of NDD is a chronical state, but early and focused
interventions are thought to mitigate its efect. As an experiment, we
investigated the use of conversational technologies as a tool for
persons with NDD to lead them to enhance their emotion expression
and communication capabilities and to get more socially integrated.
5.1</p>
      <p>Ele
As a first part of our work, we developed Ele (see Figure 5), an
embodied Wizard-of-Oz dialog system with the aspect of an elephant.
This robot wants to be a conversational companion that speaks
through the live voice of a remote human operator and can engage
the user in dialogues and recounting stories, enriching the
communication using toys body movement efects. In return, the user can
interact with the social robot verbally, making facial expressions
and by touch. This system has been used to detect some relevant
conversational patterns for people with NDD: for up to one month
we observed with the help of an expert the weekly sessions of a
group of 11 NDD adults aged 25 to 43 interacting with Ele. From
this experience and three weekly meeting sessions with two
psychological specialists it turns out that some ways of speaking (with
a lot of repetitions, reassurances and continuous reinforcements)
are the most efective for people with NDD.
Once we detected a considerable number of conversational patterns
for people with neurodevelopmental disorders, we designed a tool
for people afected by a specific disturb called Alexithymia, that
is the inability to identify and express emotions. In particular, we
developed Emoty, a Dialog System playing the role of emotional
facilitator and trainer by exploiting emotion detection capabilities
from both text and audio data. Emoty proposes to the user some
emotion expression tasks in the form of a game, organized in
increasingly dificult levels. For example, the user could be asked to
read an assigned sentence trying to express a given emotion with
her/his tone of voice.</p>
      <p>The project has been designed in close collaboration with
caregivers and psychologists who actively participated in the following
phases:
(1) eliciting the key requirements,
(2) evaluating iterative prototypes,
(3) performing an exploratory evaluation.</p>
      <p>
        In addition to the verbal channel, the application exploits the visual
one as a support to the player (see Figure 6). Specifically, the use
of emojis facilitates the conceptualization of the emotions, and the
combination emotion-color works as reinforcement to the spoken
feedbacks by the system. This kind of matching is very common in
literature and we decided to follow Plutchik’s theory [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. In his
model, joy is represented as yellow, fear is dark green, surprise is
cyan, sadness is blue, and anger is red. We associated neutrality to
light grey, used as background for the system as well.
5.2.1 Architectural overview. The system follows the guidance of
CORK, the Conversational agent framework. It has been realized as
a web application because web apps are accessible to everybody via
browser on everyday devices and do not need any kind of
preliminary configuration. They permit both vocal and visual interaction
with the user thanks to the screen, the microphone and the speakers
of the executor device.
      </p>
      <p>The engine module of the entire system lies on the Cloud and is
accessible through serverless functions to be triggered via HTTPS
requests. This guarantees fair pricing, a safe execution environment
and high-level scalability and availability.</p>
      <p>For every conversation step, the engine remotely calls Google Cloud
Speech-to-Text to get a transcription of the user’s speech and then
Dialogflow to detect the user’s intention according to the context.
The semantic analysis of the input contents is delegated by our
system to indico.io. This external service returns a dictionary that
maps anger, fear, joy, sadness and surprise to the probability with
which the author is expressing each emotion. At the time of
development, there is no emotion analysis API working in Italian
language and this constrains us to translate our Italian inputs to
English and then to let them be processed by the emotion
recognizer. The automatic translation may imply a loss of information to
be considered, however after a preliminary evaluation the results
are satisfactory.</p>
      <p>The emotion recognition from the harmonic features of the audio
is performed by an original machine learning model all by us. The
process is organized in two subsequent steps:
(1) feature extraction
(2) classification into emotions.</p>
      <p>
        The former extracts temporal and spectral characteristics of the
audio signal; whereas, the classification has been implemented
using a supervised learning approach based on deep neural networks.
In particular, a wide and deep (convolutional) neural network has
been built to exploit the principle of temporal locality across audio
signal partitions and increase the discriminatory strength of the
model. In order to properly train the model an open source and
free Italian dataset called Emovo [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] has been used. It is a corpus
of recordings by 6 actors playing 14 diferent sentences simulating
the five emotion categories.
      </p>
      <p>The messages by the system are generated by the output creation
module that fills the conversational templates detected with Ele
with the right contents depending on the context and on the
interacting user. Information about the users are inserted before the first
session by the caregiver supporting her/him. They are the name,
the age and some parameters indicating the ability of the user to
express and recognize emotions.</p>
      <p>Finally, the human voice synthesis is done by calling the Google
Cloud Text to Speech API.
5.2.2 Exploratory study. When ready, Emoty has been field tested
by people with NDD at mild or moderate severity ranging from
16 to 60 years old. Our study has been plan to take place in a
sixmonth evaluation organized in weekly scheduled sessions of 10-15
minutes integrated in daily activities at a specialized care center.
Experimental sessions are focusing on both
• the improvements in performance (i.e., the completion of
the activities) of users across sessions in time;
• the usability of the application.</p>
      <p>For now, only three weeks of experimentation have been
conducted. Already after this little time, according to the attending
therapist some users tend to be more eager to interact with our
device rather than to speak to other people. We know that the
evaluation based on three sessions is hardly statistically significant,
therefore, for now, the caregivers’ opinions are the most valuable
feedback we can collect: they observed a growing awareness about
feelings by the participants. Anyway, the graphical representation
of the performances by three users across the three sessions is
reported in Figure 8.</p>
      <p>This encouraging early results are a clue that the detected
conversational patterns fit the target user’s needs and the conversational
agent framework was well designed to develop complex
emotionally sensitive dialog systems (at least for this application domain).
6</p>
    </sec>
    <sec id="sec-15">
      <title>CONCLUSION</title>
      <p>This work proposes a solution to a common need we observed in
literature: the necessity to have a structured, tested and simplified
way to develop a conversational agent.</p>
      <p>In this paper we described CORK, a modular framework to facilitate
and accelerate the realization and the maintenance of intelligent
Conversational Agents with both rational and emotional
capabilities.
Our framework has a client-server architecture exploiting the
ModelView-Controller MVC pattern. This way, it will be possible to easily
integrate the realized CAs in digital devices such as tablets or smart
phones or embed them in everyday physical objects (e.g. toys, home
equipment). In addition, the system is organized in independent,
reusable software modules controlled by a centralized engine, such
that each of them deals with a single functionality in the whole
conversational process. The proposed components are the recorder,
the engine, the speech-to-text module, the NLP Unit and intent
detector, the sentiment analysis module, the emotional analysis
module, the topic analysis module, the profiling module, the output
creation module and the text-to-speech module. Other modules
executing additional elaborations can be integrated at will.
A big and innovative challenge faced by CORK is the split of the
content of the speech from its pure conversational scheme and rules.
This permits to detect during design some general communicative
patterns to be filled at run-time with diferent contents depending
on the context.</p>
      <p>To test the efectiveness of the proposed framework, we designed
and developed in cooperation with psychologists and therapists
an emotional facilitator and trainer, called Emoty, for individuals
with neurodevelopmental disorders as a conversational supporting
tool for regular interventions. Although Emoty is still a prototype,
the initial results of our empirical study indicate that Emoty can
be easily used by therapists and persons with NDD and has some
potential to mitigate Alexithymia and its efects. After only three
weeks of practice with the emotional trainer, the caregivers
observed a growing awareness by the users of their own feelings. This
encouraging early results are a clue that the chosen conversational
patterns fit the target user’s needs and the conversational agent
framework was well designed to develop complex emotionally
sensitive dialog systems for this application domain and in general at
large.</p>
      <p>The natural follow up of this project will be the development of a
user-friendly platform inspired to IBM Watson Assistant and
Dialogflow that permits the development of content-driven
emotionally sensitive Conversational Agents without even programming. In
addition, we will try to improve the efectiveness of the framework
by testing it with new dialog systems in new application domains.
In parallel, we will work to improve the accuracy of the emotion
recognition Machine Learning ML model. To do so, we will create
and tag a larger, proprietary emotional dataset collecting
speechbased conversations and dialogues with the collaboration of local
theatre companies and acting schools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Afectiva</surname>
          </string-name>
          .
          <year>2018</year>
          . Afectiva. https://www.afectiva.com
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Amazon</surname>
          </string-name>
          .
          <year>2018</year>
          . Alexa. https://developer.amazon.com/it/alexa
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Amazon. 2018. Amazon</given-names>
            <surname>Polly</surname>
          </string-name>
          . https://aws.amazon.com/polly/?nc1=f_ls
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Becker</surname>
          </string-name>
          , Stefan Kopp, and
          <string-name>
            <given-names>Ipke</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Why emotions should be integrated into conversational agents. Conversational informatics: an engineering approach (</article-title>
          <year>2007</year>
          ),
          <fpage>49</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>BusinessWire. 2018. Global</given-names>
            <surname>Afective Computing Market</surname>
          </string-name>
          2018
          <article-title>-2023</article-title>
          . https: //bit.ly/2Qv5kKX
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Justine</given-names>
            <surname>Cassell</surname>
          </string-name>
          ,
          <string-name>
            <surname>Alastair J Gill</surname>
          </string-name>
          , and Paul A Tepper.
          <year>2007</year>
          .
          <article-title>Coordination in conversation and rapport</article-title>
          .
          <source>In Proceedings of the workshop on Embodied Language Processing. Association for Computational Linguistics</source>
          ,
          <fpage>41</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Claude</surname>
            <given-names>C Chibelushi</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Fabrice</given-names>
            <surname>Bourel</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Facial expression recognition: A brief tutorial overview</article-title>
          .
          <source>CVonline: On-Line Compendium of Computer Vision</source>
          <volume>9</volume>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Meaning</given-names>
            <surname>Cloud</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Meaning Cloud Topic Extraction</article-title>
          . https://www. meaningcloud.com/developer/topics-extraction
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Giovanni</given-names>
            <surname>Costantini</surname>
          </string-name>
          , Iacopo Iaderola, Andrea Paoloni, and
          <string-name>
            <given-names>Massimiliano</given-names>
            <surname>Todisco</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Emovo corpus: an italian emotional speech database</article-title>
          .
          <source>In International Conference on Language Resources and Evaluation (LREC</source>
          <year>2014</year>
          ).
          <source>European Language Resources Association (ELRA)</source>
          ,
          <fpage>3501</fpage>
          -
          <lpage>3504</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Richard</surname>
            <given-names>S</given-names>
          </string-name>
          <string-name>
            <surname>Croft</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Communication theory</article-title>
          . Eastern Oregon University, La Grande,
          <string-name>
            <surname>OR</surname>
          </string-name>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Facebook</surname>
          </string-name>
          .
          <year>2018</year>
          . wit.
          <source>ai</source>
          . https://wit.ai
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Google</surname>
          </string-name>
          .
          <year>2018</year>
          . Dialogflow. https://dialogflow.com
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Google</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <string-name>
            <given-names>Google</given-names>
            <surname>Cloud</surname>
          </string-name>
          Speech-to-Text. https://cloud.google.com/ speech-to-text
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Google</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <string-name>
            <given-names>Google</given-names>
            <surname>Cloud</surname>
          </string-name>
          Text-To-Speech. https://cloud.google.com/ text-to-speech
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Google</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Googlecloud - natural language</article-title>
          . https://cloud.google.com/ natural-language
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Shang</surname>
            <given-names>Guo</given-names>
          </string-name>
          , Jonathan Lenchner,
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Connell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mishal</given-names>
            <surname>Dholakia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Hidemasa</given-names>
            <surname>Muta</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Conversational bootstrapping and other tricks of a concierge robot</article-title>
          .
          <source>In Proceedings of the 2017 ACM/IEEE International Conference on HumanRobot Interaction. ACM</source>
          ,
          <volume>73</volume>
          -
          <fpage>81</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>IBM</surname>
          </string-name>
          .
          <year>2018</year>
          . IBM ToneAnalyzer. https://www.ibm.com/watson/services/ tone-analyzer
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>IBM</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>IBM Watson Speech to Text</article-title>
          . https://www.ibm.com/watson/services/ speech-to-text
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>IBM</surname>
          </string-name>
          .
          <year>2018</year>
          . Watson. https://www.ibm.com/watson
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>IBM</surname>
          </string-name>
          .
          <year>2018</year>
          . Watson Text to Speech. https://www.ibm.com/watson/services/ text-to-speech
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Indico</surname>
          </string-name>
          .io.
          <year>2018</year>
          . Indico.io. https://indico.io
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Jiwei</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michel</given-names>
            <surname>Galley</surname>
          </string-name>
          , Chris Brockett, Georgios P Spithourakis,
          <string-name>
            <surname>Jianfeng Gao</surname>
            , and
            <given-names>Bill</given-names>
          </string-name>
          <string-name>
            <surname>Dolan</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A persona-based neural conversation model</article-title>
          .
          <source>arXiv preprint arXiv:1603.06155</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Microsoft</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Azure - text analytics</article-title>
          . https://azure.microsoft.com/en-us/ services/cognitive-services/text-analytics
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Microsoft</surname>
          </string-name>
          .
          <year>2018</year>
          . Microsoft Azure. https://azure.microsoft.com/en-us/services/ bot-service
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Microsoft</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Microsoft Azure Speech to Text</article-title>
          . https://azure.microsoft.com/ en-us/services/cognitive-services/speech-to-text
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Karen O'Shea</surname>
            ,
            <given-names>Zuhair</given-names>
          </string-name>
          <string-name>
            <surname>Bandar</surname>
            , and
            <given-names>Keeley</given-names>
          </string-name>
          <string-name>
            <surname>Crockett</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>A conversational agent framework using semantic analysis</article-title>
          .
          <source>International Journal of Intelligent Computing Research (IJICR) 1</source>
          ,
          <issue>1</issue>
          /2 (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Rosalind</given-names>
            <surname>Wright</surname>
          </string-name>
          Picard et al.
          <year>1995</year>
          .
          <article-title>Afective computing</article-title>
          . (
          <year>1995</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Plutchik</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice</article-title>
          .
          <source>American scientist 89</source>
          ,
          <issue>4</issue>
          (
          <year>2001</year>
          ),
          <fpage>344</fpage>
          -
          <lpage>350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Text</surname>
            <given-names>processing.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Text processing</article-title>
          . http://text-processing.com
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Qemotion</surname>
          </string-name>
          .
          <year>2018</year>
          . Qemotion. https://www.qemotion.com
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Repustate</surname>
          </string-name>
          .
          <year>2018</year>
          . Repustate. https://www.repustate.com
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>ResponsiveVoice</surname>
          </string-name>
          .
          <year>2018</year>
          . ResponsiveVoice. https://responsivevoice.org
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Iulian</given-names>
            <surname>Vlad</surname>
          </string-name>
          <string-name>
            <surname>Serban</surname>
          </string-name>
          , Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and
          <string-name>
            <given-names>Joelle</given-names>
            <surname>Pineau</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>AAAI</given-names>
          </string-name>
          , Vol.
          <volume>16</volume>
          .
          <fpage>3776</fpage>
          -
          <lpage>3784</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Heung-Yeung</surname>
            <given-names>Shum</given-names>
          </string-name>
          , Xiao-dong
          <string-name>
            <surname>He</surname>
            , and
            <given-names>Di</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>From Eliza to XiaoIce: challenges and opportunities with social chatbots</article-title>
          .
          <source>Frontiers of Information Technology &amp; Electronic Engineering</source>
          <volume>19</volume>
          ,
          <issue>1</issue>
          (
          <year>2018</year>
          ),
          <fpage>10</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>TextRazor</surname>
          </string-name>
          .
          <year>2018</year>
          . TextRazor. https://www.textrazor.com
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>TheySay</surname>
          </string-name>
          .
          <year>2018</year>
          . TheySay. http://www.theysay.io
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Good</given-names>
            <surname>Vibrations</surname>
          </string-name>
          .
          <year>2018</year>
          . Good Vibrations. https://www.good-vibrations.it
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Vokaturi</surname>
          </string-name>
          .
          <year>2018</year>
          . Vokaturi. https://vokaturi.com
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>Joseph</given-names>
            <surname>Weizenbaum</surname>
          </string-name>
          .
          <year>1966</year>
          .
          <article-title>ELIZA - a computer program for the study of natural language communication between man and machine</article-title>
          .
          <source>Commun. ACM 9</source>
          ,
          <issue>1</issue>
          (
          <year>1966</year>
          ),
          <fpage>36</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>