<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Demutti);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Cloud Architecture for Emotion Recognition Based on the Appraisal Theory</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Demutti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincenzo D'Amato</string-name>
          <email>vincenzostefano.damato@edu.unige.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carmine Tommaso Recchiuto</string-name>
          <email>carmine.recchiuto@dibris.unige.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Oneto</string-name>
          <email>luca.oneto@unige.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Sgorbissa</string-name>
          <email>antonio.sgorbissa@unige.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIBRIS, Università di Genova</institution>
          ,
          <addr-line>Via all'Opera Pia 13, 16145, Genova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Designing robots with the ability to infer a person's emotional state represents one of the major challenges in social robotics. This work proposes a cloud system for online human emotion recognition in spontaneous human-robot verbal interaction, structured as a set of REST API endpoints. Based on the appraisal theory of emotion, the system acquires data about the person's expected appraisal of a given situation, depending on their needs and goals, and combines it with sensory data, such as facial expressions, angles of the head, and gaze of the person, and distance between the person and the robot. The whole set of data is used to infer the person's emotional state during the interaction through a Random Forest classifier, trained for binary classification (i.e., positive vs. negative emotions). Results confirmed that using both data sources improved performance in both the K-fold and the Leave One Person Out scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Human-robot interaction</kwd>
        <kwd>social robotics</kwd>
        <kwd>cloud robotics</kwd>
        <kwd>REST API</kwd>
        <kwd>emotion recognition</kwd>
        <kwd>appraisal theory</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Providing a natural, genuine, and efective human-robot interaction (HRI) represents one of
the major and fascinating challenges in Social Robotics [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The most crucial skill that confers
naturalism to interactions between humans is our ability to infer the emotional states of others
based on non-verbal signals, such as facial expressions, voice, body posture, and movement [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
This ability allows us to adjust our social behaviors and communication patterns to optimize the
interaction. A social robot with the same ability would reliably adapt to changes in its partners’
behavior, and earn their trust during the interaction.
      </p>
      <p>
        The complexity and variability of emotions make emotion recognition a challenging task,
especially when performed in a natural and spontaneous HRI context, where conditions may
diverge from the controlled environment where most experiments are carried out [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The vast literature on emotion recognition covers (i) the problem of emotion classification,
discussing how emotions should be represented, and (ii) the choice of the most informative
non-verbal signals for the robot to acquire and interpret. In other words, the two main research
topics establish the output and the input of the classification process, respectively.</p>
      <p>
        Narrowing it down to social robotics, most previous studies performed emotion recognition
by combining multiple sensory modalities, such as facial expression, body posture, and speech,
using them with black-box models to predict an emotion from a list of possible labels [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ].
      </p>
      <p>
        This approach only considers emotional expressions, as people reveal them to the outside
world, voluntarily or not [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However, emotional expressions do not necessarily reflect the
person’s emotional state [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which can be expressed in diferent ways depending on several
individual factors or even hidden. For this reason, these techniques lead to good classification
performances in acted situations but are less suitable in actual HRI scenarios.
      </p>
      <p>This work proposes a novel emotion recognition framework to assess the person’s emotional
state during a dyadic autonomous HRI.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Emotion Recognition Through Cognitive Appraisal</title>
      <p>
        The proposed emotion recognition framework is based on an implementation of the appraisal
theory of emotion [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. According to the theory, emotions result from a two-dimension individual
evaluation (the so-called appraisal) of a person’s situation. In the primary appraisal, the person
evaluates the situation in terms of the relevance to their needs and congruence with their goals.
The secondary appraisal mainly concerns the person’s possibilities to cope with the situation.
      </p>
      <p>
        Instead of recognizing emotions in people, the appraisal theory of emotion has frequently
been employed to address the dual issue of generating and expressing emotions in artificial
agents rather than recognizing them in individuals [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]: Kismet [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] represents one of the
most significant studies in the field. However, the fact that the theory arose to predict human
emotions supports extending its principles to emotion recognition.
      </p>
      <p>
        In a previous study [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], we developed and used the emotion recognition framework to collect
data from participants who had a spontaneous and autonomous verbal interaction with the
humanoid robot Pepper, programmed to elicit diferent emotions in various moments of the
conversation. Figure 1 shows the experimental setup1.
      </p>
      <p>Throughout the interaction, we combined information about the person’s appraisal with
state-of-the-art sensory data. More specifically, we trained a Random Forest classifier using two
sources of data:
• Sensory data, which consisted of the user’s facial expressions, head and gaze angles, and
the distance from the camera.
• Appraisal data, which encoded information about the person’s needs and goals and
how coherent they were with what the robot said and did. For example, appraisal data
1A video showing participants during the experiment can be found here: https://youtu.be/73ecZZWgG0k</p>
      <p>considered when the person decided to change the topic of conversation (which may
indicate that the topic was not suitable for them, thus conflicting with their needs and
goals) or how well the robot was able to perform the activity that the person had requested.
Binary classification (i.e., positive vs. negative emotion) results showed that using both data
sources led to a performance improvement compared to using sensory data only. For example,
the balanced accuracy passed from (64.85± 2.30)% to (66.44± 0.55)% and from (59.71± 1.33)%
to (62.43 ± 0.66)%, respectively.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Cloud Architecture For Emotion Recognition</title>
      <p>
        Given these preliminary results, the current work proposes a cloud architecture for the online
implementation of the system. The overall framework results from integrating an Emotion
Recognition service with the preexisting CAIR verbal interaction system [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The CAIR system
can manage a knowledge-based autonomous interaction by accepting commands to execute
actions and conversing with the person about various topics. Such integration is possible due
to the client-server architecture and the use of services implemented as REST APIs [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which
grant flexibility, scalability, portability, and independence. In the same way, further new services
may be easily added in the future. The system can be used by most devices with Internet
connectivity, able to acquire an input through a microphone and provide an output through a
screen or speaker, combined with a camera (and possibly other sensors) acquiring data from
the environment. Figure 2 shows the overall architecture of the system.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Server</title>
        <p>The server is composed of three web services, implemented using the Flask-RESTful framework
on Python2: i) the Hub service, that handles the requests from the client, ii) the Dialogue
service, that manages the interaction with the user, and iii) the Emotion Recognition service,
that provides the user’s emotional state during the interaction.</p>
        <p>2Flask-RESTful framework: https://flask-restful.readthedocs.io/en/latest/</p>
        <sec id="sec-3-1-1">
          <title>Dialogue service</title>
          <p>Dialogue
request
Dialogue
response
Dialogue
request
Dialogue
response
(+ emotion)
Acquisition Acquisition
request response
Sensory data
Sensory data
Appraisal data
Acquisition</p>
          <p>request
Acquisition
response
Emotion request
Emotion response</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Sensory data processing</title>
          <p>Start Stop</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Emotion</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Recognition service</title>
          <p>Raw
sensory
data</p>
        </sec>
        <sec id="sec-3-1-5">
          <title>Sensors Cloud</title>
          <p>Raw
sensory
data</p>
        </sec>
        <sec id="sec-3-1-6">
          <title>Sensory device</title>
          <p>Appraisal data
DB</p>
        </sec>
        <sec id="sec-3-1-7">
          <title>Hub service</title>
          <p>Client</p>
        </sec>
        <sec id="sec-3-1-8">
          <title>Client device</title>
          <p>3.1.1. Hub Service
The Hub service handles all the requests from the client. At the first request, the Hub associates
the new client with an initial state, which will be stored on the client side and included in all
upcoming requests. The client state is composed of the emotional state of the user and pieces of
dialogue information, e.g., the current topic of conversation, the type of sentence chosen by
the system, and the moments when the person and the robot started and finished speaking.
Throughout the interaction, the requests from the client can be of two types: i) the “dialogue”
request, aimed to develop the interaction with the user, and ii) the “acquisition” request to start
or stop the acquisition from one or multiple sensors.</p>
          <p>In case of a dialogue request, the Hub forwards it to the Dialogue service, which provides
the next step of the interaction (more details in Section 3.1.2). Then, an “emotion” request,
containing the client state, allows the Hub to obtain the user’s emotional state from the Emotion
Recognition service.</p>
          <p>
            In case of an acquisition request, the Hub forwards it to the Emotion Recognition service
(more details in Section 3.1.3), which starts or stops the acquisition from the corresponding
sensor.
3.1.2. Dialogue Service
The Dialogue service is mainly responsible for managing the interaction with the user. It
recognizes the person’s intention to discuss a specific topic or to ask the agent to execute a
task. More in detail, after processing the sentence pronounced by the person (contained in the
dialogue request), it obtains the verbal reply and possibly the task to execute by exploiting the
Ontology [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ], containing all concepts and sentences used in the interaction. In addition, the
service extract appraisal data from the user sentence and stores them in the SQLite database.
3.1.3. Emotion Recognition Service
The Emotion Recognition service exploits the pre-trained Random Forest classifier to assess
the user’s emotional state. It handles two types of requests from the Hub service, namely the
acquisition and the emotion requests. Upon an acquisition request, the Emotion Recognition
service starts or stops one or multiple “Sensory data processing” tasks, which process data
coming from sensors through the cloud. Sensory data are continuously stored in the SQLite
database during the acquisition. When the Hub sends an “emotion” request, the Emotion
Recognition service retrieves the two categories of inputs of the classifier from the database,
namely sensory and appraisal data (explained in Section 2). The emotion label is then returned
to the Hub in the client state.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Client</title>
        <p>As for the aim of this study, the client represents the robot used for the interaction. However,
the client may also be a computer or, in general, most devices with Internet connectivity, able
to acquire an input through a microphone and provide an output through a screen or speaker.</p>
        <p>Each client is associated with a state, initialized at the first request to the Hub service, and
then stored locally. The state is updated from the client side to contain helpful information,
such as when the person and the robot started and finished speaking. The Dialogue and the
Emotion Recognition services then use these pieces of information to provide the response to
upcoming requests.</p>
        <p>At the beginning of the interaction, the client also makes a request to the Hub service to
start the acquisition from one or multiple sensors. For example, the request may include the IP
address of the camera’s video stream.</p>
        <p>Throughout the interaction, once it has obtained the verbal reply and possibly the task to
execute from the server, it interacts with the user and acquires their reply. The interaction ends
when explicitly asked by the user.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Sensory Devices</title>
        <p>Sensory devices acquire data and stream them to the cloud. Although the post-processing
algorithm has been designed to extract data from camera video streams the system may be
used for other types of sensors (such as a smartwatch or a microphone for speech emotion
recognition).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mavridis</surname>
          </string-name>
          ,
          <article-title>A review of verbal and non-verbal human-robot interactive communication</article-title>
          ,
          <source>Robotics and Autonomous Systems</source>
          <volume>63</volume>
          (
          <year>2015</year>
          )
          <fpage>22</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Spezialetti</surname>
          </string-name>
          , G. Placidi,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <article-title>Emotion recognition for human-robot interaction: Recent advances and future perspectives</article-title>
          ,
          <source>Frontiers in Robotics and AI</source>
          <volume>7</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Castellano</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Leite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martinho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paiva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mcowan</surname>
          </string-name>
          ,
          <article-title>Afect recognition for interactive companions: Challenges and design in real world scenarios</article-title>
          ,
          <source>Journal on Multimodal User Interfaces</source>
          <volume>3</volume>
          (
          <year>2009</year>
          )
          <fpage>89</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>She</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hirota</surname>
          </string-name>
          ,
          <article-title>Softmax regression based deep sparse autoencoder network for facial emotion recognition in human-robot interaction</article-title>
          ,
          <source>Information Sciences 428</source>
          (
          <year>2018</year>
          )
          <fpage>49</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hirota</surname>
          </string-name>
          ,
          <article-title>Weight-adapted convolution neural network for facial expression recognition in human-robot interaction</article-title>
          ,
          <source>IEEE Transactions on Systems, Man, and Cybernetics: Systems</source>
          <volume>51</volume>
          (
          <year>2021</year>
          )
          <fpage>1473</fpage>
          -
          <lpage>1484</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mower</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Busso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <article-title>Emotion recognition using a hierarchical binary decision tree approach</article-title>
          ,
          <source>Speech Communication</source>
          <volume>53</volume>
          (
          <year>2011</year>
          )
          <fpage>1162</fpage>
          -
          <lpage>1171</lpage>
          . Sensing Emotion and Afect - Facing Realism in Speech Processing.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mortillaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Meuleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Scherer</surname>
          </string-name>
          ,
          <article-title>Advocating a componential appraisal model to guide emotion recognition</article-title>
          ,
          <source>International Journal of Synthetic Emotions</source>
          <volume>3</volume>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Picard</surname>
          </string-name>
          , Afective Computing, MIT Press,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Lazarus</surname>
          </string-name>
          ,
          <source>Emotion and Adaptation</source>
          , Oxford University Press,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kowalczuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Czubenko</surname>
          </string-name>
          , T. Merta,
          <article-title>Interpretation and modeling of emotions in the management of autonomous robots using a control paradigm based on a scheduling variable</article-title>
          ,
          <source>Engineering Applications of Artificial Intelligence</source>
          <volume>91</volume>
          (
          <year>2020</year>
          )
          <fpage>103562</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Breazeal</surname>
          </string-name>
          , Emotion and sociable humanoid robots,
          <source>International Journal on HumanComputer Studies</source>
          <volume>59</volume>
          (
          <year>2003</year>
          )
          <fpage>119</fpage>
          -
          <lpage>155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Demutti</surname>
          </string-name>
          , V.
          <string-name>
            <surname>D'Amato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Recchiuto</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Oneto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sgorbissa</surname>
          </string-name>
          ,
          <article-title>Assessing emotions in human-robot interaction based on the appraisal theory</article-title>
          ,
          <source>in: 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1435</fpage>
          -
          <lpage>1442</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Grassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Recchiuto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sgorbissa</surname>
          </string-name>
          ,
          <article-title>Sustainable verbal and non-verbal human-robot interaction through cloud services</article-title>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Masse</surname>
          </string-name>
          ,
          <string-name>
            <surname>REST API Design Rulebook: Designing Consistent RESTful Web Service Interfaces</surname>
            ,
            <given-names>O</given-names>
          </string-name>
          <string-name>
            <surname>'Reilly Media</surname>
          </string-name>
          , Inc.,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>