<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal interaction with emotional feedback</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesco Cutugno</string-name>
          <email>cutugno@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Origlia</string-name>
          <email>antonio.origlia@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Rinaldi</string-name>
          <email>rober.rinaldi@studenti.unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LUSI-lab, Department of Physics, University of Naples \Federico II"</institution>
          ,
          <addr-line>Naples</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we extend a multimodal framework based on speech and gestures to include emotional information by means of anger detection. In recent years multimodal interaction has become of great interest thanks to the increasing availability of mobile devices allowing a number of di erent interaction modalities. Taking intelligent decisions is a complex task for automated systems as multimodality requires procedures to integrate di erent events to be interpreted as a single intention of the user and it must take into account that di erent kinds of information could come from a single channel as in the case of speech, which conveys a user's intentions using syntax and prosody both.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>State of the art</title>
      <p>
        Multimodal interface systems were introduced for the rst time in the system
presented in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], where graphical objects were created and moved on a screen
using voice recognition and nger pointing. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] a set of theoretical guidelines
were de ned that were named CARE Proprieties (Complementary, Assignment,
Redundancy, Equivalence). These properties establish which modes of
interaction between users and systems can be implemented and, at the same time, help
to formalize relationships among di erent modalities. The increasing amount of
research and practical applications of multimodal interaction systems recently
led to the de nition of the Synchronized Multimodal User Interaction
Modeling Language (SMUIML) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]: a formal way of representing multimodal
interactions. While the possibilities of implementing multimodal information access
systems has been explored since when mobile phones started to o er internet
based services [16], with the widespread adoption of touch screens on mobile
devices, mobile broad band and fast speech recognition, interfaces supporting truly
multimodal commands are now available to everyday users. An example is the
Speak4it local search application [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], where users can use multimodal commands
combining speech and gestures to issue mobile search queries. The great interest
risen from the possibilities o ered by this kind of systems, not only in a mobile
environment, soon highlighted the need of formalizing the requirements an
automated interactive systems needs to ful ll to be considered multimodal. This
problem was addressed by the W3C, which has established a set of requirements,
concerning both interaction design [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and system architecture [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], formalized
as proprieties and theoretical standards multimodal architectures
      </p>
      <p>
        Concerning the use of anger detectors in IVRs, in previous studies [
        <xref ref-type="bibr" rid="ref13">13,15</xref>
        ]
systems have been usually trained on acted emotions corpora before being deployed
on IVR platforms. An exception to this trend is represented by [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], in which a
corpus of telephone calls collected from a troubleshooting call-center database
was used. In that study, the impact of emotions was shown to be minimal with
respect to the use of log- les as the authors observed a uniform distribution of
negative emotions over successful and unsuccessful calls. This, however, may be
a characteristic of the employed corpus, in which people having problems with a
High Speed Internet Provider were calling, and is therefore signi cantly di erent
from the situation our system deals with, as our target consists of users of a bus
stops information service.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>System architecture</title>
      <p>
        In this paper we extend a pre-existing multimodal framework, running on
Android OS, based on speech and gesture to include emotional information by
means of a user emotional attitude detector. We merge these concepts in a case
study previously presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], in which a querying system for bus stops in the
city of Naples was implemented. Users can query the system by speaking and
drawing on the touch screen producing requests for bus stops in a given area on
the map. In a typical use case the user asks: \Please show me the bus stops of
C6 line in this area" drawing a circle on a map on the screen while speaking.
      </p>
      <p>The user can draw lines and circles on a map aiming at selecting a precise
geographic area of interest concerning public transportation. In addition the user
can hold her nger for some second on a precise point on the map in order to
select a small rectangular default area on the map with the same purposes. At
the same time, speech integrates the touch gesture to complete the command.
This way, users can ask for a particular bus line or timetable (using speech) in
a given geographic area (using touch), as shown in Figure 1.</p>
      <p>
        For details concerning the general architecture, we refer the reader to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
In the present system, the audio signal is considered as twin input: the rst
one connected to the linguistic content itself obtained by means of an ASR
process and a subsequent string parsing process that generates a Command
table structurally incomplete as more data are needed in correspondence with
the missing geographical data completing the user request; the latter goes to
an emotional attitude classi er (details will be presented in the next section)
returning the anger level characterizing the utterance produced by the user.
      </p>
      <p>The semantic interpreter collects the inputs from parsed ASR and from
touch/geographical modules and attempts an answer using the freely available
Drools (http://www.jboss.org/drools) rule engine while anger detection is used
to launch backup strategies if the transaction does not succeed and the user is
unsatis ed by the service as shown in Figure 2.</p>
      <p>(a) Places of interest found
by combining speech and
gestures
(b) Backup strategy for
unrecognized commands
with angry users</p>
    </sec>
    <sec id="sec-4">
      <title>Emotion recognition module</title>
      <p>
        Automatic emotion recognition is a research topic that has been gaining
attention in the last years because of the additional information it brings into
automatic systems about the users' state of mind. While there are a number
of applications and representations of emotions in the literature, one that has
found application in IVR systems is anger detection. Capturing a negative state
of the speaker during the interaction is an information that has been exploited in
the past, for example, in automated call centers to forward the call to a human
agent. Anger detection is usually based on the response given by an automatic
classi er on the basis of acoustic features extracted from a received utterance.
Features extraction and classi cation methods for emotions are active research
areas: in this work, we use a syllable-based features extraction method and a
Support Vector Machine (SVM) to perform the automatic classi cation of an
utterance into two classes: Neutral and Angry. The anger detection module is
trained on a subpart of the motion corpus [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] containing 400 angry and neutral
speech recordings in Italian, German, French and English.
      </p>
      <p>
        First, the recorded utterance is segmented into syllables. This is done by
applying the automatic segmentation algorithm presented in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Next, data
are extracted from syllable nuclei, estimated by the -3db band of the energy
peak associated with each automatically detected syllable. Syllable nuclei, being
stable spectral areas containing vowel sounds, contain more reliable information
regarding the distribution of the energy among the frequencies as it was intended
by the speaker. Speci c spectral measurements like the spectral centroid,
moreover, do make sense inside syllable nuclei only. To improve the reliability of the
extracted measures, only syllable nuclei at least 80ms long were analyzed. An
example of automatic syllable nuclei detection is shown in Figure 3.
5000
      </p>
      <p>00
75.7
)
z
H
(
cyne
u
q
e
r
F
)
synedB
it(
tI
n
48.290
0</p>
      <p>Time (s)
Time (s)</p>
      <p>Time (s)
I O I O</p>
      <p>I O</p>
      <p>I O</p>
      <p>I O</p>
      <p>I O</p>
      <p>I O</p>
      <p>I O I O I</p>
      <p>O</p>
      <p>From each nucleus we extract the following features: mean pitch (perceived
fundamental frequency), spectral centroid (mean of the frequencies in the
spectrum weighted by their magnitude) and energy.</p>
      <p>To produce the nal features set, global statistics were computed over the
feature vectors extracted from each syllable. Mean and standard deviation were
included for each feature while the maximum value was introduced for energy
only. An SVM was trained and tested on the features extracted from the motion
corpus. The F-measure obtained in a 10-fold cross validation test was 90.5%.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>
        The proposed system is presently still under development so its usability has not
yet been completely assessed. The multimodal interaction front-end presented in
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], here integrated with the anger detection module, will be tested in the next
future in order to validate both the accuracy of the approach in real conditions
of use and the user acceptability and satisfaction. This will be done by means
of both an objective and a subjective analysis. The former evaluation will be
based on a background software module able to producing log- les containing
all the details of the interaction session (time of interaction, number of touches
on the pad, length of the speech utterance, etc.), in an evaluation release of the
application the user will be requested a-posteriori to verify:
{ if the ASR worked properly;
{ if the request was correctly recognized and executed.
      </p>
      <p>The analysis of the data collected in this way will be put in relation with
those coming from a subjective investigation based on a questionnaire proposed
to a further set of users (di erent from those involved in the former analysis) in
order to estimate the subjective acceptability and the degree of satisfaction for
the proposed application.</p>
      <p>For what it concerns the data on which the Support Vector Machine classi er
is trained, while we are currently using a corpus of acted emotions, we plan to use
the recordings coming from the tests the system will undergo. We expect this will
improve performance as the system will be retrained to work in nal deployment
conditions. The classi er will therefore be adapted to real-life conditions both in
terms on spontaneous emotional display and in terms of recording environment
as new recordings will include telephonic microphones quality and background
noise.</p>
      <p>
        Di erently from what stated in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], where the telephonic domain and the
nature of the interaction did not encourage the introduction of an anger
detection system in order to reduce the amount of hang-ups during dialogues, we
believe that the mobile device domain will take advantage by the addition of an
emotional state recognizer. In the case of apps for mobile devices requirements
are di erent from those observed during telephonic dialogues and, provided that
the Human-Computer Interface is well designed and correctly engineered, it is
not really expected that the user closes the app before obtaining the required
service. In this view, anger detection must be seen as a further e ort made by
the designer to convince users not to give up and close the app before reaching
their goals.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>We have presented a framework to design and implement multimodal
interfaces with relatively little e ort. As far as we know, anger detection and, in
general, emotional feedback has not been taken into account in mobile
applications before. The case study we presented shows a mobile application
integrating speech recognition, anger detection and gesture analysis to implement
a bus stops querying system. A basic release of the presented system,
without speech and multimodal system is presently available on the Google Market
(https://play.google.com/store/apps/details?id=it.unina.lab.citybusnapoli) and
received excellent user reviews and more than 2600 downloads (April 2012), we
consider this as a very e ective usability test. Multimodal without emotive
feedback is also being tested for usability by means of a subjective procedure, we
are now undergoing formal testing of the complete system in order to verify its
usability and its stability.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>We would like to thank Vincenzo Galata for providing the speech recordings from
the yet unpublished multilingual emotional speech corpus motion we used in
our experiments. We would also like to thank Antonio Caso for assisting during
the extension of the original framework to include the emotional module.
14. Polzehl, T., Schmitt, A., Metze, F.: Approaching multilingual emotion
recognition from speech - on language dependency of acoustic/prosodic features for anger
detection. In: Proc. of Speech Prosody (2010)
15. Yacoub, S., Simske, S., Lin, X., Burns, J.: Recognition of emotions in interactive
voice response systems. In: Proc. of Eurospeech. pp. 729{732 (2003)
16. Zaykovskiy, D., Schmitt, A., Lutz, M.: New use of mobile phones: towards
multimodal information access systems. In: Proc. of IE. pp. 255{259 (2007)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dhillon</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krupski</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shriberg</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Stolcke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Prosody-based automatic detection of annoyance and frustration in human-computer dialog</article-title>
          .
          <source>In: Proc. of of ICSLP</source>
          . pp.
          <year>2037</year>
          {
          <year>2040</year>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bodell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dahl</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kliche</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tumuluri</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yudkowsky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Selvaraj</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raggett</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raman</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wahbe</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Multimodal architectures and interfaces (</article-title>
          <year>2011</year>
          ), http://www.w3.org/TR/mmi-arch/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bolt</surname>
          </string-name>
          , R.A.:
          <article-title>\Put-that-there": Voice and gesture at the graphics interface</article-title>
          .
          <source>SIGGRAPH Comput. Graph</source>
          .
          <volume>14</volume>
          (
          <issue>3</issue>
          ),
          <volume>262</volume>
          {
          <fpage>270</fpage>
          (
          <year>1980</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Burkhardt</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polzehl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stegmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Metze</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huber</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Detecting real life anger</article-title>
          .
          <source>In: Proc. of ICASSP</source>
          . pp.
          <volume>4761</volume>
          {
          <issue>4764</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Coutaz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nigay</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salber</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blandford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>May</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>R.M.:</given-names>
          </string-name>
          <article-title>Four easy pieces for assessing the usability of multimodal interaction: the care properties</article-title>
          .
          <source>In: Proc. of INTERACT</source>
          . pp.
          <volume>115</volume>
          {
          <issue>120</issue>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cutugno</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leano</surname>
            ,
            <given-names>V.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mignini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rinaldi</surname>
          </string-name>
          , R.:
          <article-title>Multimodal framework for mobile interaction</article-title>
          .
          <source>In: Proc. of AVI</source>
          . pp.
          <volume>197</volume>
          {
          <issue>203</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dumas</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lalanne</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ingold</surname>
          </string-name>
          , R.:
          <article-title>Description languages for multimodal interaction: A set of guidelines and its illustration with SMUIML</article-title>
          .
          <source>Journal on Multimodal User Interfaces</source>
          <volume>3</volume>
          (
          <issue>3</issue>
          ),
          <volume>237</volume>
          {
          <fpage>247</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ehlen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johnston</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Multimodal local search in speak4it</article-title>
          .
          <source>In: Proc. of IUI</source>
          . pp.
          <volume>435</volume>
          {
          <issue>436</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Galata</surname>
          </string-name>
          , V.:
          <article-title>Production and perception of vocal emotions: a cross-linguistic and cross-cultural study</article-title>
          ,
          <source>PhD Thesis</source>
          - University of Calabria, Italy
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Herm</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmitt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liscombe</surname>
          </string-name>
          , J.:
          <article-title>When calls go wrong: How to detect problematic calls based on log- les and emotions</article-title>
          .
          <source>In: Proc. of Interspeech</source>
          . pp.
          <volume>463</volume>
          {
          <issue>466</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Larson</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raman</surname>
            ,
            <given-names>T.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raggett</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bodell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johnston</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Waters</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>W3C multimodal interaction framework (</article-title>
          <year>2003</year>
          ), http://www.w3.org/TR/mmi-framework/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Petrillo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cutugno</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A syllable segmentation algorithm for english and italian</article-title>
          .
          <source>In: Proc. of Eurospeech</source>
          . pp.
          <volume>2913</volume>
          {
          <issue>2916</issue>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Petrushin</surname>
          </string-name>
          , V.:
          <article-title>Emotion in speech: Recognition and application to call centers</article-title>
          .
          <source>In: Proc. of ANNE [Online]</source>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>