<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Audio-based Emotion Recognition for Advanced Automatic Retrieval in Judicial Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>F. Archetti</string-name>
          <email>archetti@milanoricerche.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>G. Arosio</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>E. Fersini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>E. Messina</string-name>
          <email>messinag@disco.unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Consorzio Milano Ricerche</institution>
          ,
          <addr-line>Via Cicognara 7 - 20129 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DISCO, Universita degli Studi di Milano-Bicocca</institution>
          ,
          <addr-line>Viale Sarca, 336 - 20126 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Thanks to the recent progresses in judicial proceedings management, especially related to the introduction of audio/video recording systems, semantic retrieval has now become a realistic key challenge. In this context emotion recognition engine, through the analysis of vocal signature of actors involved in judicial proceedings, could provide useful annotations for semantic retrieval of multimedia clips. With respect to the generation of semantic emotional tag in judicial domain, two main contributions are given: (1) the construction of an Italian emotional database for Italian proceedings annotation; (2) the investigation of a hierarchical classi cation system, based on risk minimization method, able to recognize emotional states from vocal signatures. In order to estimate the degree of a ection we compared the proposed classi cation method with the traditional ones, highlighting in terms of classi cation accuracy the improvements given by a hierarchical learning approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The IT infrastructure introduced into judicial environments, with particular
attention at audio/video recording systems into courtrooms, had a great impact
related to the legal actor work's. All the recorded events that occur during a trial
are available for subsequent consultation. However, despite the huge quantity of
information expressed in multimedia form that are captured during trials, the
current retrieval process of contents is based on manual consultation of the entire
multimedia tracks or, in the best case, on an automatic retrieval service based
on textual user queries with no possibility to search speci c semantic concepts.
Innovative features, that will impact the current consultation processes, are
introduced by the JUMAS project: fusion of semantic annotations of di erent data
streams to deliver a more e ective automatic retrieval system. A synthetic
representation of JUMAS components are depicted in gure 1. Consorzio Milano
Ricerche and Milano-Bicocca University will address in JUMAS three main
topics: (1) semantic annotation of the audio stream, (2) automatic template lling
of judicial transcripts and (3) multimedia summarization of audio/video judicial
proceedings. In this paper we deal with the semantic annotation of audio signals
that characterize each trial.</p>
      <p>
        Emotional states associated to the actors involved in courtroom debates,
represent one of the semantic concepts that can be extracted from multimedia
sources, indexed and subsequently retrieved for consultation purposes. It is
useful to stress the main di erence between our method and the one at the base
of Layered Voice Analysis (LVA) systems: while the main objective of LVA is to
empower security o cers and law enforcement agencies to discriminate between
"normal stress" and stress induced by deception during investigative phases, in
JUMAS the aim is to create semantic annotation of emotional state in order
to allow \emotion-based" retrieval of multimedia judicial proceedings clips.
Despite the progress in understanding the mechanisms of emotions in human speech
from a psychological point of view, progress in the design and development of
automatic emotion recognition systems for practical applications is still in its
infancy, especially in judicial contexts. This limited progress is due to several
reasons: (1) representation of vocal signal with a set of numerical features able
to achieve reliable recognition; (2) identi cation of those emotional states that
derive from a composition of other emotions (for example the "remorse" emotion
is a combination of "sadness" and "disgust"); (3) presence of inter-speaker
differences such as the variation in language and culture; (4) noisy environment; (5)
interaction among speakers; (6) quality of the emotional database used for
learning, and its likelihood with the real world uttered emotions. A general emotion
recognition process can be described by four main phases: dataset construction,
attribute extraction, feature selection/generation and inference model learning.
The rst phase deals with the collection of a corpus of voice signals uttered
by di erent speakers and representative of several emotional states. When the
database is created, the features extraction step is performed in order to map
the vocal signals into descriptive attributes collected in a series of numerical
vectors. Among this attributes through a feature selection/construction phase, a
feature set able to better discriminate emotional states is derived. This features
are used in the nal step to create a classi cation model able to infer emotional
states of unlabelled speakers. With respect to these four main phases the
literature can be classi ed accordingly. Concerning the dataset construction step,
several benchmarks in di erent language have been collected. Among other we
can nd Serbian [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], German [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Polish [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] emotional corpus. Considering
the attribute extraction phase, two of the most comprehensive studies ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) were aimed at discovering those attribute set that better correlates with
respect to a given collection of emotional states. Their results highlighted that
F0 or spectral information have high impact in automatic emotion recognition
systems. With respect to the feature selection step, there exists a great
number of approaches aimed at identifying the most discriminative characteristics
for a set of emotional states [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Concerning the nal step related to the
induction of inference models, able to recognize emotional states of unlabelled
speaker, classi cation algorithms were widely investigated. The most extensive
comparisons between several classi cation algorithms are reported in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>In this paper, we address the problem of nding the model that, with respect
to courtroom debates characteristics, is able to produce the optimal recognition
performance. The outline of the paper is the following. In section 2 we present
two emotional corpus. A well-known benchmark for the German language is
introduced, while a new benchmark is proposed for the Italian language. In
Section 3 the extraction of vocal signature from uttered emotional sentences is
described. In Section 4 traditional inference models and the proposed
MultiLayer Support Vector Machines approach, with their respective experimental
results, are described. Finally, in Section 5 conclusions are derived.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Emotion Corpus</title>
      <p>The performance of an automatic emotion recognition system strictly depends
on the quality of the database used for inducing an inference model. Since an
emotion recognition engine must be \trained" by a set of samples, i.e. needs to
estimate model parameters through a set of emotionally labelled sentences, there
are three ways of obtaining an emotional corpus:
1. recording by professional actors: the actors identify themselves in a speci c
situation before acting a given "emotional" sentence;
2. Wizard-of-Oz (WOZ): a system interacts with the actors and guide them
into a speci c emotional state that is subsequently recorded;
3. recording of real-word human emotions: the "emotional" sentences are
gathered by recording real life situations.</p>
      <p>
        In order to compare the performance of learning algorithms with the state of the
art, we choose from the literature one of the most used emotional corpus known
as Berlin Database of Emotional Speech or Emo-DB. This emotional corpus is
composed by a set of wave les (531 samples) that represent di erent emotional
states: neutral, anger, fear, joy, sadness, disgust and boredom. For a more
detailed description refers to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. A further benchmark, built at the University of
Milano-Bicocca, is presented in the next subsection.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Italian Emotional DB</title>
        <p>
          As pointed out in section 1, emotion recognition can be strongly in uenced by
several factors, and in particular by language and culture. For this reason, we
decided that it would be useful to adopt an Italian corpus in order to investigate
Italian emotional behaviors. Since at the time of writing there is no Italian
benchmark, we decided to manually collect a set of audio les. Due to the di culty to
nd available actors to record acted sentences, and the more complicated
situation to obtain recordings by real-world situations, we collected audio le from
movies and TV series, dubbed by Italian professional actors. Di erently by
others database used in the emotion recognition, in which the number of speakers
vary from 5 to 10 like in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], our database construction is aimed at
creating a generic corpus: 40 movies and TV series are taken into account and, for
each of them, sentences acted by di erent actors are collected. Thus the number
of speakers is relatively high, making the system as independent as possible on
the speaker. The Italian Emotional Corpus, named ITA-DB, is composed by
391 balanced samples of di erent emotional states that respect Italian judicial
proceedings: anger, fear, joy, sadness and neutral. This subset of emotions are
chosen in order to model the most interesting emotional states, from judicial
actors point of view, that could occurs during Italian courtroom debates.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Extraction of vocal signatures</title>
      <p>Despite there is not yet a general agreement on which are the most representative
features, the most widely used are prosodic features, like fundamental frequency
(also known as F0) and formants frequencies (F1, F2, F3), energy related features
and Mel Frequency Cepstral Coe cients (M F CC). Fundamental and formants
frequencies refer to the frequency of vocal cords vibration, labelling the human
vocal tone in a quite unambiguous way; energy refers to the intensity of vocal
signal and Mel Frequency Cepstral Coe cients concern the spectrum of the audio
signal. Duration, rate and pause related features are also used, as well as di erent
types of voice quality features. In our work, for each audio le, an attribute
extraction process was performed. Initially audio signal was sampled and split
in 10ms frames and for each of these frames 8 basic features were extracted. We
calculated prosodic features such as F0, F1, F2, F3, intensity related features like
energy and its high and low-passed version and a spectral analysis made up of
the rst 10 MFCC coe cients normalized by Euclidean Norm. After this rst
step a 8 features vector for each frame was obtained. In order to extract from this
information the necessary features, we considered their respective 3 time series,
i.e. the series itself, the series of its maxima and the series of its minimum, and
we computed a set of statistical index.</p>
      <p>In particular, for each series that describe one of the attribute over the N
frames, we computed 10 statistics: minimum, maximum, range (di erence
between min and max), mean, median, rst quartile, third quartile, interquartile
range, variance and mean of the absolute value of the local derivative. At the
end of this feature extraction process, each vocal signal is represented into a
feature space characterized by 240 components (8 3 10). In Figure 2 the entire
features extraction process is depicted.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Emotional State Inference Models</title>
      <p>The feature extraction phase, that creates a feature vector for each audio le,
allow us to consider emotion recognition as a generic machine learning problem.
The learning algorithm investigation, presented in the following subsections, can
be distinguished in Flat and Multi-Layer classi cation.
4.1</p>
      <sec id="sec-4-1">
        <title>Flat Classi cation</title>
        <p>The experimental investigation, show that the machine learning algorithm that
performs better is the one based on Support Vector Machines. It is interesting
to note that some similar emotions (similar in terms of vocal parameters), like
anger/joy, neutral/boredom and neutral/sadness, do not allow the classi er to
distinguish between them (See Emo-DB in Figure 3(c) and ITA-DB in Figure
3(d)). Another interesting remark, highlighted in Figure 3(b), is related to the
investigation about male and female emotion classi cation performed by two
distinct SVMs: learning gender-dependent models produce better performance than
unique model. This because some features used to discriminate emotional states
are gender-dependent; the fundamental frequency F0 is one of them: women
usually have F0 values higher than men because of the di erent size of the
vocal tract, in particular the larynx. Starting from this conclusions, we de ned a
multi-layer model based on the optimal learner, i.e. Support Vector Machines.
As highlighted in the previous sections, inference model are in uenced by
language, gender and "similar" emotional states. For this reasons we propose a
Multi-Layer Support Vector Machine approach, that tries to overcome the
mentioned limitations. At the rst layer a Gender Recognizer model is trained to
determine the gender of the speaker, distinguishing \male" speakers from
\female" ones. In order to avoid overlapping with other emotional states, at the
second layer gender-dependent models are trained. In particular, Male Emotion
Detector and Female Emotion Detector are induced to produce a binary classi
cation that discriminates the \excited" emotional states from the \not excited"
ones (i.e. the neutral emotion). The last layer of the hierarchical classi cation
process is aimed at recognizing di erent emotional state using Male Emotion
Recognizer and Female Emotion Recognizer models, where only \excited"
sentences are used to train the models for discriminating the remaining emotional
states. A synthetic representation of Multi-Layer Support Vector Machines is
depicted in Figure 4. Since also in this case all the models embedded into the
hierarchy are based on Support Vector Machines, we experimentally estimate
the optimal parameters combination. The performance obtained by the
MultiLayer Support Vector Machines are then compared with the ones provided by
the traditional \Flat" Support Vector Machines for both Emo-DB and Ita-DB.
The comparison reported in Figure 5(a) highlights the improvement, in terms
of number of instances correctly classi ed, obtained by the Multi-Layer Support
Vector Machines with respect to the traditional model. Figure 5(b) shows the
classi cation performance of each intermediate layer of the hierarchy. This has
been done to understand how the error rate is obtained by the di erent
classiers of the hierarchy. As we go down in the hierarchy layers the performance
get worse, and in the last layer they su er a remarkable reduction. This because
the classi ers have di erent target: in the root and in the rst level, learning is
simpli ed using only two classes, \male" and \female" for root and \excited"
and \not excited" for the rst layer classi ers; in the last layer a more complex
discrimination is required: 6 emotions for Emo-DB and 4 for ITA Emotional
DB. A further motivation, related to the decreasing number of instances used to
estimate models in the lower layer, could explain the performance reduction. In
fact while Gender Recognizer can learn on the entire dataset, learning on Male
and Female Emotion Detector is performed on two subsets of the whole dataset,
the rst model is trained by using only male instances and the second one by
considering only female samples. The same thing happens for the last layers, i.e.
Male Emotion Recognizer and Female Emotion Recognizer, that are induced by
using "excited" female and "exited" male samples respectively.
In this paper the problem of producing semantic annotaion for multimedia
recording of judicial proceeding is addressed. In particular, two main
contributions are given: the construction of an Italian emotional database for Italian
proceedings annotation and the investigation of a multi-layer classi cation
system able to recognize emotional states from vocal signal. The proposed model
outperforms traditional classi cation algorithms in terms of instances correctly
classi ed. In our investigation speakers emotion evolution are not considered. We
believe that by taking into account the dynamic of emotional process could
improve recognition performance. A further development will regard the fusion of
di erent of information sources in order to produce a more accurate prediction.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment References</title>
      <p>This work has been supported by the European Community FP-7 under the
JUMAS Project (ref.: 214306).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Batliner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Huber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Spilker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>No</surname>
          </string-name>
          <article-title>th. How to nd trouble in communication</article-title>
          .
          <source>Speech Commun</source>
          .,
          <volume>40</volume>
          (
          <issue>1-2</issue>
          ):
          <volume>117</volume>
          {
          <fpage>143</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>F.</given-names>
            <surname>Burkhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paeschke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rolfes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Sendlmeier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Weiss</surname>
          </string-name>
          .
          <article-title>A database of german emotional speech</article-title>
          .
          <source>In Interspeech</source>
          <year>2005</year>
          , pages
          <fpage>1517</fpage>
          {
          <fpage>1520</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>K. Slot J. Cichosz</surname>
          </string-name>
          .
          <article-title>Application of selected speech-signal characteristics to emotion recognition in polish language</article-title>
          .
          <source>In Proc. of the 5th International Conf.on signals and electronic systems</source>
          , pages
          <volume>409</volume>
          {
          <fpage>412</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>V.</given-names>
            <surname>Petrushin</surname>
          </string-name>
          .
          <article-title>Emotion recognition in speech signal: Experimental study, development, and application</article-title>
          .
          <source>In Proc. Sixth International Conf.on Spoken Language Processing (ICSLP</source>
          <year>2000</year>
          ), pages
          <fpage>222</fpage>
          {
          <fpage>225</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>M. Dordevic M. Rajkovic S. Jovicic</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kasic</surname>
          </string-name>
          .
          <article-title>Serbian emotional speech database: design, processing and evaluation</article-title>
          .
          <source>In Proc. of the 9th Conf. on Speech and Computer.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Arsic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wallho</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Rigoll</surname>
          </string-name>
          .
          <article-title>Emotion recognition in the noise applying large acoustic feature sets</article-title>
          .
          <source>In Speech Prosody</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Bjorn Schuller, Stephan Reiter, and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Rigoll</surname>
          </string-name>
          .
          <article-title>Evolutionary feature generation in speech emotion recognition</article-title>
          .
          <source>In Proceeding of the 2005 IEEE International Conf.on Multimedia and Expo</source>
          , pages
          <fpage>5</fpage>
          <issue>{8</issue>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>M.H.</given-names>
            <surname>Sedaaghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kotropoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Ververidis</surname>
          </string-name>
          .
          <article-title>Using adaptive genetic algorithms to improve speech emotion recognition</article-title>
          .
          <source>In Proc. of 9th Multimedia Signal Processing Workshop</source>
          , pages
          <volume>461</volume>
          {
          <fpage>464</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Pierre yves Oudeyer.
          <article-title>Novel useful features and algorithms for the recognition of emotions in speech</article-title>
          .
          <source>In Proc. of the 1st International Conf.on Speech Prosody</source>
          , pages
          <volume>547</volume>
          {
          <fpage>550</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <article-title>Pierre yves Oudeyer. The production and recognition of emotions in speech: features and algorithms</article-title>
          . Int.
          <string-name>
            <given-names>J.</given-names>
            <surname>Hum</surname>
          </string-name>
          .-Comput. Stud.,
          <volume>59</volume>
          (
          <issue>1-2</issue>
          ):
          <volume>157</volume>
          {
          <fpage>183</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>