<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Sentiment Analysis from Utterances</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jirˇí Kožusznik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petr Pulc</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Holenˇa</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Information Technology, Czech Technical University</institution>
          ,
          <addr-line>Thákurova 7, Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science, Czech Academy of Sciences</institution>
          ,
          <addr-line>Pod vodárenskou veˇží 2, Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>2203</volume>
      <fpage>92</fpage>
      <lpage>99</lpage>
      <abstract>
        <p>The recognition of emotional states in speech is starting to play an increasingly important role. However, it is a complicated process, which heavily relies on the extraction and selection of utterance features related to the emotional state of the speaker. In the reported research, MPEG-7 low level audio descriptors[10] serve as features for the recognition of emotional categories. To this end, a methodology combining MPEG-7 with several important kinds of classifiers is elaborated.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The recognition of emotional states in speech is expected
to play an increasingly important role in applications such
as media retrieval systems, car management systems, call
center applications, personal assistants and the like. In
many languages it is common that the meaning of spoken
words changes depending on speakers emotions, and
consequently the emotional information is important in order
to understand the intended meaning. Emotional Speech
recognition is a complicated process. Its performance
heavily relies on the extraction and selection of features
related to the emotional state of the speaker in the audio
signal of an utterance. For most of them, the
methodology has already been implemented, and they have been
experimentally tested and compared Berlin database of
emotional speech.</p>
      <p>In the reported work in progress, we use MPEG-7 low
level audio descriptors[10] as features for the
recognition of emotional categories. To this end, we elaborate a
methodology combining MPEG-7 with several important
kinds of classifiers. For most of them, the methodology
has already been implemented and tested with the publicly
available Berlin Database of Emotional Speech [1].</p>
      <p>In the next section, the task of sentiment analysis from
utterances is briefly sketched. Section 3 recalls the
necessary background concerning MPEG-7 audio descriptors
and the considered classification methods. In Section 4,
the principles of the proposed approach are explained.
Finally, Section 5 presents results of experimental testing
and comparison of the already implemented classifiers on
the publicly available Berlin database of emotional speech.
Due to the importance of recognizing emotional states in
speech, research into sentiment analysis from utterances
has been emerging during recent years. We are aware of 3
publications reporting research with the same database of
emotional utterances as we used – the Berlin Database of
Emotional Speech, used in our research. Let us recall each
of them.</p>
      <p>The research most similar to ours has been reported in
[12], where the authors also used MPEG-7 descriptors for
sentiment analysis from utterance. However, they used
only scalar MPEG-7 descriptors or scalars derived with
time-series descriptors using the software tools Sound
Description Toolbox [13] and MPEG-7 Audio Reference
Software Toolkit[2], whereas we are implementing also a
long-short-term memory network that will use directly the
time series. They also used only one classifer in their
experiments, a combination of a radial basis function
network and a support vector machine.</p>
      <p>In [11], emotions are recognized using pitch and
prosody features, which are mostly in time domain. Also
in that paper, the experiments were performed, and the
authors used only one classifer, this time a support vector
machine (SVM).</p>
      <p>The authors of [16] proposed a set of new 68 features,
such as some new based on harmonic frequencies or on
the Zipf distribution, for better speech emotion
recognition. This set of features is used in a multi-stage
classification. When performing the sentiment analysis of the
Berlin Database, the utterance classification to the
considered emotional categories was preceded with a gender
classification of the speakers, and the gender of the speaker
was subsequently used as an additional feature for the
classification of the utterances.
3</p>
    </sec>
    <sec id="sec-2">
      <title>MPEG-7 Audio Descriptors</title>
      <p>MPEG-7 is a standard for low-level description of audio
signals, describing a signal by means of the following
groups of descriptors[10]:
1. Basic: Audio Power (AP), Audio Waveform(AWF).</p>
      <p>Temporally sampled scalar values for general use,
applicable to all kinds of signals. The AP describes the
temporally-smoothed instantaneous power of
samples in the frame,in other words it is a temporally
measure of signal content as a function of time and
offers a quick summary of a signal in conjunction
with other basic spectral descriptors. The AWF
describes audio waveform envelope (minimum and
maximum), typically for display purposes.
2. Basic Spectral: Audio Spectrum Envelop (ASE),
Audio Spectrum Centroid (ASC), Audio Spectrum
Spread (ASS), Audio Spectrum Flatness (ASF).
All share a common basis, all deriving from the short
term audio signal spectrum (analysis of frequency
over time). They are all based on the ASE Descriptor,
which is a logarithmic-frequency spectrum. This
descriptor provides a compact description of the signal
spectral content and represents the similar
approximation of logarithmic response of the human ear. The
ASE descriptor is an indicator as to whether the
spectral content of a signal is dominated by high or low
frequencies. The ASC Descriptor could be
considered as an approximation of perceptual sharpness of
the signal. The ASS descriptor indicates whether the
signal content, as it is represented by the power
spectrum, is concentrated around its centroid or spread
out over a wider range of the spectrum. This gives
a measure which allows the distinction of noise-like
sounds from tonal sounds. The ASF describes the
flatness properties of the spectrum of an audio signal
for each of a number of frequency bands.
3. Basic Signal Parameters: Audio Fundamental
Frequency (AFF) and Audio Harmonicity (AH).</p>
      <p>The signal parameters constitute a simple
parametric description of the audio signal. This group
includes the computation of an estimate for the
fundamental frequency (F0) of the audio signal. The
AFF descriptor provides estimates of the
fundamental frequency in segments in which the audio signal
is assumed to be periodic. The AH represents the
harmonicity of a signal, allowing distinction between
sounds with a harmonic spectrum (e.g., musical tones
or voiced speech e.g., vowels), sounds with an
inharmonic spectrum (e.g., bell-like sounds) and sounds
with a non-harmonic spectrum (e.g., noise, unvoiced
speech).
4. Temporal Timbral: Log Attack Time (LAT),
Temporal Centroid (TC).</p>
      <p>Timbre refers to features that allow one to distinguish
two sounds that are equal in pitch, loudness and
subjective duration. These descriptors are taking into
account several perceptual dimensions at the same
time in a complex way. Temporal Timbral descriptors
describe the signal power function over time. The
power function is estimated as a local mean square
value of the signal amplitude value within a running
window. The LAT descriptor characterizes the
”attack” of a sound, the time it takes for the signal to
rise from silence to its maximum amplitude. This
feature signifies the difference between a sudden and a
smooth sound. The TC descriptor computes a
timebased centroid as the time average over the energy
envelope of the signal.
5. Timbral Spectral descriptors:
Harmonic
Spectral Centroid (HSC), Harmonic Spectral Deviation
(HSD), Harmonic Spectral Spread (HSS), Harmonic
Spectral Variation (HSV) and Spectral Centroid.
These are spectral features extracted in a
linearfrequency space. The HSC descriptor is defined
as the average, over the signal duration, of the
amplitude-weighted mean of the frequency of the
bins (the harmonic peaks of the spectrum) in the
linear power spectrum. It is has a high correlation with
the perceptual feature of ”sharpness” of a sound. The
HSD descriptor measures the spectral deviation of the
harmonic peaks from the global envelope. The HSS
descriptor measures the amplitude-weighted standard
deviation (Root Mean Square) of the harmonic peaks
of the spectrum, normalized by the HSC. The HSV
descriptor is the normalized correlation between the
amplitude of the harmonic peaks between two
subsequent time-slices of the signal.
6. Spectral Basis, which consists of Audio Spectrum</p>
      <p>Basis (ASB) and Audio Spectrum Projection (ASP).
3.1</p>
      <sec id="sec-2-1">
        <title>Tools for Working with MPEG-7 Descriptors</title>
        <p>We utilized the Sound Description Toolbox [13] and
MPEG-7 Audio Analyzer - Low Level Descriptors
Extractor [15] for our experiments. Both of them extract a
number of MPEG-7 standard descriptors, both scalar ones and
time series. In addition, the SDT also calculates
perceptual features such as Mel Frequency Cepstral Coefficients,
Specific Loudness and Sensation Coefficients. From this
descriptors calculate means, covariances, means of
firstorder differences and covariances of first order differences.
The Total number of features provided by this toolbox is
187.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Employed Classification Methods</title>
      <p>We have elaborated our approach to sentiment analysis
from utterances for six classification methods: k
nearest neighbors, support vector machines, multilayer
perceptrons, classification trees, random forests [7] and long
short-term memory (LSTM) network [5, 6, 8]. The first
five of them have already been implemented and tested (cf.
Section 5), the last and most advanced one is still being
implemented.
4.1</p>
      <p>k Nearest Neighbours (kNN)
A very traditional way of classifying a new feature vector
x ∈ X if a sequence of training data (x1, c1), . . . , (xp, cp)
is available is the nearest neighbour method: take the x j
that is the closest to x among x1, . . . , xp, and assign to x the
class assigned to x j, i.e., c j.</p>
      <p>
        A straightforward generalization of the nearest
neighbour method is to take among x1, . . . , xp not one, but k
feature vectors x j j , . . . , x jk closest to x. Then x is assigned the
class c ∈ C fulfilling
|{i, 1 ≤ i ≤ k|c ji = c}| = maxc0∈C|{i, 1 ≤ i ≤ k|c ji = c0}|.
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
This method is called, expectedly, k nearest neighbours, or
k-NN for short.
4.2
      </p>
      <sec id="sec-3-1">
        <title>Support Vector Machines (SVM)</title>
        <p>Support vector machines are classifiers into two classes.
This method attempts to derive from the training data
(x1, c1), . . . , (xp, cp) the best possible generalization to
unseen feature vectors.</p>
        <p>If both classes, more precisely their intersections with
the set {x1, . . . , xp} of training inputs, are in the space
of feature vectors linearly separable, the method
constructs two parallel hyperplanes H+ = {x ∈ Rn|x&gt;w +
b+ = 0}, H− = {x ∈ Rn|x&gt;w + b− = 0} such that the
training data fulfil
ck =
( 1 if x&gt;w + b+ ≥ 0,
-1 if x&gt;w + b− ≤ 0,
k = 1, . . . , p,</p>
        <p>H+ ∩ {x1, . . . , xp} 6= 0/, H− ∩ {x1, . . . , xp} 6= 0/.</p>
        <p>The hyperplanes H+ and H− alle called support
hyperplanes. Their common normal vector w and intercepts
b+, b− are obtained through solving the following
constrained optimization task:</p>
        <p>Maximize with respect to w, b+, b− the distance
d(H+, H−) =
b+ − b−</p>
        <p>
          kwk
on condition that the p inequalities (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) hold.
        </p>
        <p>
          The distance (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) is commonly called margin. The
solution to this optimization task coincides with the
(w∗, b∗+, b∗−, α1∗, . . . , αp∗) of the Lagrange function
p
L(w, b+, b−, α1, . . . , αp) = kwk2 + ∑ αk( , b+ − b− − ckxk&gt;w)
k=1 2
where α1, . . . , αp ≥ 0 are Lagrange coefficients
of the optimization task. Once the saddle point
(w∗, b∗+, b∗−, α1∗, . . . , αp∗) is found, the classifier is
defined by
φ (x) =
(1
−1
if ∑xk∈S αk∗ckx&gt;xk + b∗ ≥ 0,
if ∑xk∈S αk∗ckx&gt;xk + b∗ &lt; 0,
where b∗ = 21 (b∗+ + b∗ ) and
        </p>
        <p>
          −
S = {xk|αk∗ &gt; 0}.
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
        </p>
        <p>
          Due to the Karush-Kuhn-Tucker (KKT) conditions,
αk∗( b∗+ − b∗− − ckxk&gt;w∗) = 0, k = 1, . . . , p,
2
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
all feature vectors from the set S lie on some of the
suport hyperplanes (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ). Therefore, they are called support
vectors. This name together with the observation that they
completely determine the classifier defined in (
          <xref ref-type="bibr" rid="ref6">6</xref>
          ) explains
why such a classifier is called support vector machine.
        </p>
        <p>
          If the intersections of both classes with the training
inputs are not linearly separable, a SVM is constructed
similarly, but instead of the set of possible fature vectors, now
the set of functions
κ(·, x) for all possible feature vectors x
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
is considered, where κ is a kernel, i.e., a mapping on
pairs of feature vectors that is symmetric and such that for
any k ∈ N and any sequence of different feature vectors
x1, . . . , xk, the matrix
        </p>
        <p>
          κ(x1, x1) . . . κ(x1, xk)
Gκ (x1, . . . , xk) = . . . . . . . . . . . . . . . . . . . . . . . , (
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
κ(xk, x1) . . . κ(xk, xk)
which is called the Gramm matrix of x1, . . . , xk, is positive
semidefinite, i.e.,
        </p>
        <p>(∀y ∈ Rk) y&gt;Gκ (x1, . . . , xk)y ≥ 0.</p>
        <p>The most commonly used kinds of kernels are the
Gaussian kernel with a parameter ς &gt; 0,
(∀x, x0 ∈ Rn0 ) κ(x, x0) = exp
1
− ς kx − x0k2 ,
and polynomial kernel with parameters d ∈ N and c ≥ 0,
(∀x, x0 ∈ Rn0 ) κ(x, x0) = (x&gt;x0 + c)d .</p>
        <p>
          It is known [14] that, due to the properties of kernels, if
the joint distribution of a sequence of different feature
vectors x1, . . . , xk is continuous, then almost surely any proper
subset of the set of functions {κ(·, x1), . . . , κ(·, xk)} is in
the space of all functions (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ) linearly separable from its
complement.
        </p>
        <p>
          However, the featre vectors x and xk can’t be simply
replaced by the corresponding functions κ(·, x) and κ(·, xk)
in the definition (
          <xref ref-type="bibr" rid="ref6">6</xref>
          ) of a SVM classifier because a
transpose x&gt; exists for a finite-dimensional vector, but not a for
an infinite-dimensional function. Fortunately, the
transpose occurs in (
          <xref ref-type="bibr" rid="ref6">6</xref>
          ) only as a part of the scalar product
x&gt;xk. And a scalar product can be defined also on the
space of all functions (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ). Namely, the properties of a
scalar product has the function that to the pair of
functions (κ(·, x), κ(·, x0) assigns the value κ(x, x0). Using this
scalar product in (
          <xref ref-type="bibr" rid="ref6">6</xref>
          ), we obtain the following definition of
a SVM classifier for linearly non-separable classes:
φ (x) =
(1
if ∑xk∈S αk∗ckκ(x, xk) + b ≥ 0,
if ∑xk∈S αk∗ckκ(x, xk) + b ≥ 0.
(
          <xref ref-type="bibr" rid="ref14">14</xref>
          )
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
(
          <xref ref-type="bibr" rid="ref13">13</xref>
          )
4.3
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Multilayer Perceptrons (MLP)</title>
        <p>
          A multilayer percptron is a mapping φ of feature vectors
to classes with which a directed graph Gφ = (V , E ) is
associated. Due to the inspiration from biological neural
networks, the vertices of Gφ are called neurons and its edges
are called connections. In addition, Gφ is required to have
a layered structure, which means that the set V of
neurons can be decomposed into L + 1 mutually disjoint
layers, V = V0 ∪ V1 ∪ · · · ∪ VL, L ≥ 2, such that
(∀(u, v) ∈ E ) u ∈ Vi, i = 0, . . . , L − 1 &amp; v 6∈ Vi ⇒ v ∈ Vi+1.
(
          <xref ref-type="bibr" rid="ref15">15</xref>
          )
The layer I = V0 is called input layer of the MLP,
the layer O = VL its output layer and the layers H1 =
V1, . . . , HL−1 = VL−1 its hidden layers.
        </p>
        <p>The purpose of the graph Gφ associated with the
mapping φ is to define a decomposition ofφ into simple
mappings assigned to hidden and output neurons and to
connections between neurons (input neurons normally only
accept the components of the input, and no mappings are
assigned to them). Inspired by biological terminology,
mappings assigned to neurons are called somatic, those
assigned to connections are called synaptic.</p>
        <p>To each connection (u, v) ∈ E , the multiplication by a
weight w(u,v) is assigne as a synaptic mapping:
(∀ξ ∈ R) f(u,v)(ξ ) = w(u,v)ξ .</p>
        <p>To each hidden neuron v ∈ Hi, the following somatic
mapping is assigned:
(∀ξ ∈ R|in(v)|) fv(ξ ) = ϕ( ∑ [ξ ]u + bv),
u∈in(v)
where [ξ ]u for u ∈ in(v) denotes the component of ξ that
is the output of the synaptic mapping fu,v assigned to the
connection (u, v), in(v) = {u ∈ V |(u, v) ∈ E } is the
input set of v, and ϕ : R → R is called activation function.
Though the activation functions, in applications typically
sigmoidal functions are used to this end, i.e., functions that
are non-decreasing, piecewise continuous, and such that
−∞ &lt; lim ϕ(t) &lt; lim ϕ(t) &lt; ∞.</p>
        <p>t→−∞ t→∞
The activation functions most frequently encountered in
MLPs are:
• the logistic function,
• the hyperbolic tangent,
(∀t ∈ R) ϕ(t) =</p>
        <p>1
1 + e−t</p>
        <p>
          ;
et − e−t
ϕ(t) = tanht = et + e−t .
(
          <xref ref-type="bibr" rid="ref16">16</xref>
          )
(17)
(18)
(19)
        </p>
        <p>To an output neuron v ∈ O, also a somatic mapping of the
kind (17) with the activation functions (19) or (20) can be
assigned. If it is the case, then the class c predicted for a
feature vector x is obtained as c = arg maxi(φ (x))i, where
(φ (x))i denotes the i-the component of φ (x). Alternatively
the activation function assigned to an output neuron can be
the step function, aka Heaviside function
ϕ(t) =
(0 if t &lt; 0,
1 if t ≥ 0.</p>
        <p>(21)
In that case, the value (φ (x))c already directly indicates
whether x belongs to the class c.
4.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Classification Trees (CT)</title>
        <p>A classifier φ : X → C = {c1, . . . , cm} is called binary
classification tree, if there is a binary tree Tφ = (Vφ , Eφ )
with vertices Vφ and edges Eφ such that:
(i) Vφ = {v1, . . . , vL, . . . , v2L−1}, where L ≥ 2, v0 is the
root of Tφ , v1, . . . , vL−1 are its forks and vL, . . . , v2L−1
are its leaves.
(ii) If the children of a fork v ∈ {v1, . . . , vL−1} are vL ∈ Vφ
(left child) and vR ∈ Vφ (right child) and if v = vi, vL =
v j, vR = vk, then i &lt; j &lt; k.
(iii) To each fork v ∈ {v1, . . . , vL−1}, a predicate ϕv of
some formal logic is assigned, evaluated on features
of the input vectors x ∈ X .
(iv) To each leaf v ∈ {vL, . . . , v2L−1}, a class cv ∈ C is
assigned.
(v) For each input x ∈ X , the predicate ϕv1 assigned to
the root is evaluated.
(vi) If for a fork v ∈ {v1, . . . , vL−1}, the predicate ϕv
evaluates true, then φ (x) = cvL in case vL is already a leaf,
and the predicate ϕvL is evaluated in case vL is still a
fork.
(vii) If for a fork v ∈ {v1, . . . , vL−1}, the predicate ϕv
evaluates false, then φ (x) = cvR in case vR is already a
leaf, and the predicate ϕvR is evaluated in case vR is
still a fork.
4.5</p>
      </sec>
      <sec id="sec-3-4">
        <title>Random Forests (RF)</title>
        <p>Random Forests are ensembles of classifiers in which the
individual members are classification trees. They are
constructed by bagging, i.e., bootstrap aggregation of
individual trees, which consists in training each member of the
ensemble with another set of training data, sampled
randomly with replacement from the original training pairs
(x1, c1), . . . , (xp, cp). Typical sizes of random forests
encountered in applications are dozens to thousands trees.
Subsequently, when new subjects are input to the forest,
each tree classifies them separately, according to the leaves
at which they end, and the final classification by the
forest is obtained by means of an aggregation function. The
usual aggregation function of random forests is majority
voting, or some of its fuzzy generalizations.</p>
        <p>According to which kind of randomness is involved in
the costruction of the ensemble, two broad groups of
random forests can be differentiated:
1. Random forests grown in the full input space. Each
tree is trained using all considered input features.
Consequently, any feature has to be taken into
account when looking for the split condition assigned
to an inner node of the tree. However, features
actually occurring in the split conditions can be different
from tree to tree, as a consequence of the fact that
each tree is trained with another set of training data.
For the same reason, even if a particular feature
occurs in split conditions of two different trees, those
conditions can be assigned to nodes at different
levels of the tree.</p>
        <p>A great advantage of this kind of random forests is
that each tree is trained using all the information
available in its set of training data. Its main
disadvantage is high computational complexity. In addition, if
several or even only one variable are very noisy, that
noise gets nonetheless incorporated into all trees in
the forest. Because of those disadvantages, random
forests are grown in the complete input space
primarily if its dimension is not high and no input feature is
substantially noisier than the remaining ones.
2. Random forests grown in subspaces of the input
space. Each tree is trained using only a randomly
chosen fraction of features, typically a small one.
This means that a tree t is actually trained with
projections of the training data into a low-dimensional
space spanned by some randomly selected
dimensions it,1 ≤ · · · ≤ it,dt ∈ {1, . . . , d}, where d is the
dimension of the input space, and dt is typically much
smaller than d. Using only a subset of features not
only makes forest training much faster, but also
allows to eliminate noise originating from only several
features. The price paid for both these advantages is
that training makes use of only a part of the
information available in the training data.
4.6</p>
      </sec>
      <sec id="sec-3-5">
        <title>Long Short-Term Memory (LSTM)</title>
        <p>An LSTM network is used for classification of sequences
of feature vectors, or equivalently, multidimensional time
series with discrete time. Alternatively, it can be also
employed to obtain sequences of such classifications, i.e., in
situations when the neural network input is a sequence of
feature vectors and its output is a a sequence of classes.
Differently to most of other commonly encountered kinds
of artificial neural networks, an LSTM layer connects not
simple neurons, but units with their own inner structure.
Several variants of an LSTM have been proposed (e.g.,
[5, 6]), all of them include at least the following four kinds
of units described below. Each of them has certain
properties of usual ANN neurons, in particular, the values
assigned to them depend, apart from a bias, on values
assigned to the unit input at the same time step and on
values assigned to the unit output at the previous time step.
Hence, an LSTM network layers is a recurrent network.
(i) Memory cells can store values, aka cell states, for an
arbitray time. They have no activation function, thus
their output is actually a biased linear combination of
unit inputs and of the values from the previous time
step coming through recurrent connections.
(ii) Input gate controls the extent to which values from
the previous unit or from the preceding layer
influence the value stored in the memory cell. It has a
sigmoidal activation function, which is applied to a
biased linear combination of unit inputs and of
values from the previous time step, though the bias and
synaptic weights of the input and recurrent
connections are specific and in general different from the
bias and synaptic weights of the memory cell.
(iii) Forget gate controls the extent to which the memory
cell state is supressed. It again has a sigmoidal
activation function, which is applied to a specific biased
linear combination of unit inputs and of values from
the previous time step.
(iv) Output gate controls the extent to which the memory
cell state influences the unit output. Also this gate
has a sigmoidal activation function, which is applied
to a specific biased linear combination of unit inputs
and of values from the previous time step, and
subsequently composed either directly with the cell state or
with its sigmoidal transformation, using another
sigmoid than is used by the gates.
5
5.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Testing</title>
      <sec id="sec-4-1">
        <title>Berlin Database of Emotional Speech</title>
        <p>For the evaluation of already implemented classifiers, we
used the publicly available dataset ”EmoDB”, aka Berlin
database of emotional speech. It consists of 535 emotional
utterances in 7 emotional categories namely anger,
boredom, disgust, fear, happiness, sadness and neutral. These
utterances are sentences read by 10 professional actors, 5
males and 5 females [1], which were recorded in an
anechoic chamber under supervision by linguists and
psychologists) . The actors were advised to read these
predefined sentences in the targeted emotional categories, but
the sentences do not contain any emotional bias. A human
perception test was conducted with 20 persons, different
from the speakers, in order to evaluate the quality of the
recorded data with respect to recognisability and
naturalness of presented emotion. This evaluation yielded a mean
accuracy 86% over all emotional categories.
5.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Experimental Settings</title>
        <p>As input features, the outputs from the Sound Description
Toolbox were used. Consequently, the input dimension
was 187. The already implemented classifiers were
compared by means of a 10-fold cross-validation, using the
following settings for each of them:
• For the k nearest neighbors classification, the value
k = 9 was chosen by a grid method from h1, 80i. This
classifer was applied to data normalized to zero mean
and unit variance.
• Support vector machines are constructed for each of
the 7 considered emotions, to classify between that
emotion and all the remaining ones. They employ
auto-scaled Gaussian kernels and do not use slack
variables.
• The MLP has 1 hidden layer with 70 neurons. Hence,
taking into account the input dimension and the
number of classes, the overall architecture of the MLP is
187-70-7.
• Classification trees are restricted to have at most 23
leaves. This upper limit was chosen by a grid method
from h1, 50i, taking into account the way how
classification trees are grown in their Matlab
implementation.
• Random forests consist of 50 classification trees,
each of them taking over the above restriction. The
number of trees was selected by a grid method from
10, 20,. . . ,100.
5.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Comparison of Already Implemented Classifiers</title>
        <p>First, we compared the already implemented classifiers on
the whole Berlin database of emotional speech, with
respect to accuracy and area under the ROC curve (area
under curve, AUC). Since a ROC curve makes sense only
for a binary classifier, we computed areas under 7
separate curves corresponding to classifiers classifying always
1 emotion against the rest. The results are presented in
Table 1 and in Figure 1. They clearly show SVM as the most
promising classifier. It has the highest accuracy, and also
the AUC for binary classifiers corresponding to 5 of the 7
classifiers</p>
        <p>Then we compared the classifiers separately on the
utterances of each of the 10 speakers who created the
database. The results are summarized in Table 2 for
accuracy and Table 3 for AUC averaged over all 7
emotions. They indicate a great difference between most of
the compared classifiers. This is confirmed by the
Friedman test of the hypotheses that all classifiers have equal
accuracy and equal average AUC. The Friedman test
rejected both hypotheses with a high significance: With
the Holm correction for simultaneously tested
hypotheses [9], the achieved significance level (aka p-value) was
4 · 10−6. For both hypotheses, posthoc tests according to
[3, 4] were performed, testing equal accuracy and equal
average AUC between individual pairs of classifiers. For
the family-wise significance level 5%, they reveal the
following Holm-corrected significant differences between
individual pairs of classifiers: both for accuracy and
averaged AUC: (SVM,DT), (MLP,DT), and in addition
between (kNN,SVM), (SVM,RF) for accuracy.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>The presented work in progress investigated the
possibilities to analyse emotions in utterances based on MPEG7
features. So far, we implemented only five
classification methods not using time series features, but only 187
scalar features, namely the k nearest neighbours
classifier, support vector machines, mutilayer perceptrons,
decision trees and random forests. The obtained results
indicate that especially support vector machines and
multilayer perceptrons are quite successfull for this task.
Statistical testing confirms significant differences between these
two kinds of classifiers on the one hand, and decision trees
an random forests on the other hand.</p>
      <p>The next step in this ongoing research is to implement
the long short-term memory neural network, recalled in
Subsection 4.6, because they can work not only with scalar
features but also with features represented with time series.</p>
      <sec id="sec-5-1">
        <title>Acknowledgement</title>
        <p>The research reported in this paper has been supported by
the Czech Science Foundation (GACˇ R) grant 18-18080S.</p>
        <p>Sentiment Analysis from Utterances Figure 1: ROC curve for all emotions on the whole Berlin database</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Burkhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paeschke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rolfes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Sendlmeier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Weiss</surname>
          </string-name>
          .
          <article-title>A database of german emotional speech</article-title>
          .
          <source>In Interspeech</source>
          , pages
          <fpage>1517</fpage>
          -
          <lpage>1520</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Casey</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. De Cheveigne</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Jackson</surname>
            ,
            <given-names>and G.</given-names>
          </string-name>
          <string-name>
            <surname>Peeters.</surname>
          </string-name>
          MPEG-
          <article-title>7 multimedia software resources</article-title>
          . http://mpeg7.doc.gold.ac.uk/,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Demšar</surname>
          </string-name>
          .
          <article-title>Statistical comparisons of classifiers over multiple data sets</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>7</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Garcia</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Herrera</surname>
          </string-name>
          .
          <article-title>An extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all pairwise comparisons</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>9</volume>
          :
          <fpage>2677</fpage>
          -
          <lpage>2694</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.A.</given-names>
            <surname>Gers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and J.</given-names>
            <surname>Cummis</surname>
          </string-name>
          .
          <article-title>Learning to forget: Continual prediction with LSTM</article-title>
          .
          <source>In 9th International Conference on Artificial Neural Networks: ICANN '99</source>
          , pages
          <fpage>850</fpage>
          -
          <lpage>855</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          .
          <article-title>Supervised Sequence Labelling with Recurrent Neural Networks</article-title>
          .
          <source>PhD thesis</source>
          , TU München,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hastie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <source>The Elements of Statistical Learning, 2nd Edition</source>
          . Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>9</volume>
          :
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Holm</surname>
          </string-name>
          .
          <article-title>A simple sequentially rejective multiple test procedure</article-title>
          .
          <source>Scandinavian Journal of Statistics</source>
          ,
          <volume>6</volume>
          :
          <fpage>65</fpage>
          -
          <lpage>70</lpage>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.G.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Moreau</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sikora</surname>
          </string-name>
          .
          <article-title>MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval</article-title>
          . John Wiley and Sons, New York,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lalitha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bhusan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Saketh</surname>
          </string-name>
          .
          <article-title>Speech emotion recognition</article-title>
          .
          <source>In International Conference on Advances in Electronics</source>
          , pages
          <fpage>92</fpage>
          -
          <lpage>95</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.S.</given-names>
            <surname>Lampropoulos</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.A.</given-names>
            <surname>Tsihrintzis</surname>
          </string-name>
          .
          <article-title>Evaluation of MPEG-7 descriptors for speech emotional recognition</article-title>
          .
          <source>In Eighth International Conference on Intelligent Information Hiding and Multimedia Signal Processing</source>
          , pages
          <fpage>98</fpage>
          -
          <lpage>101</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lidy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Frank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Benetos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zenz</surname>
          </string-name>
          , G. Bertini,
          <string-name>
            <given-names>T.</given-names>
            <surname>Virtanen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.T.</given-names>
            <surname>Cemgil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Godsill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Peeling</surname>
          </string-name>
          , E. Peisyer,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Laprie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sloin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alfandary</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Burshtein</surname>
          </string-name>
          .
          <article-title>MUSCLE network of excellence: Multimedia understanding through semantics, computation and learning</article-title>
          .
          <source>Technical report</source>
          , TU Vienna, Information and Software Engineering Group,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.J.</given-names>
            <surname>Smola</surname>
          </string-name>
          .
          <article-title>Learning with Kernels</article-title>
          . MIT Press, Cambridge,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sikora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.G.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Moreau</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Amjad.</surname>
          </string-name>
          MPEG7
          <article-title>-based audio annotation for the archival of digital video</article-title>
          . http://mpeg7lld.nue.tu-berlin.de/,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          <article-title>. Multi-stage classification of emotional speech motivated by a dimensional emotion model</article-title>
          .
          <source>Multimedia Tools and Applicaions</source>
          ,
          <volume>46</volume>
          :
          <fpage>119</fpage>
          -
          <lpage>145</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>