=Paper=
{{Paper
|id=Vol-2203/92
|storemode=property
|title=Sentiment Analysis from Utterances
|pdfUrl=https://ceur-ws.org/Vol-2203/92.pdf
|volume=Vol-2203
|authors=Jiri Kozusznik,Petr Pulc,Martin Holena
|dblpUrl=https://dblp.org/rec/conf/itat/KozusznikPH18
}}
==Sentiment Analysis from Utterances==
S. Krajči (ed.): ITAT 2018 Proceedings, pp. 92–99
CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Jiří Kožusznik, Petr Pulc, and Martin Holeňa
Sentiment Analysis from Utterances
Jiří Kožusznik1 , Petr Pulc1,2 , Martin Holeňa2
1 Faculty of Information Technology, Czech Technical University, Thákurova 7, Prague,Czech Republic
2 Institute of Computer Science, Czech Academy of Sciences, Pod vodárenskou věží 2, Prague, Czech Republic
Abstract: The recognition of emotional states in speech is has been emerging during recent years. We are aware of 3
starting to play an increasingly important role. However, publications reporting research with the same database of
it is a complicated process, which heavily relies on the emotional utterances as we used – the Berlin Database of
extraction and selection of utterance features related to the Emotional Speech, used in our research. Let us recall each
emotional state of the speaker. In the reported research, of them.
MPEG-7 low level audio descriptors[10] serve as features The research most similar to ours has been reported in
for the recognition of emotional categories. To this end, a [12], where the authors also used MPEG-7 descriptors for
methodology combining MPEG-7 with several important sentiment analysis from utterance. However, they used
kinds of classifiers is elaborated. only scalar MPEG-7 descriptors or scalars derived with
time-series descriptors using the software tools Sound
Description Toolbox [13] and MPEG-7 Audio Reference
1 Introduction Software Toolkit[2], whereas we are implementing also a
long-short-term memory network that will use directly the
The recognition of emotional states in speech is expected time series. They also used only one classifer in their ex-
to play an increasingly important role in applications such periments, a combination of a radial basis function net-
as media retrieval systems, car management systems, call work and a support vector machine.
center applications, personal assistants and the like. In In [11], emotions are recognized using pitch and
many languages it is common that the meaning of spoken prosody features, which are mostly in time domain. Also
words changes depending on speakers emotions, and con- in that paper, the experiments were performed, and the au-
sequently the emotional information is important in order thors used only one classifer, this time a support vector
to understand the intended meaning. Emotional Speech machine (SVM).
recognition is a complicated process. Its performance The authors of [16] proposed a set of new 68 features,
heavily relies on the extraction and selection of features such as some new based on harmonic frequencies or on
related to the emotional state of the speaker in the audio the Zipf distribution, for better speech emotion recogni-
signal of an utterance. For most of them, the methodol- tion. This set of features is used in a multi-stage classi-
ogy has already been implemented, and they have been ex- fication. When performing the sentiment analysis of the
perimentally tested and compared Berlin database of emo- Berlin Database, the utterance classification to the con-
tional speech. sidered emotional categories was preceded with a gender
In the reported work in progress, we use MPEG-7 low classification of the speakers, and the gender of the speaker
level audio descriptors[10] as features for the recogni- was subsequently used as an additional feature for the clas-
tion of emotional categories. To this end, we elaborate a sification of the utterances.
methodology combining MPEG-7 with several important
kinds of classifiers. For most of them, the methodology
has already been implemented and tested with the publicly 3 MPEG-7 Audio Descriptors
available Berlin Database of Emotional Speech [1].
In the next section, the task of sentiment analysis from MPEG-7 is a standard for low-level description of audio
utterances is briefly sketched. Section 3 recalls the nec- signals, describing a signal by means of the following
essary background concerning MPEG-7 audio descriptors groups of descriptors[10]:
and the considered classification methods. In Section 4,
the principles of the proposed approach are explained. Fi- 1. Basic: Audio Power (AP), Audio Waveform(AWF).
nally, Section 5 presents results of experimental testing Temporally sampled scalar values for general use, ap-
and comparison of the already implemented classifiers on plicable to all kinds of signals. The AP describes the
the publicly available Berlin database of emotional speech. temporally-smoothed instantaneous power of sam-
ples in the frame,in other words it is a temporally
measure of signal content as a function of time and
2 Sentiment Analysis from Utterances offers a quick summary of a signal in conjunction
with other basic spectral descriptors. The AWF
Due to the importance of recognizing emotional states in describes audio waveform envelope (minimum and
speech, research into sentiment analysis from utterances maximum), typically for display purposes.
Sentiment Analysis from Utterances 93
2. Basic Spectral: Audio Spectrum Envelop (ASE), tral Centroid (HSC), Harmonic Spectral Deviation
Audio Spectrum Centroid (ASC), Audio Spectrum (HSD), Harmonic Spectral Spread (HSS), Harmonic
Spread (ASS), Audio Spectrum Flatness (ASF). Spectral Variation (HSV) and Spectral Centroid.
All share a common basis, all deriving from the short These are spectral features extracted in a linear-
term audio signal spectrum (analysis of frequency frequency space. The HSC descriptor is defined
over time). They are all based on the ASE Descriptor, as the average, over the signal duration, of the
which is a logarithmic-frequency spectrum. This de- amplitude-weighted mean of the frequency of the
scriptor provides a compact description of the signal bins (the harmonic peaks of the spectrum) in the lin-
spectral content and represents the similar approxi- ear power spectrum. It is has a high correlation with
mation of logarithmic response of the human ear. The the perceptual feature of ”sharpness” of a sound. The
ASE descriptor is an indicator as to whether the spec- HSD descriptor measures the spectral deviation of the
tral content of a signal is dominated by high or low harmonic peaks from the global envelope. The HSS
frequencies. The ASC Descriptor could be consid- descriptor measures the amplitude-weighted standard
ered as an approximation of perceptual sharpness of deviation (Root Mean Square) of the harmonic peaks
the signal. The ASS descriptor indicates whether the of the spectrum, normalized by the HSC. The HSV
signal content, as it is represented by the power spec- descriptor is the normalized correlation between the
trum, is concentrated around its centroid or spread amplitude of the harmonic peaks between two subse-
out over a wider range of the spectrum. This gives quent time-slices of the signal.
a measure which allows the distinction of noise-like
sounds from tonal sounds. The ASF describes the 6. Spectral Basis, which consists of Audio Spectrum
flatness properties of the spectrum of an audio signal Basis (ASB) and Audio Spectrum Projection (ASP).
for each of a number of frequency bands.
3.1 Tools for Working with MPEG-7 Descriptors
3. Basic Signal Parameters: Audio Fundamental Fre-
quency (AFF) and Audio Harmonicity (AH). We utilized the Sound Description Toolbox [13] and
The signal parameters constitute a simple paramet- MPEG-7 Audio Analyzer - Low Level Descriptors Extrac-
ric description of the audio signal. This group in- tor [15] for our experiments. Both of them extract a num-
cludes the computation of an estimate for the fun- ber of MPEG-7 standard descriptors, both scalar ones and
damental frequency (F0) of the audio signal. The time series. In addition, the SDT also calculates percep-
AFF descriptor provides estimates of the fundamen- tual features such as Mel Frequency Cepstral Coefficients,
tal frequency in segments in which the audio signal Specific Loudness and Sensation Coefficients. From this
is assumed to be periodic. The AH represents the descriptors calculate means, covariances, means of first-
harmonicity of a signal, allowing distinction between order differences and covariances of first order differences.
sounds with a harmonic spectrum (e.g., musical tones The Total number of features provided by this toolbox is
or voiced speech e.g., vowels), sounds with an inhar- 187.
monic spectrum (e.g., bell-like sounds) and sounds
with a non-harmonic spectrum (e.g., noise, unvoiced
4 Employed Classification Methods
speech).
4. Temporal Timbral: Log Attack Time (LAT), Tempo- We have elaborated our approach to sentiment analysis
ral Centroid (TC). from utterances for six classification methods: k near-
Timbre refers to features that allow one to distinguish est neighbors, support vector machines, multilayer per-
two sounds that are equal in pitch, loudness and sub- ceptrons, classification trees, random forests [7] and long
jective duration. These descriptors are taking into short-term memory (LSTM) network [5, 6, 8]. The first
account several perceptual dimensions at the same five of them have already been implemented and tested (cf.
time in a complex way. Temporal Timbral descriptors Section 5), the last and most advanced one is still being
describe the signal power function over time. The implemented.
power function is estimated as a local mean square
value of the signal amplitude value within a running 4.1 k Nearest Neighbours (kNN)
window. The LAT descriptor characterizes the ”at-
tack” of a sound, the time it takes for the signal to A very traditional way of classifying a new feature vector
rise from silence to its maximum amplitude. This fea- x ∈ X if a sequence of training data (x1 , c1 ), . . . , (x p , c p )
ture signifies the difference between a sudden and a is available is the nearest neighbour method: take the x j
smooth sound. The TC descriptor computes a time- that is the closest to x among x1 , . . . , x p , and assign to x the
based centroid as the time average over the energy class assigned to x j , i.e., c j .
envelope of the signal. A straightforward generalization of the nearest neigh-
bour method is to take among x1 , . . . , x p not one, but k fea-
5. Timbral Spectral descriptors: Harmonic Spec- ture vectors x j j , . . . , x jk closest to x. Then x is assigned the
94 Jiří Kožusznik, Petr Pulc, and Martin Holeňa
class c ∈ C fulfilling Due to the Karush-Kuhn-Tucker (KKT) conditions,
b∗ − b∗−
|{i, 1 ≤ i ≤ k|c ji = c}| = maxc0 ∈C |{i, 1 ≤ i ≤ k|c ji = c0 }|. αk∗ ( + − ck xk> w∗ ) = 0, k = 1, . . . , p, (8)
(1) 2
all feature vectors from the set S lie on some of the su-
This method is called, expectedly, k nearest neighbours, or port hyperplanes (3). Therefore, they are called support
k-NN for short. vectors. This name together with the observation that they
completely determine the classifier defined in (6) explains
4.2 Support Vector Machines (SVM) why such a classifier is called support vector machine.
If the intersections of both classes with the training in-
Support vector machines are classifiers into two classes. puts are not linearly separable, a SVM is constructed sim-
This method attempts to derive from the training data ilarly, but instead of the set of possible fature vectors, now
(x1 , c1 ), . . . , (x p , c p ) the best possible generalization to un- the set of functions
seen feature vectors.
κ(·, x) for all possible feature vectors x (9)
If both classes, more precisely their intersections with
the set {x1 , . . . , x p } of training inputs, are in the space is considered, where κ is a kernel, i.e., a mapping on
of feature vectors linearly separable, the method con- pairs of feature vectors that is symmetric and such that for
structs two parallel hyperplanes H+ = {x ∈ Rn |x> w + any k ∈ N and any sequence of different feature vectors
b+ = 0}, H− = {x ∈ Rn |x> w + b− = 0} such that the train- x1 , . . . , xk , the matrix
ing data fulfil
κ(x1 , x1 ) . . . κ(x1 , xk )
(
1 if x> w + b+ ≥ 0, Gκ (x1 , . . . , xk ) = . . . . . . . . . . . . . . . . . . . . . . . , (10)
ck = k = 1, . . . , p, (2) κ(xk , x1 ) . . . κ(xk , xk )
-1 if x> w + b− ≤ 0,
which is called the Gramm matrix of x1 , . . . , xk , is positive
H+ ∩ {x1 , . . . , x p } 6= 0,
/ H− ∩ {x1 , . . . , x p } 6= 0.
/ (3)
semidefinite, i.e.,
The hyperplanes H+ and H− alle called support hyper- (∀y ∈ Rk ) y> Gκ (x1 , . . . , xk )y ≥ 0. (11)
planes. Their common normal vector w and intercepts
b+ , b− are obtained through solving the following con- The most commonly used kinds of kernels are the Gaus-
strained optimization task: sian kernel with a parameter ς > 0,
Maximize with respect to w, b+ , b− the distance n0 1 0 2
(∀x, x ∈ R ) κ(x, x ) = exp − kx − x k ,
0 0
(12)
ς
b+ − b− and polynomial kernel with parameters d ∈ N and c ≥ 0,
d(H+ , H− ) = (4)
kwk 0
(∀x, x0 ∈ Rn ) κ(x, x0 ) = (x> x0 + c)d . (13)
on condition that the p inequalities (2) hold. It is known [14] that, due to the properties of kernels, if
The distance (4) is commonly called margin. The the joint distribution of a sequence of different feature vec-
solution to this optimization task coincides with the tors x1 , . . . , xk is continuous, then almost surely any proper
(w , b+ , b− , α1 , . . . , α p ) of the Lagrange function
∗ ∗ ∗ ∗ ∗ subset of the set of functions {κ(·, x1 ), . . . , κ(·, xk )} is in
the space of all functions (9) linearly separable from its
p
, b+ − b− complement.
L(w, b+ , b− , α1 , . . . , α p ) = kwk2 + ∑ αk ( − ck xk> w) However, the featre vectors x and xk can’t be simply re-
2
k=1 placed by the corresponding functions κ(·, x) and κ(·, xk )
(5) in the definition (6) of a SVM classifier because a trans-
pose x> exists for a finite-dimensional vector, but not a for
where α1 , . . . , α p ≥ 0 are Lagrange coefficients
an infinite-dimensional function. Fortunately, the trans-
of the optimization task. Once the saddle point
pose occurs in (6) only as a part of the scalar product
(w∗ , b∗+ , b∗− , α1∗ , . . . , α p∗ ) is found, the classifier is de-
x> xk . And a scalar product can be defined also on the
fined by
space of all functions (9). Namely, the properties of a
( scalar product has the function that to the pair of func-
1 if ∑xk ∈S αk∗ ck x> xk + b∗ ≥ 0,
φ (x) = (6) tions (κ(·, x), κ(·, x0 ) assigns the value κ(x, x0 ). Using this
−1 if ∑xk ∈S αk∗ ck x> xk + b∗ < 0, scalar product in (6), we obtain the following definition of
a SVM classifier for linearly non-separable classes:
where b∗ = 21 (b∗+ + b∗− ) and (
1 if ∑xk ∈S αk∗ ck κ(x, xk ) + b ≥ 0,
∗
S = {xk |αk > 0}. (7) φ (x) = (14)
−1 if ∑xk ∈S αk∗ ck κ(x, xk ) + b ≥ 0.
Sentiment Analysis from Utterances 95
4.3 Multilayer Perceptrons (MLP) To an output neuron v ∈ O, also a somatic mapping of the
kind (17) with the activation functions (19) or (20) can be
A multilayer percptron is a mapping φ of feature vectors assigned. If it is the case, then the class c predicted for a
to classes with which a directed graph Gφ = (V , E ) is as- feature vector x is obtained as c = arg maxi (φ (x))i , where
sociated. Due to the inspiration from biological neural net- (φ (x))i denotes the i-the component of φ (x). Alternatively
works, the vertices of Gφ are called neurons and its edges the activation function assigned to an output neuron can be
are called connections. In addition, Gφ is required to have the step function, aka Heaviside function
a layered structure, which means that the set V of neu- (
rons can be decomposed into L + 1 mutually disjoint lay- 0 if t < 0,
ϕ(t) = (21)
ers, V = V0 ∪ V1 ∪ · · · ∪ VL , L ≥ 2, such that 1 if t ≥ 0.
(∀(u, v) ∈ E ) u ∈ Vi , i = 0, . . . , L − 1 & v 6∈ Vi ⇒ v ∈ Vi+1 . In that case, the value (φ (x))c already directly indicates
(15) whether x belongs to the class c.
The layer I = V0 is called input layer of the MLP,
the layer O = VL its output layer and the layers H1 = 4.4 Classification Trees (CT)
V1 , . . . , HL−1 = VL−1 its hidden layers. A classifier φ : X → C = {c1 , . . . , cm } is called binary
The purpose of the graph Gφ associated with the map- classification tree, if there is a binary tree Tφ = (Vφ , Eφ )
ping φ is to define a decomposition of φ into simple map- with vertices Vφ and edges Eφ such that:
pings assigned to hidden and output neurons and to con- (i) Vφ = {v1 , . . . , vL , . . . , v2L−1 }, where L ≥ 2, v0 is the
nections between neurons (input neurons normally only root of Tφ , v1 , . . . , vL−1 are its forks and vL , . . . , v2L−1
accept the components of the input, and no mappings are are its leaves.
assigned to them). Inspired by biological terminology, (ii) If the children of a fork v ∈ {v1 , . . . , vL−1 } are vL ∈ Vφ
mappings assigned to neurons are called somatic, those (left child) and vR ∈ Vφ (right child) and if v = vi , vL =
assigned to connections are called synaptic. v j , vR = vk , then i < j < k.
To each connection (u, v) ∈ E , the multiplication by a (iii) To each fork v ∈ {v1 , . . . , vL−1 }, a predicate ϕv of
weight w(u,v) is assigne as a synaptic mapping: some formal logic is assigned, evaluated on features
of the input vectors x ∈ X .
(∀ξ ∈ R) f(u,v) (ξ ) = w(u,v) ξ . (16) (iv) To each leaf v ∈ {vL , . . . , v2L−1 }, a class cv ∈ C is as-
signed.
To each hidden neuron v ∈ Hi , the following somatic (v) For each input x ∈ X , the predicate ϕv1 assigned to
mapping is assigned: the root is evaluated.
(vi) If for a fork v ∈ {v1 , . . . , vL−1 }, the predicate ϕv eval-
(∀ξ ∈ R| in(v)| ) fv (ξ ) = ϕ( ∑ [ξ ]u + bv ), (17) uates true, then φ (x) = cvL in case vL is already a leaf,
u∈in(v)
and the predicate ϕvL is evaluated in case vL is still a
fork.
where [ξ ]u for u ∈ in(v) denotes the component of ξ that
(vii) If for a fork v ∈ {v1 , . . . , vL−1 }, the predicate ϕv eval-
is the output of the synaptic mapping fu,v assigned to the
uates false, then φ (x) = cvR in case vR is already a
connection (u, v), in(v) = {u ∈ V |(u, v) ∈ E } is the in-
leaf, and the predicate ϕvR is evaluated in case vR is
put set of v, and ϕ : R → R is called activation function.
still a fork.
Though the activation functions, in applications typically
sigmoidal functions are used to this end, i.e., functions that
are non-decreasing, piecewise continuous, and such that 4.5 Random Forests (RF)
−∞ < lim ϕ(t) < lim ϕ(t) < ∞. (18) Random Forests are ensembles of classifiers in which the
t→−∞ t→∞ individual members are classification trees. They are con-
structed by bagging, i.e., bootstrap aggregation of individ-
The activation functions most frequently encountered in ual trees, which consists in training each member of the
MLPs are: ensemble with another set of training data, sampled ran-
• the logistic function, domly with replacement from the original training pairs
(x1 , c1 ), . . . , (x p , c p ). Typical sizes of random forests en-
1 countered in applications are dozens to thousands trees.
(∀t ∈ R) ϕ(t) = ; (19) Subsequently, when new subjects are input to the forest,
1 + e−t
each tree classifies them separately, according to the leaves
• the hyperbolic tangent, at which they end, and the final classification by the for-
est is obtained by means of an aggregation function. The
et − e−t usual aggregation function of random forests is majority
ϕ(t) = tanht = . (20) voting, or some of its fuzzy generalizations.
et + e−t
96 Jiří Kožusznik, Petr Pulc, and Martin Holeňa
According to which kind of randomness is involved in signed to them depend, apart from a bias, on values as-
the costruction of the ensemble, two broad groups of ran- signed to the unit input at the same time step and on val-
dom forests can be differentiated: ues assigned to the unit output at the previous time step.
Hence, an LSTM network layers is a recurrent network.
1. Random forests grown in the full input space. Each (i) Memory cells can store values, aka cell states, for an
tree is trained using all considered input features. arbitray time. They have no activation function, thus
Consequently, any feature has to be taken into ac- their output is actually a biased linear combination of
count when looking for the split condition assigned unit inputs and of the values from the previous time
to an inner node of the tree. However, features actu- step coming through recurrent connections.
ally occurring in the split conditions can be different (ii) Input gate controls the extent to which values from
from tree to tree, as a consequence of the fact that the previous unit or from the preceding layer influ-
each tree is trained with another set of training data. ence the value stored in the memory cell. It has a
For the same reason, even if a particular feature oc- sigmoidal activation function, which is applied to a
curs in split conditions of two different trees, those biased linear combination of unit inputs and of val-
conditions can be assigned to nodes at different lev- ues from the previous time step, though the bias and
els of the tree. synaptic weights of the input and recurrent connec-
A great advantage of this kind of random forests is tions are specific and in general different from the
that each tree is trained using all the information bias and synaptic weights of the memory cell.
available in its set of training data. Its main disadvan- (iii) Forget gate controls the extent to which the memory
tage is high computational complexity. In addition, if cell state is supressed. It again has a sigmoidal acti-
several or even only one variable are very noisy, that vation function, which is applied to a specific biased
noise gets nonetheless incorporated into all trees in linear combination of unit inputs and of values from
the forest. Because of those disadvantages, random the previous time step.
forests are grown in the complete input space primar- (iv) Output gate controls the extent to which the memory
ily if its dimension is not high and no input feature is cell state influences the unit output. Also this gate
substantially noisier than the remaining ones. has a sigmoidal activation function, which is applied
to a specific biased linear combination of unit inputs
2. Random forests grown in subspaces of the input and of values from the previous time step, and subse-
space. Each tree is trained using only a randomly quently composed either directly with the cell state or
chosen fraction of features, typically a small one. with its sigmoidal transformation, using another sig-
This means that a tree t is actually trained with pro- moid than is used by the gates.
jections of the training data into a low-dimensional
space spanned by some randomly selected dimen-
sions it,1 ≤ · · · ≤ it,dt ∈ {1, . . . , d}, where d is the di- 5 Experimental Testing
mension of the input space, and dt is typically much
smaller than d. Using only a subset of features not 5.1 Berlin Database of Emotional Speech
only makes forest training much faster, but also al- For the evaluation of already implemented classifiers, we
lows to eliminate noise originating from only several used the publicly available dataset ”EmoDB”, aka Berlin
features. The price paid for both these advantages is database of emotional speech. It consists of 535 emotional
that training makes use of only a part of the informa- utterances in 7 emotional categories namely anger, bore-
tion available in the training data. dom, disgust, fear, happiness, sadness and neutral. These
utterances are sentences read by 10 professional actors, 5
4.6 Long Short-Term Memory (LSTM) males and 5 females [1], which were recorded in an ane-
choic chamber under supervision by linguists and psychol-
An LSTM network is used for classification of sequences ogists) . The actors were advised to read these prede-
of feature vectors, or equivalently, multidimensional time fined sentences in the targeted emotional categories, but
series with discrete time. Alternatively, it can be also em- the sentences do not contain any emotional bias. A human
ployed to obtain sequences of such classifications, i.e., in perception test was conducted with 20 persons, different
situations when the neural network input is a sequence of from the speakers, in order to evaluate the quality of the
feature vectors and its output is a a sequence of classes. recorded data with respect to recognisability and natural-
Differently to most of other commonly encountered kinds ness of presented emotion. This evaluation yielded a mean
of artificial neural networks, an LSTM layer connects not accuracy 86% over all emotional categories.
simple neurons, but units with their own inner structure.
Several variants of an LSTM have been proposed (e.g.,
5.2 Experimental Settings
[5, 6]), all of them include at least the following four kinds
of units described below. Each of them has certain prop- As input features, the outputs from the Sound Description
erties of usual ANN neurons, in particular, the values as- Toolbox were used. Consequently, the input dimension
Sentiment Analysis from Utterances 97
was 187. The already implemented classifiers were com-
Table 1: Accuracy and area under curve (AUC) of the im-
pared by means of a 10-fold cross-validation, using the
plemented classifiers on the whole Berlin database of emo-
following settings for each of them:
tional speech. AUC is measured for binary classification
• For the k nearest neighbors classification, the value of each of the considered 7 emotions against the rest
k = 9 was chosen by a grid method from h1, 80i. This Classifier Accuracy AUC emotion against the rest
classifer was applied to data normalized to zero mean Anger Boredom Disgust
and unit variance. kNN 0.73 0.956 0.933 0.901
SVM 0.93 0.979 0.973 0.966
• Support vector machines are constructed for each of MLP 0.78 0.977 0.969 0.964
the 7 considered emotions, to classify between that DT 0.59 0.871 0.836 0.772
RF 0.71 0.962 0.949 0.920
emotion and all the remaining ones. They employ
auto-scaled Gaussian kernels and do not use slack
Classifier AUC emotion against the rest
variables. Fear Happiness Neutral Sadness
kNN 0.902 0.856 0.962 0.995
• The MLP has 1 hidden layer with 70 neurons. Hence,
SVM 0.983 0.904 0.974 0.997
taking into account the input dimension and the num-
MLP 0.969 0.933 0.983 0.996
ber of classes, the overall architecture of the MLP is DT 0.782 0.683 0.855 0.865
187-70-7. RF 0.921 0.882 0.972 0.992
• Classification trees are restricted to have at most 23
leaves. This upper limit was chosen by a grid method Table 2: Comparison between pairs of implemented clas-
from h1, 50i, taking into account the way how classi- sifiers with respect to accuracy, based on 10 independent
fication trees are grown in their Matlab implementa- parts of the Berlin database of emotional speech corre-
tion. sponding to 10 different speakers. The result in a cell of
the table indicates on how many parts the accuracy of the
• Random forests consist of 50 classification trees,
row classifier was higher : on how many parts the accuracy
each of them taking over the above restriction. The
of the column classifier was higher. A result in bold indi-
number of trees was selected by a grid method from
cates that after the Friedman test rejected the hypothesis of
10, 20,. . . ,100.
equal accuracy of all classifiers, the post-hoc test accord-
ing to [3, 4] rejects the hypothesis of equal accuracy of the
5.3 Comparison of Already Implemented Classifiers particular row and column classifiers. All simultaneously
tested hypotheses were corrected in accordance with Holm
First, we compared the already implemented classifiers on [9]
the whole Berlin database of emotional speech, with re- classifier kNN SVM MLP DT RF
spect to accuracy and area under the ROC curve (area un- kNN 0:10 3.5:6.5 9:1 5:5
der curve, AUC). Since a ROC curve makes sense only SVM 10:0 10:0 10:0 10:0
for a binary classifier, we computed areas under 7 sepa- MLP 6.5:3.5 0:10 10:0 7:3
rate curves corresponding to classifiers classifying always DT 1:9 0:10 0:10 0:10
1 emotion against the rest. The results are presented in Ta- RF 5:5 0:10 3:7 10:0
ble 1 and in Figure 1. They clearly show SVM as the most
promising classifier. It has the highest accuracy, and also
the AUC for binary classifiers corresponding to 5 of the 7 the family-wise significance level 5%, they reveal the fol-
classifiers lowing Holm-corrected significant differences between in-
Then we compared the classifiers separately on the dividual pairs of classifiers: both for accuracy and av-
utterances of each of the 10 speakers who created the eraged AUC: (SVM,DT), (MLP,DT), and in addition be-
database. The results are summarized in Table 2 for ac- tween (kNN,SVM), (SVM,RF) for accuracy.
curacy and Table 3 for AUC averaged over all 7 emo-
tions. They indicate a great difference between most of
the compared classifiers. This is confirmed by the Fried- 6 Conclusion
man test of the hypotheses that all classifiers have equal
accuracy and equal average AUC. The Friedman test re- The presented work in progress investigated the possibil-
jected both hypotheses with a high significance: With ities to analyse emotions in utterances based on MPEG7
the Holm correction for simultaneously tested hypothe- features. So far, we implemented only five classifica-
ses [9], the achieved significance level (aka p-value) was tion methods not using time series features, but only 187
4 · 10−6 . For both hypotheses, posthoc tests according to scalar features, namely the k nearest neighbours classi-
[3, 4] were performed, testing equal accuracy and equal fier, support vector machines, mutilayer perceptrons, de-
average AUC between individual pairs of classifiers. For cision trees and random forests. The obtained results in-
98 Jiří Kožusznik, Petr Pulc, and Martin Holeňa
[7] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of
Table 3: Comparison between pairs of implemented classi- Statistical Learning, 2nd Edition. Springer, 2008.
fiers with respect to the AUC averaged over all 7 emotions,
[8] S. Hochreiter and J. Schmidhuber. Long short-term mem-
based on 10 independent parts of the Berlin database of ory. Neural Computation, 9:1735–1780, 1997.
emotional speech corresponding to 10 different speakers.
[9] S. Holm. A simple sequentially rejective multiple test pro-
The result in a cell of the table indicates on how many parts cedure. Scandinavian Journal of Statistics, 6:65–70, 1979.
the AUC of the row classifier was higher : on how many
[10] H.G. Kim, N. Moreau, and T. Sikora. MPEG-7 Audio and
parts the AUC of the column classifier was higher. A result Beyond: Audio Content Indexing and Retrieval. John Wiley
in bold indicates that after the Friedman test rejected the and Sons, New York, 2005.
hypothesis of equal AUC of all classifiers, the post-hoc test [11] S. Lalitha, A. Madhavan, B. Bhusan, and S. Saketh. Speech
according to [3, 4] rejects the hypothesis of equal AUC of emotion recognition. In International Conference on Ad-
the particular row and column classifiers. All simultane- vances in Electronics, pages 92–95, 2014.
ously tested hypotheses were corrected in accordance with [12] A.S. Lampropoulos and G.A. Tsihrintzis. Evaluation of
Holm [9] MPEG-7 descriptors for speech emotional recognition. In
classifier kNN SVM MLP DT RF Eighth International Conference on Intelligent Information
kNN 2:8 0:10 10:0 4:6 Hiding and Multimedia Signal Processing, pages 98–101,
SVM 8:2 5:5 10:0 9:1 2012.
MLP 10:0 5:5 10:0 9:1 [13] A. Rauber, T. Lidy, J. Frank, E. Benetos, V. Zenz,
DT 0:10 0:10 0:10 0:10 G. Bertini, T. Virtanen, A.T. Cemgil, S. Godsill, D. Clark,
RF 6:4 1:9 1:9 10:0 P. Peeling, E. Peisyer, Y. Laprie, A. Sloin, A. Alfandary,
and D. Burshtein. MUSCLE network of excellence: Mul-
timedia understanding through semantics, computation and
dicate that especially support vector machines and multi- learning. Technical report, TU Vienna, Information and
layer perceptrons are quite successfull for this task. Statis- Software Engineering Group, 2004.
tical testing confirms significant differences between these [14] B. Schölkopf and A.J. Smola. Learning with Kernels. MIT
two kinds of classifiers on the one hand, and decision trees Press, Cambridge, 2002.
an random forests on the other hand. [15] T. Sikora, H.G. Kim, N. Moreau, and S. Amjad. MPEG-
The next step in this ongoing research is to implement 7-based audio annotation for the archival of digital video.
the long short-term memory neural network, recalled in http://mpeg7lld.nue.tu-berlin.de/, 2003.
Subsection 4.6, because they can work not only with scalar [16] Z. Xiao, E. Dellandrea, W. Dou, and L. Chen. Multi-stage
features but also with features represented with time series. classification of emotional speech motivated by a dimen-
sional emotion model. Multimedia Tools and Applicaions,
46:119–145, 2010.
Acknowledgement
The research reported in this paper has been supported by
the Czech Science Foundation (GAČR) grant 18-18080S.
References
[1] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and
B. Weiss. A database of german emotional speech. In In-
terspeech, pages 1517–1520, 2005.
[2] M. Casey, A. De Cheveigne, P. Gardner, M. Jackson,
and G. Peeters. MPEG-7 multimedia software resources.
http://mpeg7.doc.gold.ac.uk/, 2001.
[3] J. Demšar. Statistical comparisons of classifiers over multi-
ple data sets. Journal of Machine Learning Research, 7:1–
30, 2006.
[4] S. Garcia and F. Herrera. An extension on “Statistical Com-
parisons of Classifiers over Multiple Data Sets” for all pair-
wise comparisons. Journal of Machine Learning Research,
9:2677–2694, 2008.
[5] F.A. Gers, J. Schmidhuber, and J. Cummis. Learning to
forget: Continual prediction with LSTM. In 9th Interna-
tional Conference on Artificial Neural Networks: ICANN
’99, pages 850–855, 1999.
[6] A. Graves. Supervised Sequence Labelling with Recurrent
Neural Networks. PhD thesis, TU München, 2008.
Sentiment Analysis from Utterances 99
Figure 1: ROC curve for all emotions on the whole Berlin database