Sentiment Analysis from Utterances

Sentiment Analysis from Utterances JiříKožusznik Faculty of Information Technology Czech Technical University

Thákurova 7 Prague Czech Republic

PetrPulc Faculty of Information Technology Czech Technical University

Thákurova 7 Prague Czech Republic

Institute of Computer Science Czech Academy of Sciences

Pod vodárenskou věží 2 Prague Czech Republic

MartinHoleňa Institute of Computer Science Czech Academy of Sciences

Pod vodárenskou věží 2 Prague Czech Republic

Sentiment Analysis from Utterances C2882E2BE8ACB4DBFDF9B92231C26B29 GROBID - A machine learning software for extracting information from scholarly documents

The recognition of emotional states in speech is starting to play an increasingly important role. However, it is a complicated process, which heavily relies on the extraction and selection of utterance features related to the emotional state of the speaker. In the reported research, MPEG-7 low level audio descriptors[10] serve as features for the recognition of emotional categories. To this end, a methodology combining MPEG-7 with several important kinds of classifiers is elaborated.

Introduction

The recognition of emotional states in speech is expected to play an increasingly important role in applications such as media retrieval systems, car management systems, call center applications, personal assistants and the like. In many languages it is common that the meaning of spoken words changes depending on speakers emotions, and consequently the emotional information is important in order to understand the intended meaning. Emotional Speech recognition is a complicated process. Its performance heavily relies on the extraction and selection of features related to the emotional state of the speaker in the audio signal of an utterance. For most of them, the methodology has already been implemented, and they have been experimentally tested and compared Berlin database of emotional speech.

In the reported work in progress, we use MPEG-7 low level audio descriptors [10] as features for the recognition of emotional categories. To this end, we elaborate a methodology combining MPEG-7 with several important kinds of classifiers. For most of them, the methodology has already been implemented and tested with the publicly available Berlin Database of Emotional Speech [1].

In the next section, the task of sentiment analysis from utterances is briefly sketched. Section 3 recalls the necessary background concerning MPEG-7 audio descriptors and the considered classification methods. In Section 4, the principles of the proposed approach are explained. Finally, Section 5 presents results of experimental testing and comparison of the already implemented classifiers on the publicly available Berlin database of emotional speech.

Sentiment Analysis from Utterances

Due to the importance of recognizing emotional states in speech, research into sentiment analysis from utterances has been emerging during recent years. We are aware of 3 publications reporting research with the same database of emotional utterances as we used -the Berlin Database of Emotional Speech, used in our research. Let us recall each of them.

The research most similar to ours has been reported in [12], where the authors also used MPEG-7 descriptors for sentiment analysis from utterance. However, they used only scalar MPEG-7 descriptors or scalars derived with time-series descriptors using the software tools Sound Description Toolbox [13] and MPEG-7 Audio Reference Software Toolkit [2], whereas we are implementing also a long-short-term memory network that will use directly the time series. They also used only one classifer in their experiments, a combination of a radial basis function network and a support vector machine.

In [11], emotions are recognized using pitch and prosody features, which are mostly in time domain. Also in that paper, the experiments were performed, and the authors used only one classifer, this time a support vector machine (SVM).

The authors of [16] proposed a set of new 68 features, such as some new based on harmonic frequencies or on the Zipf distribution, for better speech emotion recognition. This set of features is used in a multi-stage classification. When performing the sentiment analysis of the Berlin Database, the utterance classification to the considered emotional categories was preceded with a gender classification of the speakers, and the gender of the speaker was subsequently used as an additional feature for the classification of the utterances. Temporally sampled scalar values for general use, applicable to all kinds of signals. The AP describes the temporally-smoothed instantaneous power of samples in the frame,in other words it is a temporally measure of signal content as a function of time and offers a quick summary of a signal in conjunction with other basic spectral descriptors. The AWF describes audio waveform envelope (minimum and maximum), typically for display purposes.

MPEG-7 Audio Descriptors

2. Basic Spectral: Audio Spectrum Envelop (ASE), Audio Spectrum Centroid (ASC), Audio Spectrum Spread (ASS), Audio Spectrum Flatness (ASF). All share a common basis, all deriving from the short term audio signal spectrum (analysis of frequency over time). They are all based on the ASE Descriptor, which is a logarithmic-frequency spectrum. This descriptor provides a compact description of the signal spectral content and represents the similar approximation of logarithmic response of the human ear. The ASE descriptor is an indicator as to whether the spectral content of a signal is dominated by high or low frequencies. The ASC Descriptor could be considered as an approximation of perceptual sharpness of the signal. The ASS descriptor indicates whether the signal content, as it is represented by the power spectrum, is concentrated around its centroid or spread out over a wider range of the spectrum. This gives a measure which allows the distinction of noise-like sounds from tonal sounds. The ASF describes the flatness properties of the spectrum of an audio signal for each of a number of frequency bands.

3. Basic Signal Parameters: Audio Fundamental Frequency (AFF) and Audio Harmonicity (AH).

The signal parameters constitute a simple parametric description of the audio signal. This group includes the computation of an estimate for the fundamental frequency (F0) of the audio signal. The AFF descriptor provides estimates of the fundamental frequency in segments in which the audio signal is assumed to be periodic. The AH represents the harmonicity of a signal, allowing distinction between sounds with a harmonic spectrum (e.g., musical tones or voiced speech e.g., vowels), sounds with an inharmonic spectrum (e.g., bell-like sounds) and sounds with a non-harmonic spectrum (e.g., noise, unvoiced speech).

4. Temporal Timbral: Log Attack Time (LAT), Temporal Centroid (TC). Timbre refers to features that allow one to distinguish two sounds that are equal in pitch, loudness and subjective duration. These descriptors are taking into account several perceptual dimensions at the same time in a complex way. Temporal Timbral descriptors describe the signal power function over time. The power function is estimated as a local mean square value of the signal amplitude value within a running window. The LAT descriptor characterizes the "attack" of a sound, the time it takes for the signal to rise from silence to its maximum amplitude. This feature signifies the difference between a sudden and a smooth sound. The TC descriptor computes a timebased centroid as the time average over the energy envelope of the signal.

5. Timbral Spectral descriptors: Harmonic Spec-tral Centroid (HSC), Harmonic Spectral Deviation (HSD), Harmonic Spectral Spread (HSS), Harmonic Spectral Variation (HSV) and Spectral Centroid. These are spectral features extracted in a linearfrequency space. The HSC descriptor is defined as the average, over the signal duration, of the amplitude-weighted mean of the frequency of the bins (the harmonic peaks of the spectrum) in the linear power spectrum. It is has a high correlation with the perceptual feature of "sharpness" of a sound. The HSD descriptor measures the spectral deviation of the harmonic peaks from the global envelope. The HSS descriptor measures the amplitude-weighted standard deviation (Root Mean Square) of the harmonic peaks of the spectrum, normalized by the HSC. The HSV descriptor is the normalized correlation between the amplitude of the harmonic peaks between two subsequent time-slices of the signal.

6. Spectral Basis, which consists of Audio Spectrum Basis (ASB) and Audio Spectrum Projection (ASP).

Tools for Working with MPEG-7 Descriptors

We utilized the Sound Description Toolbox [13] and MPEG-7 Audio Analyzer -Low Level Descriptors Extractor [15] for our experiments. Both of them extract a number of MPEG-7 standard descriptors, both scalar ones and time series. In addition, the SDT also calculates perceptual features such as Mel Frequency Cepstral Coefficients, Specific Loudness and Sensation Coefficients. From this descriptors calculate means, covariances, means of firstorder differences and covariances of first order differences. The Total number of features provided by this toolbox is 187.

Employed Classification Methods

We have elaborated our approach to sentiment analysis from utterances for six classification methods: k nearest neighbors, support vector machines, multilayer perceptrons, classification trees, random forests [7] and long short-term memory (LSTM) network [5,6,8]. The first five of them have already been implemented and tested (cf. Section 5), the last and most advanced one is still being implemented.

k Nearest Neighbours (kNN)

A very traditional way of classifying a new feature vector x ∈ X if a sequence of training data (x 1 , c 1 ), . . . , (x p , c p ) is available is the nearest neighbour method: take the x j that is the closest to x among x 1 , . . . , x p , and assign to x the class assigned to x j , i.e., c j . A straightforward generalization of the nearest neighbour method is to take among x 1 , . . . , x p not one, but k feature vectors x j j , . . . , x j k closest to x. Then x is assigned the

class c ∈ C fulfilling |{i, 1 ≤ i ≤ k|c j i = c}| = max c ∈C |{i, 1 ≤ i ≤ k|c j i = c }|. (1)

This method is called, expectedly, k nearest neighbours, or k-NN for short.

Support Vector Machines (SVM)

Support vector machines are classifiers into two classes. This method attempts to derive from the training data (x 1 , c 1 ), . . . , (x p , c p ) the best possible generalization to unseen feature vectors.

If both classes, more precisely their intersections with the set {x 1 , . . . , x p } of training inputs, are in the space of feature vectors linearly separable, the method constructs two parallel hyperplanes

H + = {x ∈ R n |x w + b + = 0}, H − = {x ∈ R n |x w + b − = 0} such that the train- ing data fulfil c k = 1 if x w + b + ≥ 0, -1 if x w + b − ≤ 0, k = 1, . . . , p,(2)H + ∩ {x 1 , . . . , x p } = / 0, H − ∩ {x 1 , . . . , x p } = / 0.(3)

The hyperplanes H

d(H + , H − ) = b + − b − w(4)

on condition that the p inequalities (2) hold.

The distance ( 4) is commonly called margin. The solution to this optimization task coincides with the

(w * , b * + , b * − , α * 1 , . . . , α * p ) of the Lagrange function L(w, b + , b − , α 1 , . . . , α p ) = w 2 + p ∑ k=1 α k ( , b + − b − 2 − c k x k w)(5)

where α 1 , . . . , α p ≥ 0 are Lagrange coefficients of the optimization task.

Once the saddle point

(w * , b * + , b * − , α * 1 , . . . , α * p ) is found, the classifier is de- fined by φ (x) = 1 if ∑ x k ∈S α * k c k x x k + b * ≥ 0, −1 if ∑ x k ∈S α * k c k x x k + b * < 0,(6)

where

b * = 1 2 (b * + + b * − ) and S = {x k |α * k > 0}. (7)

Due to the Karush-Kuhn-Tucker (KKT) conditions,

α * k ( b * + − b * − 2 − c k x k w * ) = 0, k = 1, . . . , p,(8)

all feature vectors from the set S lie on some of the suport hyperplanes (3). Therefore, they are called support vectors. This name together with the observation that they completely determine the classifier defined in ( 6) explains why such a classifier is called support vector machine.

If the intersections of both classes with the training inputs are not linearly separable, a SVM is constructed similarly, but instead of the set of possible fature vectors, now the set of functions κ(•, x) for all possible feature vectors x

is considered, where κ is a kernel, i.e., a mapping on pairs of feature vectors that is symmetric and such that for any k ∈ N and any sequence of different feature vectors x 1 , . . . , x k , the matrix

G κ (x 1 , . . . , x k ) =   κ(x 1 , x 1 ) . . . κ(x 1 , x k ) . . . . . . . . . . . . . . . . . . . . . . . κ(x k , x 1 ) . . . κ(x k , x k )   , (10)

which is called the Gramm matrix of x 1 , . . . , x k , is positive semidefinite, i.e.,

(∀y ∈ R k ) y G κ (x 1 , . . . , x k )y ≥ 0.(11)

The most commonly used kinds of kernels are the Gaussian kernel with a parameter ς > 0,

(∀x, x ∈ R n ) κ(x, x ) = exp − 1 ς x − x 2 ,(12)

and polynomial kernel with parameters d ∈ N and c ≥ 0,

(∀x, x ∈ R n ) κ(x, x ) = (x x + c) d .(13)

It is known [14] that, due to the properties of kernels, if the joint distribution of a sequence of different feature vectors x 1 , . . . , x k is continuous, then almost surely any proper subset of the set of functions {κ(•, x 1 ), . . . , κ(•, x k )} is in the space of all functions (9) linearly separable from its complement.

However, the featre vectors x and x k can't be simply replaced by the corresponding functions κ(•, x) and κ(•, x k ) in the definition (6) of a SVM classifier because a transpose x exists for a finite-dimensional vector, but not a for an infinite-dimensional function. Fortunately, the transpose occurs in (6) only as a part of the scalar product x x k . And a scalar product can be defined also on the space of all functions (9). Namely, the properties of a scalar product has the function that to the pair of functions (κ(•, x), κ(•, x ) assigns the value κ(x, x ). Using this scalar product in (6), we obtain the following definition of a SVM classifier for linearly non-separable classes:

φ (x) = 1 if ∑ x k ∈S α * k c k κ(x, x k ) + b ≥ 0, −1 if ∑ x k ∈S α * k c k κ(x, x k ) + b ≥ 0. (14)

Multilayer Perceptrons (MLP)

A multilayer percptron is a mapping φ of feature vectors to classes with which a directed graph G φ = (V , E ) is associated. Due to the inspiration from biological neural networks, the vertices of G φ are called neurons and its edges are called connections. In addition, G φ is required to have a layered structure, which means that the set V of neurons can be decomposed into L + 1 mutually disjoint layers,

V = V 0 ∪ V 1 ∪ • • • ∪ V L , L ≥ 2, such that (∀(u, v) ∈ E ) u ∈ V i , i = 0, . . . , L − 1 & v ∈ V i ⇒ v ∈ V i+1 .(15)

The layer I = V 0 is called input layer of the MLP, the layer O = V L its output layer and the layers

H 1 = V 1 , . . . , H L−1 = V L−1 its hidden layers.

The purpose of the graph G φ associated with the mapping φ is to define a decomposition of φ into simple mappings assigned to hidden and output neurons and to connections between neurons (input neurons normally only accept the components of the input, and no mappings are assigned to them). Inspired by biological terminology, mappings assigned to neurons are called somatic, those assigned to connections are called synaptic.

To each connection (u, v) ∈ E , the multiplication by a weight w (u,v) is assigne as a synaptic mapping:

(∀ξ ∈ R) f (u,v) (ξ ) = w (u,v) ξ .(16)

To each hidden neuron v ∈ H i , the following somatic mapping is assigned:

(∀ξ ∈ R | in(v)| ) f v (ξ ) = ϕ( ∑ u∈in(v) [ξ ] u + b v ),(17)

where [ξ ] u for u ∈ in(v) denotes the component of ξ that is the output of the synaptic mapping f u,v assigned to the connection (u, v), in(v) = {u ∈ V |(u, v) ∈ E } is the input set of v, and ϕ : R → R is called activation function.

Though the activation functions, in applications typically sigmoidal functions are used to this end, i.e., functions that are non-decreasing, piecewise continuous, and such that

−∞ < lim t→−∞ ϕ(t) < lim t→∞ ϕ(t) < ∞. (18)

The activation functions most frequently encountered in MLPs are:

• the logistic function,

(∀t ∈ R) ϕ(t) = 1 1 + e −t ;(19)

• the hyperbolic tangent,

ϕ(t) = tanht = e t − e −t e t + e −t .(20)

To an output neuron v ∈ O, also a somatic mapping of the kind (17) with the activation functions ( 19) or (20) can be assigned. If it is the case, then the class c predicted for a feature vector x is obtained as c = arg max i (φ (x)) i , where (φ (x)) i denotes the i-the component of φ (x). Alternatively the activation function assigned to an output neuron can be the step function, aka Heaviside function

ϕ(t) = 0 if t < 0, 1 if t ≥ 0. (21)

In that case, the value (φ (x)) c already directly indicates whether x belongs to the class c.

Classification Trees (CT)

A classifier φ : X → C = {c 1 , . . . , c m } is called binary classification tree, if there is a binary tree T φ = (V φ , E φ ) with vertices V φ and edges E φ such that:

(i) V φ = {v 1 , . . . , v L , . . . , v 2L−1 }, where L ≥ 2, v 0 is the root of T φ , v 1 , . . . , v L−1 are its forks and v L , . . . , v 2L−1 are its leaves. (ii) If the children of a fork v ∈ {v 1 , . . . , v L−1 } are v L ∈ V φ (left child) and v R ∈ V φ (right child) and if v = v i , v L = v j , v R = v k , then i < j < k. (iii) To each fork v ∈ {v 1 , . . . , v L−1 }, a predicate ϕ v of

some formal logic is assigned, evaluated on features of the input vectors x ∈ X . (iv) To each leaf v ∈ {v L , . . . , v 2L−1 }, a class c v ∈ C is assigned. (v) For each input x ∈ X , the predicate ϕ v 1 assigned to the root is evaluated. (vi) If for a fork v ∈ {v 1 , . . . , v L−1 }, the predicate ϕ v evaluates true, then φ (x) = c v L in case v L is already a leaf, and the predicate ϕ v L is evaluated in case v L is still a fork. (vii) If for a fork v ∈ {v 1 , . . . , v L−1 }, the predicate ϕ v evaluates false, then φ (x) = c v R in case v R is already a leaf, and the predicate ϕ v R is evaluated in case v R is still a fork.

Random Forests (RF)

Random Forests are ensembles of classifiers in which the individual members are classification trees. They are constructed by bagging, i.e., bootstrap aggregation of individual trees, which consists in training each member of the ensemble with another set of training data, sampled randomly with replacement from the original training pairs (x 1 , c 1 ), . . . , (x p , c p ). Typical sizes of random forests encountered in applications are dozens to thousands trees. Subsequently, when new subjects are input to the forest, each tree classifies them separately, according to the leaves at which they end, and the final classification by the forest is obtained by means of an aggregation function. The usual aggregation function of random forests is majority voting, or some of its fuzzy generalizations.

According to which kind of randomness is involved in the costruction of the ensemble, two broad groups of random forests can be differentiated:

1. Random forests grown in the full input space. Each tree is trained using all considered input features. Consequently, any feature has to be taken into account when looking for the split condition assigned to an inner node of the tree. However, features actually occurring in the split conditions can be different from tree to tree, as a consequence of the fact that each tree is trained with another set of training data.

For the same reason, even if a particular feature occurs in split conditions of two different trees, those conditions can be assigned to nodes at different levels of the tree.

A great advantage of this kind of random forests is that each tree is trained using all the information available in its set of training data. Its main disadvantage is high computational complexity. In addition, if several or even only one variable are very noisy, that noise gets nonetheless incorporated into all trees in the forest. Because of those disadvantages, random forests are grown in the complete input space primarily if its dimension is not high and no input feature is substantially noisier than the remaining ones.

2. Random forests grown in subspaces of the input space. Each tree is trained using only a randomly chosen fraction of features, typically a small one. This means that a tree t is actually trained with projections of the training data into a low-dimensional space spanned by some randomly selected dimensions i t,1 ≤ • • • ≤ i t,d t ∈ {1, . . . , d}, where d is the dimension of the input space, and d t is typically much smaller than d. Using only a subset of features not only makes forest training much faster, but also allows to eliminate noise originating from only several features. The price paid for both these advantages is that training makes use of only a part of the information available in the training data.

Long Short-Term Memory (LSTM)

An LSTM network is used for classification of sequences of feature vectors, or equivalently, multidimensional time series with discrete time. Alternatively, it can be also employed to obtain sequences of such classifications, i.e., in situations when the neural network input is a sequence of feature vectors and its output is a a sequence of classes. Differently to most of other commonly encountered kinds of artificial neural networks, an LSTM layer connects not simple neurons, but units with their own inner structure. Several variants of an LSTM have been proposed (e.g., [5,6]), all of them include at least the following four kinds of units described below. Each of them has certain properties of usual ANN neurons, in particular, the values as-signed to them depend, apart from a bias, on values assigned to the unit input at the same time step and on values assigned to the unit output at the previous time step. Hence, an LSTM network layers is a recurrent network.

(i) Memory cells can store values, aka cell states, for an arbitray time. They have no activation function, thus their output is actually a biased linear combination of unit inputs and of the values from the previous time step coming through recurrent connections. (ii) Input gate controls the extent to which values from the previous unit or from the preceding layer influence the value stored in the memory cell. It has a sigmoidal activation function, which is applied to a biased linear combination of unit inputs and of values from the previous time step, though the bias and synaptic weights of the input and recurrent connections are specific and in general different from the bias and synaptic weights of the memory cell. (iii) Forget gate controls the extent to which the memory cell state is supressed. It again has a sigmoidal activation function, which is applied to a specific biased linear combination of unit inputs and of values from the previous time step. (iv) Output gate controls the extent to which the memory cell state influences the unit output. Also this gate has a sigmoidal activation function, which is applied to a specific biased linear combination of unit inputs and of values from the previous time step, and subsequently composed either directly with the cell state or with its sigmoidal transformation, using another sigmoid than is used by the gates.

5 Experimental Testing

Berlin Database of Emotional Speech

For the evaluation of already implemented classifiers, we used the publicly available dataset "EmoDB", aka Berlin database of emotional speech. It consists of 535 emotional utterances in 7 emotional categories namely anger, boredom, disgust, fear, happiness, sadness and neutral. These utterances are sentences read by 10 professional actors, 5 males and 5 females [1], which were recorded in an anechoic chamber under supervision by linguists and psychologists) . The actors were advised to read these predefined sentences in the targeted emotional categories, but the sentences do not contain any emotional bias. A human perception test was conducted with 20 persons, different from the speakers, in order to evaluate the quality of the recorded data with respect to recognisability and naturalness of presented emotion. This evaluation yielded a mean accuracy 86% over all emotional categories.

Experimental Settings

As input features, the outputs from the Sound Description Toolbox were used. Consequently, the input dimension was 187. The already implemented classifiers were compared by means of a 10-fold cross-validation, using the following settings for each of them:

• For the k nearest neighbors classification, the value k = 9 was chosen by a grid method from 1, 80 . This classifer was applied to data normalized to zero mean and unit variance.

• Support vector machines are constructed for each of the 7 considered emotions, to classify between that emotion and all the remaining ones. They employ auto-scaled Gaussian kernels and do not use slack variables.

• The MLP has 1 hidden layer with 70 neurons. Hence, taking into account the input dimension and the number of classes, the overall architecture of the MLP is 187-70-7.

• Classification trees are restricted to have at most 23 leaves. This upper limit was chosen by a grid method from 1, 50 , taking into account the way how classification trees are grown in their Matlab implementation.

• Random forests consist of 50 classification trees, each of them taking over the above restriction. The number of trees was selected by a grid method from 10, 20,. . . ,100.

Comparison of Already Implemented Classifiers

First, we compared the already implemented classifiers on the whole Berlin database of emotional speech, with respect to accuracy and area under the ROC curve (area under curve, AUC). Since a ROC curve makes sense only for a binary classifier, we computed areas under 7 separate curves corresponding to classifiers classifying always 1 emotion against the rest. The results are presented in Table 1 and in Figure 1. They clearly show SVM as the most promising classifier. It has the highest accuracy, and also the AUC for binary classifiers corresponding to 5 of the 7 classifiers Then we compared the classifiers separately on the utterances of each of the 10 speakers who created the database. The results are summarized in Table 2 for accuracy and Table 3 for AUC averaged over all 7 emotions. They indicate a great difference between most of the compared classifiers. This is confirmed by the Friedman test of the hypotheses that all classifiers have equal accuracy and equal average AUC. The Friedman test rejected both hypotheses with a high significance: With the Holm correction for simultaneously tested hypotheses [9], the achieved significance level (aka p-value) was 4 • 10 −6 . For both hypotheses, posthoc tests according to [3,4] were performed, testing equal accuracy and equal average AUC between individual pairs of classifiers. For on how many parts the accuracy of the row classifier was higher : on how many parts the accuracy of the column classifier was higher. A result in bold indicates that after the Friedman test rejected the hypothesis of equal accuracy of all classifiers, the post-hoc test according to [3,4] rejects the hypothesis of equal accuracy of the particular row and column classifiers. All simultaneously tested hypotheses were corrected in accordance with Holm the family-wise significance level 5%, they reveal the following Holm-corrected significant differences between individual pairs of classifiers: both for accuracy and averaged AUC: (SVM,DT), (MLP,DT), and in addition between (kNN,SVM), (SVM,RF) for accuracy.

Conclusion

The presented work in progress investigated the possibilities to analyse emotions in utterances based on MPEG7 features. So far, we implemented only five classification methods not using time series features, but only 187 scalar features, namely the k nearest neighbours classifier, support vector machines, mutilayer perceptrons, decision trees and random forests. The obtained results in-Table 3: Comparison between pairs of implemented classifiers with respect to the AUC averaged over all 7 emotions, based on 10 independent parts of the Berlin database of emotional speech corresponding to 10 different speakers.

The result in a cell of the table indicates on how many parts the AUC of the row classifier was higher : on how many parts the AUC of the column classifier was higher. A result in bold indicates that after the Friedman test rejected the hypothesis of equal AUC of all classifiers, the post-hoc test according to [3,4] rejects the hypothesis of equal AUC of the particular row and column classifiers. All simultaneously tested hypotheses were corrected in accordance with Holm dicate that especially support vector machines and multilayer perceptrons are quite successfull for this task. Statistical testing confirms significant differences between these two kinds of classifiers on the one hand, and decision trees an random forests on the other hand. The next step in this ongoing research is to implement the long short-term memory neural network, recalled in Subsection 4.6, because they can work not only with scalar features but also with features represented with time series.

MPEG- 77is a standard for low-level description of audio signals, describing a signal by means of the following groups of descriptors[10]: 1. Basic: Audio Power (AP), Audio Waveform(AWF).

+ and H − alle called support hyperplanes. Their common normal vector w and intercepts b + , b − are obtained through solving the following constrained optimization task: Maximize with respect to w, b + , b − the distance

Figure 1 :1Figure 1: ROC curve for all emotions on the whole Berlin database

Table 1 :1Accuracy and area under curve (AUC) of the implemented classifiers on the whole Berlin database of emotional speech. AUC is measured for binary classification of each of the considered 7 emotions against the restClassifier AccuracyAUC emotion against the restAngerBoredom DisgustkNN SVM MLP0.73 0.93 0.780.956 0.979 0.9770.933 0.973 0.9690.901 0.966 0.964DT0.590.8710.8360.772RF0.710.9620.9490.920ClassifierAUC emotion against the restFearHappinessNeutralSadnesskNN SVM MLP DT0.902 0.983 0.969 0.7820.856 0.904 0.933 0.6830.962 0.974 0.983 0.8550.995 0.997 0.996 0.865RF0.9210.8820.9720.992

Table 2 :2Comparison between pairs of implemented classifiers with respect to accuracy, based on 10 independent parts of the Berlin database of emotional speech corresponding to 10 different speakers. The result in a cell of the table indicatesJiří Kožusznik, Petr Pulc, and Martin Hole ňa

Acknowledgement

The research reported in this paper has been supported by the Czech Science Foundation (GA ČR) grant 18-18080S.

A database of german emotional speech FBurkhardt APaeschke MRolfes WSendlmeier BWeiss Interspeech 2005 MCasey ADe Cheveigne PGardner MJackson GPeeters MPEG-7 multimedia software resources 2001 Statistical comparisons of classifiers over multiple data sets JDemšar Journal of Machine Learning Research 7 2006 An extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all pairwise comparisons SGarcia FHerrera Journal of Machine Learning Research 9 2008 Learning to forget: Continual prediction with LSTM FAGers JSchmidhuber JCummis 9th International Conference on Artificial Neural Networks: ICANN '99 1999 Supervised Sequence Labelling with Recurrent Neural Networks AGraves 2008 TU München PhD thesis The Elements of Statistical Learning THastie RTibshirani JFriedman 2008 Springer 2nd Edition Long short-term memory SHochreiter JSchmidhuber Neural Computation 9 1997 A simple sequentially rejective multiple test procedure SHolm Scandinavian Journal of Statistics 6 1979 MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval HGKim NMoreau TSikora 2005 John Wiley and Sons New York Speech emotion recognition SLalitha AMadhavan BBhusan SSaketh International Conference on Advances in Electronics 2014 Evaluation of MPEG-7 descriptors for speech emotional recognition ASLampropoulos GATsihrintzis Eighth International Conference on Intelligent Information Hiding and Multimedia Signal Processing 2012 MUSCLE network of excellence: Multimedia understanding through semantics, computation and learning ARauber TLidy JFrank EBenetos VZenz GBertini TVirtanen ATCemgil SGodsill DClark PPeeling EPeisyer YLaprie ASloin AAlfandary DBurshtein 2004 TU Vienna, Information and Software Engineering Group Technical report Learning with Kernels BSchölkopf AJSmola 2002 MIT Press Cambridge TSikora HGKim NMoreau SAmjad MPEG-7-based audio annotation for the archival of digital video 2003 Multi-stage classification of emotional speech motivated by a dimensional emotion model ZXiao EDellandrea WDou LChen Multimedia Tools and Applicaions 46 2010