S. Krajči (ed.): ITAT 2018 Proceedings, pp. 92–99 CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Jiří Kožusznik, Petr Pulc, and Martin Holeňa Sentiment Analysis from Utterances Jiří Kožusznik1 , Petr Pulc1,2 , Martin Holeňa2 1 Faculty of Information Technology, Czech Technical University, Thákurova 7, Prague,Czech Republic 2 Institute of Computer Science, Czech Academy of Sciences, Pod vodárenskou věží 2, Prague, Czech Republic Abstract: The recognition of emotional states in speech is has been emerging during recent years. We are aware of 3 starting to play an increasingly important role. However, publications reporting research with the same database of it is a complicated process, which heavily relies on the emotional utterances as we used – the Berlin Database of extraction and selection of utterance features related to the Emotional Speech, used in our research. Let us recall each emotional state of the speaker. In the reported research, of them. MPEG-7 low level audio descriptors[10] serve as features The research most similar to ours has been reported in for the recognition of emotional categories. To this end, a [12], where the authors also used MPEG-7 descriptors for methodology combining MPEG-7 with several important sentiment analysis from utterance. However, they used kinds of classifiers is elaborated. only scalar MPEG-7 descriptors or scalars derived with time-series descriptors using the software tools Sound Description Toolbox [13] and MPEG-7 Audio Reference 1 Introduction Software Toolkit[2], whereas we are implementing also a long-short-term memory network that will use directly the The recognition of emotional states in speech is expected time series. They also used only one classifer in their ex- to play an increasingly important role in applications such periments, a combination of a radial basis function net- as media retrieval systems, car management systems, call work and a support vector machine. center applications, personal assistants and the like. In In [11], emotions are recognized using pitch and many languages it is common that the meaning of spoken prosody features, which are mostly in time domain. Also words changes depending on speakers emotions, and con- in that paper, the experiments were performed, and the au- sequently the emotional information is important in order thors used only one classifer, this time a support vector to understand the intended meaning. Emotional Speech machine (SVM). recognition is a complicated process. Its performance The authors of [16] proposed a set of new 68 features, heavily relies on the extraction and selection of features such as some new based on harmonic frequencies or on related to the emotional state of the speaker in the audio the Zipf distribution, for better speech emotion recogni- signal of an utterance. For most of them, the methodol- tion. This set of features is used in a multi-stage classi- ogy has already been implemented, and they have been ex- fication. When performing the sentiment analysis of the perimentally tested and compared Berlin database of emo- Berlin Database, the utterance classification to the con- tional speech. sidered emotional categories was preceded with a gender In the reported work in progress, we use MPEG-7 low classification of the speakers, and the gender of the speaker level audio descriptors[10] as features for the recogni- was subsequently used as an additional feature for the clas- tion of emotional categories. To this end, we elaborate a sification of the utterances. methodology combining MPEG-7 with several important kinds of classifiers. For most of them, the methodology has already been implemented and tested with the publicly 3 MPEG-7 Audio Descriptors available Berlin Database of Emotional Speech [1]. In the next section, the task of sentiment analysis from MPEG-7 is a standard for low-level description of audio utterances is briefly sketched. Section 3 recalls the nec- signals, describing a signal by means of the following essary background concerning MPEG-7 audio descriptors groups of descriptors[10]: and the considered classification methods. In Section 4, the principles of the proposed approach are explained. Fi- 1. Basic: Audio Power (AP), Audio Waveform(AWF). nally, Section 5 presents results of experimental testing Temporally sampled scalar values for general use, ap- and comparison of the already implemented classifiers on plicable to all kinds of signals. The AP describes the the publicly available Berlin database of emotional speech. temporally-smoothed instantaneous power of sam- ples in the frame,in other words it is a temporally measure of signal content as a function of time and 2 Sentiment Analysis from Utterances offers a quick summary of a signal in conjunction with other basic spectral descriptors. The AWF Due to the importance of recognizing emotional states in describes audio waveform envelope (minimum and speech, research into sentiment analysis from utterances maximum), typically for display purposes. Sentiment Analysis from Utterances 93 2. Basic Spectral: Audio Spectrum Envelop (ASE), tral Centroid (HSC), Harmonic Spectral Deviation Audio Spectrum Centroid (ASC), Audio Spectrum (HSD), Harmonic Spectral Spread (HSS), Harmonic Spread (ASS), Audio Spectrum Flatness (ASF). Spectral Variation (HSV) and Spectral Centroid. All share a common basis, all deriving from the short These are spectral features extracted in a linear- term audio signal spectrum (analysis of frequency frequency space. The HSC descriptor is defined over time). They are all based on the ASE Descriptor, as the average, over the signal duration, of the which is a logarithmic-frequency spectrum. This de- amplitude-weighted mean of the frequency of the scriptor provides a compact description of the signal bins (the harmonic peaks of the spectrum) in the lin- spectral content and represents the similar approxi- ear power spectrum. It is has a high correlation with mation of logarithmic response of the human ear. The the perceptual feature of ”sharpness” of a sound. The ASE descriptor is an indicator as to whether the spec- HSD descriptor measures the spectral deviation of the tral content of a signal is dominated by high or low harmonic peaks from the global envelope. The HSS frequencies. The ASC Descriptor could be consid- descriptor measures the amplitude-weighted standard ered as an approximation of perceptual sharpness of deviation (Root Mean Square) of the harmonic peaks the signal. The ASS descriptor indicates whether the of the spectrum, normalized by the HSC. The HSV signal content, as it is represented by the power spec- descriptor is the normalized correlation between the trum, is concentrated around its centroid or spread amplitude of the harmonic peaks between two subse- out over a wider range of the spectrum. This gives quent time-slices of the signal. a measure which allows the distinction of noise-like sounds from tonal sounds. The ASF describes the 6. Spectral Basis, which consists of Audio Spectrum flatness properties of the spectrum of an audio signal Basis (ASB) and Audio Spectrum Projection (ASP). for each of a number of frequency bands. 3.1 Tools for Working with MPEG-7 Descriptors 3. Basic Signal Parameters: Audio Fundamental Fre- quency (AFF) and Audio Harmonicity (AH). We utilized the Sound Description Toolbox [13] and The signal parameters constitute a simple paramet- MPEG-7 Audio Analyzer - Low Level Descriptors Extrac- ric description of the audio signal. This group in- tor [15] for our experiments. Both of them extract a num- cludes the computation of an estimate for the fun- ber of MPEG-7 standard descriptors, both scalar ones and damental frequency (F0) of the audio signal. The time series. In addition, the SDT also calculates percep- AFF descriptor provides estimates of the fundamen- tual features such as Mel Frequency Cepstral Coefficients, tal frequency in segments in which the audio signal Specific Loudness and Sensation Coefficients. From this is assumed to be periodic. The AH represents the descriptors calculate means, covariances, means of first- harmonicity of a signal, allowing distinction between order differences and covariances of first order differences. sounds with a harmonic spectrum (e.g., musical tones The Total number of features provided by this toolbox is or voiced speech e.g., vowels), sounds with an inhar- 187. monic spectrum (e.g., bell-like sounds) and sounds with a non-harmonic spectrum (e.g., noise, unvoiced 4 Employed Classification Methods speech). 4. Temporal Timbral: Log Attack Time (LAT), Tempo- We have elaborated our approach to sentiment analysis ral Centroid (TC). from utterances for six classification methods: k near- Timbre refers to features that allow one to distinguish est neighbors, support vector machines, multilayer per- two sounds that are equal in pitch, loudness and sub- ceptrons, classification trees, random forests [7] and long jective duration. These descriptors are taking into short-term memory (LSTM) network [5, 6, 8]. The first account several perceptual dimensions at the same five of them have already been implemented and tested (cf. time in a complex way. Temporal Timbral descriptors Section 5), the last and most advanced one is still being describe the signal power function over time. The implemented. power function is estimated as a local mean square value of the signal amplitude value within a running 4.1 k Nearest Neighbours (kNN) window. The LAT descriptor characterizes the ”at- tack” of a sound, the time it takes for the signal to A very traditional way of classifying a new feature vector rise from silence to its maximum amplitude. This fea- x ∈ X if a sequence of training data (x1 , c1 ), . . . , (x p , c p ) ture signifies the difference between a sudden and a is available is the nearest neighbour method: take the x j smooth sound. The TC descriptor computes a time- that is the closest to x among x1 , . . . , x p , and assign to x the based centroid as the time average over the energy class assigned to x j , i.e., c j . envelope of the signal. A straightforward generalization of the nearest neigh- bour method is to take among x1 , . . . , x p not one, but k fea- 5. Timbral Spectral descriptors: Harmonic Spec- ture vectors x j j , . . . , x jk closest to x. Then x is assigned the 94 Jiří Kožusznik, Petr Pulc, and Martin Holeňa class c ∈ C fulfilling Due to the Karush-Kuhn-Tucker (KKT) conditions, b∗ − b∗− |{i, 1 ≤ i ≤ k|c ji = c}| = maxc0 ∈C |{i, 1 ≤ i ≤ k|c ji = c0 }|. αk∗ ( + − ck xk> w∗ ) = 0, k = 1, . . . , p, (8) (1) 2 all feature vectors from the set S lie on some of the su- This method is called, expectedly, k nearest neighbours, or port hyperplanes (3). Therefore, they are called support k-NN for short. vectors. This name together with the observation that they completely determine the classifier defined in (6) explains 4.2 Support Vector Machines (SVM) why such a classifier is called support vector machine. If the intersections of both classes with the training in- Support vector machines are classifiers into two classes. puts are not linearly separable, a SVM is constructed sim- This method attempts to derive from the training data ilarly, but instead of the set of possible fature vectors, now (x1 , c1 ), . . . , (x p , c p ) the best possible generalization to un- the set of functions seen feature vectors. κ(·, x) for all possible feature vectors x (9) If both classes, more precisely their intersections with the set {x1 , . . . , x p } of training inputs, are in the space is considered, where κ is a kernel, i.e., a mapping on of feature vectors linearly separable, the method con- pairs of feature vectors that is symmetric and such that for structs two parallel hyperplanes H+ = {x ∈ Rn |x> w + any k ∈ N and any sequence of different feature vectors b+ = 0}, H− = {x ∈ Rn |x> w + b− = 0} such that the train- x1 , . . . , xk , the matrix ing data fulfil   κ(x1 , x1 ) . . . κ(x1 , xk ) ( 1 if x> w + b+ ≥ 0, Gκ (x1 , . . . , xk ) = . . . . . . . . . . . . . . . . . . . . . . . , (10) ck = k = 1, . . . , p, (2) κ(xk , x1 ) . . . κ(xk , xk ) -1 if x> w + b− ≤ 0, which is called the Gramm matrix of x1 , . . . , xk , is positive H+ ∩ {x1 , . . . , x p } 6= 0, / H− ∩ {x1 , . . . , x p } 6= 0. / (3) semidefinite, i.e., The hyperplanes H+ and H− alle called support hyper- (∀y ∈ Rk ) y> Gκ (x1 , . . . , xk )y ≥ 0. (11) planes. Their common normal vector w and intercepts b+ , b− are obtained through solving the following con- The most commonly used kinds of kernels are the Gaus- strained optimization task: sian kernel with a parameter ς > 0,   Maximize with respect to w, b+ , b− the distance n0 1 0 2 (∀x, x ∈ R ) κ(x, x ) = exp − kx − x k , 0 0 (12) ς b+ − b− and polynomial kernel with parameters d ∈ N and c ≥ 0, d(H+ , H− ) = (4) kwk 0 (∀x, x0 ∈ Rn ) κ(x, x0 ) = (x> x0 + c)d . (13) on condition that the p inequalities (2) hold. It is known [14] that, due to the properties of kernels, if The distance (4) is commonly called margin. The the joint distribution of a sequence of different feature vec- solution to this optimization task coincides with the tors x1 , . . . , xk is continuous, then almost surely any proper (w , b+ , b− , α1 , . . . , α p ) of the Lagrange function ∗ ∗ ∗ ∗ ∗ subset of the set of functions {κ(·, x1 ), . . . , κ(·, xk )} is in the space of all functions (9) linearly separable from its p , b+ − b− complement. L(w, b+ , b− , α1 , . . . , α p ) = kwk2 + ∑ αk ( − ck xk> w) However, the featre vectors x and xk can’t be simply re- 2 k=1 placed by the corresponding functions κ(·, x) and κ(·, xk ) (5) in the definition (6) of a SVM classifier because a trans- pose x> exists for a finite-dimensional vector, but not a for where α1 , . . . , α p ≥ 0 are Lagrange coefficients an infinite-dimensional function. Fortunately, the trans- of the optimization task. Once the saddle point pose occurs in (6) only as a part of the scalar product (w∗ , b∗+ , b∗− , α1∗ , . . . , α p∗ ) is found, the classifier is de- x> xk . And a scalar product can be defined also on the fined by space of all functions (9). Namely, the properties of a ( scalar product has the function that to the pair of func- 1 if ∑xk ∈S αk∗ ck x> xk + b∗ ≥ 0, φ (x) = (6) tions (κ(·, x), κ(·, x0 ) assigns the value κ(x, x0 ). Using this −1 if ∑xk ∈S αk∗ ck x> xk + b∗ < 0, scalar product in (6), we obtain the following definition of a SVM classifier for linearly non-separable classes: where b∗ = 21 (b∗+ + b∗− ) and ( 1 if ∑xk ∈S αk∗ ck κ(x, xk ) + b ≥ 0, ∗ S = {xk |αk > 0}. (7) φ (x) = (14) −1 if ∑xk ∈S αk∗ ck κ(x, xk ) + b ≥ 0. Sentiment Analysis from Utterances 95 4.3 Multilayer Perceptrons (MLP) To an output neuron v ∈ O, also a somatic mapping of the kind (17) with the activation functions (19) or (20) can be A multilayer percptron is a mapping φ of feature vectors assigned. If it is the case, then the class c predicted for a to classes with which a directed graph Gφ = (V , E ) is as- feature vector x is obtained as c = arg maxi (φ (x))i , where sociated. Due to the inspiration from biological neural net- (φ (x))i denotes the i-the component of φ (x). Alternatively works, the vertices of Gφ are called neurons and its edges the activation function assigned to an output neuron can be are called connections. In addition, Gφ is required to have the step function, aka Heaviside function a layered structure, which means that the set V of neu- ( rons can be decomposed into L + 1 mutually disjoint lay- 0 if t < 0, ϕ(t) = (21) ers, V = V0 ∪ V1 ∪ · · · ∪ VL , L ≥ 2, such that 1 if t ≥ 0. (∀(u, v) ∈ E ) u ∈ Vi , i = 0, . . . , L − 1 & v 6∈ Vi ⇒ v ∈ Vi+1 . In that case, the value (φ (x))c already directly indicates (15) whether x belongs to the class c. The layer I = V0 is called input layer of the MLP, the layer O = VL its output layer and the layers H1 = 4.4 Classification Trees (CT) V1 , . . . , HL−1 = VL−1 its hidden layers. A classifier φ : X → C = {c1 , . . . , cm } is called binary The purpose of the graph Gφ associated with the map- classification tree, if there is a binary tree Tφ = (Vφ , Eφ ) ping φ is to define a decomposition of φ into simple map- with vertices Vφ and edges Eφ such that: pings assigned to hidden and output neurons and to con- (i) Vφ = {v1 , . . . , vL , . . . , v2L−1 }, where L ≥ 2, v0 is the nections between neurons (input neurons normally only root of Tφ , v1 , . . . , vL−1 are its forks and vL , . . . , v2L−1 accept the components of the input, and no mappings are are its leaves. assigned to them). Inspired by biological terminology, (ii) If the children of a fork v ∈ {v1 , . . . , vL−1 } are vL ∈ Vφ mappings assigned to neurons are called somatic, those (left child) and vR ∈ Vφ (right child) and if v = vi , vL = assigned to connections are called synaptic. v j , vR = vk , then i < j < k. To each connection (u, v) ∈ E , the multiplication by a (iii) To each fork v ∈ {v1 , . . . , vL−1 }, a predicate ϕv of weight w(u,v) is assigne as a synaptic mapping: some formal logic is assigned, evaluated on features of the input vectors x ∈ X . (∀ξ ∈ R) f(u,v) (ξ ) = w(u,v) ξ . (16) (iv) To each leaf v ∈ {vL , . . . , v2L−1 }, a class cv ∈ C is as- signed. To each hidden neuron v ∈ Hi , the following somatic (v) For each input x ∈ X , the predicate ϕv1 assigned to mapping is assigned: the root is evaluated. (vi) If for a fork v ∈ {v1 , . . . , vL−1 }, the predicate ϕv eval- (∀ξ ∈ R| in(v)| ) fv (ξ ) = ϕ( ∑ [ξ ]u + bv ), (17) uates true, then φ (x) = cvL in case vL is already a leaf, u∈in(v) and the predicate ϕvL is evaluated in case vL is still a fork. where [ξ ]u for u ∈ in(v) denotes the component of ξ that (vii) If for a fork v ∈ {v1 , . . . , vL−1 }, the predicate ϕv eval- is the output of the synaptic mapping fu,v assigned to the uates false, then φ (x) = cvR in case vR is already a connection (u, v), in(v) = {u ∈ V |(u, v) ∈ E } is the in- leaf, and the predicate ϕvR is evaluated in case vR is put set of v, and ϕ : R → R is called activation function. still a fork. Though the activation functions, in applications typically sigmoidal functions are used to this end, i.e., functions that are non-decreasing, piecewise continuous, and such that 4.5 Random Forests (RF) −∞ < lim ϕ(t) < lim ϕ(t) < ∞. (18) Random Forests are ensembles of classifiers in which the t→−∞ t→∞ individual members are classification trees. They are con- structed by bagging, i.e., bootstrap aggregation of individ- The activation functions most frequently encountered in ual trees, which consists in training each member of the MLPs are: ensemble with another set of training data, sampled ran- • the logistic function, domly with replacement from the original training pairs (x1 , c1 ), . . . , (x p , c p ). Typical sizes of random forests en- 1 countered in applications are dozens to thousands trees. (∀t ∈ R) ϕ(t) = ; (19) Subsequently, when new subjects are input to the forest, 1 + e−t each tree classifies them separately, according to the leaves • the hyperbolic tangent, at which they end, and the final classification by the for- est is obtained by means of an aggregation function. The et − e−t usual aggregation function of random forests is majority ϕ(t) = tanht = . (20) voting, or some of its fuzzy generalizations. et + e−t 96 Jiří Kožusznik, Petr Pulc, and Martin Holeňa According to which kind of randomness is involved in signed to them depend, apart from a bias, on values as- the costruction of the ensemble, two broad groups of ran- signed to the unit input at the same time step and on val- dom forests can be differentiated: ues assigned to the unit output at the previous time step. Hence, an LSTM network layers is a recurrent network. 1. Random forests grown in the full input space. Each (i) Memory cells can store values, aka cell states, for an tree is trained using all considered input features. arbitray time. They have no activation function, thus Consequently, any feature has to be taken into ac- their output is actually a biased linear combination of count when looking for the split condition assigned unit inputs and of the values from the previous time to an inner node of the tree. However, features actu- step coming through recurrent connections. ally occurring in the split conditions can be different (ii) Input gate controls the extent to which values from from tree to tree, as a consequence of the fact that the previous unit or from the preceding layer influ- each tree is trained with another set of training data. ence the value stored in the memory cell. It has a For the same reason, even if a particular feature oc- sigmoidal activation function, which is applied to a curs in split conditions of two different trees, those biased linear combination of unit inputs and of val- conditions can be assigned to nodes at different lev- ues from the previous time step, though the bias and els of the tree. synaptic weights of the input and recurrent connec- A great advantage of this kind of random forests is tions are specific and in general different from the that each tree is trained using all the information bias and synaptic weights of the memory cell. available in its set of training data. Its main disadvan- (iii) Forget gate controls the extent to which the memory tage is high computational complexity. In addition, if cell state is supressed. It again has a sigmoidal acti- several or even only one variable are very noisy, that vation function, which is applied to a specific biased noise gets nonetheless incorporated into all trees in linear combination of unit inputs and of values from the forest. Because of those disadvantages, random the previous time step. forests are grown in the complete input space primar- (iv) Output gate controls the extent to which the memory ily if its dimension is not high and no input feature is cell state influences the unit output. Also this gate substantially noisier than the remaining ones. has a sigmoidal activation function, which is applied to a specific biased linear combination of unit inputs 2. Random forests grown in subspaces of the input and of values from the previous time step, and subse- space. Each tree is trained using only a randomly quently composed either directly with the cell state or chosen fraction of features, typically a small one. with its sigmoidal transformation, using another sig- This means that a tree t is actually trained with pro- moid than is used by the gates. jections of the training data into a low-dimensional space spanned by some randomly selected dimen- sions it,1 ≤ · · · ≤ it,dt ∈ {1, . . . , d}, where d is the di- 5 Experimental Testing mension of the input space, and dt is typically much smaller than d. Using only a subset of features not 5.1 Berlin Database of Emotional Speech only makes forest training much faster, but also al- For the evaluation of already implemented classifiers, we lows to eliminate noise originating from only several used the publicly available dataset ”EmoDB”, aka Berlin features. The price paid for both these advantages is database of emotional speech. It consists of 535 emotional that training makes use of only a part of the informa- utterances in 7 emotional categories namely anger, bore- tion available in the training data. dom, disgust, fear, happiness, sadness and neutral. These utterances are sentences read by 10 professional actors, 5 4.6 Long Short-Term Memory (LSTM) males and 5 females [1], which were recorded in an ane- choic chamber under supervision by linguists and psychol- An LSTM network is used for classification of sequences ogists) . The actors were advised to read these prede- of feature vectors, or equivalently, multidimensional time fined sentences in the targeted emotional categories, but series with discrete time. Alternatively, it can be also em- the sentences do not contain any emotional bias. A human ployed to obtain sequences of such classifications, i.e., in perception test was conducted with 20 persons, different situations when the neural network input is a sequence of from the speakers, in order to evaluate the quality of the feature vectors and its output is a a sequence of classes. recorded data with respect to recognisability and natural- Differently to most of other commonly encountered kinds ness of presented emotion. This evaluation yielded a mean of artificial neural networks, an LSTM layer connects not accuracy 86% over all emotional categories. simple neurons, but units with their own inner structure. Several variants of an LSTM have been proposed (e.g., 5.2 Experimental Settings [5, 6]), all of them include at least the following four kinds of units described below. Each of them has certain prop- As input features, the outputs from the Sound Description erties of usual ANN neurons, in particular, the values as- Toolbox were used. Consequently, the input dimension Sentiment Analysis from Utterances 97 was 187. The already implemented classifiers were com- Table 1: Accuracy and area under curve (AUC) of the im- pared by means of a 10-fold cross-validation, using the plemented classifiers on the whole Berlin database of emo- following settings for each of them: tional speech. AUC is measured for binary classification • For the k nearest neighbors classification, the value of each of the considered 7 emotions against the rest k = 9 was chosen by a grid method from h1, 80i. This Classifier Accuracy AUC emotion against the rest classifer was applied to data normalized to zero mean Anger Boredom Disgust and unit variance. kNN 0.73 0.956 0.933 0.901 SVM 0.93 0.979 0.973 0.966 • Support vector machines are constructed for each of MLP 0.78 0.977 0.969 0.964 the 7 considered emotions, to classify between that DT 0.59 0.871 0.836 0.772 RF 0.71 0.962 0.949 0.920 emotion and all the remaining ones. They employ auto-scaled Gaussian kernels and do not use slack Classifier AUC emotion against the rest variables. Fear Happiness Neutral Sadness kNN 0.902 0.856 0.962 0.995 • The MLP has 1 hidden layer with 70 neurons. Hence, SVM 0.983 0.904 0.974 0.997 taking into account the input dimension and the num- MLP 0.969 0.933 0.983 0.996 ber of classes, the overall architecture of the MLP is DT 0.782 0.683 0.855 0.865 187-70-7. RF 0.921 0.882 0.972 0.992 • Classification trees are restricted to have at most 23 leaves. This upper limit was chosen by a grid method Table 2: Comparison between pairs of implemented clas- from h1, 50i, taking into account the way how classi- sifiers with respect to accuracy, based on 10 independent fication trees are grown in their Matlab implementa- parts of the Berlin database of emotional speech corre- tion. sponding to 10 different speakers. The result in a cell of the table indicates on how many parts the accuracy of the • Random forests consist of 50 classification trees, row classifier was higher : on how many parts the accuracy each of them taking over the above restriction. The of the column classifier was higher. A result in bold indi- number of trees was selected by a grid method from cates that after the Friedman test rejected the hypothesis of 10, 20,. . . ,100. equal accuracy of all classifiers, the post-hoc test accord- ing to [3, 4] rejects the hypothesis of equal accuracy of the 5.3 Comparison of Already Implemented Classifiers particular row and column classifiers. All simultaneously tested hypotheses were corrected in accordance with Holm First, we compared the already implemented classifiers on [9] the whole Berlin database of emotional speech, with re- classifier kNN SVM MLP DT RF spect to accuracy and area under the ROC curve (area un- kNN 0:10 3.5:6.5 9:1 5:5 der curve, AUC). Since a ROC curve makes sense only SVM 10:0 10:0 10:0 10:0 for a binary classifier, we computed areas under 7 sepa- MLP 6.5:3.5 0:10 10:0 7:3 rate curves corresponding to classifiers classifying always DT 1:9 0:10 0:10 0:10 1 emotion against the rest. The results are presented in Ta- RF 5:5 0:10 3:7 10:0 ble 1 and in Figure 1. They clearly show SVM as the most promising classifier. It has the highest accuracy, and also the AUC for binary classifiers corresponding to 5 of the 7 the family-wise significance level 5%, they reveal the fol- classifiers lowing Holm-corrected significant differences between in- Then we compared the classifiers separately on the dividual pairs of classifiers: both for accuracy and av- utterances of each of the 10 speakers who created the eraged AUC: (SVM,DT), (MLP,DT), and in addition be- database. The results are summarized in Table 2 for ac- tween (kNN,SVM), (SVM,RF) for accuracy. curacy and Table 3 for AUC averaged over all 7 emo- tions. They indicate a great difference between most of the compared classifiers. This is confirmed by the Fried- 6 Conclusion man test of the hypotheses that all classifiers have equal accuracy and equal average AUC. The Friedman test re- The presented work in progress investigated the possibil- jected both hypotheses with a high significance: With ities to analyse emotions in utterances based on MPEG7 the Holm correction for simultaneously tested hypothe- features. So far, we implemented only five classifica- ses [9], the achieved significance level (aka p-value) was tion methods not using time series features, but only 187 4 · 10−6 . For both hypotheses, posthoc tests according to scalar features, namely the k nearest neighbours classi- [3, 4] were performed, testing equal accuracy and equal fier, support vector machines, mutilayer perceptrons, de- average AUC between individual pairs of classifiers. For cision trees and random forests. The obtained results in- 98 Jiří Kožusznik, Petr Pulc, and Martin Holeňa [7] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Table 3: Comparison between pairs of implemented classi- Statistical Learning, 2nd Edition. Springer, 2008. fiers with respect to the AUC averaged over all 7 emotions, [8] S. Hochreiter and J. Schmidhuber. Long short-term mem- based on 10 independent parts of the Berlin database of ory. Neural Computation, 9:1735–1780, 1997. emotional speech corresponding to 10 different speakers. [9] S. Holm. A simple sequentially rejective multiple test pro- The result in a cell of the table indicates on how many parts cedure. Scandinavian Journal of Statistics, 6:65–70, 1979. the AUC of the row classifier was higher : on how many [10] H.G. Kim, N. Moreau, and T. Sikora. MPEG-7 Audio and parts the AUC of the column classifier was higher. A result Beyond: Audio Content Indexing and Retrieval. John Wiley in bold indicates that after the Friedman test rejected the and Sons, New York, 2005. hypothesis of equal AUC of all classifiers, the post-hoc test [11] S. Lalitha, A. Madhavan, B. Bhusan, and S. Saketh. Speech according to [3, 4] rejects the hypothesis of equal AUC of emotion recognition. In International Conference on Ad- the particular row and column classifiers. All simultane- vances in Electronics, pages 92–95, 2014. ously tested hypotheses were corrected in accordance with [12] A.S. Lampropoulos and G.A. Tsihrintzis. Evaluation of Holm [9] MPEG-7 descriptors for speech emotional recognition. In classifier kNN SVM MLP DT RF Eighth International Conference on Intelligent Information kNN 2:8 0:10 10:0 4:6 Hiding and Multimedia Signal Processing, pages 98–101, SVM 8:2 5:5 10:0 9:1 2012. MLP 10:0 5:5 10:0 9:1 [13] A. Rauber, T. Lidy, J. Frank, E. Benetos, V. Zenz, DT 0:10 0:10 0:10 0:10 G. Bertini, T. Virtanen, A.T. Cemgil, S. Godsill, D. Clark, RF 6:4 1:9 1:9 10:0 P. Peeling, E. Peisyer, Y. Laprie, A. Sloin, A. Alfandary, and D. Burshtein. MUSCLE network of excellence: Mul- timedia understanding through semantics, computation and dicate that especially support vector machines and multi- learning. Technical report, TU Vienna, Information and layer perceptrons are quite successfull for this task. Statis- Software Engineering Group, 2004. tical testing confirms significant differences between these [14] B. Schölkopf and A.J. Smola. Learning with Kernels. MIT two kinds of classifiers on the one hand, and decision trees Press, Cambridge, 2002. an random forests on the other hand. [15] T. Sikora, H.G. Kim, N. Moreau, and S. Amjad. MPEG- The next step in this ongoing research is to implement 7-based audio annotation for the archival of digital video. the long short-term memory neural network, recalled in http://mpeg7lld.nue.tu-berlin.de/, 2003. Subsection 4.6, because they can work not only with scalar [16] Z. Xiao, E. Dellandrea, W. Dou, and L. Chen. Multi-stage features but also with features represented with time series. classification of emotional speech motivated by a dimen- sional emotion model. Multimedia Tools and Applicaions, 46:119–145, 2010. Acknowledgement The research reported in this paper has been supported by the Czech Science Foundation (GAČR) grant 18-18080S. References [1] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss. A database of german emotional speech. In In- terspeech, pages 1517–1520, 2005. [2] M. Casey, A. De Cheveigne, P. Gardner, M. Jackson, and G. Peeters. MPEG-7 multimedia software resources. http://mpeg7.doc.gold.ac.uk/, 2001. [3] J. Demšar. Statistical comparisons of classifiers over multi- ple data sets. Journal of Machine Learning Research, 7:1– 30, 2006. [4] S. Garcia and F. Herrera. An extension on “Statistical Com- parisons of Classifiers over Multiple Data Sets” for all pair- wise comparisons. Journal of Machine Learning Research, 9:2677–2694, 2008. [5] F.A. Gers, J. Schmidhuber, and J. Cummis. Learning to forget: Continual prediction with LSTM. In 9th Interna- tional Conference on Artificial Neural Networks: ICANN ’99, pages 850–855, 1999. [6] A. Graves. Supervised Sequence Labelling with Recurrent Neural Networks. PhD thesis, TU München, 2008. Sentiment Analysis from Utterances 99 Figure 1: ROC curve for all emotions on the whole Berlin database