<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Sentiment Analysis from Utterances</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jiří</forename><surname>Kožusznik</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Information Technology</orgName>
								<orgName type="institution">Czech Technical University</orgName>
								<address>
									<addrLine>Thákurova 7</addrLine>
									<settlement>Prague</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Petr</forename><surname>Pulc</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Information Technology</orgName>
								<orgName type="institution">Czech Technical University</orgName>
								<address>
									<addrLine>Thákurova 7</addrLine>
									<settlement>Prague</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Institute of Computer Science</orgName>
								<orgName type="institution">Czech Academy of Sciences</orgName>
								<address>
									<addrLine>Pod vodárenskou věží 2</addrLine>
									<settlement>Prague</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Martin</forename><surname>Holeňa</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Institute of Computer Science</orgName>
								<orgName type="institution">Czech Academy of Sciences</orgName>
								<address>
									<addrLine>Pod vodárenskou věží 2</addrLine>
									<settlement>Prague</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Sentiment Analysis from Utterances</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">C2882E2BE8ACB4DBFDF9B92231C26B29</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T09:11+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The recognition of emotional states in speech is starting to play an increasingly important role. However, it is a complicated process, which heavily relies on the extraction and selection of utterance features related to the emotional state of the speaker. In the reported research, MPEG-7 low level audio descriptors[10] serve as features for the recognition of emotional categories. To this end, a methodology combining MPEG-7 with several important kinds of classifiers is elaborated.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The recognition of emotional states in speech is expected to play an increasingly important role in applications such as media retrieval systems, car management systems, call center applications, personal assistants and the like. In many languages it is common that the meaning of spoken words changes depending on speakers emotions, and consequently the emotional information is important in order to understand the intended meaning. Emotional Speech recognition is a complicated process. Its performance heavily relies on the extraction and selection of features related to the emotional state of the speaker in the audio signal of an utterance. For most of them, the methodology has already been implemented, and they have been experimentally tested and compared Berlin database of emotional speech.</p><p>In the reported work in progress, we use MPEG-7 low level audio descriptors <ref type="bibr" target="#b9">[10]</ref> as features for the recognition of emotional categories. To this end, we elaborate a methodology combining MPEG-7 with several important kinds of classifiers. For most of them, the methodology has already been implemented and tested with the publicly available Berlin Database of Emotional Speech <ref type="bibr" target="#b0">[1]</ref>.</p><p>In the next section, the task of sentiment analysis from utterances is briefly sketched. Section 3 recalls the necessary background concerning MPEG-7 audio descriptors and the considered classification methods. In Section 4, the principles of the proposed approach are explained. Finally, Section 5 presents results of experimental testing and comparison of the already implemented classifiers on the publicly available Berlin database of emotional speech.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Sentiment Analysis from Utterances</head><p>Due to the importance of recognizing emotional states in speech, research into sentiment analysis from utterances has been emerging during recent years. We are aware of 3 publications reporting research with the same database of emotional utterances as we used -the Berlin Database of Emotional Speech, used in our research. Let us recall each of them.</p><p>The research most similar to ours has been reported in <ref type="bibr" target="#b11">[12]</ref>, where the authors also used MPEG-7 descriptors for sentiment analysis from utterance. However, they used only scalar MPEG-7 descriptors or scalars derived with time-series descriptors using the software tools Sound Description Toolbox <ref type="bibr" target="#b12">[13]</ref> and MPEG-7 Audio Reference Software Toolkit <ref type="bibr" target="#b1">[2]</ref>, whereas we are implementing also a long-short-term memory network that will use directly the time series. They also used only one classifer in their experiments, a combination of a radial basis function network and a support vector machine.</p><p>In <ref type="bibr" target="#b10">[11]</ref>, emotions are recognized using pitch and prosody features, which are mostly in time domain. Also in that paper, the experiments were performed, and the authors used only one classifer, this time a support vector machine (SVM).</p><p>The authors of <ref type="bibr" target="#b15">[16]</ref> proposed a set of new 68 features, such as some new based on harmonic frequencies or on the Zipf distribution, for better speech emotion recognition. This set of features is used in a multi-stage classification. When performing the sentiment analysis of the Berlin Database, the utterance classification to the considered emotional categories was preceded with a gender classification of the speakers, and the gender of the speaker was subsequently used as an additional feature for the classification of the utterances. Temporally sampled scalar values for general use, applicable to all kinds of signals. The AP describes the temporally-smoothed instantaneous power of samples in the frame,in other words it is a temporally measure of signal content as a function of time and offers a quick summary of a signal in conjunction with other basic spectral descriptors. The AWF describes audio waveform envelope (minimum and maximum), typically for display purposes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">MPEG-7 Audio Descriptors</head><p>2. Basic Spectral: Audio Spectrum Envelop (ASE), Audio Spectrum Centroid (ASC), Audio Spectrum Spread (ASS), Audio Spectrum Flatness (ASF). All share a common basis, all deriving from the short term audio signal spectrum (analysis of frequency over time). They are all based on the ASE Descriptor, which is a logarithmic-frequency spectrum. This descriptor provides a compact description of the signal spectral content and represents the similar approximation of logarithmic response of the human ear. The ASE descriptor is an indicator as to whether the spectral content of a signal is dominated by high or low frequencies. The ASC Descriptor could be considered as an approximation of perceptual sharpness of the signal. The ASS descriptor indicates whether the signal content, as it is represented by the power spectrum, is concentrated around its centroid or spread out over a wider range of the spectrum. This gives a measure which allows the distinction of noise-like sounds from tonal sounds. The ASF describes the flatness properties of the spectrum of an audio signal for each of a number of frequency bands.</p><p>3. Basic Signal Parameters: Audio Fundamental Frequency (AFF) and Audio Harmonicity (AH).</p><p>The signal parameters constitute a simple parametric description of the audio signal. This group includes the computation of an estimate for the fundamental frequency (F0) of the audio signal. The AFF descriptor provides estimates of the fundamental frequency in segments in which the audio signal is assumed to be periodic. The AH represents the harmonicity of a signal, allowing distinction between sounds with a harmonic spectrum (e.g., musical tones or voiced speech e.g., vowels), sounds with an inharmonic spectrum (e.g., bell-like sounds) and sounds with a non-harmonic spectrum (e.g., noise, unvoiced speech).</p><p>4. Temporal Timbral: Log Attack Time (LAT), Temporal Centroid (TC). Timbre refers to features that allow one to distinguish two sounds that are equal in pitch, loudness and subjective duration. These descriptors are taking into account several perceptual dimensions at the same time in a complex way. Temporal Timbral descriptors describe the signal power function over time. The power function is estimated as a local mean square value of the signal amplitude value within a running window. The LAT descriptor characterizes the "attack" of a sound, the time it takes for the signal to rise from silence to its maximum amplitude. This feature signifies the difference between a sudden and a smooth sound. The TC descriptor computes a timebased centroid as the time average over the energy envelope of the signal.</p><p>5. Timbral Spectral descriptors: Harmonic Spec-tral Centroid (HSC), Harmonic Spectral Deviation (HSD), Harmonic Spectral Spread (HSS), Harmonic Spectral Variation (HSV) and Spectral Centroid. These are spectral features extracted in a linearfrequency space. The HSC descriptor is defined as the average, over the signal duration, of the amplitude-weighted mean of the frequency of the bins (the harmonic peaks of the spectrum) in the linear power spectrum. It is has a high correlation with the perceptual feature of "sharpness" of a sound. The HSD descriptor measures the spectral deviation of the harmonic peaks from the global envelope. The HSS descriptor measures the amplitude-weighted standard deviation (Root Mean Square) of the harmonic peaks of the spectrum, normalized by the HSC. The HSV descriptor is the normalized correlation between the amplitude of the harmonic peaks between two subsequent time-slices of the signal.</p><p>6. Spectral Basis, which consists of Audio Spectrum Basis (ASB) and Audio Spectrum Projection (ASP).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Tools for Working with MPEG-7 Descriptors</head><p>We utilized the Sound Description Toolbox <ref type="bibr" target="#b12">[13]</ref> and MPEG-7 Audio Analyzer -Low Level Descriptors Extractor <ref type="bibr" target="#b14">[15]</ref> for our experiments. Both of them extract a number of MPEG-7 standard descriptors, both scalar ones and time series. In addition, the SDT also calculates perceptual features such as Mel Frequency Cepstral Coefficients, Specific Loudness and Sensation Coefficients. From this descriptors calculate means, covariances, means of firstorder differences and covariances of first order differences. The Total number of features provided by this toolbox is 187.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Employed Classification Methods</head><p>We have elaborated our approach to sentiment analysis from utterances for six classification methods: k nearest neighbors, support vector machines, multilayer perceptrons, classification trees, random forests <ref type="bibr" target="#b6">[7]</ref> and long short-term memory (LSTM) network <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b7">8]</ref>. The first five of them have already been implemented and tested (cf. Section 5), the last and most advanced one is still being implemented.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">k Nearest Neighbours (kNN)</head><p>A very traditional way of classifying a new feature vector x ∈ X if a sequence of training data (x 1 , c 1 ), . . . , (x p , c p ) is available is the nearest neighbour method: take the x j that is the closest to x among x 1 , . . . , x p , and assign to x the class assigned to x j , i.e., c j . A straightforward generalization of the nearest neighbour method is to take among x 1 , . . . , x p not one, but k feature vectors x j j , . . . , x j k closest to x. Then x is assigned the</p><formula xml:id="formula_0">class c ∈ C fulfilling |{i, 1 ≤ i ≤ k|c j i = c}| = max c ∈C |{i, 1 ≤ i ≤ k|c j i = c }|. (<label>1</label></formula><formula xml:id="formula_1">)</formula><p>This method is called, expectedly, k nearest neighbours, or k-NN for short.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Support Vector Machines (SVM)</head><p>Support vector machines are classifiers into two classes. This method attempts to derive from the training data (x 1 , c 1 ), . . . , (x p , c p ) the best possible generalization to unseen feature vectors.</p><p>If both classes, more precisely their intersections with the set {x 1 , . . . , x p } of training inputs, are in the space of feature vectors linearly separable, the method constructs two parallel hyperplanes</p><formula xml:id="formula_2">H + = {x ∈ R n |x w + b + = 0}, H − = {x ∈ R n |x w + b − = 0} such that the train- ing data fulfil c k = 1 if x w + b + ≥ 0, -1 if x w + b − ≤ 0, k = 1, . . . , p,<label>(2)</label></formula><formula xml:id="formula_3">H + ∩ {x 1 , . . . , x p } = / 0, H − ∩ {x 1 , . . . , x p } = / 0.<label>(3)</label></formula><p>The hyperplanes H </p><formula xml:id="formula_4">d(H + , H − ) = b + − b − w<label>(4)</label></formula><p>on condition that the p inequalities (2) hold.</p><p>The distance ( <ref type="formula" target="#formula_4">4</ref>) is commonly called margin. The solution to this optimization task coincides with the</p><formula xml:id="formula_5">(w * , b * + , b * − , α * 1 , . . . , α * p ) of the Lagrange function L(w, b + , b − , α 1 , . . . , α p ) = w 2 + p ∑ k=1 α k ( , b + − b − 2 − c k x k w)<label>(5)</label></formula><p>where α 1 , . . . , α p ≥ 0 are Lagrange coefficients of the optimization task.</p><p>Once the saddle point</p><formula xml:id="formula_6">(w * , b * + , b * − , α * 1 , . . . , α * p ) is found, the classifier is de- fined by φ (x) = 1 if ∑ x k ∈S α * k c k x x k + b * ≥ 0, −1 if ∑ x k ∈S α * k c k x x k + b * &lt; 0,<label>(6)</label></formula><p>where</p><formula xml:id="formula_7">b * = 1 2 (b * + + b * − ) and S = {x k |α * k &gt; 0}. (<label>7</label></formula><formula xml:id="formula_8">)</formula><p>Due to the Karush-Kuhn-Tucker (KKT) conditions,</p><formula xml:id="formula_9">α * k ( b * + − b * − 2 − c k x k w * ) = 0, k = 1, . . . , p,<label>(8)</label></formula><p>all feature vectors from the set S lie on some of the suport hyperplanes <ref type="bibr" target="#b2">(3)</ref>. Therefore, they are called support vectors. This name together with the observation that they completely determine the classifier defined in ( <ref type="formula" target="#formula_6">6</ref>) explains why such a classifier is called support vector machine.</p><p>If the intersections of both classes with the training inputs are not linearly separable, a SVM is constructed similarly, but instead of the set of possible fature vectors, now the set of functions κ(•, x) for all possible feature vectors x</p><p>is considered, where κ is a kernel, i.e., a mapping on pairs of feature vectors that is symmetric and such that for any k ∈ N and any sequence of different feature vectors x 1 , . . . , x k , the matrix</p><formula xml:id="formula_11">G κ (x 1 , . . . , x k ) =   κ(x 1 , x 1 ) . . . κ(x 1 , x k ) . . . . . . . . . . . . . . . . . . . . . . . κ(x k , x 1 ) . . . κ(x k , x k )   , (<label>10</label></formula><formula xml:id="formula_12">)</formula><p>which is called the Gramm matrix of x 1 , . . . , x k , is positive semidefinite, i.e.,</p><formula xml:id="formula_13">(∀y ∈ R k ) y G κ (x 1 , . . . , x k )y ≥ 0.<label>(11)</label></formula><p>The most commonly used kinds of kernels are the Gaussian kernel with a parameter ς &gt; 0,</p><formula xml:id="formula_14">(∀x, x ∈ R n ) κ(x, x ) = exp − 1 ς x − x 2 ,<label>(12)</label></formula><p>and polynomial kernel with parameters d ∈ N and c ≥ 0,</p><formula xml:id="formula_15">(∀x, x ∈ R n ) κ(x, x ) = (x x + c) d .<label>(13)</label></formula><p>It is known <ref type="bibr" target="#b13">[14]</ref> that, due to the properties of kernels, if the joint distribution of a sequence of different feature vectors x 1 , . . . , x k is continuous, then almost surely any proper subset of the set of functions {κ(•, x 1 ), . . . , κ(•, x k )} is in the space of all functions (9) linearly separable from its complement.</p><p>However, the featre vectors x and x k can't be simply replaced by the corresponding functions κ(•, x) and κ(•, x k ) in the definition (6) of a SVM classifier because a transpose x exists for a finite-dimensional vector, but not a for an infinite-dimensional function. Fortunately, the transpose occurs in (6) only as a part of the scalar product x x k . And a scalar product can be defined also on the space of all functions <ref type="bibr" target="#b8">(9)</ref>. Namely, the properties of a scalar product has the function that to the pair of functions (κ(•, x), κ(•, x ) assigns the value κ(x, x ). Using this scalar product in <ref type="bibr" target="#b5">(6)</ref>, we obtain the following definition of a SVM classifier for linearly non-separable classes:</p><formula xml:id="formula_16">φ (x) = 1 if ∑ x k ∈S α * k c k κ(x, x k ) + b ≥ 0, −1 if ∑ x k ∈S α * k c k κ(x, x k ) + b ≥ 0. (<label>14</label></formula><formula xml:id="formula_17">)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Multilayer Perceptrons (MLP)</head><p>A multilayer percptron is a mapping φ of feature vectors to classes with which a directed graph G φ = (V , E ) is associated. Due to the inspiration from biological neural networks, the vertices of G φ are called neurons and its edges are called connections. In addition, G φ is required to have a layered structure, which means that the set V of neurons can be decomposed into L + 1 mutually disjoint layers,</p><formula xml:id="formula_18">V = V 0 ∪ V 1 ∪ • • • ∪ V L , L ≥ 2, such that (∀(u, v) ∈ E ) u ∈ V i , i = 0, . . . , L − 1 &amp; v ∈ V i ⇒ v ∈ V i+1 .<label>(15)</label></formula><p>The layer I = V 0 is called input layer of the MLP, the layer O = V L its output layer and the layers</p><formula xml:id="formula_19">H 1 = V 1 , . . . , H L−1 = V L−1 its hidden layers.</formula><p>The purpose of the graph G φ associated with the mapping φ is to define a decomposition of φ into simple mappings assigned to hidden and output neurons and to connections between neurons (input neurons normally only accept the components of the input, and no mappings are assigned to them). Inspired by biological terminology, mappings assigned to neurons are called somatic, those assigned to connections are called synaptic.</p><p>To each connection (u, v) ∈ E , the multiplication by a weight w (u,v) is assigne as a synaptic mapping:</p><formula xml:id="formula_20">(∀ξ ∈ R) f (u,v) (ξ ) = w (u,v) ξ .<label>(16)</label></formula><p>To each hidden neuron v ∈ H i , the following somatic mapping is assigned:</p><formula xml:id="formula_21">(∀ξ ∈ R | in(v)| ) f v (ξ ) = ϕ( ∑ u∈in(v) [ξ ] u + b v ),<label>(17)</label></formula><p>where [ξ ] u for u ∈ in(v) denotes the component of ξ that is the output of the synaptic mapping f u,v assigned to the connection (u, v), in(v) = {u ∈ V |(u, v) ∈ E } is the input set of v, and ϕ : R → R is called activation function.</p><p>Though the activation functions, in applications typically sigmoidal functions are used to this end, i.e., functions that are non-decreasing, piecewise continuous, and such that</p><formula xml:id="formula_22">−∞ &lt; lim t→−∞ ϕ(t) &lt; lim t→∞ ϕ(t) &lt; ∞. (<label>18</label></formula><formula xml:id="formula_23">)</formula><p>The activation functions most frequently encountered in MLPs are:</p><p>• the logistic function,</p><formula xml:id="formula_24">(∀t ∈ R) ϕ(t) = 1 1 + e −t ;<label>(19)</label></formula><p>• the hyperbolic tangent,</p><formula xml:id="formula_25">ϕ(t) = tanht = e t − e −t e t + e −t .<label>(20)</label></formula><p>To an output neuron v ∈ O, also a somatic mapping of the kind (17) with the activation functions ( <ref type="formula" target="#formula_24">19</ref>) or (20) can be assigned. If it is the case, then the class c predicted for a feature vector x is obtained as c = arg max i (φ (x)) i , where (φ (x)) i denotes the i-the component of φ (x). Alternatively the activation function assigned to an output neuron can be the step function, aka Heaviside function</p><formula xml:id="formula_26">ϕ(t) = 0 if t &lt; 0, 1 if t ≥ 0. (<label>21</label></formula><formula xml:id="formula_27">)</formula><p>In that case, the value (φ (x)) c already directly indicates whether x belongs to the class c.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Classification Trees (CT)</head><p>A classifier φ : X → C = {c 1 , . . . , c m } is called binary classification tree, if there is a binary tree T φ = (V φ , E φ ) with vertices V φ and edges E φ such that:</p><formula xml:id="formula_28">(i) V φ = {v 1 , . . . , v L , . . . , v 2L−1 }, where L ≥ 2, v 0 is the root of T φ , v 1 , . . . , v L−1 are its forks and v L , . . . , v 2L−1 are its leaves. (ii) If the children of a fork v ∈ {v 1 , . . . , v L−1 } are v L ∈ V φ (left child) and v R ∈ V φ (right child) and if v = v i , v L = v j , v R = v k , then i &lt; j &lt; k. (iii) To each fork v ∈ {v 1 , . . . , v L−1 }, a predicate ϕ v of</formula><p>some formal logic is assigned, evaluated on features of the input vectors x ∈ X . (iv) To each leaf v ∈ {v L , . . . , v 2L−1 }, a class c v ∈ C is assigned. (v) For each input x ∈ X , the predicate ϕ v 1 assigned to the root is evaluated. (vi) If for a fork v ∈ {v 1 , . . . , v L−1 }, the predicate ϕ v evaluates true, then φ (x) = c v L in case v L is already a leaf, and the predicate ϕ v L is evaluated in case v L is still a fork. (vii) If for a fork v ∈ {v 1 , . . . , v L−1 }, the predicate ϕ v evaluates false, then φ (x) = c v R in case v R is already a leaf, and the predicate ϕ v R is evaluated in case v R is still a fork.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Random Forests (RF)</head><p>Random Forests are ensembles of classifiers in which the individual members are classification trees. They are constructed by bagging, i.e., bootstrap aggregation of individual trees, which consists in training each member of the ensemble with another set of training data, sampled randomly with replacement from the original training pairs (x 1 , c 1 ), . . . , (x p , c p ). Typical sizes of random forests encountered in applications are dozens to thousands trees. Subsequently, when new subjects are input to the forest, each tree classifies them separately, according to the leaves at which they end, and the final classification by the forest is obtained by means of an aggregation function. The usual aggregation function of random forests is majority voting, or some of its fuzzy generalizations.</p><p>According to which kind of randomness is involved in the costruction of the ensemble, two broad groups of random forests can be differentiated:</p><p>1. Random forests grown in the full input space. Each tree is trained using all considered input features. Consequently, any feature has to be taken into account when looking for the split condition assigned to an inner node of the tree. However, features actually occurring in the split conditions can be different from tree to tree, as a consequence of the fact that each tree is trained with another set of training data.</p><p>For the same reason, even if a particular feature occurs in split conditions of two different trees, those conditions can be assigned to nodes at different levels of the tree.</p><p>A great advantage of this kind of random forests is that each tree is trained using all the information available in its set of training data. Its main disadvantage is high computational complexity. In addition, if several or even only one variable are very noisy, that noise gets nonetheless incorporated into all trees in the forest. Because of those disadvantages, random forests are grown in the complete input space primarily if its dimension is not high and no input feature is substantially noisier than the remaining ones.</p><p>2. Random forests grown in subspaces of the input space. Each tree is trained using only a randomly chosen fraction of features, typically a small one. This means that a tree t is actually trained with projections of the training data into a low-dimensional space spanned by some randomly selected dimensions i t,1 ≤ • • • ≤ i t,d t ∈ {1, . . . , d}, where d is the dimension of the input space, and d t is typically much smaller than d. Using only a subset of features not only makes forest training much faster, but also allows to eliminate noise originating from only several features. The price paid for both these advantages is that training makes use of only a part of the information available in the training data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6">Long Short-Term Memory (LSTM)</head><p>An LSTM network is used for classification of sequences of feature vectors, or equivalently, multidimensional time series with discrete time. Alternatively, it can be also employed to obtain sequences of such classifications, i.e., in situations when the neural network input is a sequence of feature vectors and its output is a a sequence of classes. Differently to most of other commonly encountered kinds of artificial neural networks, an LSTM layer connects not simple neurons, but units with their own inner structure. Several variants of an LSTM have been proposed (e.g., <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref>), all of them include at least the following four kinds of units described below. Each of them has certain properties of usual ANN neurons, in particular, the values as-signed to them depend, apart from a bias, on values assigned to the unit input at the same time step and on values assigned to the unit output at the previous time step. Hence, an LSTM network layers is a recurrent network.</p><p>(i) Memory cells can store values, aka cell states, for an arbitray time. They have no activation function, thus their output is actually a biased linear combination of unit inputs and of the values from the previous time step coming through recurrent connections. (ii) Input gate controls the extent to which values from the previous unit or from the preceding layer influence the value stored in the memory cell. It has a sigmoidal activation function, which is applied to a biased linear combination of unit inputs and of values from the previous time step, though the bias and synaptic weights of the input and recurrent connections are specific and in general different from the bias and synaptic weights of the memory cell. (iii) Forget gate controls the extent to which the memory cell state is supressed. It again has a sigmoidal activation function, which is applied to a specific biased linear combination of unit inputs and of values from the previous time step. (iv) Output gate controls the extent to which the memory cell state influences the unit output. Also this gate has a sigmoidal activation function, which is applied to a specific biased linear combination of unit inputs and of values from the previous time step, and subsequently composed either directly with the cell state or with its sigmoidal transformation, using another sigmoid than is used by the gates.</p><p>5 Experimental Testing</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Berlin Database of Emotional Speech</head><p>For the evaluation of already implemented classifiers, we used the publicly available dataset "EmoDB", aka Berlin database of emotional speech. It consists of 535 emotional utterances in 7 emotional categories namely anger, boredom, disgust, fear, happiness, sadness and neutral. These utterances are sentences read by 10 professional actors, 5 males and 5 females <ref type="bibr" target="#b0">[1]</ref>, which were recorded in an anechoic chamber under supervision by linguists and psychologists) . The actors were advised to read these predefined sentences in the targeted emotional categories, but the sentences do not contain any emotional bias. A human perception test was conducted with 20 persons, different from the speakers, in order to evaluate the quality of the recorded data with respect to recognisability and naturalness of presented emotion. This evaluation yielded a mean accuracy 86% over all emotional categories.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Experimental Settings</head><p>As input features, the outputs from the Sound Description Toolbox were used. Consequently, the input dimension was 187. The already implemented classifiers were compared by means of a 10-fold cross-validation, using the following settings for each of them:</p><p>• For the k nearest neighbors classification, the value k = 9 was chosen by a grid method from 1, 80 . This classifer was applied to data normalized to zero mean and unit variance.</p><p>• Support vector machines are constructed for each of the 7 considered emotions, to classify between that emotion and all the remaining ones. They employ auto-scaled Gaussian kernels and do not use slack variables.</p><p>• The MLP has 1 hidden layer with 70 neurons. Hence, taking into account the input dimension and the number of classes, the overall architecture of the MLP is 187-70-7.</p><p>• Classification trees are restricted to have at most 23 leaves. This upper limit was chosen by a grid method from 1, 50 , taking into account the way how classification trees are grown in their Matlab implementation.</p><p>• Random forests consist of 50 classification trees, each of them taking over the above restriction. The number of trees was selected by a grid method from 10, 20,. . . ,100.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Comparison of Already Implemented Classifiers</head><p>First, we compared the already implemented classifiers on the whole Berlin database of emotional speech, with respect to accuracy and area under the ROC curve (area under curve, AUC). Since a ROC curve makes sense only for a binary classifier, we computed areas under 7 separate curves corresponding to classifiers classifying always 1 emotion against the rest. The results are presented in Table 1 and in Figure <ref type="figure" target="#fig_4">1</ref>. They clearly show SVM as the most promising classifier. It has the highest accuracy, and also the AUC for binary classifiers corresponding to 5 of the 7 classifiers Then we compared the classifiers separately on the utterances of each of the 10 speakers who created the database. The results are summarized in Table <ref type="table" target="#tab_1">2</ref> for accuracy and Table <ref type="table">3</ref> for AUC averaged over all 7 emotions. They indicate a great difference between most of the compared classifiers. This is confirmed by the Friedman test of the hypotheses that all classifiers have equal accuracy and equal average AUC. The Friedman test rejected both hypotheses with a high significance: With the Holm correction for simultaneously tested hypotheses <ref type="bibr" target="#b8">[9]</ref>, the achieved significance level (aka p-value) was 4 • 10 −6 . For both hypotheses, posthoc tests according to <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref> were performed, testing equal accuracy and equal average AUC between individual pairs of classifiers. For on how many parts the accuracy of the row classifier was higher : on how many parts the accuracy of the column classifier was higher. A result in bold indicates that after the Friedman test rejected the hypothesis of equal accuracy of all classifiers, the post-hoc test according to <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref> rejects the hypothesis of equal accuracy of the particular row and column classifiers. All simultaneously tested hypotheses were corrected in accordance with Holm the family-wise significance level 5%, they reveal the following Holm-corrected significant differences between individual pairs of classifiers: both for accuracy and averaged AUC: (SVM,DT), (MLP,DT), and in addition between (kNN,SVM), (SVM,RF) for accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>The presented work in progress investigated the possibilities to analyse emotions in utterances based on MPEG7 features. So far, we implemented only five classification methods not using time series features, but only 187 scalar features, namely the k nearest neighbours classifier, support vector machines, mutilayer perceptrons, decision trees and random forests. The obtained results in-Table <ref type="table">3</ref>: Comparison between pairs of implemented classifiers with respect to the AUC averaged over all 7 emotions, based on 10 independent parts of the Berlin database of emotional speech corresponding to 10 different speakers.</p><p>The result in a cell of the table indicates on how many parts the AUC of the row classifier was higher : on how many parts the AUC of the column classifier was higher. A result in bold indicates that after the Friedman test rejected the hypothesis of equal AUC of all classifiers, the post-hoc test according to <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref> rejects the hypothesis of equal AUC of the particular row and column classifiers. All simultaneously tested hypotheses were corrected in accordance with Holm dicate that especially support vector machines and multilayer perceptrons are quite successfull for this task. Statistical testing confirms significant differences between these two kinds of classifiers on the one hand, and decision trees an random forests on the other hand. The next step in this ongoing research is to implement the long short-term memory neural network, recalled in Subsection 4.6, because they can work not only with scalar features but also with features represented with time series. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>MPEG- 7</head><label>7</label><figDesc>is a standard for low-level description of audio signals, describing a signal by means of the following groups of descriptors[10]: 1. Basic: Audio Power (AP), Audio Waveform(AWF).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>+ and H − alle called support hyperplanes. Their common normal vector w and intercepts b + , b − are obtained through solving the following constrained optimization task: Maximize with respect to w, b + , b − the distance</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: ROC curve for all emotions on the whole Berlin database</figDesc><graphic coords="8,72.99,84.34,238.01,178.51" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Accuracy and area under curve (AUC) of the implemented classifiers on the whole Berlin database of emotional speech. AUC is measured for binary classification of each of the considered 7 emotions against the rest</figDesc><table><row><cell cols="2">Classifier Accuracy</cell><cell cols="3">AUC emotion against the rest</cell></row><row><cell></cell><cell></cell><cell>Anger</cell><cell cols="2">Boredom Disgust</cell></row><row><cell>kNN SVM MLP</cell><cell>0.73 0.93 0.78</cell><cell>0.956 0.979 0.977</cell><cell>0.933 0.973 0.969</cell><cell>0.901 0.966 0.964</cell></row><row><cell>DT</cell><cell>0.59</cell><cell>0.871</cell><cell>0.836</cell><cell>0.772</cell></row><row><cell>RF</cell><cell>0.71</cell><cell>0.962</cell><cell>0.949</cell><cell>0.920</cell></row><row><cell>Classifier</cell><cell cols="3">AUC emotion against the rest</cell><cell></cell></row><row><cell></cell><cell>Fear</cell><cell>Happiness</cell><cell>Neutral</cell><cell>Sadness</cell></row><row><cell>kNN SVM MLP DT</cell><cell>0.902 0.983 0.969 0.782</cell><cell>0.856 0.904 0.933 0.683</cell><cell>0.962 0.974 0.983 0.855</cell><cell>0.995 0.997 0.996 0.865</cell></row><row><cell>RF</cell><cell>0.921</cell><cell>0.882</cell><cell>0.972</cell><cell>0.992</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Comparison between pairs of implemented classifiers with respect to accuracy, based on 10 independent parts of the Berlin database of emotional speech corresponding to 10 different speakers. The result in a cell of the table indicates</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0">Jiří Kožusznik, Petr Pulc, and Martin Hole ňa</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgement</head><p>The research reported in this paper has been supported by the Czech Science Foundation (GA ČR) grant 18-18080S.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A database of german emotional speech</title>
		<author>
			<persName><forename type="first">F</forename><surname>Burkhardt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Paeschke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rolfes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Sendlmeier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Weiss</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Interspeech</title>
				<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="1517" to="1520" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Casey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>De Cheveigne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gardner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jackson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Peeters</surname></persName>
		</author>
		<ptr target="http://mpeg7.doc.gold.ac.uk/" />
		<title level="m">MPEG-7 multimedia software resources</title>
				<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Statistical comparisons of classifiers over multiple data sets</title>
		<author>
			<persName><forename type="first">J</forename><surname>Demšar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="1" to="30" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">An extension on &quot;Statistical Comparisons of Classifiers over Multiple Data Sets&quot; for all pairwise comparisons</title>
		<author>
			<persName><forename type="first">S</forename><surname>Garcia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Herrera</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="2677" to="2694" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Learning to forget: Continual prediction with LSTM</title>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">A</forename><surname>Gers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schmidhuber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cummis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">9th International Conference on Artificial Neural Networks: ICANN &apos;99</title>
				<imprint>
			<date type="published" when="1999">1999</date>
			<biblScope unit="page" from="850" to="855" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Supervised Sequence Labelling with Recurrent Neural Networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Graves</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
		<respStmt>
			<orgName>TU München</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">The Elements of Statistical Learning</title>
		<author>
			<persName><forename type="first">T</forename><surname>Hastie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tibshirani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Friedman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2008">2008</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
	<note>2nd Edition</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Long short-term memory</title>
		<author>
			<persName><forename type="first">S</forename><surname>Hochreiter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schmidhuber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural Computation</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="1735" to="1780" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A simple sequentially rejective multiple test procedure</title>
		<author>
			<persName><forename type="first">S</forename><surname>Holm</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scandinavian Journal of Statistics</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="65" to="70" />
			<date type="published" when="1979">1979</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">G</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Moreau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Sikora</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2005">2005</date>
			<publisher>John Wiley and Sons</publisher>
			<pubPlace>New York</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Speech emotion recognition</title>
		<author>
			<persName><forename type="first">S</forename><surname>Lalitha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Madhavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bhusan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Saketh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Advances in Electronics</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="92" to="95" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Evaluation of MPEG-7 descriptors for speech emotional recognition</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Lampropoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Tsihrintzis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Eighth International Conference on Intelligent Information Hiding and Multimedia Signal Processing</title>
				<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="98" to="101" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">MUSCLE network of excellence: Multimedia understanding through semantics, computation and learning</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rauber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lidy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Frank</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Benetos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Zenz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Bertini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Virtanen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">T</forename><surname>Cemgil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Godsill</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Peeling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Peisyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Laprie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sloin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Alfandary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Burshtein</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
		<respStmt>
			<orgName>TU Vienna, Information and Software Engineering Group</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Learning with Kernels</title>
		<author>
			<persName><forename type="first">B</forename><surname>Schölkopf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">J</forename><surname>Smola</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2002">2002</date>
			<publisher>MIT Press</publisher>
			<pubPlace>Cambridge</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Sikora</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">G</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Moreau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Amjad</surname></persName>
		</author>
		<ptr target="http://mpeg7lld.nue.tu-berlin.de/" />
		<title level="m">MPEG-7-based audio annotation for the archival of digital video</title>
				<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Multi-stage classification of emotional speech motivated by a dimensional emotion model</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Dellandrea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Dou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Multimedia Tools and Applicaions</title>
		<imprint>
			<biblScope unit="volume">46</biblScope>
			<biblScope unit="page" from="119" to="145" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
