S. Krajči (ed.): ITAT 2018 Proceedings, pp. 92–99
CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Jiří Kožusznik, Petr Pulc, and Martin Holeňa


                                           Sentiment Analysis from Utterances

                                                Jiří Kožusznik1 , Petr Pulc1,2 , Martin Holeňa2
                      1  Faculty of Information Technology, Czech Technical University, Thákurova 7, Prague,Czech Republic
                 2   Institute of Computer Science, Czech Academy of Sciences, Pod vodárenskou věží 2, Prague, Czech Republic

       Abstract: The recognition of emotional states in speech is          has been emerging during recent years. We are aware of 3
       starting to play an increasingly important role. However,           publications reporting research with the same database of
       it is a complicated process, which heavily relies on the            emotional utterances as we used – the Berlin Database of
       extraction and selection of utterance features related to the       Emotional Speech, used in our research. Let us recall each
       emotional state of the speaker. In the reported research,           of them.
       MPEG-7 low level audio descriptors[10] serve as features               The research most similar to ours has been reported in
       for the recognition of emotional categories. To this end, a         [12], where the authors also used MPEG-7 descriptors for
       methodology combining MPEG-7 with several important                 sentiment analysis from utterance. However, they used
       kinds of classifiers is elaborated.                                 only scalar MPEG-7 descriptors or scalars derived with
                                                                           time-series descriptors using the software tools Sound
                                                                           Description Toolbox [13] and MPEG-7 Audio Reference
       1    Introduction                                                   Software Toolkit[2], whereas we are implementing also a
                                                                           long-short-term memory network that will use directly the
       The recognition of emotional states in speech is expected           time series. They also used only one classifer in their ex-
       to play an increasingly important role in applications such         periments, a combination of a radial basis function net-
       as media retrieval systems, car management systems, call            work and a support vector machine.
       center applications, personal assistants and the like. In              In [11], emotions are recognized using pitch and
       many languages it is common that the meaning of spoken              prosody features, which are mostly in time domain. Also
       words changes depending on speakers emotions, and con-              in that paper, the experiments were performed, and the au-
       sequently the emotional information is important in order           thors used only one classifer, this time a support vector
       to understand the intended meaning. Emotional Speech                machine (SVM).
       recognition is a complicated process. Its performance                  The authors of [16] proposed a set of new 68 features,
       heavily relies on the extraction and selection of features          such as some new based on harmonic frequencies or on
       related to the emotional state of the speaker in the audio          the Zipf distribution, for better speech emotion recogni-
       signal of an utterance. For most of them, the methodol-             tion. This set of features is used in a multi-stage classi-
       ogy has already been implemented, and they have been ex-            fication. When performing the sentiment analysis of the
       perimentally tested and compared Berlin database of emo-            Berlin Database, the utterance classification to the con-
       tional speech.                                                      sidered emotional categories was preceded with a gender
          In the reported work in progress, we use MPEG-7 low              classification of the speakers, and the gender of the speaker
       level audio descriptors[10] as features for the recogni-            was subsequently used as an additional feature for the clas-
       tion of emotional categories. To this end, we elaborate a           sification of the utterances.
       methodology combining MPEG-7 with several important
       kinds of classifiers. For most of them, the methodology
       has already been implemented and tested with the publicly           3      MPEG-7 Audio Descriptors
       available Berlin Database of Emotional Speech [1].
          In the next section, the task of sentiment analysis from         MPEG-7 is a standard for low-level description of audio
       utterances is briefly sketched. Section 3 recalls the nec-          signals, describing a signal by means of the following
       essary background concerning MPEG-7 audio descriptors               groups of descriptors[10]:
       and the considered classification methods. In Section 4,
       the principles of the proposed approach are explained. Fi-              1. Basic: Audio Power (AP), Audio Waveform(AWF).
       nally, Section 5 presents results of experimental testing                  Temporally sampled scalar values for general use, ap-
       and comparison of the already implemented classifiers on                   plicable to all kinds of signals. The AP describes the
       the publicly available Berlin database of emotional speech.                temporally-smoothed instantaneous power of sam-
                                                                                  ples in the frame,in other words it is a temporally
                                                                                  measure of signal content as a function of time and
       2    Sentiment Analysis from Utterances                                    offers a quick summary of a signal in conjunction
                                                                                  with other basic spectral descriptors. The AWF
       Due to the importance of recognizing emotional states in                   describes audio waveform envelope (minimum and
       speech, research into sentiment analysis from utterances                   maximum), typically for display purposes.
Sentiment Analysis from Utterances                                                                                                           93

       2. Basic Spectral: Audio Spectrum Envelop (ASE),                   tral Centroid (HSC), Harmonic Spectral Deviation
          Audio Spectrum Centroid (ASC), Audio Spectrum                   (HSD), Harmonic Spectral Spread (HSS), Harmonic
          Spread (ASS), Audio Spectrum Flatness (ASF).                    Spectral Variation (HSV) and Spectral Centroid.
          All share a common basis, all deriving from the short           These are spectral features extracted in a linear-
          term audio signal spectrum (analysis of frequency               frequency space. The HSC descriptor is defined
          over time). They are all based on the ASE Descriptor,           as the average, over the signal duration, of the
          which is a logarithmic-frequency spectrum. This de-             amplitude-weighted mean of the frequency of the
          scriptor provides a compact description of the signal           bins (the harmonic peaks of the spectrum) in the lin-
          spectral content and represents the similar approxi-            ear power spectrum. It is has a high correlation with
          mation of logarithmic response of the human ear. The            the perceptual feature of ”sharpness” of a sound. The
          ASE descriptor is an indicator as to whether the spec-          HSD descriptor measures the spectral deviation of the
          tral content of a signal is dominated by high or low            harmonic peaks from the global envelope. The HSS
          frequencies. The ASC Descriptor could be consid-                descriptor measures the amplitude-weighted standard
          ered as an approximation of perceptual sharpness of             deviation (Root Mean Square) of the harmonic peaks
          the signal. The ASS descriptor indicates whether the            of the spectrum, normalized by the HSC. The HSV
          signal content, as it is represented by the power spec-         descriptor is the normalized correlation between the
          trum, is concentrated around its centroid or spread             amplitude of the harmonic peaks between two subse-
          out over a wider range of the spectrum. This gives              quent time-slices of the signal.
          a measure which allows the distinction of noise-like
          sounds from tonal sounds. The ASF describes the               6. Spectral Basis, which consists of Audio Spectrum
          flatness properties of the spectrum of an audio signal           Basis (ASB) and Audio Spectrum Projection (ASP).
          for each of a number of frequency bands.
                                                                    3.1    Tools for Working with MPEG-7 Descriptors
       3. Basic Signal Parameters: Audio Fundamental Fre-
          quency (AFF) and Audio Harmonicity (AH).                  We utilized the Sound Description Toolbox [13] and
          The signal parameters constitute a simple paramet-        MPEG-7 Audio Analyzer - Low Level Descriptors Extrac-
          ric description of the audio signal. This group in-       tor [15] for our experiments. Both of them extract a num-
          cludes the computation of an estimate for the fun-        ber of MPEG-7 standard descriptors, both scalar ones and
          damental frequency (F0) of the audio signal. The          time series. In addition, the SDT also calculates percep-
          AFF descriptor provides estimates of the fundamen-        tual features such as Mel Frequency Cepstral Coefficients,
          tal frequency in segments in which the audio signal       Specific Loudness and Sensation Coefficients. From this
          is assumed to be periodic. The AH represents the          descriptors calculate means, covariances, means of first-
          harmonicity of a signal, allowing distinction between     order differences and covariances of first order differences.
          sounds with a harmonic spectrum (e.g., musical tones      The Total number of features provided by this toolbox is
          or voiced speech e.g., vowels), sounds with an inhar-     187.
          monic spectrum (e.g., bell-like sounds) and sounds
          with a non-harmonic spectrum (e.g., noise, unvoiced
                                                                    4      Employed Classification Methods
          speech).

       4. Temporal Timbral: Log Attack Time (LAT), Tempo-           We have elaborated our approach to sentiment analysis
          ral Centroid (TC).                                        from utterances for six classification methods: k near-
          Timbre refers to features that allow one to distinguish   est neighbors, support vector machines, multilayer per-
          two sounds that are equal in pitch, loudness and sub-     ceptrons, classification trees, random forests [7] and long
          jective duration. These descriptors are taking into       short-term memory (LSTM) network [5, 6, 8]. The first
          account several perceptual dimensions at the same         five of them have already been implemented and tested (cf.
          time in a complex way. Temporal Timbral descriptors       Section 5), the last and most advanced one is still being
          describe the signal power function over time. The         implemented.
          power function is estimated as a local mean square
          value of the signal amplitude value within a running      4.1    k Nearest Neighbours (kNN)
          window. The LAT descriptor characterizes the ”at-
          tack” of a sound, the time it takes for the signal to     A very traditional way of classifying a new feature vector
          rise from silence to its maximum amplitude. This fea-     x ∈ X if a sequence of training data (x1 , c1 ), . . . , (x p , c p )
          ture signifies the difference between a sudden and a      is available is the nearest neighbour method: take the x j
          smooth sound. The TC descriptor computes a time-          that is the closest to x among x1 , . . . , x p , and assign to x the
          based centroid as the time average over the energy        class assigned to x j , i.e., c j .
          envelope of the signal.                                      A straightforward generalization of the nearest neigh-
                                                                    bour method is to take among x1 , . . . , x p not one, but k fea-
       5. Timbral Spectral descriptors:       Harmonic Spec-        ture vectors x j j , . . . , x jk closest to x. Then x is assigned the
94                                                                                                                    Jiří Kožusznik, Petr Pulc, and Martin Holeňa

     class c ∈ C fulfilling                                                        Due to the Karush-Kuhn-Tucker (KKT) conditions,
                                                                                                   b∗ − b∗−
     |{i, 1 ≤ i ≤ k|c ji = c}| = maxc0 ∈C |{i, 1 ≤ i ≤ k|c ji = c0 }|.                        αk∗ ( +       − ck xk> w∗ ) = 0, k = 1, . . . , p,                (8)
                                                                   (1)                                2
                                                                                   all feature vectors from the set S lie on some of the su-
     This method is called, expectedly, k nearest neighbours, or                   port hyperplanes (3). Therefore, they are called support
     k-NN for short.                                                               vectors. This name together with the observation that they
                                                                                   completely determine the classifier defined in (6) explains
     4.2    Support Vector Machines (SVM)                                          why such a classifier is called support vector machine.
                                                                                      If the intersections of both classes with the training in-
     Support vector machines are classifiers into two classes.                     puts are not linearly separable, a SVM is constructed sim-
     This method attempts to derive from the training data                         ilarly, but instead of the set of possible fature vectors, now
     (x1 , c1 ), . . . , (x p , c p ) the best possible generalization to un-      the set of functions
     seen feature vectors.
                                                                                                κ(·, x) for all possible feature vectors x                      (9)
        If both classes, more precisely their intersections with
     the set {x1 , . . . , x p } of training inputs, are in the space              is considered, where κ is a kernel, i.e., a mapping on
     of feature vectors linearly separable, the method con-                        pairs of feature vectors that is symmetric and such that for
     structs two parallel hyperplanes H+ = {x ∈ Rn |x> w +                         any k ∈ N and any sequence of different feature vectors
     b+ = 0}, H− = {x ∈ Rn |x> w + b− = 0} such that the train-                    x1 , . . . , xk , the matrix
     ing data fulfil                                                                                                                                        
                                                                                                                κ(x1 , x1 ) . . . κ(x1 , xk )
                          (
                                1 if x> w + b+ ≥ 0,                                    Gκ (x1 , . . . , xk ) = . . . . . . . . . . . . . . . . . . . . . . . , (10)
                ck =                                    k = 1, . . . , p, (2)                                    κ(xk , x1 ) . . . κ(xk , xk )
                              -1 if x> w + b− ≤ 0,
                                                                                   which is called the Gramm matrix of x1 , . . . , xk , is positive
            H+ ∩ {x1 , . . . , x p } 6= 0,
                                        / H− ∩ {x1 , . . . , x p } 6= 0.
                                                                      /    (3)
                                                                                   semidefinite, i.e.,
     The hyperplanes H+ and H− alle called support hyper-                                           (∀y ∈ Rk ) y> Gκ (x1 , . . . , xk )y ≥ 0.                  (11)
     planes. Their common normal vector w and intercepts
     b+ , b− are obtained through solving the following con-                       The most commonly used kinds of kernels are the Gaus-
     strained optimization task:                                                   sian kernel with a parameter ς > 0,
                                                                                                                              
           Maximize with respect to w, b+ , b− the distance                                       n0                   1   0 2
                                                                                       (∀x, x ∈ R ) κ(x, x ) = exp − kx − x k ,
                                                                                             0             0
                                                                                                                                    (12)
                                                                                                                       ς
                                             b+ − b−                               and polynomial kernel with parameters d ∈ N and c ≥ 0,
                          d(H+ , H− ) =                                    (4)
                                               kwk                                                                0
                                                                                                  (∀x, x0 ∈ Rn ) κ(x, x0 ) = (x> x0 + c)d .                    (13)
             on condition that the p inequalities (2) hold.                              It is known [14] that, due to the properties of kernels, if
     The distance (4) is commonly called margin. The                                  the joint distribution of a sequence of different feature vec-
     solution to this optimization task coincides with the                            tors x1 , . . . , xk is continuous, then almost surely any proper
     (w , b+ , b− , α1 , . . . , α p ) of the Lagrange function
        ∗    ∗     ∗     ∗              ∗                                             subset   of the set of functions {κ(·, x1 ), . . . , κ(·, xk )} is in
                                                                                      the space of all functions (9) linearly separable from its
                                                        p
                                                                , b+ − b−             complement.
     L(w, b+ , b− , α1 , . . . , α p ) = kwk2 + ∑ αk (                    − ck xk> w)    However, the featre vectors x and xk can’t be simply re-
                                                                     2
                                                      k=1                             placed by the corresponding functions κ(·, x) and κ(·, xk )
                                                                             (5)      in the definition (6) of a SVM classifier because a trans-
                                                                                      pose x> exists for a finite-dimensional vector, but not a for
     where α1 , . . . , α p ≥ 0 are Lagrange coefficients
                                                                                      an infinite-dimensional function. Fortunately, the trans-
     of the optimization task.                      Once the saddle point
                                                                                      pose occurs in (6) only as a part of the scalar product
     (w∗ , b∗+ , b∗− , α1∗ , . . . , α p∗ ) is found, the classifier is de-
                                                                                      x> xk . And a scalar product can be defined also on the
     fined by
                                                                                      space of all functions (9). Namely, the properties of a
                         (                                                            scalar product has the function that to the pair of func-
                             1           if ∑xk ∈S αk∗ ck x> xk + b∗ ≥ 0,
             φ (x) =                                                         (6)      tions (κ(·, x), κ(·, x0 ) assigns the value κ(x, x0 ). Using this
                             −1 if ∑xk ∈S αk∗ ck x> xk + b∗ < 0,                      scalar product in (6), we obtain the following definition of
                                                                                      a SVM classifier for linearly non-separable classes:
     where b∗ = 21 (b∗+ + b∗− ) and                                                                    (
                                                                                                          1      if ∑xk ∈S αk∗ ck κ(x, xk ) + b ≥ 0,
                                                  ∗
                                    S = {xk |αk > 0}.                        (7)          φ (x)   =                                                   (14)
                                                                                                          −1 if ∑xk ∈S αk∗ ck κ(x, xk ) + b ≥ 0.
Sentiment Analysis from Utterances                                                                                                                     95

     4.3    Multilayer Perceptrons (MLP)                                   To an output neuron v ∈ O, also a somatic mapping of the
                                                                           kind (17) with the activation functions (19) or (20) can be
     A multilayer percptron is a mapping φ of feature vectors              assigned. If it is the case, then the class c predicted for a
     to classes with which a directed graph Gφ = (V , E ) is as-           feature vector x is obtained as c = arg maxi (φ (x))i , where
     sociated. Due to the inspiration from biological neural net-          (φ (x))i denotes the i-the component of φ (x). Alternatively
     works, the vertices of Gφ are called neurons and its edges            the activation function assigned to an output neuron can be
     are called connections. In addition, Gφ is required to have           the step function, aka Heaviside function
     a layered structure, which means that the set V of neu-                                           (
     rons can be decomposed into L + 1 mutually disjoint lay-                                            0 if t < 0,
                                                                                               ϕ(t) =                               (21)
     ers, V = V0 ∪ V1 ∪ · · · ∪ VL , L ≥ 2, such that                                                    1 if t ≥ 0.

     (∀(u, v) ∈ E ) u ∈ Vi , i = 0, . . . , L − 1 & v 6∈ Vi ⇒ v ∈ Vi+1 .   In that case, the value (φ (x))c already directly indicates
                                                                   (15)    whether x belongs to the class c.

     The layer I = V0 is called input layer of the MLP,
     the layer O = VL its output layer and the layers H1 =                 4.4    Classification Trees (CT)
     V1 , . . . , HL−1 = VL−1 its hidden layers.                           A classifier φ : X → C = {c1 , . . . , cm } is called binary
       The purpose of the graph Gφ associated with the map-                classification tree, if there is a binary tree Tφ = (Vφ , Eφ )
     ping φ is to define a decomposition of φ into simple map-             with vertices Vφ and edges Eφ such that:
     pings assigned to hidden and output neurons and to con-               (i) Vφ = {v1 , . . . , vL , . . . , v2L−1 }, where L ≥ 2, v0 is the
     nections between neurons (input neurons normally only                       root of Tφ , v1 , . . . , vL−1 are its forks and vL , . . . , v2L−1
     accept the components of the input, and no mappings are                     are its leaves.
     assigned to them). Inspired by biological terminology,                (ii) If the children of a fork v ∈ {v1 , . . . , vL−1 } are vL ∈ Vφ
     mappings assigned to neurons are called somatic, those                      (left child) and vR ∈ Vφ (right child) and if v = vi , vL =
     assigned to connections are called synaptic.                                v j , vR = vk , then i < j < k.
       To each connection (u, v) ∈ E , the multiplication by a             (iii) To each fork v ∈ {v1 , . . . , vL−1 }, a predicate ϕv of
     weight w(u,v) is assigne as a synaptic mapping:                             some formal logic is assigned, evaluated on features
                                                                                 of the input vectors x ∈ X .
                     (∀ξ ∈ R) f(u,v) (ξ ) = w(u,v) ξ .             (16)    (iv) To each leaf v ∈ {vL , . . . , v2L−1 }, a class cv ∈ C is as-
                                                                                 signed.
      To each hidden neuron v ∈ Hi , the following somatic                 (v) For each input x ∈ X , the predicate ϕv1 assigned to
     mapping is assigned:                                                        the root is evaluated.
                                                                           (vi) If for a fork v ∈ {v1 , . . . , vL−1 }, the predicate ϕv eval-
           (∀ξ ∈ R| in(v)| ) fv (ξ ) = ϕ( ∑ [ξ ]u + bv ),          (17)          uates true, then φ (x) = cvL in case vL is already a leaf,
                                         u∈in(v)
                                                                                 and the predicate ϕvL is evaluated in case vL is still a
                                                                                 fork.
     where [ξ ]u for u ∈ in(v) denotes the component of ξ that
                                                                           (vii) If for a fork v ∈ {v1 , . . . , vL−1 }, the predicate ϕv eval-
     is the output of the synaptic mapping fu,v assigned to the
                                                                                 uates false, then φ (x) = cvR in case vR is already a
     connection (u, v), in(v) = {u ∈ V |(u, v) ∈ E } is the in-
                                                                                 leaf, and the predicate ϕvR is evaluated in case vR is
     put set of v, and ϕ : R → R is called activation function.
                                                                                 still a fork.
     Though the activation functions, in applications typically
     sigmoidal functions are used to this end, i.e., functions that
     are non-decreasing, piecewise continuous, and such that               4.5    Random Forests (RF)

                  −∞ < lim ϕ(t) < lim ϕ(t) < ∞.                    (18)    Random Forests are ensembles of classifiers in which the
                          t→−∞           t→∞                               individual members are classification trees. They are con-
                                                                           structed by bagging, i.e., bootstrap aggregation of individ-
     The activation functions most frequently encountered in               ual trees, which consists in training each member of the
     MLPs are:                                                             ensemble with another set of training data, sampled ran-
        • the logistic function,                                           domly with replacement from the original training pairs
                                                                           (x1 , c1 ), . . . , (x p , c p ). Typical sizes of random forests en-
                                                  1                        countered in applications are dozens to thousands trees.
                          (∀t ∈ R) ϕ(t) =              ;           (19)    Subsequently, when new subjects are input to the forest,
                                               1 + e−t
                                                                           each tree classifies them separately, according to the leaves
        • the hyperbolic tangent,                                          at which they end, and the final classification by the for-
                                                                           est is obtained by means of an aggregation function. The
                                             et − e−t                      usual aggregation function of random forests is majority
                          ϕ(t) = tanht =              .            (20)    voting, or some of its fuzzy generalizations.
                                             et + e−t
96                                                                                                    Jiří Kožusznik, Petr Pulc, and Martin Holeňa

       According to which kind of randomness is involved in                 signed to them depend, apart from a bias, on values as-
     the costruction of the ensemble, two broad groups of ran-              signed to the unit input at the same time step and on val-
     dom forests can be differentiated:                                     ues assigned to the unit output at the previous time step.
                                                                            Hence, an LSTM network layers is a recurrent network.
       1. Random forests grown in the full input space. Each                (i) Memory cells can store values, aka cell states, for an
          tree is trained using all considered input features.                    arbitray time. They have no activation function, thus
          Consequently, any feature has to be taken into ac-                      their output is actually a biased linear combination of
          count when looking for the split condition assigned                     unit inputs and of the values from the previous time
          to an inner node of the tree. However, features actu-                   step coming through recurrent connections.
          ally occurring in the split conditions can be different           (ii) Input gate controls the extent to which values from
          from tree to tree, as a consequence of the fact that                    the previous unit or from the preceding layer influ-
          each tree is trained with another set of training data.                 ence the value stored in the memory cell. It has a
          For the same reason, even if a particular feature oc-                   sigmoidal activation function, which is applied to a
          curs in split conditions of two different trees, those                  biased linear combination of unit inputs and of val-
          conditions can be assigned to nodes at different lev-                   ues from the previous time step, though the bias and
          els of the tree.                                                        synaptic weights of the input and recurrent connec-
           A great advantage of this kind of random forests is                    tions are specific and in general different from the
           that each tree is trained using all the information                    bias and synaptic weights of the memory cell.
           available in its set of training data. Its main disadvan-        (iii) Forget gate controls the extent to which the memory
           tage is high computational complexity. In addition, if                 cell state is supressed. It again has a sigmoidal acti-
           several or even only one variable are very noisy, that                 vation function, which is applied to a specific biased
           noise gets nonetheless incorporated into all trees in                  linear combination of unit inputs and of values from
           the forest. Because of those disadvantages, random                     the previous time step.
           forests are grown in the complete input space primar-            (iv) Output gate controls the extent to which the memory
           ily if its dimension is not high and no input feature is               cell state influences the unit output. Also this gate
           substantially noisier than the remaining ones.                         has a sigmoidal activation function, which is applied
                                                                                  to a specific biased linear combination of unit inputs
       2. Random forests grown in subspaces of the input                          and of values from the previous time step, and subse-
          space. Each tree is trained using only a randomly                       quently composed either directly with the cell state or
          chosen fraction of features, typically a small one.                     with its sigmoidal transformation, using another sig-
          This means that a tree t is actually trained with pro-                  moid than is used by the gates.
          jections of the training data into a low-dimensional
          space spanned by some randomly selected dimen-
          sions it,1 ≤ · · · ≤ it,dt ∈ {1, . . . , d}, where d is the di-   5     Experimental Testing
          mension of the input space, and dt is typically much
          smaller than d. Using only a subset of features not               5.1   Berlin Database of Emotional Speech
          only makes forest training much faster, but also al-              For the evaluation of already implemented classifiers, we
          lows to eliminate noise originating from only several             used the publicly available dataset ”EmoDB”, aka Berlin
          features. The price paid for both these advantages is             database of emotional speech. It consists of 535 emotional
          that training makes use of only a part of the informa-            utterances in 7 emotional categories namely anger, bore-
          tion available in the training data.                              dom, disgust, fear, happiness, sadness and neutral. These
                                                                            utterances are sentences read by 10 professional actors, 5
     4.6    Long Short-Term Memory (LSTM)                                   males and 5 females [1], which were recorded in an ane-
                                                                            choic chamber under supervision by linguists and psychol-
     An LSTM network is used for classification of sequences                ogists) . The actors were advised to read these prede-
     of feature vectors, or equivalently, multidimensional time             fined sentences in the targeted emotional categories, but
     series with discrete time. Alternatively, it can be also em-           the sentences do not contain any emotional bias. A human
     ployed to obtain sequences of such classifications, i.e., in           perception test was conducted with 20 persons, different
     situations when the neural network input is a sequence of              from the speakers, in order to evaluate the quality of the
     feature vectors and its output is a a sequence of classes.             recorded data with respect to recognisability and natural-
     Differently to most of other commonly encountered kinds                ness of presented emotion. This evaluation yielded a mean
     of artificial neural networks, an LSTM layer connects not              accuracy 86% over all emotional categories.
     simple neurons, but units with their own inner structure.
     Several variants of an LSTM have been proposed (e.g.,
                                                                            5.2   Experimental Settings
     [5, 6]), all of them include at least the following four kinds
     of units described below. Each of them has certain prop-               As input features, the outputs from the Sound Description
     erties of usual ANN neurons, in particular, the values as-             Toolbox were used. Consequently, the input dimension
Sentiment Analysis from Utterances                                                                                                   97

     was 187. The already implemented classifiers were com-
                                                                    Table 1: Accuracy and area under curve (AUC) of the im-
     pared by means of a 10-fold cross-validation, using the
                                                                    plemented classifiers on the whole Berlin database of emo-
     following settings for each of them:
                                                                    tional speech. AUC is measured for binary classification
        • For the k nearest neighbors classification, the value     of each of the considered 7 emotions against the rest
          k = 9 was chosen by a grid method from h1, 80i. This          Classifier   Accuracy      AUC emotion against the rest
          classifer was applied to data normalized to zero mean                                    Anger    Boredom Disgust
          and unit variance.                                              kNN          0.73        0.956     0.933        0.901
                                                                          SVM          0.93        0.979     0.973        0.966
        • Support vector machines are constructed for each of             MLP          0.78        0.977     0.969        0.964
          the 7 considered emotions, to classify between that              DT          0.59        0.871     0.836        0.772
                                                                           RF          0.71        0.962     0.949        0.920
          emotion and all the remaining ones. They employ
          auto-scaled Gaussian kernels and do not use slack
                                                                        Classifier            AUC emotion against the rest
          variables.                                                                  Fear      Happiness     Neutral      Sadness
                                                                          kNN         0.902        0.856       0.962        0.995
        • The MLP has 1 hidden layer with 70 neurons. Hence,
                                                                          SVM         0.983        0.904       0.974        0.997
          taking into account the input dimension and the num-
                                                                          MLP         0.969        0.933       0.983        0.996
          ber of classes, the overall architecture of the MLP is           DT         0.782        0.683       0.855        0.865
          187-70-7.                                                        RF         0.921        0.882       0.972        0.992
        • Classification trees are restricted to have at most 23
          leaves. This upper limit was chosen by a grid method      Table 2: Comparison between pairs of implemented clas-
          from h1, 50i, taking into account the way how classi-     sifiers with respect to accuracy, based on 10 independent
          fication trees are grown in their Matlab implementa-      parts of the Berlin database of emotional speech corre-
          tion.                                                     sponding to 10 different speakers. The result in a cell of
                                                                    the table indicates on how many parts the accuracy of the
        • Random forests consist of 50 classification trees,
                                                                    row classifier was higher : on how many parts the accuracy
          each of them taking over the above restriction. The
                                                                    of the column classifier was higher. A result in bold indi-
          number of trees was selected by a grid method from
                                                                    cates that after the Friedman test rejected the hypothesis of
          10, 20,. . . ,100.
                                                                    equal accuracy of all classifiers, the post-hoc test accord-
                                                                    ing to [3, 4] rejects the hypothesis of equal accuracy of the
     5.3    Comparison of Already Implemented Classifiers           particular row and column classifiers. All simultaneously
                                                                    tested hypotheses were corrected in accordance with Holm
     First, we compared the already implemented classifiers on      [9]
     the whole Berlin database of emotional speech, with re-          classifier    kNN       SVM      MLP        DT     RF
     spect to accuracy and area under the ROC curve (area un-           kNN                    0:10 3.5:6.5 9:1          5:5
     der curve, AUC). Since a ROC curve makes sense only                SVM          10:0              10:0      10:0 10:0
     for a binary classifier, we computed areas under 7 sepa-           MLP        6.5:3.5 0:10                  10:0 7:3
     rate curves corresponding to classifiers classifying always         DT           1:9      0:10    0:10             0:10
     1 emotion against the rest. The results are presented in Ta-        RF           5:5      0:10     3:7      10:0
     ble 1 and in Figure 1. They clearly show SVM as the most
     promising classifier. It has the highest accuracy, and also
     the AUC for binary classifiers corresponding to 5 of the 7     the family-wise significance level 5%, they reveal the fol-
     classifiers                                                    lowing Holm-corrected significant differences between in-
        Then we compared the classifiers separately on the          dividual pairs of classifiers: both for accuracy and av-
     utterances of each of the 10 speakers who created the          eraged AUC: (SVM,DT), (MLP,DT), and in addition be-
     database. The results are summarized in Table 2 for ac-        tween (kNN,SVM), (SVM,RF) for accuracy.
     curacy and Table 3 for AUC averaged over all 7 emo-
     tions. They indicate a great difference between most of
     the compared classifiers. This is confirmed by the Fried-      6       Conclusion
     man test of the hypotheses that all classifiers have equal
     accuracy and equal average AUC. The Friedman test re-          The presented work in progress investigated the possibil-
     jected both hypotheses with a high significance: With          ities to analyse emotions in utterances based on MPEG7
     the Holm correction for simultaneously tested hypothe-         features. So far, we implemented only five classifica-
     ses [9], the achieved significance level (aka p-value) was     tion methods not using time series features, but only 187
     4 · 10−6 . For both hypotheses, posthoc tests according to     scalar features, namely the k nearest neighbours classi-
     [3, 4] were performed, testing equal accuracy and equal        fier, support vector machines, mutilayer perceptrons, de-
     average AUC between individual pairs of classifiers. For       cision trees and random forests. The obtained results in-
98                                                                                                    Jiří Kožusznik, Petr Pulc, and Martin Holeňa

                                                                            [7] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of
     Table 3: Comparison between pairs of implemented classi-                   Statistical Learning, 2nd Edition. Springer, 2008.
     fiers with respect to the AUC averaged over all 7 emotions,
                                                                            [8] S. Hochreiter and J. Schmidhuber. Long short-term mem-
     based on 10 independent parts of the Berlin database of                    ory. Neural Computation, 9:1735–1780, 1997.
     emotional speech corresponding to 10 different speakers.
                                                                            [9] S. Holm. A simple sequentially rejective multiple test pro-
     The result in a cell of the table indicates on how many parts              cedure. Scandinavian Journal of Statistics, 6:65–70, 1979.
     the AUC of the row classifier was higher : on how many
                                                                           [10] H.G. Kim, N. Moreau, and T. Sikora. MPEG-7 Audio and
     parts the AUC of the column classifier was higher. A result                Beyond: Audio Content Indexing and Retrieval. John Wiley
     in bold indicates that after the Friedman test rejected the                and Sons, New York, 2005.
     hypothesis of equal AUC of all classifiers, the post-hoc test         [11] S. Lalitha, A. Madhavan, B. Bhusan, and S. Saketh. Speech
     according to [3, 4] rejects the hypothesis of equal AUC of                 emotion recognition. In International Conference on Ad-
     the particular row and column classifiers. All simultane-                  vances in Electronics, pages 92–95, 2014.
     ously tested hypotheses were corrected in accordance with             [12] A.S. Lampropoulos and G.A. Tsihrintzis. Evaluation of
     Holm [9]                                                                   MPEG-7 descriptors for speech emotional recognition. In
       classifier kNN SVM MLP DT                       RF                       Eighth International Conference on Intelligent Information
         kNN                  2:8      0:10 10:0 4:6                            Hiding and Multimedia Signal Processing, pages 98–101,
         SVM         8:2                5:5    10:0 9:1                         2012.
         MLP        10:0      5:5              10:0 9:1                    [13] A. Rauber, T. Lidy, J. Frank, E. Benetos, V. Zenz,
          DT        0:10 0:10 0:10                    0:10                      G. Bertini, T. Virtanen, A.T. Cemgil, S. Godsill, D. Clark,
          RF         6:4      1:9       1:9    10:0                             P. Peeling, E. Peisyer, Y. Laprie, A. Sloin, A. Alfandary,
                                                                                and D. Burshtein. MUSCLE network of excellence: Mul-
                                                                                timedia understanding through semantics, computation and
     dicate that especially support vector machines and multi-                  learning. Technical report, TU Vienna, Information and
     layer perceptrons are quite successfull for this task. Statis-             Software Engineering Group, 2004.
     tical testing confirms significant differences between these          [14] B. Schölkopf and A.J. Smola. Learning with Kernels. MIT
     two kinds of classifiers on the one hand, and decision trees               Press, Cambridge, 2002.
     an random forests on the other hand.                                  [15] T. Sikora, H.G. Kim, N. Moreau, and S. Amjad. MPEG-
        The next step in this ongoing research is to implement                  7-based audio annotation for the archival of digital video.
     the long short-term memory neural network, recalled in                     http://mpeg7lld.nue.tu-berlin.de/, 2003.
     Subsection 4.6, because they can work not only with scalar            [16] Z. Xiao, E. Dellandrea, W. Dou, and L. Chen. Multi-stage
     features but also with features represented with time series.              classification of emotional speech motivated by a dimen-
                                                                                sional emotion model. Multimedia Tools and Applicaions,
                                                                                46:119–145, 2010.
     Acknowledgement
     The research reported in this paper has been supported by
     the Czech Science Foundation (GAČR) grant 18-18080S.


     References
      [1] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and
          B. Weiss. A database of german emotional speech. In In-
          terspeech, pages 1517–1520, 2005.
      [2] M. Casey, A. De Cheveigne, P. Gardner, M. Jackson,
          and G. Peeters. MPEG-7 multimedia software resources.
          http://mpeg7.doc.gold.ac.uk/, 2001.
      [3] J. Demšar. Statistical comparisons of classifiers over multi-
          ple data sets. Journal of Machine Learning Research, 7:1–
          30, 2006.
      [4] S. Garcia and F. Herrera. An extension on “Statistical Com-
          parisons of Classifiers over Multiple Data Sets” for all pair-
          wise comparisons. Journal of Machine Learning Research,
          9:2677–2694, 2008.
      [5] F.A. Gers, J. Schmidhuber, and J. Cummis. Learning to
          forget: Continual prediction with LSTM. In 9th Interna-
          tional Conference on Artificial Neural Networks: ICANN
          ’99, pages 850–855, 1999.
      [6] A. Graves. Supervised Sequence Labelling with Recurrent
          Neural Networks. PhD thesis, TU München, 2008.
Sentiment Analysis from Utterances                                                                       99


                                     Figure 1: ROC curve for all emotions on the whole Berlin database