=Paper=
{{Paper
|id=Vol-2328/session3_paper3
|storemode=property
|title=Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network
|pdfUrl=https://ceur-ws.org/Vol-2328/3_2_paper_17.pdf
|volume=Vol-2328
|authors=Ziqian Luo,Hua Xu,Feiyang Chen
|dblpUrl=https://dblp.org/rec/conf/aaai/LuoXC19
}}
==Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network==
<pdf width="1500px">https://ceur-ws.org/Vol-2328/3_2_paper_17.pdf</pdf>
<pre>
                  Audio Sentiment Analysis by Heterogeneous Signal Features
                   Learned from Utterance-Based Parallel Neural Network

                                               Ziquan Luo, Hua Xu, and Feiyang Chen
                                                         Tsinghua University, China
                                                        luoziqian@bupt.edu.cn


                            Abstract                                    that speech is the most convenient and natural medium for
                                                                        human communication, not only carries the implicit seman-
  Audio Sentiment Analysis is an increasingly popular research
  area which extends the conventional text-based sentiment              tic information, but also contains rich affective information
  analysis to depend on the effectiveness of acoustic features          (S. Zhang et al. 2017). Therefore, audio sentiment analysis,
  extracted from speech. However, current progress on audio             which aims to analyze correctly the sentiment of the speaker
  sentiment analysis mainly focuses on extracting homoge-               from speech signals, has drawn a great deal of attention of
  neous acoustic features or doesn’t fuse heterogeneous fea-            researchers.
  tures effectively. In this paper, we propose an utterance-based          In recent years, there are three main methods for audio
  deep neural network model, which has a parallel combina-              sentiment analysis. Firstly, utilizes automatic speech recog-
  tion of CNN and LSTM based network, to obtain representa-
                                                                        nition (ASR) technology to convert speech into texts, fol-
  tive features termed Audio Sentiment Vector (ASV), that can
  maximally reflect sentiment information in an audio. Specifi-         lowing by conventional text-based sentiment detection sys-
  cally, our model is trained by utterance-level labels and ASV         tems (S. Ezzat et at al. 2012). Secondly, adopts a genera-
  can be extracted and fused creatively from two branches. In           tive model operating directly on the raw audio waveform
  the CNN model branch, spectrum graphs produced by sig-                (Van Den Oord A et at al. 2016). Thirdly, focuses on ex-
  nals are fed as inputs while in the LSTM model branch, in-            tracting signal features from the raw audio files (Bertin et at
  puts include spectral centroid, MFCC and other recognized             al. 2011), which well captures the tonal content of a music,
  traditional acoustic features extracted from dependent utter-         and has been proved to be more effective than original au-
  ances in an audio. Besides, BiLSTM with attention mecha-              dio spectrums descriptors such as Mel-frequency cepstrum
  nism is used for feature fusion. Extensive experiments have           coefficients(MFCC).
  been conducted to show our model can recognize audio sen-
  timent precisely and quickly, and demonstrate our ASV are                However, for converting speech into texts, by recognizing
  better than traditional acoustic features or vectors extracted        each word said by the person in an audio, change them into
  from other deep learning models. Furthermore, experimen-              the word embedding and use some techniques in NLP, like
  tal results indicate that the proposed model outperforms the          TF-IDF and bag of words model. The result is not always
  state-of-the-art approach by 9.33% on MOSI dataset.                   accurate, because sentiment detection accuracy depends on
                                                                        being able to reliably detect a very focused vocabulary in
                      1 Introduction                                    the spoken comments (Kaushik L et al. 2015). Furthermore,
                                                                        when the voice is transferred to the text, some sentiment-
Sentiment Analysis is a well-studied research area in Natural           related signal characteristics are also lost, resulting in a de-
Language Processing (NLP) (Pang B et at al. 2008), which                crease in the accuracy of the sentiment classification. As for
is the computational study of peoples’ opinions, sentiments,            extracting from the raw audio files through human works
appraisals, and attitudes towards entities such as products,            and then being put into the support vector machine(SVM)
services, organizations and so on (Liu B et al. 2015). Tradi-           classifier for classification, those methods require lots of hu-
tional sentiment analysis methods are mostly based on text,             man work and are heavily dependent on language types.
with the rapid development of communication technology,                 Luckily, along with the success of deep learning in many
abundance of smartphones and the rapid rise of social me-               other application domains, deep learning is also popularly
dia, large amounts of data are uploaded by web users in the             used in audio sentiment analysis in recent years (Mariel
form of audios or videos, rather than text (S. Poria et al.             W C F et at al. 2018). More recently, (G. Trigeorgis et al.
2017). Interestingly, a recent study shows that voice-only as           2016) directly use the raw audio samples to train a convolu-
modality seems best for humans empathic accuracy as com-                tional recurrent neural network (CRNN) to predict continu-
pared to video-only or audiovisual communication (Kraus et              ous arousal /valence space. (Mirsamadi et al. 2017) study the
al. 2017). In fact, audio sentiment analysis is a difficult task        use of deep learning to automatically discover emotionally
due to the complexity of audio signal. It is generally known            relevant features from speech. They propose a novel strategy
Copyright c 2019, Association for the Advancement of Artificial         for feature pooling over time which uses local attention in
Intelligence (www.aaai.org). All rights reserved.                       order to focus on specific regions of a speech signal that are
more emotionally salient. (Neumann et al. 2017) use an at-           we briefly present the advances on audio sentiment analysis
tentive convolutional neural network with multi-view learn-          task by utilizing deep learning, and then we give a summary
ing objective function and achieved state-of-the-art results         on the progress of extracting the audio feature representa-
on the improvised speech data of IEMOCAP. (Wang et al.               tion.
2017) propose to use deep neural networks (DNN) to en-
code each utterance into a fixed-length vector by pooling the        Long Short-Term Memory (LSTM)
activations of the last hidden layer over time. The feature          It has been demonstrated that LSTM (Hochreiter S et at al.
encoding process is designed to be jointly trained with the          1997) are well-suited to make predictions based on time se-
utterance-level classifier for better classification. (Chen et al.   ries data, by utilizing a cell to remember values over arbi-
2018) propose a 3-D attention-based convolutional recurrent          trary time intervals and the three gates(input gate i, output
neural networks to learn discriminative features for speech          gate o, forget gate f ) to regulate the flow of information into
emotion recognition, where the Mel-spectrogram with deltas           and out of the cell, which can be described as follows:
and delta-deltas are creatively used as input. But most of the
previous methods still either considered only one single au-                         ft = σ(Wf · [ht−1 , xt ] + bf )
dio feature (Chen et al. 2018) or high-dimensional vectors
from one homogeneous feature (Poria et al. 2017), and did
not effectively extract and fuse audio features.                                       it = σ(Wi · [ht−1 , xt ] + bi )
   We believe the information extracted from a single utter-
ance must have dependency on its context. For example, a                             ot = σ(Wo · [ht−1 , xt ] + bo )
flash of loud expression may not indicate a person has a
                                                                     where ht = ot ∗ tanh(Ct ) is the output of the last cell and
strong emotion since it maybe just caused by a cough while
                                                                     xt is the input of current cell. Besides, the current cell state
continuous loud one is far more likely to indicate the speaker
                                                                     Ct can be updated by the following formula:
has a strong emotion.
   In this paper, based on a large number of experiments, we                       ∼
extract the features of each utterance in an audio through                         Ct = tanh(Wc · [ht−1 , xt ] + bc )
the Librosa toolkit, and obtain four most effective features
                                                                                                                  ∼
representing sentiment information, merge them by adopt-                                Ct = ft ∗ Ct−1 + it ∗ Ct
ing a BiLSTM with attention mechanism. Moreover, we de-
sign a novel model called Audio Feature Fusion-Attention             where Ct−1 stands for the previous cell state.
based CNN and RNN (AFF-ACRNN) for audio sentiment                       One of the most effective variant of LTSM is the bidirec-
analysis. Spectrum graphs and selected traditional acoustic          tional LSTM. Each input sequence will be fed into both the
features are fed as input in two separate branches, we can ob-       forward and backward LSTM layers and thus a hidden layer
tain a new fusion of audio feature vector before the softmax         receives an input by joining forward and backward LSTM
layer, which we call the Audio Sentiment Vector (ASV). Fi-           layers.
nally, the output of the softmax layer is the class of senti-
ment.                                                                Convolutional Neural Network (CNN)
   Major contributions of the paper are that:                        CNN (Y. Le Cun et at al. 1990) are well-known for extract-
• We propose an effective AFF-ACRNN model for au-                    ing features from a image by using convolutional kernels
   dio sentiment analysis, through combining multiple tradi-         and pooling layers to emulates the response of an individ-
   tional acoustic features and spectrum graphs to learn more        ual to visual stimuli. Moreover, CNN have been success-
   comprehensive sentiment information in audio.                     fully used not only for computer vision, but also for speech
                                                                     (T. N. Sainath et at al. 2015). For speech recognition, CNN
• Our model is language insensitive and pay more attention
                                                                     is proved to be robust against noise compared to other DL
   to acoustic features of the original audio rather than words
                                                                     models (D.Palaz et at al. 2015).
   recognized from the audio.
• Experimental results indicate that the proposed method             Audio Feature Representation and Extraction
   outperforms the state-of-the-art methods (Poria et al.            Researchers have found pitch and energy related features
   2017) on Multimodal Corpus of Sentiment Inten-                    playing a key role in affect recognition (Poria S et at al.
   sity dataset(MOSI) and Multimodal Opinion Utterances              2017). Other features that have been used by some re-
   Dataset(MOUD).                                                    searchers for feature extraction include formants, MFCC,
   The rest of the paper is organized as follows. In the fol-        root-mean-square energy, spectral centroid and tonal cen-
lowing section, we will review related work. In Section 3,           troid features. During the speech production, there are sev-
we will exhibit more details of our methodology. In Section          eral utterances and for each utterance, the audio signal can
4, experiments and results are presented, and conclusion fol-        be divided into several segments. Global features are cal-
lows in Section 5.                                                   culated by measuring several statistics, e.g., average, mean,
                                                                     deviation of the local features. Global features are the most
                     2 Related Work                                  commonly used features in the literature. They are fast to
Current state-of-the-art methods for audio sentiment analy-          compute and, as they are fewer in number compared to lo-
sis are mostly based on deep neural network. In this section,        cal features, the overall speed of computation is enhanced
(El Ayadi M et at al. 2011). However, there are some draw-        more, audio feature vector of each piece of utterance is the
backs of calculating global features, as some of them are         input of the proposed neural network that based on Audio
only useful to detect affect of high arousal, e.g., anger and     Feature Fusion (AFF), we can obtain a new fusion audio
disgust. For lower arousal, global features are not that effec-   feature vector before the softmax layer, which we call the
tive, e.g., global features are less prominent to distinguish     Audio Sentiment Vector (ASV). Finally, the output of the
between anger and joy. Global features also lack temporal         softmax layer produces our final sentiment classification re-
information and dependence between two segments in an ut-         sults, as shown in Figure 1.
terance. In a recent study (Cummins N et at al. 2017), a new
acoustic feature representation, denoted as deep spectrum
features, derived from feeding spectrum graphs through a
very deep image classification CNN and forming a feature                                                        Sof
                                                                                                                  tmax
vector from the activation of the last fully connected layer.
Librosa (McFee B et at al. 2015) is an open-source python                                                        LASV
package for music and audio analysis which is able to extract
all the key features as elaborated above.
                                                                       AFF2                                 At
                                                                                                             tent
                                                                                                                ion

                                                                                                                        Dr
                                                                                                                         opout
                     3 Methodology                                                     Bi
                                                                                        LSTM                    Bi
                                                                                                                 LSTM                 Bi
                                                                                                                                       LSTM
In this section, we describe the proposed AFF-ACRNN
                                                                                             ut
                                                                                              ter
                                                                                                ance1                   ut
                                                                                                                         ter
                                                                                                                           ance2                  ut
                                                                                                                                                   ter
                                                                                                                                                     ance3
model for audio sentiment analysis in details. We firstly in-
                                                                                        V1                        V2                       V3
troduce an overview of the whole neural network architec-
ture. After that, two separate branch of AFF-ACRNN will be
                                                                                       Bi
                                                                                        LSTM                    Bi
                                                                                                                 LSTM                     Bi
                                                                                                                                           LSTM
explained in details. Finally, we talk about the fusion mech-
                                                                       AFF1
anism used in our model.
                                                                                       Dr
                                                                                        opout                   Dr
                                                                                                                 opout                    Dr
                                                                                                                                           opout


Model—AFF-ACRNN

                                                                                F1_1     .
                                                                                         .
                                                                                         .      F1_n     F2_1     ...    F2_n      F3_1     ...    F3_n


                                                                       Li
                                                                        brosa


                                                                                V1_1     .
                                                                                         .
                                                                                         .      V1_n     V2_1     ...    V2_n      V3_1     ...    V3_n


                                                                                                       3ut
                                                                                                         ter
                                                                                                           ancesofavi
                                                                                                                    deo


                                                                        Figure 2: Overview of Our UB-BiLSTM Model


                                                                  Audio Sentiment Vector (ASV) from Audio Feature
                                                                  Fusion (AFF)
                                                                  LSTM Layers The hidden layers of LSTM have self-
                                                                  recurrent weights. These enable the cell in the memory
                                                                  block to retain previous information (Bae et at al. 2016).
                                                                  Firstly, we separate the different videos and take three con-
                                                                  tinuous utterances (e.g. u1 , u2 , u3 ) in one video at a time.
                                                                  Among them, for each utterance (e.g. u1 ), we extract its
                                                                  internal acoustic features through the librosa toolkit, say
                                                                  f11 , f12 ...f1n , and then trained by two layers of BiLSTM
                                                                  in AFF1 to obtain the extracted features from the tradi-
                                                                  tional acoustic feature. Therefore, three utterances are cor-
        Figure 1: Overview of AFF-ACRNN Model                     responding to three more efficient and representative vec-
                                                                  tors v1 , v2 , v3 , as the inputs to BiLSTM in AFF2. AFF2 ef-
   We concentrate on a model that has two parallel branches,      fectively combines the contextual information between ad-
the utterance based BiLSTM branch (UB-BiLSTM) and the             jacent utterances, and then subtly acquires the utterance that
spectrum based CNN branch (SBCNN), whose core mech-               has the greatest impact on the final sentiment classification
anisms are based on LSTM and CNN. One branch of pro-              through the attention mechanism. Finally, after the dropout
posed model uses the BiLSTM to extract temporal informa-          layer, a more representative LASV extracted by our LSTM
tion between adjacent utterances, another branch uses the         framework is obtained before the softmax layer, as shown in
renowned CNN based network to extract features from spec-         Figure 2. The process is described in LSTM branch proce-
trum graph that sequence model cannot achieve. Further-           dure in Algorithm 1.
Algorithm 1 Related Procedure
 1: procedure LSTM BRANCH
 2:    for i:[0,n] do
 3:        fi = getAudioF eature(ui )
 4:        ASVi = getASV (fi )
 5:    end for
 6:    for i:[0,M] do //M is the number of videos
 7:        inputi = GetT opU tter(vi )
 8:        ufi = getU tterF eature(inputi )
 9:    end for
10:    shuf f le(v)
11: end procedure
12: procedure CNN B RANCH
13:    for i:[0,n] do
14:        xi ← get SpectrogramImage(ui )
15:        ci ← CNNModel(xi )
16:        li ← BiLSTM(ci )
17:    end for
18: end procedure
19: procedure F IND CORRESPONDING L ABEL
20:    for i:[0:2199] do
21:        rename(ui )        // for better order in sorting
22:        N ameAndLabel = createIndex(ui )                          Figure 3: Overview of Our ResNet152 CNN Model
23:        // A dictionary [utterance Name: Label]
24:    end for
25:    Labelx = N ameAndLabel(ux )                               layer. In (J. Donahue et at al. 2015), long-term recurrent
26: end procedure                                                convolution network (LRCN) model was proposed for vi-
                                                                 sual recognition. LRCN is a consecutive structure of CNN
                                                                 and LSTM. LRCN processes the variable-length input with
                                                                 a CNN, whose outputs are fed into LSTM network, which fi-
CNN Layers Similar to the UB-BiLSTM model proposed
                                                                 nally predicts the class of the input. In (T. N. Sainath et at al.
above, we extracted the spectrum graph of each utterance
                                                                 2015), a cascade structure was used for voice search. Com-
through the Librosa toolkit and use it as the input of our
                                                                 pared to the method mentioned above, the proposed network
CNN branch. After a lot of experiments, we found that
                                                                 forms a parallel structure in which LSTM and CNN accept
the audio feature vector learned by the ResNet152 network
                                                                 different inputs separately. Therefore, the Audio Sentiment
structure has the best effect on the final sentiment classifi-
                                                                 Vector (ASV) can be extracted more comprehensively, and
cation, so we choose the ResNet model in this branch. The
                                                                 a better classification result can be got.
convolutional layer performs 2-dimensional convolution be-
tween the spectrum graph and the predefined linear filters.      Feature Fusion base on Attention Mechanism
To enable the network to extract complementary features
and learn the characteristics of input spectrum graph, a num-    Inspired by human visual attention, the attention mechanism
ber of filters with different functions are used. A more re-     is proposed by (Bahdanau et at al. 2015) in machine transla-
fined audio feature vector is obtained through deep convolu-     tion, which is introduced into the Encoder-Decoder frame-
tional neural network, and then put into the BiLSTM layer        work to select the reference words in source language for
to learn related sentiment information between adjacent ut-      words in target language. We use the attention mechanism
terances. Finally, before the softmax layer, we get another      to preserve the intermediate output of the input sequence by
effective vector CASV extracted by our CNN framework, as         retaining the LSTM encoder, and then a model is trained to
shown in Figure 3. The process is described in CNN branch        selectively learn these inputs and to correlate the output se-
procedure in Algorithm 1.                                        quences with the model output. Specifically, when we fuse
                                                                 the features, each phoneme of the output sequence is associ-
Fusion Layers Through the LSTM and CNN branches                  ated with some specific frames in the input speech sequence,
proposed above, we can extract two refined audio sentiment       so that the feature representation that has the greatest influ-
vectors, LASV and CASV for each utterance. We use these          ence on the final sentiment classification can be obtained,
two kinds of vectors in parallel as the input of BiLSTM in       and finally obtain a fused Audio Feature Vector. At the same
AFF-ACRNN model. While effectively learning the relevant         time, attention mechanism behaves like a regulator since it
sentiment information of adjacent utterance, we extract the      can judge the importance of the contribution by adjacent rel-
Audio Sentiment Vector (ASV) that has the greatest influ-        evant utterances for classifying the target utterance. Indeed,
ence on the sentiment classification in the three utterances     it is very hard to tell the sentiment of a single utterance if
through the action of the attention mechanism. Finally, the      you do not concern its contextual information. However, you
final sentiment classification result is obtained by softmax     will also make a wrong estimation if contextual information
is overly concerned. More specifically, in Figure 2, let Ax be                               Train                 Test
the Xth attention network for utterance Ux , the correspond-       Datasets
                                                                                      utterance video      utterance video
ing attention weight vector is αx weighted hidden represen-        MOSI                1616        65        583        28
tation is Rx , we have:
                                                                   MOSI→MOUD           2199        93        437        79
                Px = tanh(Wh [x] · H)
                Ax = sof tmax(w[x]T · Px )                                         Table 1: Datasets setting
                Rx = H · αx T
Final representation for xt h utterance is:                      Our train/test splits of the dataset are completely disjoint
          hx ∗ = tanh(Wm [x] · Rx + Wn [x] · hx )                with respect to speakers. In order to better compare with the
                                                                 previous work, similar to (Poria et at al. 2017), we divide
Where Wm [x] and Wn [x] are weights to be learned while          the data set by 7:3 approximately, 1616 and 583 utterances
training.                                                        are used for training and testing respectively. Furthermore,
                                                                 in order to verify that our model will not be heavily depen-
                     4 Experiments                               dent on the language category, we tested it with the Span-
In this section, we exhibit our experimental results and         ish dataset MOUD. MOUD contains product review videos
the analysis of our proposed model. More specifically, our       provided by 55 persons. The reviews are in Spanish. The de-
model is trained and evaluated on utterance-level audio from     tailed datasets setup is depicted at Table 1.
CMU-MOSI dataset (A. Zadeh et at al. 2016) and being
                                                                 Network structure parameter Our proposed architec-
tested on MOUD (Prez-Rosas V et al. 2013).
                                                                 ture is implemented based on the open-source deep learn-
Experiment Setting                                               ing framework Keras. More specifically, for proposed UB-
                                                                 BiLSTM framework, after a lot of experiments, we extracted
Evaluation Metrics We evaluate our performance by                the most four representative audio features of each utterance
weighted accuracy on both 2-class, 5-class and 7-class clas-     in a video through Librosa toolkit, which are MFCC, spec-
sification.                                                      tral centroid, chroma stft and spectral contrast respectively.
                                correct utterances               In data processing, we make each utterance one-to-one cor-
        weighted accuracy =
                                     utterances                  respondence with the label and rename the utterance. Ac-
   Additionally, F-Score is used to evaluate 2-class classifi-   cordingly, we extend each utterance to a feature matrix of
cation.                                                          256 ∗ 33 dimensions. The output dimension of the first layer
                                                                 of BiLSTM is 128, and the second layer is 32. The output
                             precision · recall                  dimension of the first layer of Dense is 200, and the second
         Fβ =(1 + β 2 ) ·                                        is 2.
                         (β 2 · precision) + recall
                                                                    For proposed CNN framework, the input images are
Where β represents the weight between precision and recall.
                                                                 warped into a fixed size of 512 ∗ 512. If the bounding box
During our evaluation process, we set β = 1 since we regard
                                                                 of the training samples provided, we firstly crop the images
precision and recall has the same weight thus F1 -score is
                                                                 and then warp them to the fixed size. To train the feature
adopted.
                                                                 encoder, we follow the fine-tuning training strategy.
   However, in 5-class and 7-class classification, we use
                                                                    In all experiments, our networks are trained by Adam or
Macro F1 -Score to evaluate the result.
                                                                 SGD optimizer. In the LSTM branch, we initiate the learning
                                   n
                                  P                              rate to be 0.0001, and there are 200 epochs in the training
                                     F1n
                                                                 part with batch size equals to 30 in each epoch. In the CNN
                   M acro F1 = 1                                 branch, we initiate the learning rate to be 0.001, and there are
                                     n
                                                                 200 epochs in training Resnet-152 with batch size equals to
where n represents the number of classification and F1n is       20 in each epoch.
the F1 score on nth category.
Dataset details CMU-MOSI dataset is rich in sentiment            Performance Comparison
expressions, consisting 2199 opinionated utterances, 93          Comparison of different feature combinations. Firstly,
videos by 89 speakers. The videos address a large array          we have considered seven types of acoustic features that can
of topics, such as movies, books, and products. Videos           best represent an audio,which mainly includes MFCC, root-
were crawled from YouTube and segmented into utterances          mean-square energy, spectral and tonal features. A lot of
where each utterance is annotated with scores between −3         experiments have be done in order to get the best feature
(strongly negative) and +3 (strongly positive) by five anno-     combinations with different model on three types of clas-
tators. We took the average of these five annotations as the     sification. The results are listed in Table 2. What’s more,
sentiment polarity and considered three conditions where         we have also compared the different performance of our
consists of two classes (positive and negative), five classes    LASV extracted from LSTM-based and BiLSTM-based fu-
(strong positive, positive, neutral, negative and strong nega-   sion model. As the Table 2 shows, the performance of LASV
tive) and seven classes (strong positive, positive, weak pos-    that extracted from BiLSTM-based model behaves better,
itive, neutral, strong negative,negative and weak negative).     since the acoustic information behind may also have impact
 Best Feature                        Accuracy(%)                than MOSI, to make comparison with our model because
                  Model
 Combination                  2-class 5-class 7-class           MOUD has only two sentiment level and each utterance has
                  LSTM         55.12    23.64    16.99          text record in the dataset.
 Single Type
                  BiLSTM       55.98    23.75    17.24             [BAJECE, 2018](C.bakir et al.2018) In this paper, ex-
                  LSTM         62.26    28.23    21.54          cept for SVM, the feature vectors like Mel Frequency Dis-
 Two Types
                  BiLSTM       63.76    29.77    22.92          crete Wavelet Coefficients (MFDWC), MFCC and LPCC
                  LSTM         66.36    32.98    24.66          extracted from original record signal are trained with classi-
 Thress Types
                  BiLSTM       67.02    33.75    25.80          fication algorithm such as Dynamic Time Warping (DTW),
                  LSTM         68.23    33.15    26.27          Hidden Markov Model (HMM) and Gauss Mixture Model
 Four Types
                  BiLSTM       68.72    34.27    26.82          (GMM).
                  LSTM         67.86    31.29    25.79             As shown in Table 5, we use weighted accuracy (ACC)
 Five Types
                  BiLSTM       67.97    32.66    26.01          and F1-Score to evaluate our results. Especially, for the ACC
                  LSTM         67.88    32.23    26.07          on MOUD, our proposed model outperforms the best model,
 Six Types
                  BiLSTM       68.61    33.97    26.78          SVM classifier, by 11.51%.
                  LSTM         68.01    33.06    25.99
 Seven Types                                                    Comparison with the state-of-art. (Poria et al.2017) This
                  BiLSTM       68.67    34.18    26.12
                                                                paper introduced a LSTM-based model to utilize the contex-
  Table 2: Comparison of different feature combinations         tual information extracted form each utterance in an video.
                                                                However, the input of the neural network model only has one
                                                                type of feature, which is MFCC. This means all the utterance
                                                                information is merely represented by one single feature. The
on the acoustic information previous. It can be seen that the   acoustic information contained by the feature is somewhat
best number of feature combination is four and those four       duplicated and is bound to omit much sentiment information
features are MFCC, spectral centroid, spectral contrast and     that might be hidden in many other useful features. What’s
chroma stft. That means the other three features, which are     worse, one type of feature means the input vector should be
root-mean-square energy, spectral contrast and tonal cen-       large enough to make sure that it carries enough information
troid may introduce some noise or misleading in our sen-        before it is fed into the neural network. This will undoubt-
timent analysis since all seven types of features do not have   edly increase the parameters to be trained in the network and
the best result.                                                meanwhile, it is time consuming and computation costly.
                                                                   Our proposed model not only extracts the feature or senti-
Comparison of several renowned CNN-based model.                 ment vector from four types of traditional recognized acous-
We have compared our CASV performance extracted from            tic features, have considered utterance dependency, but also
the spectral map with several genres of popular models of       extracts the feature from the spectrum graph, which may re-
CNN and its variants: LeNet (LeCun Y et at al. 1998),           veal some sentiment information that acoustic features can-
AlexNet (Krizhevsky A et at al. 2012), VGG16 (Simonyan          not reflect. The final AFF-ACRNN consists of the best com-
K et at al. 2014), ZFNet (Zeiler M D et at al. 2014), ResNet    bination of SBCNN and UB-BiLSTM and outperforms the
(He K et at al. 2016). The results are listed in Table 3. As    state-of-the-art approach by 9.33% in binary classification
the neural network goes deeper, more representative fea-        on MOSI dataset and by 8.75% on MOUD. The results are
tures can be got from the spectrum graph and that is why        shown in the Table 6.
ResNet152 has the best performance. It is benefited from           We have also run our model on one audio whose length
the residual unit which will guarantee the network will not     is 10s for 1000 times and the average time to get the sen-
degrade when the network goes deeper.                           timent calssification result from input is only 655.94ms
                                                                which thanks to our concentrated ASV extracted from AFF-
Comparison of different combinations between SBCNN              ACRNN.
and UB-Bilstm At last, we have performed fusion ex-
periments between several best SBCNN and UB-BiLSTM              Discussion
and UB-LSTM. More accurately, we choose the three best
SBCNN, which are ReSNet18, ResNet50 and ResNet152 to            The above experimental results have already shown us that
combine with the two kinds of utterance dependent LSTM.         the proposed method has a great improvement in the perfor-
The best combination is UB-BiLSTM with Res152. The fi-          mance of audio sentiment analysis. In order to get the best
nal result is shown in Table 4.                                 structure of our AFF-ACRNN model, we have tested two
                                                                separate branch respectively, and compare the final AFF-
Comparison with traditional method. Apart from train-           ACRNN with traditional or state-of-art method. Weighted
ing deep neural network, a bunch of traditional binary clas-    accuracy and F1-Score, Macro F1-Score are used as met-
sifiers has been used for sentiment analysis. In order to       rics to evaluate the model’s performance. In the UB-Bilstm
demonstrate the effectiveness of our model, we firstly com-     branch, a lot of experiments have shown that four types of
pare our model with those traditional methods.                  heterogeneous traditional features trained by BiLSTM will
   [I2C2, 2017](Maghilnan S et al.2017) introduced a text-      have the best result, whose weighted accuracy is 68.72% on
based SVM and Naive Bayes model for binary sentiment            MOSI. In the SBCNN branch, we have carried out seven ex-
classification, thus we test their model on MOUD, rather        periments to prove the ResNet152 used in SBCNN will have
                                             2-class              5-class                7-class
                            Methods
                                         Acc(%)      F1    Acc(%) Macro F1        Acc(%) Macro F1
                          LeNet           56.75    55.62    23.67       21.87      15.63       15.12
                         AlexNet          58.71    57.88    26.43       23.19      19.21       18.79
                         VGG16            57.88    55.97    27.37       25.78      17.34       16.25
                          ZFNet           55.37    53.12    21.90       21.38      12.82       11.80
                        ResNet18          58.94    56.79    25.26       24.63      18.35       17.89
                        ResNet50          62.52    61.21    28.13       27.04      20.21       20.01
                        ResNet152         65.42    64.86    28.78       28.08      21.56       20.57

                                      Table 3: Comparison of SBCNN with different structure


                                                  2-class              5-class                7-class
                            Methods
                                              Acc(%)      F1     Acc(%) Micro F1        Acc(%) Micro F1
                   UB-LSTM+Res18               67.19    66.37     33.83      31.97       26.78      25.83
                   UB-LSTM+Res50               67.83    66.69     34.21      33.78       27.75      26.41
                   UB-LSTM+Res152              68.64    67.94     35.87      34.11       28.15      27.03
                  UB-BiLSTM+Res18              68.26    66.25     35.43      33.52       27.63      26.09
                  UB-BiLSTM+Res50              69.18    68.22     36.93      34.67       28.11      27.54
                  UB-BiLSTM+Res152             69.64    68.51     37.71      35.12       29.26      28.45

                     Table 4: Comparison of different combinations between SBCNN and UB-BiLSTM


                                   MOUD                            sification. Furthermore, in the experiment of using MOSI as
                Model
                               ACC(%)   F1                         training set and verification set and MOUD as test set, it
                SVM             57.23  54.83                       also shows that our proposed model has strong generaliza-
             Naive Bayes        55.72  52.14                       tion ability.
               GMM              54.66  52.89
               HMM              56.63  55.84
                DTW             53.92  53.06                                             5 Conclusion
            AFF-ACRNN           68.74  66.37
                                                                   In this paper, we propose a novel utterance-based deep neu-
 Table 5: Comparison with traditional methods on MOUD              ral network model termed AFF-ACRNN, which has a par-
                                                                   allel combination of CNN and LSTM based network, to ob-
                                                                   tain representative features termed ASV, that can maximally
                                      ACC(%)                       reflect sentiment information in an utterance from an au-
             Model                                                 dio. We extract several traditional heterogeneous acoustic
                              MOSI     → MOUD
         State-of-the-art     60.31          59.99                 features by Librosa toolkit and choose the four most rep-
         AFF-ACRNN            69.64          57.74                 resentative features through a large number of experiments,
                                                                   and regard them as the input of the neural network. We can
Table 6: Comparison with state-of-art result (Poria et             get CASV and LASV from the CNN branch and the LSTM
al.2017) . The right arrow means the model is trained and          branch respectively, and finally merge the two branches to
validated on the MOSI and tested on the MOUD                       obtain the final ASV for sentiment classification of each ut-
                                                                   terance. Besides, BiLSTM with attention mechanism is used
                                                                   for feature fusion. The experimental results show our model
                                                                   can recognize audio sentiment precisely and quickly, and
the best result, for instance, with the weighted accuracy of       demonstrate our heterogeneous ASV are better than tradi-
65.42% on MOSI, due to its extreme depth and the help-             tional acoustic features or vectors extracted from other deep
ful residual units used to prevent degradation. We selected        learning models. Furthermore, experimental results indicate
six best combinations of SBCNN and UB-BiLSTM and find              that the proposed model outperforms the state-of-the-art ap-
that the best is ResNet152 used in SBCNN with UB-Bilstm,           proach by 9.33% on MOSI dataset. We have also tested our
whose weighted accuracy is 69.42% on MOSI and outper-              model on MOUD to prove the model won’t heavily depend
forms not only the traditional classifier like SVM, but also       on language types. In the future, we will combine the feature
the state-of-the-art approach by 9.33% on MOSI dataset. At-        engineering technologies to further discuss the fusion di-
tention mechanism is used in both branch to subtly combine         mension of audio features and consider the fusion of differ-
the heterogeneous acoustic features and choose the feature         ent dimensions of different categories of features, and even
vectors that have the greatest impact on the sentiment clas-       apply them to multimodal sentiment analysis.
                       References                              Recognition[J]. IEEE Signal Processing Letters, 2018.
                                                               Poria, Soujanya, et al. ”Context-dependent sentiment anal-
Pang B, Lee L. Opinion mining and sentiment analysis[J].       ysis in user-generated videos.” Proceedings of the 55th An-
Foundations and Trends in Information Retrieval, 2008,         nual Meeting of the Association for Computational Linguis-
2(12): 1-135.                                                  tics (Volume 1: Long Papers). Vol. 1. 2017.
Liu B, ”Sentiment analysis: mining opinions, sentiments,       Hochreiter S, Schmidhuber J. Long short-term memory[J].
and emotions”, The Cambridge University Press, 2015.           Neural computation, 1997, 9(8): 1735-1780.
S. Poria, E. Cambria, R. Bajpai, and A. Hussain, ”A re-        Y. Le Cun, B. Boser et al., Handwritten digit recognition
view of affective computing: From unimodal analysis to         with a back-propagation network, in Advances in neural in-
multimodal fusion”, Information Fusion, vol. 37, pp. 98125,    formation processing systems, 1990.
2017.                                                          T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, Convolu-
Kraus, ”M.W. Voice-only communication enhances em-             tional, long short-term memory, fully connected deep neural
pathic accuracy”, American Psychologist 72, 7 (2017), 644.     networks, in IEEE International Conference on Acoustics,
S. Zhang, T. Huang and W. Gao, ”Speech Emotion Recog-          Speech and Signal Processing (ICASSP). IEEE, 2015, pp.
nition Using Deep Convolutional Neural Network and Dis-        45804584.
criminant Temporal Pyramid Matching,” in IEEE Transac-         D.Palaz,R.Collobertetal.,Analysisofcnn-based speechrecog-
tions on Multimedia, vol. PP. 99 (2017):1-1.                   nition system using raw speech as input, in Proceedings of
S. Ezzat, N. Gayar and M.M. Ghanem, Sentiment Analysis         Interspeech, 2015.
of Call Centre Audio Conversations using Text Classifica-      El Ayadi M, Kamel M S, Karray F. Survey on speech
tion, in International Journal of Computer Information Sys-    emotion recognition: Features, classification schemes, and
tems and Industrial Management Applications, vol. 4, pp.       databases[J]. Pattern Recognition, 2011, 44(3): 572-587.
619-627, 2012.                                                 Cummins N, Amiriparian S, Hagerer G, et al. An Image-
Van Den Oord A, Dieleman S, Zen H, et al. WaveNet: A           based deep spectrum feature representation for the recogni-
generative model for raw audio[C]//SSW. 2016: 125.             tion of emotional speech[C]//Proceedings of the 2017 ACM
Bertin-Mahieux, T., and Ellis, D. P. 2011. Large-scale cover   on Multimedia Conference. ACM, 2017: 478-484.
song recognition using hashed chroma landmarks. In 2011        McFee B, Raffel C, Liang D, et al. librosa: Audio and music
IEEE Workshop on Applications of Signal Processing to Au-      signal analysis in python[C]//Proceedings of the 14th python
dio and Acoustics (WASPAA), 117120. IEEE.                      in science conference. 2015: 18-25.
Kaushik L, Sangwan A, Hansen J H L. Automatic audio            Bae S H, Choi I, Kim N S. Acoustic scene clas-
sentiment extraction using keyword spotting[C]//Sixteenth      sification using parallel combination of LSTM and
Annual Conference of the International Speech Communi-         CNN[C]//Proceedings of the Detection and Classification of
cation Association. 2015.                                      Acoustic Scenes and Events 2016 Workshop (DCASE2016).
Mariel W C F, Mariyah S, Pramana S. Sentiment analysis: a      2016: 11-15.
comparison of deep learning neural network algorithm with      Prez-Rosas V, Mihalcea R, Morency L P. Utterance-level
SVM and naŁve Bayes for Indonesian text[C]//Journal of         multimodal sentiment analysis[C]//Proceedings of the 51st
Physics: Conference Series. IOP Publishing, 2018, 971(1):      Annual Meeting of the Association for Computational Lin-
012049.                                                        guistics (Volume 1: Long Papers). 2013, 1: 973-982.
G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M.        LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learn-
A. Nicolaou, B. Schuller, and S. Zafeiriou, ADIEU Fea-         ing applied to document recognition[J]. Proceedings of the
tures? End-to-end Speech Emotion Recognition using A           IEEE, 1998, 86(11): 2278-2324.
Deep Convolutional Recurrent Network, in IEEE Interna-         Simonyan K, Zisserman A. Very deep convolutional net-
tional Conference on Acoustics, Speech and Signal Process-     works for large-scale image recognition[J]. arXiv preprint
ing, 2016, pp. 52005204.                                       arXiv:1409.1556, 2014.
Mirsamadi, Seyedmahdad, Emad Barsoum, and Cha Zhang.           Zeiler M D, Fergus R. Visualizing and understanding con-
”Automatic speech emotion recognition using recurrent neu-     volutional networks[C]//European conference on computer
ral networks with local attention.” Acoustics, Speech and      vision. Springer, Cham, 2014: 818-833.
Signal Processing (ICASSP), 2017 IEEE International Con-       Dong C, Loy C C, He K, et al. Image super-resolution us-
ference on. IEEE, 2017.                                        ing deep convolutional networks[J]. IEEE transactions on
Neumann, Michael, and Ngoc Thang Vu. ”Attentive convo-         pattern analysis and machine intelligence, 2016, 38(2): 295-
lutional neural network based speech emotion recognition:      307.
A study on the impact of input features, signal length, and    Balamurugan B, Maghilnan S, Kumar M R. Source cam-
acted speech.” arXiv preprint arXiv:1706.00612 (2017).         era identification using SPN with PRNU estimation and en-
Wang, Zhong-Qiu, and Ivan Tashev. ”Learning utterance-         hancement[C]//Intelligent Computing and Control (I2C2),
level representations for speech emotion and age/gender        2017 International Conference on. IEEE, 2017: 1-6.
recognition using deep neural networks.” Acoustics, Speech     Bakir C, Jarvis D S L. Institutional entrepreneurship and pol-
and Signal Processing (ICASSP), 2017 IEEE International        icy change[J]. Policy and Society, 2018.
Conference on. IEEE, 2017.                                     Zadeh A, Liang P P, Poria S, et al. Multi-attention recur-
Chen M, He X, Yang J, et al. 3-D Convolutional Recurrent       rent network for human communication comprehension[J].
Neural Networks with Attention Model for Speech Emotion        arXiv preprint arXiv:1802.00923, 2018.

</pre>