=Paper=
{{Paper
|id=Vol-2328/session3_paper3
|storemode=property
|title=Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network
|pdfUrl=https://ceur-ws.org/Vol-2328/3_2_paper_17.pdf
|volume=Vol-2328
|authors=Ziqian Luo,Hua Xu,Feiyang Chen
|dblpUrl=https://dblp.org/rec/conf/aaai/LuoXC19
}}
==Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network==
Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network Ziquan Luo, Hua Xu, and Feiyang Chen Tsinghua University, China luoziqian@bupt.edu.cn Abstract that speech is the most convenient and natural medium for human communication, not only carries the implicit seman- Audio Sentiment Analysis is an increasingly popular research area which extends the conventional text-based sentiment tic information, but also contains rich affective information analysis to depend on the effectiveness of acoustic features (S. Zhang et al. 2017). Therefore, audio sentiment analysis, extracted from speech. However, current progress on audio which aims to analyze correctly the sentiment of the speaker sentiment analysis mainly focuses on extracting homoge- from speech signals, has drawn a great deal of attention of neous acoustic features or doesn’t fuse heterogeneous fea- researchers. tures effectively. In this paper, we propose an utterance-based In recent years, there are three main methods for audio deep neural network model, which has a parallel combina- sentiment analysis. Firstly, utilizes automatic speech recog- tion of CNN and LSTM based network, to obtain representa- nition (ASR) technology to convert speech into texts, fol- tive features termed Audio Sentiment Vector (ASV), that can maximally reflect sentiment information in an audio. Specifi- lowing by conventional text-based sentiment detection sys- cally, our model is trained by utterance-level labels and ASV tems (S. Ezzat et at al. 2012). Secondly, adopts a genera- can be extracted and fused creatively from two branches. In tive model operating directly on the raw audio waveform the CNN model branch, spectrum graphs produced by sig- (Van Den Oord A et at al. 2016). Thirdly, focuses on ex- nals are fed as inputs while in the LSTM model branch, in- tracting signal features from the raw audio files (Bertin et at puts include spectral centroid, MFCC and other recognized al. 2011), which well captures the tonal content of a music, traditional acoustic features extracted from dependent utter- and has been proved to be more effective than original au- ances in an audio. Besides, BiLSTM with attention mecha- dio spectrums descriptors such as Mel-frequency cepstrum nism is used for feature fusion. Extensive experiments have coefficients(MFCC). been conducted to show our model can recognize audio sen- timent precisely and quickly, and demonstrate our ASV are However, for converting speech into texts, by recognizing better than traditional acoustic features or vectors extracted each word said by the person in an audio, change them into from other deep learning models. Furthermore, experimen- the word embedding and use some techniques in NLP, like tal results indicate that the proposed model outperforms the TF-IDF and bag of words model. The result is not always state-of-the-art approach by 9.33% on MOSI dataset. accurate, because sentiment detection accuracy depends on being able to reliably detect a very focused vocabulary in 1 Introduction the spoken comments (Kaushik L et al. 2015). Furthermore, when the voice is transferred to the text, some sentiment- Sentiment Analysis is a well-studied research area in Natural related signal characteristics are also lost, resulting in a de- Language Processing (NLP) (Pang B et at al. 2008), which crease in the accuracy of the sentiment classification. As for is the computational study of peoples’ opinions, sentiments, extracting from the raw audio files through human works appraisals, and attitudes towards entities such as products, and then being put into the support vector machine(SVM) services, organizations and so on (Liu B et al. 2015). Tradi- classifier for classification, those methods require lots of hu- tional sentiment analysis methods are mostly based on text, man work and are heavily dependent on language types. with the rapid development of communication technology, Luckily, along with the success of deep learning in many abundance of smartphones and the rapid rise of social me- other application domains, deep learning is also popularly dia, large amounts of data are uploaded by web users in the used in audio sentiment analysis in recent years (Mariel form of audios or videos, rather than text (S. Poria et al. W C F et at al. 2018). More recently, (G. Trigeorgis et al. 2017). Interestingly, a recent study shows that voice-only as 2016) directly use the raw audio samples to train a convolu- modality seems best for humans empathic accuracy as com- tional recurrent neural network (CRNN) to predict continu- pared to video-only or audiovisual communication (Kraus et ous arousal /valence space. (Mirsamadi et al. 2017) study the al. 2017). In fact, audio sentiment analysis is a difficult task use of deep learning to automatically discover emotionally due to the complexity of audio signal. It is generally known relevant features from speech. They propose a novel strategy Copyright c 2019, Association for the Advancement of Artificial for feature pooling over time which uses local attention in Intelligence (www.aaai.org). All rights reserved. order to focus on specific regions of a speech signal that are more emotionally salient. (Neumann et al. 2017) use an at- we briefly present the advances on audio sentiment analysis tentive convolutional neural network with multi-view learn- task by utilizing deep learning, and then we give a summary ing objective function and achieved state-of-the-art results on the progress of extracting the audio feature representa- on the improvised speech data of IEMOCAP. (Wang et al. tion. 2017) propose to use deep neural networks (DNN) to en- code each utterance into a fixed-length vector by pooling the Long Short-Term Memory (LSTM) activations of the last hidden layer over time. The feature It has been demonstrated that LSTM (Hochreiter S et at al. encoding process is designed to be jointly trained with the 1997) are well-suited to make predictions based on time se- utterance-level classifier for better classification. (Chen et al. ries data, by utilizing a cell to remember values over arbi- 2018) propose a 3-D attention-based convolutional recurrent trary time intervals and the three gates(input gate i, output neural networks to learn discriminative features for speech gate o, forget gate f ) to regulate the flow of information into emotion recognition, where the Mel-spectrogram with deltas and out of the cell, which can be described as follows: and delta-deltas are creatively used as input. But most of the previous methods still either considered only one single au- ft = σ(Wf · [ht−1 , xt ] + bf ) dio feature (Chen et al. 2018) or high-dimensional vectors from one homogeneous feature (Poria et al. 2017), and did not effectively extract and fuse audio features. it = σ(Wi · [ht−1 , xt ] + bi ) We believe the information extracted from a single utter- ance must have dependency on its context. For example, a ot = σ(Wo · [ht−1 , xt ] + bo ) flash of loud expression may not indicate a person has a where ht = ot ∗ tanh(Ct ) is the output of the last cell and strong emotion since it maybe just caused by a cough while xt is the input of current cell. Besides, the current cell state continuous loud one is far more likely to indicate the speaker Ct can be updated by the following formula: has a strong emotion. In this paper, based on a large number of experiments, we ∼ extract the features of each utterance in an audio through Ct = tanh(Wc · [ht−1 , xt ] + bc ) the Librosa toolkit, and obtain four most effective features ∼ representing sentiment information, merge them by adopt- Ct = ft ∗ Ct−1 + it ∗ Ct ing a BiLSTM with attention mechanism. Moreover, we de- sign a novel model called Audio Feature Fusion-Attention where Ct−1 stands for the previous cell state. based CNN and RNN (AFF-ACRNN) for audio sentiment One of the most effective variant of LTSM is the bidirec- analysis. Spectrum graphs and selected traditional acoustic tional LSTM. Each input sequence will be fed into both the features are fed as input in two separate branches, we can ob- forward and backward LSTM layers and thus a hidden layer tain a new fusion of audio feature vector before the softmax receives an input by joining forward and backward LSTM layer, which we call the Audio Sentiment Vector (ASV). Fi- layers. nally, the output of the softmax layer is the class of senti- ment. Convolutional Neural Network (CNN) Major contributions of the paper are that: CNN (Y. Le Cun et at al. 1990) are well-known for extract- • We propose an effective AFF-ACRNN model for au- ing features from a image by using convolutional kernels dio sentiment analysis, through combining multiple tradi- and pooling layers to emulates the response of an individ- tional acoustic features and spectrum graphs to learn more ual to visual stimuli. Moreover, CNN have been success- comprehensive sentiment information in audio. fully used not only for computer vision, but also for speech (T. N. Sainath et at al. 2015). For speech recognition, CNN • Our model is language insensitive and pay more attention is proved to be robust against noise compared to other DL to acoustic features of the original audio rather than words models (D.Palaz et at al. 2015). recognized from the audio. • Experimental results indicate that the proposed method Audio Feature Representation and Extraction outperforms the state-of-the-art methods (Poria et al. Researchers have found pitch and energy related features 2017) on Multimodal Corpus of Sentiment Inten- playing a key role in affect recognition (Poria S et at al. sity dataset(MOSI) and Multimodal Opinion Utterances 2017). Other features that have been used by some re- Dataset(MOUD). searchers for feature extraction include formants, MFCC, The rest of the paper is organized as follows. In the fol- root-mean-square energy, spectral centroid and tonal cen- lowing section, we will review related work. In Section 3, troid features. During the speech production, there are sev- we will exhibit more details of our methodology. In Section eral utterances and for each utterance, the audio signal can 4, experiments and results are presented, and conclusion fol- be divided into several segments. Global features are cal- lows in Section 5. culated by measuring several statistics, e.g., average, mean, deviation of the local features. Global features are the most 2 Related Work commonly used features in the literature. They are fast to Current state-of-the-art methods for audio sentiment analy- compute and, as they are fewer in number compared to lo- sis are mostly based on deep neural network. In this section, cal features, the overall speed of computation is enhanced (El Ayadi M et at al. 2011). However, there are some draw- more, audio feature vector of each piece of utterance is the backs of calculating global features, as some of them are input of the proposed neural network that based on Audio only useful to detect affect of high arousal, e.g., anger and Feature Fusion (AFF), we can obtain a new fusion audio disgust. For lower arousal, global features are not that effec- feature vector before the softmax layer, which we call the tive, e.g., global features are less prominent to distinguish Audio Sentiment Vector (ASV). Finally, the output of the between anger and joy. Global features also lack temporal softmax layer produces our final sentiment classification re- information and dependence between two segments in an ut- sults, as shown in Figure 1. terance. In a recent study (Cummins N et at al. 2017), a new acoustic feature representation, denoted as deep spectrum features, derived from feeding spectrum graphs through a very deep image classification CNN and forming a feature Sof tmax vector from the activation of the last fully connected layer. Librosa (McFee B et at al. 2015) is an open-source python LASV package for music and audio analysis which is able to extract all the key features as elaborated above. AFF2 At tent ion Dr opout 3 Methodology Bi LSTM Bi LSTM Bi LSTM In this section, we describe the proposed AFF-ACRNN ut ter ance1 ut ter ance2 ut ter ance3 model for audio sentiment analysis in details. We firstly in- V1 V2 V3 troduce an overview of the whole neural network architec- ture. After that, two separate branch of AFF-ACRNN will be Bi LSTM Bi LSTM Bi LSTM explained in details. Finally, we talk about the fusion mech- AFF1 anism used in our model. Dr opout Dr opout Dr opout Model—AFF-ACRNN F1_1 . . . F1_n F2_1 ... F2_n F3_1 ... F3_n Li brosa V1_1 . . . V1_n V2_1 ... V2_n V3_1 ... V3_n 3ut ter ancesofavi deo Figure 2: Overview of Our UB-BiLSTM Model Audio Sentiment Vector (ASV) from Audio Feature Fusion (AFF) LSTM Layers The hidden layers of LSTM have self- recurrent weights. These enable the cell in the memory block to retain previous information (Bae et at al. 2016). Firstly, we separate the different videos and take three con- tinuous utterances (e.g. u1 , u2 , u3 ) in one video at a time. Among them, for each utterance (e.g. u1 ), we extract its internal acoustic features through the librosa toolkit, say f11 , f12 ...f1n , and then trained by two layers of BiLSTM in AFF1 to obtain the extracted features from the tradi- tional acoustic feature. Therefore, three utterances are cor- Figure 1: Overview of AFF-ACRNN Model responding to three more efficient and representative vec- tors v1 , v2 , v3 , as the inputs to BiLSTM in AFF2. AFF2 ef- We concentrate on a model that has two parallel branches, fectively combines the contextual information between ad- the utterance based BiLSTM branch (UB-BiLSTM) and the jacent utterances, and then subtly acquires the utterance that spectrum based CNN branch (SBCNN), whose core mech- has the greatest impact on the final sentiment classification anisms are based on LSTM and CNN. One branch of pro- through the attention mechanism. Finally, after the dropout posed model uses the BiLSTM to extract temporal informa- layer, a more representative LASV extracted by our LSTM tion between adjacent utterances, another branch uses the framework is obtained before the softmax layer, as shown in renowned CNN based network to extract features from spec- Figure 2. The process is described in LSTM branch proce- trum graph that sequence model cannot achieve. Further- dure in Algorithm 1. Algorithm 1 Related Procedure 1: procedure LSTM BRANCH 2: for i:[0,n] do 3: fi = getAudioF eature(ui ) 4: ASVi = getASV (fi ) 5: end for 6: for i:[0,M] do //M is the number of videos 7: inputi = GetT opU tter(vi ) 8: ufi = getU tterF eature(inputi ) 9: end for 10: shuf f le(v) 11: end procedure 12: procedure CNN B RANCH 13: for i:[0,n] do 14: xi ← get SpectrogramImage(ui ) 15: ci ← CNNModel(xi ) 16: li ← BiLSTM(ci ) 17: end for 18: end procedure 19: procedure F IND CORRESPONDING L ABEL 20: for i:[0:2199] do 21: rename(ui ) // for better order in sorting 22: N ameAndLabel = createIndex(ui ) Figure 3: Overview of Our ResNet152 CNN Model 23: // A dictionary [utterance Name: Label] 24: end for 25: Labelx = N ameAndLabel(ux ) layer. In (J. Donahue et at al. 2015), long-term recurrent 26: end procedure convolution network (LRCN) model was proposed for vi- sual recognition. LRCN is a consecutive structure of CNN and LSTM. LRCN processes the variable-length input with a CNN, whose outputs are fed into LSTM network, which fi- CNN Layers Similar to the UB-BiLSTM model proposed nally predicts the class of the input. In (T. N. Sainath et at al. above, we extracted the spectrum graph of each utterance 2015), a cascade structure was used for voice search. Com- through the Librosa toolkit and use it as the input of our pared to the method mentioned above, the proposed network CNN branch. After a lot of experiments, we found that forms a parallel structure in which LSTM and CNN accept the audio feature vector learned by the ResNet152 network different inputs separately. Therefore, the Audio Sentiment structure has the best effect on the final sentiment classifi- Vector (ASV) can be extracted more comprehensively, and cation, so we choose the ResNet model in this branch. The a better classification result can be got. convolutional layer performs 2-dimensional convolution be- tween the spectrum graph and the predefined linear filters. Feature Fusion base on Attention Mechanism To enable the network to extract complementary features and learn the characteristics of input spectrum graph, a num- Inspired by human visual attention, the attention mechanism ber of filters with different functions are used. A more re- is proposed by (Bahdanau et at al. 2015) in machine transla- fined audio feature vector is obtained through deep convolu- tion, which is introduced into the Encoder-Decoder frame- tional neural network, and then put into the BiLSTM layer work to select the reference words in source language for to learn related sentiment information between adjacent ut- words in target language. We use the attention mechanism terances. Finally, before the softmax layer, we get another to preserve the intermediate output of the input sequence by effective vector CASV extracted by our CNN framework, as retaining the LSTM encoder, and then a model is trained to shown in Figure 3. The process is described in CNN branch selectively learn these inputs and to correlate the output se- procedure in Algorithm 1. quences with the model output. Specifically, when we fuse the features, each phoneme of the output sequence is associ- Fusion Layers Through the LSTM and CNN branches ated with some specific frames in the input speech sequence, proposed above, we can extract two refined audio sentiment so that the feature representation that has the greatest influ- vectors, LASV and CASV for each utterance. We use these ence on the final sentiment classification can be obtained, two kinds of vectors in parallel as the input of BiLSTM in and finally obtain a fused Audio Feature Vector. At the same AFF-ACRNN model. While effectively learning the relevant time, attention mechanism behaves like a regulator since it sentiment information of adjacent utterance, we extract the can judge the importance of the contribution by adjacent rel- Audio Sentiment Vector (ASV) that has the greatest influ- evant utterances for classifying the target utterance. Indeed, ence on the sentiment classification in the three utterances it is very hard to tell the sentiment of a single utterance if through the action of the attention mechanism. Finally, the you do not concern its contextual information. However, you final sentiment classification result is obtained by softmax will also make a wrong estimation if contextual information is overly concerned. More specifically, in Figure 2, let Ax be Train Test the Xth attention network for utterance Ux , the correspond- Datasets utterance video utterance video ing attention weight vector is αx weighted hidden represen- MOSI 1616 65 583 28 tation is Rx , we have: MOSI→MOUD 2199 93 437 79 Px = tanh(Wh [x] · H) Ax = sof tmax(w[x]T · Px ) Table 1: Datasets setting Rx = H · αx T Final representation for xt h utterance is: Our train/test splits of the dataset are completely disjoint hx ∗ = tanh(Wm [x] · Rx + Wn [x] · hx ) with respect to speakers. In order to better compare with the previous work, similar to (Poria et at al. 2017), we divide Where Wm [x] and Wn [x] are weights to be learned while the data set by 7:3 approximately, 1616 and 583 utterances training. are used for training and testing respectively. Furthermore, in order to verify that our model will not be heavily depen- 4 Experiments dent on the language category, we tested it with the Span- In this section, we exhibit our experimental results and ish dataset MOUD. MOUD contains product review videos the analysis of our proposed model. More specifically, our provided by 55 persons. The reviews are in Spanish. The de- model is trained and evaluated on utterance-level audio from tailed datasets setup is depicted at Table 1. CMU-MOSI dataset (A. Zadeh et at al. 2016) and being Network structure parameter Our proposed architec- tested on MOUD (Prez-Rosas V et al. 2013). ture is implemented based on the open-source deep learn- Experiment Setting ing framework Keras. More specifically, for proposed UB- BiLSTM framework, after a lot of experiments, we extracted Evaluation Metrics We evaluate our performance by the most four representative audio features of each utterance weighted accuracy on both 2-class, 5-class and 7-class clas- in a video through Librosa toolkit, which are MFCC, spec- sification. tral centroid, chroma stft and spectral contrast respectively. correct utterances In data processing, we make each utterance one-to-one cor- weighted accuracy = utterances respondence with the label and rename the utterance. Ac- Additionally, F-Score is used to evaluate 2-class classifi- cordingly, we extend each utterance to a feature matrix of cation. 256 ∗ 33 dimensions. The output dimension of the first layer of BiLSTM is 128, and the second layer is 32. The output precision · recall dimension of the first layer of Dense is 200, and the second Fβ =(1 + β 2 ) · is 2. (β 2 · precision) + recall For proposed CNN framework, the input images are Where β represents the weight between precision and recall. warped into a fixed size of 512 ∗ 512. If the bounding box During our evaluation process, we set β = 1 since we regard of the training samples provided, we firstly crop the images precision and recall has the same weight thus F1 -score is and then warp them to the fixed size. To train the feature adopted. encoder, we follow the fine-tuning training strategy. However, in 5-class and 7-class classification, we use In all experiments, our networks are trained by Adam or Macro F1 -Score to evaluate the result. SGD optimizer. In the LSTM branch, we initiate the learning n P rate to be 0.0001, and there are 200 epochs in the training F1n part with batch size equals to 30 in each epoch. In the CNN M acro F1 = 1 branch, we initiate the learning rate to be 0.001, and there are n 200 epochs in training Resnet-152 with batch size equals to where n represents the number of classification and F1n is 20 in each epoch. the F1 score on nth category. Dataset details CMU-MOSI dataset is rich in sentiment Performance Comparison expressions, consisting 2199 opinionated utterances, 93 Comparison of different feature combinations. Firstly, videos by 89 speakers. The videos address a large array we have considered seven types of acoustic features that can of topics, such as movies, books, and products. Videos best represent an audio,which mainly includes MFCC, root- were crawled from YouTube and segmented into utterances mean-square energy, spectral and tonal features. A lot of where each utterance is annotated with scores between −3 experiments have be done in order to get the best feature (strongly negative) and +3 (strongly positive) by five anno- combinations with different model on three types of clas- tators. We took the average of these five annotations as the sification. The results are listed in Table 2. What’s more, sentiment polarity and considered three conditions where we have also compared the different performance of our consists of two classes (positive and negative), five classes LASV extracted from LSTM-based and BiLSTM-based fu- (strong positive, positive, neutral, negative and strong nega- sion model. As the Table 2 shows, the performance of LASV tive) and seven classes (strong positive, positive, weak pos- that extracted from BiLSTM-based model behaves better, itive, neutral, strong negative,negative and weak negative). since the acoustic information behind may also have impact Best Feature Accuracy(%) than MOSI, to make comparison with our model because Model Combination 2-class 5-class 7-class MOUD has only two sentiment level and each utterance has LSTM 55.12 23.64 16.99 text record in the dataset. Single Type BiLSTM 55.98 23.75 17.24 [BAJECE, 2018](C.bakir et al.2018) In this paper, ex- LSTM 62.26 28.23 21.54 cept for SVM, the feature vectors like Mel Frequency Dis- Two Types BiLSTM 63.76 29.77 22.92 crete Wavelet Coefficients (MFDWC), MFCC and LPCC LSTM 66.36 32.98 24.66 extracted from original record signal are trained with classi- Thress Types BiLSTM 67.02 33.75 25.80 fication algorithm such as Dynamic Time Warping (DTW), LSTM 68.23 33.15 26.27 Hidden Markov Model (HMM) and Gauss Mixture Model Four Types BiLSTM 68.72 34.27 26.82 (GMM). LSTM 67.86 31.29 25.79 As shown in Table 5, we use weighted accuracy (ACC) Five Types BiLSTM 67.97 32.66 26.01 and F1-Score to evaluate our results. Especially, for the ACC LSTM 67.88 32.23 26.07 on MOUD, our proposed model outperforms the best model, Six Types BiLSTM 68.61 33.97 26.78 SVM classifier, by 11.51%. LSTM 68.01 33.06 25.99 Seven Types Comparison with the state-of-art. (Poria et al.2017) This BiLSTM 68.67 34.18 26.12 paper introduced a LSTM-based model to utilize the contex- Table 2: Comparison of different feature combinations tual information extracted form each utterance in an video. However, the input of the neural network model only has one type of feature, which is MFCC. This means all the utterance information is merely represented by one single feature. The on the acoustic information previous. It can be seen that the acoustic information contained by the feature is somewhat best number of feature combination is four and those four duplicated and is bound to omit much sentiment information features are MFCC, spectral centroid, spectral contrast and that might be hidden in many other useful features. What’s chroma stft. That means the other three features, which are worse, one type of feature means the input vector should be root-mean-square energy, spectral contrast and tonal cen- large enough to make sure that it carries enough information troid may introduce some noise or misleading in our sen- before it is fed into the neural network. This will undoubt- timent analysis since all seven types of features do not have edly increase the parameters to be trained in the network and the best result. meanwhile, it is time consuming and computation costly. Our proposed model not only extracts the feature or senti- Comparison of several renowned CNN-based model. ment vector from four types of traditional recognized acous- We have compared our CASV performance extracted from tic features, have considered utterance dependency, but also the spectral map with several genres of popular models of extracts the feature from the spectrum graph, which may re- CNN and its variants: LeNet (LeCun Y et at al. 1998), veal some sentiment information that acoustic features can- AlexNet (Krizhevsky A et at al. 2012), VGG16 (Simonyan not reflect. The final AFF-ACRNN consists of the best com- K et at al. 2014), ZFNet (Zeiler M D et at al. 2014), ResNet bination of SBCNN and UB-BiLSTM and outperforms the (He K et at al. 2016). The results are listed in Table 3. As state-of-the-art approach by 9.33% in binary classification the neural network goes deeper, more representative fea- on MOSI dataset and by 8.75% on MOUD. The results are tures can be got from the spectrum graph and that is why shown in the Table 6. ResNet152 has the best performance. It is benefited from We have also run our model on one audio whose length the residual unit which will guarantee the network will not is 10s for 1000 times and the average time to get the sen- degrade when the network goes deeper. timent calssification result from input is only 655.94ms which thanks to our concentrated ASV extracted from AFF- Comparison of different combinations between SBCNN ACRNN. and UB-Bilstm At last, we have performed fusion ex- periments between several best SBCNN and UB-BiLSTM Discussion and UB-LSTM. More accurately, we choose the three best SBCNN, which are ReSNet18, ResNet50 and ResNet152 to The above experimental results have already shown us that combine with the two kinds of utterance dependent LSTM. the proposed method has a great improvement in the perfor- The best combination is UB-BiLSTM with Res152. The fi- mance of audio sentiment analysis. In order to get the best nal result is shown in Table 4. structure of our AFF-ACRNN model, we have tested two separate branch respectively, and compare the final AFF- Comparison with traditional method. Apart from train- ACRNN with traditional or state-of-art method. Weighted ing deep neural network, a bunch of traditional binary clas- accuracy and F1-Score, Macro F1-Score are used as met- sifiers has been used for sentiment analysis. In order to rics to evaluate the model’s performance. In the UB-Bilstm demonstrate the effectiveness of our model, we firstly com- branch, a lot of experiments have shown that four types of pare our model with those traditional methods. heterogeneous traditional features trained by BiLSTM will [I2C2, 2017](Maghilnan S et al.2017) introduced a text- have the best result, whose weighted accuracy is 68.72% on based SVM and Naive Bayes model for binary sentiment MOSI. In the SBCNN branch, we have carried out seven ex- classification, thus we test their model on MOUD, rather periments to prove the ResNet152 used in SBCNN will have 2-class 5-class 7-class Methods Acc(%) F1 Acc(%) Macro F1 Acc(%) Macro F1 LeNet 56.75 55.62 23.67 21.87 15.63 15.12 AlexNet 58.71 57.88 26.43 23.19 19.21 18.79 VGG16 57.88 55.97 27.37 25.78 17.34 16.25 ZFNet 55.37 53.12 21.90 21.38 12.82 11.80 ResNet18 58.94 56.79 25.26 24.63 18.35 17.89 ResNet50 62.52 61.21 28.13 27.04 20.21 20.01 ResNet152 65.42 64.86 28.78 28.08 21.56 20.57 Table 3: Comparison of SBCNN with different structure 2-class 5-class 7-class Methods Acc(%) F1 Acc(%) Micro F1 Acc(%) Micro F1 UB-LSTM+Res18 67.19 66.37 33.83 31.97 26.78 25.83 UB-LSTM+Res50 67.83 66.69 34.21 33.78 27.75 26.41 UB-LSTM+Res152 68.64 67.94 35.87 34.11 28.15 27.03 UB-BiLSTM+Res18 68.26 66.25 35.43 33.52 27.63 26.09 UB-BiLSTM+Res50 69.18 68.22 36.93 34.67 28.11 27.54 UB-BiLSTM+Res152 69.64 68.51 37.71 35.12 29.26 28.45 Table 4: Comparison of different combinations between SBCNN and UB-BiLSTM MOUD sification. Furthermore, in the experiment of using MOSI as Model ACC(%) F1 training set and verification set and MOUD as test set, it SVM 57.23 54.83 also shows that our proposed model has strong generaliza- Naive Bayes 55.72 52.14 tion ability. GMM 54.66 52.89 HMM 56.63 55.84 DTW 53.92 53.06 5 Conclusion AFF-ACRNN 68.74 66.37 In this paper, we propose a novel utterance-based deep neu- Table 5: Comparison with traditional methods on MOUD ral network model termed AFF-ACRNN, which has a par- allel combination of CNN and LSTM based network, to ob- tain representative features termed ASV, that can maximally ACC(%) reflect sentiment information in an utterance from an au- Model dio. We extract several traditional heterogeneous acoustic MOSI → MOUD State-of-the-art 60.31 59.99 features by Librosa toolkit and choose the four most rep- AFF-ACRNN 69.64 57.74 resentative features through a large number of experiments, and regard them as the input of the neural network. We can Table 6: Comparison with state-of-art result (Poria et get CASV and LASV from the CNN branch and the LSTM al.2017) . The right arrow means the model is trained and branch respectively, and finally merge the two branches to validated on the MOSI and tested on the MOUD obtain the final ASV for sentiment classification of each ut- terance. Besides, BiLSTM with attention mechanism is used for feature fusion. The experimental results show our model can recognize audio sentiment precisely and quickly, and the best result, for instance, with the weighted accuracy of demonstrate our heterogeneous ASV are better than tradi- 65.42% on MOSI, due to its extreme depth and the help- tional acoustic features or vectors extracted from other deep ful residual units used to prevent degradation. We selected learning models. Furthermore, experimental results indicate six best combinations of SBCNN and UB-BiLSTM and find that the proposed model outperforms the state-of-the-art ap- that the best is ResNet152 used in SBCNN with UB-Bilstm, proach by 9.33% on MOSI dataset. We have also tested our whose weighted accuracy is 69.42% on MOSI and outper- model on MOUD to prove the model won’t heavily depend forms not only the traditional classifier like SVM, but also on language types. In the future, we will combine the feature the state-of-the-art approach by 9.33% on MOSI dataset. At- engineering technologies to further discuss the fusion di- tention mechanism is used in both branch to subtly combine mension of audio features and consider the fusion of differ- the heterogeneous acoustic features and choose the feature ent dimensions of different categories of features, and even vectors that have the greatest impact on the sentiment clas- apply them to multimodal sentiment analysis. References Recognition[J]. IEEE Signal Processing Letters, 2018. Poria, Soujanya, et al. ”Context-dependent sentiment anal- Pang B, Lee L. Opinion mining and sentiment analysis[J]. ysis in user-generated videos.” Proceedings of the 55th An- Foundations and Trends in Information Retrieval, 2008, nual Meeting of the Association for Computational Linguis- 2(12): 1-135. tics (Volume 1: Long Papers). Vol. 1. 2017. Liu B, ”Sentiment analysis: mining opinions, sentiments, Hochreiter S, Schmidhuber J. Long short-term memory[J]. and emotions”, The Cambridge University Press, 2015. Neural computation, 1997, 9(8): 1735-1780. S. Poria, E. Cambria, R. Bajpai, and A. Hussain, ”A re- Y. Le Cun, B. Boser et al., Handwritten digit recognition view of affective computing: From unimodal analysis to with a back-propagation network, in Advances in neural in- multimodal fusion”, Information Fusion, vol. 37, pp. 98125, formation processing systems, 1990. 2017. T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, Convolu- Kraus, ”M.W. Voice-only communication enhances em- tional, long short-term memory, fully connected deep neural pathic accuracy”, American Psychologist 72, 7 (2017), 644. networks, in IEEE International Conference on Acoustics, S. Zhang, T. Huang and W. Gao, ”Speech Emotion Recog- Speech and Signal Processing (ICASSP). IEEE, 2015, pp. nition Using Deep Convolutional Neural Network and Dis- 45804584. criminant Temporal Pyramid Matching,” in IEEE Transac- D.Palaz,R.Collobertetal.,Analysisofcnn-based speechrecog- tions on Multimedia, vol. PP. 99 (2017):1-1. nition system using raw speech as input, in Proceedings of S. Ezzat, N. Gayar and M.M. Ghanem, Sentiment Analysis Interspeech, 2015. of Call Centre Audio Conversations using Text Classifica- El Ayadi M, Kamel M S, Karray F. Survey on speech tion, in International Journal of Computer Information Sys- emotion recognition: Features, classification schemes, and tems and Industrial Management Applications, vol. 4, pp. databases[J]. Pattern Recognition, 2011, 44(3): 572-587. 619-627, 2012. Cummins N, Amiriparian S, Hagerer G, et al. An Image- Van Den Oord A, Dieleman S, Zen H, et al. WaveNet: A based deep spectrum feature representation for the recogni- generative model for raw audio[C]//SSW. 2016: 125. tion of emotional speech[C]//Proceedings of the 2017 ACM Bertin-Mahieux, T., and Ellis, D. P. 2011. Large-scale cover on Multimedia Conference. ACM, 2017: 478-484. song recognition using hashed chroma landmarks. In 2011 McFee B, Raffel C, Liang D, et al. librosa: Audio and music IEEE Workshop on Applications of Signal Processing to Au- signal analysis in python[C]//Proceedings of the 14th python dio and Acoustics (WASPAA), 117120. IEEE. in science conference. 2015: 18-25. Kaushik L, Sangwan A, Hansen J H L. Automatic audio Bae S H, Choi I, Kim N S. Acoustic scene clas- sentiment extraction using keyword spotting[C]//Sixteenth sification using parallel combination of LSTM and Annual Conference of the International Speech Communi- CNN[C]//Proceedings of the Detection and Classification of cation Association. 2015. Acoustic Scenes and Events 2016 Workshop (DCASE2016). Mariel W C F, Mariyah S, Pramana S. Sentiment analysis: a 2016: 11-15. comparison of deep learning neural network algorithm with Prez-Rosas V, Mihalcea R, Morency L P. Utterance-level SVM and naŁve Bayes for Indonesian text[C]//Journal of multimodal sentiment analysis[C]//Proceedings of the 51st Physics: Conference Series. IOP Publishing, 2018, 971(1): Annual Meeting of the Association for Computational Lin- 012049. guistics (Volume 1: Long Papers). 2013, 1: 973-982. G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learn- A. Nicolaou, B. Schuller, and S. Zafeiriou, ADIEU Fea- ing applied to document recognition[J]. Proceedings of the tures? End-to-end Speech Emotion Recognition using A IEEE, 1998, 86(11): 2278-2324. Deep Convolutional Recurrent Network, in IEEE Interna- Simonyan K, Zisserman A. Very deep convolutional net- tional Conference on Acoustics, Speech and Signal Process- works for large-scale image recognition[J]. arXiv preprint ing, 2016, pp. 52005204. arXiv:1409.1556, 2014. Mirsamadi, Seyedmahdad, Emad Barsoum, and Cha Zhang. Zeiler M D, Fergus R. Visualizing and understanding con- ”Automatic speech emotion recognition using recurrent neu- volutional networks[C]//European conference on computer ral networks with local attention.” Acoustics, Speech and vision. Springer, Cham, 2014: 818-833. Signal Processing (ICASSP), 2017 IEEE International Con- Dong C, Loy C C, He K, et al. Image super-resolution us- ference on. IEEE, 2017. ing deep convolutional networks[J]. IEEE transactions on Neumann, Michael, and Ngoc Thang Vu. ”Attentive convo- pattern analysis and machine intelligence, 2016, 38(2): 295- lutional neural network based speech emotion recognition: 307. A study on the impact of input features, signal length, and Balamurugan B, Maghilnan S, Kumar M R. Source cam- acted speech.” arXiv preprint arXiv:1706.00612 (2017). era identification using SPN with PRNU estimation and en- Wang, Zhong-Qiu, and Ivan Tashev. ”Learning utterance- hancement[C]//Intelligent Computing and Control (I2C2), level representations for speech emotion and age/gender 2017 International Conference on. IEEE, 2017: 1-6. recognition using deep neural networks.” Acoustics, Speech Bakir C, Jarvis D S L. Institutional entrepreneurship and pol- and Signal Processing (ICASSP), 2017 IEEE International icy change[J]. Policy and Society, 2018. Conference on. IEEE, 2017. Zadeh A, Liang P P, Poria S, et al. Multi-attention recur- Chen M, He X, Yang J, et al. 3-D Convolutional Recurrent rent network for human communication comprehension[J]. Neural Networks with Attention Model for Speech Emotion arXiv preprint arXiv:1802.00923, 2018.