=Paper=
{{Paper
|id=Vol-2614/session1_paper6
|storemode=property
|title=Complementary Fusion of Multi-Features and Multi-Modalities in Sentiment Analysis
|pdfUrl=https://ceur-ws.org/Vol-2614/AffCon20_session1_complementary.pdf
|volume=Vol-2614
|authors=Feiyang Chen,Ziqian Luo,Yanyan Xu,Dengfeng Ke
|dblpUrl=https://dblp.org/rec/conf/aaai/ChenLXK20
}}
==Complementary Fusion of Multi-Features and Multi-Modalities in Sentiment Analysis==
<pdf width="1500px">https://ceur-ws.org/Vol-2614/AffCon20_session1_complementary.pdf</pdf>
<pre>
      Complementary Fusion of Multi-Features and
        Multi-Modalities in Sentiment Analysis

            Feiyang Chen1 , Ziqian Luo2 , Yanyan Xu1? , and Dengfeng Ke3??
       1
         School of Computer Science and Technology, Beijing Forestry University
             2
               Language Technologies Institute, Carnegie Mellon University
     3
       National Laboratory of Pattern Recognition, Institute of Automation, Chinese
                                  Academy of Sciences
                       {fychen98.ai,luoziqian98}@gmail.com
                                xuyanyan@bjfu.edu.cn
                              dengfeng.ke@nlpr.ia.ac.cn


           Abstract. Sentiment analysis, mostly based on text, has been rapidly
           developing in the last decade and has attracted widespread attention
           in both academia and industry. However, information in the real world
           usually comes from multiple modalities, such as audio and text. Therefore,
           in this paper, based on audio and text, we consider the task of multimodal
           sentiment analysis and propose a novel fusion strategy including both
           multi-feature fusion and multi-modality fusion to improve the accuracy of
           audio-text sentiment analysis. We call it the DFF-ATMF (Deep Feature
           Fusion - Audio and Text Modality Fusion) model, which consists of
           two parallel branches, the audio modality based branch and the text
           modality based branch. Its core mechanisms are the fusion of multiple
           feature vectors and multiple modality attention. Experiments on the
           CMU-MOSI dataset and the recently released CMU-MOSEI dataset, both
           collected from YouTube for sentiment analysis, show the very competitive
           results of our DFF-ATMF model. Furthermore, by virtue of attention
           weight distribution heatmaps, we also demonstrate the deep features
           learned by using DFF-ATMF are complementary to each other and robust.
           Surprisingly, DFF-ATMF also achieves new state-of-the-art results on
           the IEMOCAP dataset, indicating that the proposed fusion strategy also
           has a good generalization ability for multimodal emotion recognition.

           Keywords: Multimodal Fusion· Multi-Features Fusion · Multimodal
           Sentiment Analysis


1      Introduction

Sentiment analysis provides beneficial information to understand an individual’s
attitude, behavior, and preference [33]. Understanding and analyzing context-
related sentiment is an innate ability of a human being, which is also an important
?
     Corresponding Author
??
     Corresponding Author


 Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License
 Attribution 4.0 International (CC BY 4.0). In: N. Chhaya, K. Jaidka, J. Healey, L. H. Ungar, A. Sinha
 (eds.): Proceedings of the 3rd Workshop of Affective Content Analysis, New York, USA, 07-
 FEB-2020, published at http://ceur-ws.org
2         Feiyang Chen et al.

distinction between a machine and a human being [11]. Therefore, sentiment
analysis becomes a crucial issue in the field of artificial intelligence to be explored.
    In recent years, sentiment analysis mainly focuses on textual data, and
consequently, text-based sentiment analysis becomes relatively mature [33]. With
the popularity of social media such as Facebook and YouTube, many users are
more inclined to express their views with audio or video [21]. Audio reviews
become an increasing source of consumer information and are increasingly being
followed with interest by companies, researchers and consumers. They also provide
more natural experiences than traditional text comments due to allowing viewers
to better perceive a commentator’s sentiment, belief, and intention through
richer channels such as intonation [24]. The combination of multiple modalities
[32,24] brings significant advantages over using only text, including language
disambiguation (audio features can help eliminate ambiguous language meanings)
and language sparsity (audio features can bring additional emotional information).
Also, basic audio patterns can enhance links to the real world environment.
Actually, people often associate information with learning and interact with the
external environment through multiple modalities such as audio and text [1].
Consequently, multimodal learning becomes a new effective method for sentiment
analysis [17]. Its main challenge lies in inferring joint representations that can
process and connect information from multiple modalities [25].
    In this paper, we propose a novel fusion strategy, including the multi-feature
fusion and the multi-modality fusion, to improve the accuracy of multimodal
sentiment analysis based on audio and text. We call it the DFF-ATMF model, and
the learned features have strong complementarity and robustness. We conduct
experiments on the CMU Multimodal Opinion-level Sentiment Intensity (CMU-
MOSI) [31] dataset and the recently released CMU Multimodal Opinion Sentiment
and Emotion Intensity (CMU-MOSEI) [32] dataset, both collected from YouTube,
and make comparisons with other state-of-the-art models to show the very
competitive performance of our proposed model. It is worth mentioning that
DFF-ATMF also achieves the most advanced results on the IEMOCAP dataset in
the generalized verification experiments, meaning that it has a good generalization
ability for multimodal emotion recognition.
    The major contributions of this paper are as follows:
    – We propose the DFF-ATMF model for audio-text sentiment analysis, com-
      bining the multi-feature fusion with the multi-modality fusion to learn more
      comprehensive sentiment information.
    – The features learned by the DFF-ATMF model have good complementarity
      and excellent robustness, and even show an amazing performance when
      generalized to emotion recognition tasks.
    – Experimental results indicate that the proposed model outperforms the
      state-of-the-art models on the CMU-MOSI dataset [7] and the IEMOCAP
      dataset [25], and also has very competitive results on the recently released
      CMU-MOSEI dataset.
   The rest of this paper is structured as follows. In the following section, we
review related work. We exhibit the details of our proposed methodologies in
                                   Title Suppressed Due to Excessive Length         3

Section 3. Then, in Section 4, experimental results and further discussions are
presented. Finally, we conclude this paper in Section 5.


2     Related Work

2.1   Audio Sentiment Analysis

Audio data are usually extracted from the characteristics of audio samples’
channel, excitation, and prosody. Among them, prosody parameters extracted
from segments, sub-segments, and hyper-segments are used for sentiment analysis
in [14]. In the past several years, classical machine learning algorithms, such as
Hidden Markov Model (HMM), Support Vector Machine (SVM), and decision
tree-based methods, have been utilized for audio sentiment analysis [27,26,13].
Recently, researchers have proposed various neural network-based architectures
to improve audio sentiment analysis. In 2014, an initial study employed deep
neural networks (DNNs) to extract high-level features from raw audio data
and demonstrated its effectiveness [8]. With the development of deep learning,
more complex neural-based architectures have been proposed. For example,
convolutional neural network (CNN)-based models have been used to train
spectrograms or audio features derived from original audio signals such as Mel
Frequency Cepstral Coefficients (MFCCs) and Low-Level Descriptors (LLDs)
[2,20,19].


2.2   Text Sentiment Analysis

After decades of development, text sentiment analysis has become mature in
recent years [9]. The most commonly used classification techniques such as SVM,
maximum entropy and naive Bayes, are based on the word bag model, where the
sequence of words is ignored, which may result in inefficient extraction of sentiment
from the input because the sequence of words will affect the existing sentiment
[4]. Later research has overcome this problem by using deep learning in sentiment
analysis [33]. For instance, a kind of DNN model is proposed, using word-level,
character-level and sentence-level representations for sentiment analysis [10]. In
order to better capture the temporal information, [5] proposes a novel neural
architecture, called Transformer-XL, which enables learning dependency beyond a
fixed-length without disrupting temporal coherence. It consists of a segment-level
recurrence mechanism and a novel positional encoding scheme, not only capturing
longer-term dependency but also resolving the context fragmentation problem.


2.3   Multimodal Learning

Multimodal learning is an emerging field of research [1]. Learning from multiple
modalities needs to capture the correlation among these modalities. Data from
different modalities may have different predictive power and noise topology, with
possibly losing the information of at least one of the modalities [1]. [17] presents a
4       Feiyang Chen et al.

novel feature fusion strategy that proceeds in a hierarchical manner for multimodal
sentiment analysis. [7] proposes a recurrent neural network-based multimodal
attention framework that leverages contextual information for utterance-level
sentiment prediction and shows a state-of-the-art model on the CMU-MOSI and
CMU-MOSEI datasets.


3    Proposed Methodology
In this section, we describe the proposed DFF-ATMF model for audio-text
sentiment analysis in detail. We firstly introduce an overview of the whole neural
network architecture, illustrating how to fuse audio and text modalities. After
that, two separate branches of DFF-ATMF are respectively explained to show
how to fuse the audio feature vector and the text feature vector. Finally, we
present the multimodal-attention mechanism used in the DFF-ATMF model.


                                                                                Sentiment
                                                             FC                  Analysis


                                             FC


                                        e          ht             at     ht
                     LSTM        LSTM       LSTM                  LSTM     LSTM          LSTM


                                 ASV                                          TSV


                            Audio Modal                                  Text Modal


                            U1                          U2                          U3


                                                   Dataset


Fig. 1. The overall architecture of the proposed DFF-ATMF framework. ht represents
the hidden state of Bi-LSTM at time t. e means the final audio sentiment vector. at
represents the attention weight and is calculated as the dot product of the final audio
sentiment vector and the final text sentiment vector of ht . “FC” means a fully-connected
layer.
                                            Title Suppressed Due to Excessive Length        5

3.1   The DFF-ATMF Framework

The overall architecture of the proposed DFF-ATMF framework is shown in
Figure 1. We fuse audio and text modalities in this framework through two parallel
branches, that is, the audio modality based branch and the text modality based
branch. DFF-ATMF’s core mechanisms are feature vector fusion and multimodal-
attention fusion. The audio modality branch uses Bi-LSTM [3] to extract audio
sentiment information between adjacent utterances (U1, U2, U3), while another
branch uses the same network architecture to extract text features. Furthermore,
the audio feature vector of each piece of utterance is used as the input of our
proposed neural network, which is based on the audio feature fusion, so we can
obtain a new feature vector before the softmax layer, called the audio sentiment
vector (ASV). The text sentiment vector (TSV) can be achieved similarly. Finally,
after the multimodal-attention fusion, the output of the softmax layer produces
final sentiment analysis results, as shown in Figure 1.


3.2   Audio Sentiment Vector (ASV) from Audio Feature Fusion
      (AFF)


                          Concatenated Feature Vector --- Audio Sentiment Vector (ASV)


                                                      CNN


                                                    Attention


                                                     Bi-LSTM


                             Raw Waveform                                Acoustic Feature
                                                   Audio Modal


                                U1                     U2                      U3


                                                      Dataset


                    Fig. 2. The architecture of ASV from AFF.


   Base on the work in [15], in order to explore further the fusion of feature
vectors inter the audio modality, we extend the experiments of different types of
audio features on the CMU-MOSI dataset, and the results are shown in Table 1.
6         Feiyang Chen et al.

    Feature                                                      Model                   Accuracy(%)
                                                                               2-class      5-class  7-class
    1 Chromagram from spectrogram (chroma_stft)                  LSTM          43.24        20.23    13.96
                                                                 BiLSTM        45.37        2.29     12.39
    2 Chroma Energy Normalized (chroma_cens)                     LSTM          42.98        20.87    13.31
                                                                 BiLSTM        45.85        20.53    13.76
    3 Mel-frequency cepstral coefficients (MFCC)                 LSTM          55.12        23.64    16.99
                                                                 BiLSTM        55.98        23.75    17.24
    4 Root-Mean-Square Energy (RMSE)                             LSTM          52.30        21.14    15.33
                                                                 BiLSTM        52.76        22.35    15.87
    5 Spectral_Centroid                                          LSTM          48.39        22.25    14.97
                                                                 BiLSTM        48.84        22.36    15.79
    6 Spectral_Contrast                                          LSTM          48.34        22.50    15.02
                                                                 BiLSTM        48.97        22.28    15.98
    7 Tonal Centroid Features (tonnetz)                          LSTM          53.78        22.67    15.83
                                                                 BiLSTM        54.24        21.87    16.01
Table 1. Comparison of different types of audio features on the CMU-MOSI dataset.


    In addition, we also implement an improved serial neural network of Bi-LSTM
and CNN [28], combining with the attention mechanism to learn the deep features
of different sound representations. The multi-feature fusion procedure is described
with the LSTM branch and the CNN branch respectively in Algorithm 1.
   The features are learned from raw waveforms and acoustic features, which
are complementary to each other. Therefore, audio sentiment analysis can be
improved by applying our feature fusion technique, that is, ASV from AFF,
whose architecture is shown in Figure 2.


                          80

                          70

                          60

                          50
              Frequency


                          40

                          30

                          20

                          10

                          0
                               0   100000   200000 300000 400000     500000   600000
                                               Audio Vector Length

    Fig. 3. The raw audio waveform sampling distribution on the CMU-MOSI dataset.
                                  Title Suppressed Due to Excessive Length         7

Algorithm 1 The Multi-Feature Fusion Procedure
 1: procedure LSTM branch
 2:    for i:[0,n] do
 3:        fi = getAudioF eature(ui ) // get the audio feature from the uth
    utterance
 4:        ai = getASV (fi )
 5:    end for
 6:    for i:[0,M] do //M is the number of videos
 7:        inputi = GetT opU tter(vi )
 8:        ufi = getU tterF eature(inputi )
 9:    end for
10:    shuf f le(v)
11:    Attention(Ai )
12:    M ulti − F eature F usion f rom the LST M branch
13: end procedure
14: procedure CNN Branch
15:    for i:[0,n] do
16:        xi ← get SpectrogramImage(ui )
17:        ci ← CNNModel(xi )
18:    end for
19:    Attention(Ci )
20:    M ulti − F eature F usion f rom the CN N branch
21: end procedure
22: procedure Feature Fusion
23:    for i:[0,n] do
24:        Li = Attention(ai )
25:        Ci = Attention(li )
26:    end for
27:    Attention(Li + Ci )
28:    M ulti − F eature F usion
29: end procedure


    In terms of raw audio waveforms, taking the CMU-MOSI dataset as an
example, we illustrate their sampling distribution in Figure 3. The inputs to the
network are raw audio waveforms sampled at 22 kHz. We also scale the waveforms
to be in the range [-256, 256], so that we do not need to subtract the mean value
as the data are naturally near zero already. To obtain a better sentiment analysis
accuracy, batch normalization (BN) and the ReLU function are employed after
each convolutional layer. Additionally, dropout regularization is also applied to
the proposed serial network architecture.
    In terms of acoustic features, we extract them using the Librosa [18] toolkit and
obtain four effective kinds of features to represent sentiment information, which
are MFCCs, spectral_centroid, chroma_stft and spectral_contrast, respectively.
In particular, taking log-Mel spectrogram extraction [29] as an example, we
8      Feiyang Chen et al.

use 44.1 kHz without downsampling and extract the spectrograms with 64 Bin
Mel-scale. The window size for short-time Fourier transform is 1,024 with a hop
size of 512. The resulting Mel-spectrograms are next converted into log-scaled
ones and standardized by subtracting the mean value and divided by the standard
deviation.
    Finally, we feed feature vectors of raw waveforms and acoustic features into
our improved serial neural network of Bi-LSTM and CNN, combining with the
attention mechanism to learn the deep features of different sound representations,
that is, ASV.


                      Concatenated Feature Vector --- Text Sentiment Vector (TSV)


                                                CNN


                                               Attention


                                               Bi-LSTM


                                               BERT


                                            Text Modal


                            U1                     U2                     U3


                                                 Dataset


                    Fig. 4. The architecture of TSV from TFF.


3.3   Text Sentiment Vector (TSV) from Text Feature Fusion (TFF)

The architecture of TSV from TFF is shown in Figure 4. BERT [6] is a new
language representation model, standing for Bidirectional Encoder Represen-
tations from Transformers. Thus far, to the best of our knowledge, no studies
have leveraged BERT to pre-train text feature representations on the multimodal
dataset such as CMU-MOSI. We then utilize BERT embeddings for CMU-MOSI.
Next, the Bi-LSTM layer takes the concatenated word embeddings and POS tags
                                  Title Suppressed Due to Excessive Length        9

as its inputs and it outputs each hidden state. Let hi be the output hidden state
at time i. Then its attention weight ai can be formulated by Equation 1.


                                   mi = tanh(hi )
                                   aˆi = wi mi + bi
                                                                                (1)
                                      exp(aˆi )
                                ai = P
                                      j exp(aˆj )

   In Equation 1, wi mi + bi denotes a linear transformation of mi . Therefore,
the output representation ri is given by:

                                     ri = ai hi .                               (2)
    Based on such text representations, the sequence of features will be assigned
with different attention weights. Thus, crucial information such as emotional words
can be identified more easily. The convolutional layer takes the text representation
ri as its input, and the output CNN feature maps are concatenated together.
Finally, text sentiment analysis can be improved by using TSV from TFF.


3.4   Audio and Text Modal Fusion with the Multimodal-Attention
      Mechanism

Inspired by human visual attention, the attention mechanism, proposed by [16]
for neural machine translation, is introduced into the encoder-decoder framework
to select reference words from the source language for the words in the target
language. Based on the existing attention mechanism, inspired by the work in [30],
we improve the multimodal-attention method on the basis of the multi-feature
fusion strategy, focusing on the fusion of comprehensive and complementary
sentiment information from audio and text. We leverage the multimodal-attention
mechanism to preserve the intermediate outputs of the input sequences by
retaining the Bi-LSTM encoder, and then a model is trained to selectively learn
these inputs and to correlate output sequences with the model’s output.
    More specifically, ASV and TSV are firstly encoded with Audio-BiLSTM and
Text-BiLSTM using Equation 3.


                               At+1 = fθ (At , xt+1 )
                               At−1 = fθ (At , xt−1 )
                                                                                (3)
                                Tt+1 = fθ (Tt , xt+1 )
                                Tt−1 = fθ (Tt , xt−1 )

   In Equation 3, fθ is the LSTM function with the weight parameter θ. At+1 ,
                                                       th                th
At and At−1 represent the hidden states at time (t + 1) , tth and (t − 1) from
the audio modality, respectively. xt+1 and xt−1 represent the features at time
10         Feiyang Chen et al.

          th              th
(t + 1)        and (t − 1) , respectively. The text modality is similar, represented by
T.


                                          exp(eT ht )
                                    at = P        T
                                           t exp(e ht )
                                           exp(eT h0t )
                                     tt = P        T 0
                                            t exp(e ht )                           (4)
                                                X
                                          Za =      at ht
                                                  t
                                                  X
                                           Zt =        tt h0t
                                                   t


                    ŷi,j = sof tmax(concat(concat(Za , Zt ), A)T M + b)           (5)

    We then consider the final ASV e as an intermediate vector, as shown in
Figure 1. During each time step t, the dot product of the intermediate vector
e and the hidden state ht is evaluated to calculate a similarity
                                                          P        score at . Using
this score as a weight parameter, the weighted sum t at ht is calculated to
generate a multi-feature fusion vector Za . The multi-feature fusion vector of the
text modality is calculated similarly, represented by Zt . We are therefore able to
obtain two kinds of multi-feature fusion vectors for the audio modality and the
text modality respectively, as shown in Equation 4 and 5. These multi-feature
fusion vectors are respectively concatenated with the final intermediate vectors
of ASV and TSV, which will pass through the softmax function to perform
sentiment analysis, as shown in Equation 6 and 7.


                                        ASV = gθ (e)
                                                                                   (6)
                                      T SV = g θ 0 (ht )


                        ŷi = sof tmax(concat(ASV, T SV )T M + b)                  (7)


4    Empirical Evaluation

In this section, we firstly introduce the datasets, the evaluation metrics and the
network structure parameters used in our experiments, and then exhibit the
experimental results and make comparisons with other state-of-the-art models to
show the advantages of DFF-ATMF. At last, more discussions are illustrated to
understand the learning behavior of DFF-ATMF better.
                                      Title Suppressed Due to Excessive Length           11


                Table 2. Datasets for training and test in our experiments.


                                     Training                            Test
      Dataset
                              #utterance      #video             #utterance     #video
      CMU-MOSI                1 616             65                583            28
      CMU-MOSEI               18 051          1 550              4 625           679
      IEMOCAP                  4 290           120               1 208           31


4.1     Experiment Settings

Datasets The datasets used for training and test are depicted in Table 2. The
CMU-MOSI dataset is rich in sentiment expression, consisting of 2,199 utterances,
that is, 93 videos by 89 speakers. The videos involve a large array of topics such
as movies, books, and other products. These videos were crawled from YouTube
and segmented into utterances where each utterance is annotated with scores
between −3 (strongly negative) and +3 (strongly positive) by five annotators.
We take the average of these five annotations as the sentiment polarity and then
consider only two classes, that is, “positive” and “negative”. Our training and test
splits of the dataset are completely disjoint with respect to speakers. In order
to better compare with the previous work, similar to [25], we divide the dataset
by 7:3 approximately, resulting in 1,616 and 583 utterances for training and test
respectively.
    The CMU-MOSEI dataset is an upgraded version of the CMU-MOSI dataset,
which has 3,229 videos, that is, 22,676 utterances, from more than 1,000 online
YouTube speakers. The training and test sets include 18,051 and 4,625 utterances
respectively, similar to [7].
    The IEMOCAP dataset was collected following theatrical theory in order to
simulate natural dyadic interactions between actors. We use categorical evalua-
tions with majority agreement and use only four emotional categories, that is,
“happy”, “sad”, “angry”, and “neutral” to compare the performance of our model
with other researches using the same categories [25].


Evaluation Metrics We evaluate the performance of our proposed model by
the weighted accuracy on 2-class or multi-class classifications.
                                            correct utterances
                       weighted accuracy =                                           (8)
                                                utterances
      Additionally, F1-Score is used to evaluate 2-class classification.

                                               precision · recall
                        Fβ =(1 + β 2 ) ·                                             (9)
                                           (β 2 · precision) + recall
   In Equation 9, β represents the weight between precision and recall. During
our evaluation process, we set β = 1 since we consider precision and recall to
have the same weight, and thus F 1-score is adopted.
12      Feiyang Chen et al.

   However, in emotion recognition, we use Macro F 1-Score to evaluate the
performance.
                                             n
                                             P
                                                 F1n
                               M acro F 1= 1                               (10)
                                           n
   In Equation 10, n represents the number of classifications and F1n is the F 1
score on the nth category.


Network Structure Parameters Our proposed architecture is implemented
on the open-source deep learning framework TensorFlow. More specifically, for the
proposed audio and text multi-modality fusion framework, we use Bi-LSTM with
200 neurons, each followed by a dense layer consisting of 100 neurons. Utilizing the
dense layer, we project the input features of audio and text to the same dimension,
and next combine them with the multimodal-attention mechanism. We set the
dropout hyperparameter to be 0.4 for CMU-MOSI and 0.3 for CMU-MOSEI &
IEMOCAP as a measure of regularization. We also use the same dropout rates
for the Bi-LSTM layers. We employ the ReLu function in the dense layers and
softmax in the final classification layer. When training the network, we set the
batch size to be 32, and use Adam optimizer with the cross-entropy loss function
and train for 50 epochs. In data processing, we make each utterance one-to-one
correspondence with the label and rename the utterance.
    The network structure of the proposed audio and text multi-feature fusion
framework is similar. Taking the audio multi-feature fusion framework as an
example, the hidden states of Bi-LSTM are of 2 ∗ 200-dim. The kernel sizes of
CNN are 3, 5, 7 and 9 respectively. The size of the feature map is 4 ∗ 200. A
dropout rate is a random number between 0.3 and 0.4. The loss function used
is MAE, and the batch size is set to 16. We combine the training set and the
development set in our experiments. We use 90% for training and reserve 10% for
cross-validation. To train the feature encoder, we follow the fine-tuning training
strategy.
    In order to reduce randomness and improve credibility, we report the average
value over 3 runs for all experiments.


4.2   Experimental Results

Comparison with Other Models

 – [22] proposes an LSTM-based model that enables utterances to capture
   contextual information from their surroundings in the video, thus aiding the
   classification.
 – [23] introduces attention-based networks to improve both context learning
   and dynamic feature fusion.
 – [32] proposes a novel multimodal fusion technique called Dynamic Fusion
   Graph (DFG).
                                  Title Suppressed Due to Excessive Length     13

                    CMU-MOSI           CMU-MOSEI                   IEMOCAP
      Model
                  Acc(%)  F1          Acc(%)  F1          Overall Acc(%) Macro F1
    [22]           79.30 80.12           -     -              75.60        76.31
    [23]           80.10 80.62           -     -                 -           -
    [32]           74.93 75.42         76.24 77.03               -           -
    [25]           76.60 76.93           -     -              78.20        78.79
     [7]           80.58 80.96         79.74 80.15               -           -
    [12]             -     -          84.08  88.89               -           -
 DFF-ATMF         80.98  81.26         77.15 78.33            81.37        82.29


              Table 3. Comparison with other state-of-the-art models.


 – [25] explores three different deep learning-based architectures, each improving
   upon the previous one, which is the state-of-the-art method on the
   IEMOCAP dataset at present.
 – [7] proposes a recurrent neural network-based multimodal-attention frame-
   work that leverages the contextual information, which is the state-of-the-
   art model on the CMU-MOSI dataset at present.
 – [12] proposes a new method of learning about the hidden representations
   between speech and text data using CNN, which is the state-of-the-art
   model on the CMU-MOSEI dataset at present.
   Table 3 shows the comparison of DFF-ATMF with other state-of-the-art
models. From Table 3, we can see that DFF-ATMF outperforms the other models
on the CMU-MOSI dataset and the IEMOCAP dataset. At the same time,
the experimental results on the CMU-MOSEI dataset also show DFF-ATMF’s
competitive performance.

Generalization Ability Analysis In order to verify the feature complemen-
tarity of our proposed fusion strategy and its robustness, we conduct experiments
on the IEMOCAP dataset to examine DFF-ATMF’s generalization capability.
Surprisingly, our proposed fusion strategy is effective on the IEMOCAP dataset
and outperforms the current state-of-the-art method in [25], which can be seen
from Table 3 and the overall accuracy is improved by 3.17%. More detailed
experimental results on the IEMOCAP dataset are illustrated in Table 4.

4.3   Further Discussions
The above experimental results have already shown that DFF-ATMF can improve
the performance of audio-text Sentiment analysis. We now analyze the attention
values to understand the learning behavior of the proposed architecture better.
   We take a video from the CMU-MOSI test set as an example. From the
attention heatmap in Figure 5, we can see evidently that by applying different
weights across contextual utterances and modalities, the model is able to predict
labels of all the utterances correctly, which shows that our proposed fusion
14        Feiyang Chen et al.

                                                   IEMOCAP
            Emotion
                                        ACC(%)                   Macro F1
             happy                       74.41                    75.66
              sad                        73.62                    74.31
             angry                       78.57                    79.14
            neutral                      64.35                    65.72
            Overall                      81.37                    82.29


               Table 4. Experimental results on the IEMOCAP dataset.


strategy with multi-feature and multi-modality is indeed effective, and thus has
good feature complementarity and excellent robustness of generalization ability.
However, at the same time, we have a doubt about the multi-feature fusion.
When the raw waveform of the audio is fused with the vector of acoustic features,
the dimensions are inconsistent. If the existing method is utilized to reduce the
dimension, some audio information may also be lost. We intend to solve this
problem from the perspective of some mathematical theory such as the angle
between two vectors.


                                                                    0.90
             u1


                                                                    0.75

                                                                    0.60
             u2


                                                                    0.45

                                                                    0.30
             u3


                                                                    0.15
                   u1      u2      u3      u1      u2      u3

     Fig. 5. Softmax attention weights of an example from the CMU-MOSI test set.


                                                                    0.90
             u1


                                                                    0.75
                                                                    0.60
             u2


                                                                    0.45
                                                                    0.30
             u3


                                                                    0.15
                   u1      u2      u3      u1      u2      u3

     Fig. 6. Softmax attention weights of an example from the CMU-MOSEI test set.
                                                             Title Suppressed Due to Excessive Length   15


                                                                                                0.75


           u1
                                                                                                0.60

                                                                                                0.45


           u2
                                                                                                0.30

           u3                                                                                   0.15

                                          u1       u2       u3       u1       u2       u3

    Fig. 7. Softmax attention weights of an example from the IEMOCAP test set.
           u9 u8 u7 u6 u5 u4 u3 u2 u1


                                                                                                 0.8

                                                                                                 0.6

                                                                                                 0.4

                                                                                                 0.2

                                        u1 u2 u3 u4 u5 u6 u7 u8 u9 u1 u2 u3 u4 u5 u6 u7 u8 u9

Fig. 8. Softmax attention weight comparison of the CMU-MOSI, CMU-MOSEI, and
IEMOCAP test sets.


   Similarly, the attention weight distribution heatmaps on the CMU-MOSEI
and IEMOCAP test sets are shown in Figure 6 and 7, respectively. Furthermore,
we also give the softmax attention weight comparison of the CMU-MOSI, CMU-
MOSEI, and IEMOCAP test sets in Figure 8.


5    Conclusions


In this paper, we propose a novel fusion strategy, including multi-feature fusion
and multi-modality fusion, and the learned features have strong complementarity
and robustness, leading to the most advanced experimental results on the audio-
text multimodal sentiment analysis tasks. Experiments on both the CMU-MOSI
and CMU-MOSEI datasets show that our proposed model is very competitive.
More surprisingly, the experiments on the IEMOCAP dataset achieve unexpected
state-of-the-art results, indicating that DFF-ATMF can also be generalized for
multimodal emotion recognition. In this paper, we did not consider the video
modality because we try to use only the information of audio and text derived from
videos. To the best of our knowledge, this is the first attempt in the multimodal
domain. In the future, we will consider more fusion strategies supported by basic
mathematical theories for multimodal sentiment analysis.
16      Feiyang Chen et al.

6    Acknowledgements

This research work was supported by the National Undergraduate Training
Programs for Innovation and Entrepreneurship (Grant No. 201810022064) and the
World-Class Discipline Construction and Characteristic Development Guidance
Funds for Beijing Forestry University (Grant No. 2019XKJS0310). We also thank
the anonymous reviewers for their thoughtful comments. Special thanks to the
support of AAAI 2020 and AffCon2020.


References

 1. Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: A survey
    and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence
    41(2), 423–443 (2019)
 2. Bertero, D., Fung, P.: A first look into a convolutional neural network for speech
    emotion detection. In: 2017 IEEE international conference on acoustics, speech and
    signal processing (ICASSP). pp. 5115–5119. IEEE (2017)
 3. Cai, X., Hao, Z.: Multi-view and attention-based bi-lstm for weibo emotion recogni-
    tion. In: 2018 International Conference on Network, Communication, Computer
    Engineering (NCCE), Advances in Intelligent Systems Research, volume 147.
    pp. 772–779. Atlantis Press (2018). https://doi.org/https://doi.org/10.2991/ncce-
    18.2018.127, https://doi.org/10.2991/ncce-18.2018.127
 4. Chaturvedi, I., Cambria, E., Welsch, R.E., Herrera, F.: Distinguishing between facts
    and opinions for sentiment analysis: Survey and challenges. Information Fusion 44,
    65–77 (2018)
 5. Dai, Z., Yang, Z., Yang, Y., Cohen, W.W., Carbonell, J., Le, Q.V., Salakhutdinov,
    R.: Transformer-xl: Attentive language models beyond a fixed-length context. arXiv
    preprint arXiv:1901.02860 (2019)
 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
    rectional transformers for language understanding. arXiv preprint arXiv:1810.04805
    (2018)
 7. Ghosal, D., Akhtar, M.S., Chauhan, D., Poria, S., Ekbal, A., Bhattacharyya, P.:
    Contextual inter-modal attention for multi-modal sentiment analysis. In: Proceed-
    ings of the 2018 Conference on Empirical Methods in Natural Language Processing.
    pp. 3454–3466 (2018)
 8. Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural net-
    work and extreme learning machine. In: The fifteenth annual conference of the
    international speech communication association (INTERSPEECH). pp. 223–227
    (2014)
 9. Hussein, D.M.E.D.M.: A survey on sentiment analysis challenges. Journal of King
    Saud University-Engineering Sciences 30(4), 330–338 (2018)
10. Jianqiang, Z., Xiaolin, G., Xuejun, Z.: Deep convolution neural networks for twitter
    sentiment analysis. IEEE Access 6, 23253–23260 (2018)
11. Kozinets, R.V., Scaraboto, D., Parmentier, M.A.: Evolving netnography: how brand
    auto-netnography, a netnographic sensibility, and more-than-human netnography
    can transform your research. JOURNAL OF MARKETING MANAGEMENT
    34(3-4), 231–242 (2018)
                                    Title Suppressed Due to Excessive Length          17

12. Lee, C.W., Song, K.Y., Jeong, J., Choi, W.Y.: Convolutional attention networks
    for multimodal emotion recognition from speech and text data. arXiv preprint
    arXiv:1805.06606 (2018)
13. Lee, C.C., Mower, E., Busso, C., Lee, S., Narayanan, S.: Emotion recognition using
    a hierarchical binary decision tree approach. Speech Communication 53(9-10),
    1162–1171 (2011)
14. Liu, Z.T., Wu, M., Cao, W.H., Mao, J.W., Xu, J.P., Tan, G.Z.: Speech emotion
    recognition based on feature selection and extreme learning machine decision tree.
    Neurocomputing 273, 271–280 (2018)
15. Luo, Z., Xu, H., Chen, F.: Audio sentiment analysis by heterogeneous signal features
    learned from utterance-based parallel neural network. In: Proceedings of the AAAI-
    19 Workshop on Affective Content Analysis, Honolulu, USA, AAAI (2019)
16. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based
    neural machine translation. arXiv preprint arXiv:1508.04025 (2015)
17. Majumder, N., Hazarika, D., Gelbukh, A., Cambria, E., Poria, S.: Multimodal
    sentiment analysis using hierarchical fusion with context modeling. Knowledge-
    Based Systems 161, 124–133 (2018)
18. McFee, B., Raffel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., Nieto,
    O.: librosa: Audio and music signal analysis in python. In: Proceedings of the 14th
    python in science conference. pp. 18–25 (2015)
19. Minaee, S., Abdolrashidi, A.: Deep-emotion: Facial expression recognition using
    attentional convolutional network. arXiv preprint arXiv:1902.01019 (2019)
20. Parthasarathy, S., Tashev, I.: Convolutional neural network techniques for speech
    emotion recognition. In: 2018 16th International Workshop on Acoustic Signal
    Enhancement (IWAENC). pp. 121–125. IEEE (2018)
21. Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing:
    From unimodal analysis to multimodal fusion. Information Fusion 37, 98–125 (2017)
22. Poria, S., Cambria, E., Hazarika, D., Majumder, N., Zadeh, A., Morency, L.P.:
    Context-dependent sentiment analysis in user-generated videos. In: Proceedings of
    the 55th Annual Meeting of the Association for Computational Linguistics (Volume
    1: Long Papers). vol. 1, pp. 873–883 (2017)
23. Poria, S., Cambria, E., Hazarika, D., Mazumder, N., Zadeh, A., Morency, L.P.:
    Multi-level multiple attentions for contextual multimodal sentiment analysis. In:
    2017 IEEE International Conference on Data Mining (ICDM). pp. 1033–1038. IEEE
    (2017)
24. Poria, S., Hussain, A., Cambria, E.: Combining textual clues with audio-visual
    information for multimodal sentiment analysis. In: Multimodal Sentiment Analysis,
    pp. 153–178. Springer (2018)
25. Poria, S., Majumder, N., Hazarika, D., Cambria, E., Gelbukh, A., Hussain, A.:
    Multimodal sentiment analysis: Addressing key issues and setting up the baselines.
    IEEE Intelligent Systems 33(6), 17–25 (2018)
26. Schuller, B., Rigoll, G., Lang, M.: Hidden markov model-based speech emotion
    recognition. In: 2003 IEEE International Conference on Acoustics, Speech, and
    Signal Processing, 2003. Proceedings.(ICASSP’03). vol. 2, pp. II–1. IEEE (2003)
27. Schuller, B., Rigoll, G., Lang, M.: Speech emotion recognition combining acoustic
    features and linguistic information in a hybrid support vector machine-belief network
    architecture. In: 2004 IEEE International Conference on Acoustics, Speech, and
    Signal Processing. vol. 1, pp. I–577. IEEE (2004)
28. Wu, C., Wu, F., Liu, J., Yuan, Z., Wu, S., Huang, Y.: Thu_ngn at semeval-2018
    task 1: Fine-grained tweet sentiment intensity analysis with attention cnn-lstm.
18      Feiyang Chen et al.

    In: Proceedings of The 12th International Workshop on Semantic Evaluation. pp.
    186–192 (2018)
29. Yin, Y., Shah, R.R., Zimmermann, R.: Learning and fusing multimodal deep
    features for acoustic scene categorization. In: 2018 ACM Multimedia Conference
    on Multimedia Conference. pp. 1892–1900. ACM (2018)
30. Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio
    and text. In: 2018 IEEE Spoken Language Technology Workshop (SLT). pp. 112–118.
    IEEE (2018)
31. Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Mosi: multimodal corpus of
    sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint
    arXiv:1606.06259 (2016)
32. Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language
    analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph.
    In: Proceedings of the 56th Annual Meeting of the Association for Computational
    Linguistics (Volume 1: Long Papers). vol. 1, pp. 2236–2246 (2018)
33. Zhang, L., Wang, S., Liu, B.: Deep learning for sentiment analysis: A survey. Wiley
    Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(4), e1253 (2018)

</pre>