<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Complementary Fusion of Multi-Features and Multi-Modalities in Sentiment Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Feiyang Chen</string-name>
          <email>fychen98.ai@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ziqian Luo</string-name>
          <email>luoziqian98@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanyan Xu</string-name>
          <email>xuyanyan@bjfu.edu.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dengfeng Ke</string-name>
          <email>dengfeng.ke@nlpr.ia.ac.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Language Technologies Institute, Carnegie Mellon University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Computer Science and Technology, Beijing Forestry University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Sentiment analysis, mostly based on text, has been rapidly developing in the last decade and has attracted widespread attention in both academia and industry. However, information in the real world usually comes from multiple modalities, such as audio and text. Therefore, in this paper, based on audio and text, we consider the task of multimodal sentiment analysis and propose a novel fusion strategy including both multi-feature fusion and multi-modality fusion to improve the accuracy of audio-text sentiment analysis. We call it the DFF-ATMF (Deep Feature Fusion - Audio and Text Modality Fusion) model, which consists of two parallel branches, the audio modality based branch and the text modality based branch. Its core mechanisms are the fusion of multiple feature vectors and multiple modality attention. Experiments on the CMU-MOSI dataset and the recently released CMU-MOSEI dataset, both collected from YouTube for sentiment analysis, show the very competitive results of our DFF-ATMF model. Furthermore, by virtue of attention weight distribution heatmaps, we also demonstrate the deep features learned by using DFF-ATMF are complementary to each other and robust. Surprisingly, DFF-ATMF also achieves new state-of-the-art results on the IEMOCAP dataset, indicating that the proposed fusion strategy also has a good generalization ability for multimodal emotion recognition.</p>
      </abstract>
      <kwd-group>
        <kwd>Multimodal Fusion</kwd>
        <kwd>Multi-Features Fusion</kwd>
        <kwd>Sentiment Analysis</kwd>
        <kwd>Multimodal</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Sentiment analysis provides beneficial information to understand an individual’s
attitude, behavior, and preference [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. Understanding and analyzing
contextrelated sentiment is an innate ability of a human being, which is also an important
distinction between a machine and a human being [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Therefore, sentiment
analysis becomes a crucial issue in the field of artificial intelligence to be explored.
      </p>
      <p>
        In recent years, sentiment analysis mainly focuses on textual data, and
consequently, text-based sentiment analysis becomes relatively mature [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. With
the popularity of social media such as Facebook and YouTube, many users are
more inclined to express their views with audio or video [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Audio reviews
become an increasing source of consumer information and are increasingly being
followed with interest by companies, researchers and consumers. They also provide
more natural experiences than traditional text comments due to allowing viewers
to better perceive a commentator’s sentiment, belief, and intention through
richer channels such as intonation [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. The combination of multiple modalities
[
        <xref ref-type="bibr" rid="ref24 ref32">32,24</xref>
        ] brings significant advantages over using only text, including language
disambiguation (audio features can help eliminate ambiguous language meanings)
and language sparsity (audio features can bring additional emotional information).
Also, basic audio patterns can enhance links to the real world environment.
Actually, people often associate information with learning and interact with the
external environment through multiple modalities such as audio and text [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Consequently, multimodal learning becomes a new effective method for sentiment
analysis [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Its main challenge lies in inferring joint representations that can
process and connect information from multiple modalities [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>
        In this paper, we propose a novel fusion strategy, including the multi-feature
fusion and the multi-modality fusion, to improve the accuracy of multimodal
sentiment analysis based on audio and text. We call it the DFF-ATMF model, and
the learned features have strong complementarity and robustness. We conduct
experiments on the CMU Multimodal Opinion-level Sentiment Intensity
(CMUMOSI) [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] dataset and the recently released CMU Multimodal Opinion Sentiment
and Emotion Intensity (CMU-MOSEI) [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] dataset, both collected from YouTube,
and make comparisons with other state-of-the-art models to show the very
competitive performance of our proposed model. It is worth mentioning that
DFF-ATMF also achieves the most advanced results on the IEMOCAP dataset in
the generalized verification experiments, meaning that it has a good generalization
ability for multimodal emotion recognition.
      </p>
      <p>
        The major contributions of this paper are as follows:
– We propose the DFF-ATMF model for audio-text sentiment analysis,
combining the multi-feature fusion with the multi-modality fusion to learn more
comprehensive sentiment information.
– The features learned by the DFF-ATMF model have good complementarity
and excellent robustness, and even show an amazing performance when
generalized to emotion recognition tasks.
– Experimental results indicate that the proposed model outperforms the
state-of-the-art models on the CMU-MOSI dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and the IEMOCAP
dataset [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], and also has very competitive results on the recently released
CMU-MOSEI dataset.
      </p>
      <p>The rest of this paper is structured as follows. In the following section, we
review related work. We exhibit the details of our proposed methodologies in
Section 3. Then, in Section 4, experimental results and further discussions are
presented. Finally, we conclude this paper in Section 5.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Audio Sentiment Analysis</title>
        <p>
          Audio data are usually extracted from the characteristics of audio samples’
channel, excitation, and prosody. Among them, prosody parameters extracted
from segments, sub-segments, and hyper-segments are used for sentiment analysis
in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. In the past several years, classical machine learning algorithms, such as
Hidden Markov Model (HMM), Support Vector Machine (SVM), and decision
tree-based methods, have been utilized for audio sentiment analysis [
          <xref ref-type="bibr" rid="ref13 ref26 ref27">27,26,13</xref>
          ].
Recently, researchers have proposed various neural network-based architectures
to improve audio sentiment analysis. In 2014, an initial study employed deep
neural networks (DNNs) to extract high-level features from raw audio data
and demonstrated its effectiveness [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. With the development of deep learning,
more complex neural-based architectures have been proposed. For example,
convolutional neural network (CNN)-based models have been used to train
spectrograms or audio features derived from original audio signals such as Mel
Frequency Cepstral Coefficients (MFCCs) and Low-Level Descriptors (LLDs)
[
          <xref ref-type="bibr" rid="ref19 ref2 ref20">2,20,19</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Text Sentiment Analysis</title>
        <p>
          After decades of development, text sentiment analysis has become mature in
recent years [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The most commonly used classification techniques such as SVM,
maximum entropy and naive Bayes, are based on the word bag model, where the
sequence of words is ignored, which may result in inefficient extraction of sentiment
from the input because the sequence of words will affect the existing sentiment
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Later research has overcome this problem by using deep learning in sentiment
analysis [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]. For instance, a kind of DNN model is proposed, using word-level,
character-level and sentence-level representations for sentiment analysis [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. In
order to better capture the temporal information, [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] proposes a novel neural
architecture, called Transformer-XL, which enables learning dependency beyond a
fixed-length without disrupting temporal coherence. It consists of a segment-level
recurrence mechanism and a novel positional encoding scheme, not only capturing
longer-term dependency but also resolving the context fragmentation problem.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Multimodal Learning</title>
        <p>
          Multimodal learning is an emerging field of research [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Learning from multiple
modalities needs to capture the correlation among these modalities. Data from
different modalities may have different predictive power and noise topology, with
possibly losing the information of at least one of the modalities [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] presents a
novel feature fusion strategy that proceeds in a hierarchical manner for multimodal
sentiment analysis. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] proposes a recurrent neural network-based multimodal
attention framework that leverages contextual information for utterance-level
sentiment prediction and shows a state-of-the-art model on the CMU-MOSI and
CMU-MOSEI datasets.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Proposed Methodology</title>
      <p>In this section, we describe the proposed DFF-ATMF model for audio-text
sentiment analysis in detail. We firstly introduce an overview of the whole neural
network architecture, illustrating how to fuse audio and text modalities. After
that, two separate branches of DFF-ATMF are respectively explained to show
how to fuse the audio feature vector and the text feature vector. Finally, we
present the multimodal-attention mechanism used in the DFF-ATMF model.
FC</p>
      <p>Sentiment</p>
      <p>Analysis
e</p>
      <p>FC
ht
at</p>
      <p>ht
LSTM</p>
      <p>LSTM</p>
      <p>LSTM</p>
      <p>LSTM</p>
      <p>LSTM</p>
      <p>LSTM</p>
      <p>ASV
Audio Modal</p>
      <p>TSV</p>
      <p>Text Modal
U1</p>
      <p>U2</p>
      <p>U3</p>
      <p>Dataset</p>
      <sec id="sec-3-1">
        <title>The DFF-ATMF Framework</title>
        <p>
          The overall architecture of the proposed DFF-ATMF framework is shown in
Figure 1. We fuse audio and text modalities in this framework through two parallel
branches, that is, the audio modality based branch and the text modality based
branch. DFF-ATMF’s core mechanisms are feature vector fusion and
multimodalattention fusion. The audio modality branch uses Bi-LSTM [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to extract audio
sentiment information between adjacent utterances (U1, U2, U3), while another
branch uses the same network architecture to extract text features. Furthermore,
the audio feature vector of each piece of utterance is used as the input of our
proposed neural network, which is based on the audio feature fusion, so we can
obtain a new feature vector before the softmax layer, called the audio sentiment
vector (ASV). The text sentiment vector (TSV) can be achieved similarly. Finally,
after the multimodal-attention fusion, the output of the softmax layer produces
final sentiment analysis results, as shown in Figure 1.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Audio Sentiment Vector (ASV) from Audio Feature Fusion (AFF)</title>
        <p>Concatenated Feature Vector --- Audio Sentiment Vector (ASV)</p>
        <p>Raw Waveform</p>
        <p>Acoustic Feature</p>
        <p>Audio Modal
U1</p>
        <p>U2</p>
        <p>U3
CNN
Attention
Bi-LSTM</p>
        <p>Dataset</p>
        <p>
          Base on the work in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], in order to explore further the fusion of feature
vectors inter the audio modality, we extend the experiments of different types of
audio features on the CMU-MOSI dataset, and the results are shown in Table 1.
        </p>
        <p>Feature
1 Chromagram from spectrogram (chroma_stft)</p>
        <sec id="sec-3-2-1">
          <title>2 Chroma Energy Normalized (chroma_cens)</title>
          <p>3 Mel-frequency cepstral coefficients (MFCC)</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>4 Root-Mean-Square Energy (RMSE)</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>5 Spectral_Centroid</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>6 Spectral_Contrast</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>7 Tonal Centroid Features (tonnetz)</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>Model</title>
        </sec>
        <sec id="sec-3-2-7">
          <title>LSTM</title>
        </sec>
        <sec id="sec-3-2-8">
          <title>BiLSTM</title>
        </sec>
        <sec id="sec-3-2-9">
          <title>LSTM</title>
        </sec>
        <sec id="sec-3-2-10">
          <title>BiLSTM</title>
          <p>LSTM
BiLSTM</p>
        </sec>
        <sec id="sec-3-2-11">
          <title>LSTM</title>
        </sec>
        <sec id="sec-3-2-12">
          <title>BiLSTM</title>
        </sec>
        <sec id="sec-3-2-13">
          <title>LSTM</title>
        </sec>
        <sec id="sec-3-2-14">
          <title>BiLSTM</title>
        </sec>
        <sec id="sec-3-2-15">
          <title>LSTM</title>
        </sec>
        <sec id="sec-3-2-16">
          <title>BiLSTM</title>
        </sec>
        <sec id="sec-3-2-17">
          <title>LSTM</title>
          <p>
            BiLSTM
and CNN [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ], combining with the attention mechanism to learn the deep features
of different sound representations. The multi-feature fusion procedure is described
with the LSTM branch and the CNN branch respectively in Algorithm 1.
          </p>
          <p>The features are learned from raw waveforms and acoustic features, which
are complementary to each other. Therefore, audio sentiment analysis can be
improved by applying our feature fusion technique, that is, ASV from AFF,
whose architecture is shown in Figure 2.</p>
          <p>80
70
60
50
y
c
n
e
u40
q
e
r
F
30
20
10
0
0
Fig. 3. The raw audio waveform sampling distribution on the CMU-MOSI dataset.</p>
          <p>In terms of raw audio waveforms, taking the CMU-MOSI dataset as an
example, we illustrate their sampling distribution in Figure 3. The inputs to the
network are raw audio waveforms sampled at 22 kHz. We also scale the waveforms
to be in the range [-256, 256], so that we do not need to subtract the mean value
as the data are naturally near zero already. To obtain a better sentiment analysis
accuracy, batch normalization (BN) and the ReLU function are employed after
each convolutional layer. Additionally, dropout regularization is also applied to
the proposed serial network architecture.</p>
          <p>
            In terms of acoustic features, we extract them using the Librosa [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] toolkit and
obtain four effective kinds of features to represent sentiment information, which
are MFCCs, spectral_centroid, chroma_stft and spectral_contrast, respectively.
In particular, taking log-Mel spectrogram extraction [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ] as an example, we
use 44.1 kHz without downsampling and extract the spectrograms with 64 Bin
Mel-scale. The window size for short-time Fourier transform is 1,024 with a hop
size of 512. The resulting Mel-spectrograms are next converted into log-scaled
ones and standardized by subtracting the mean value and divided by the standard
deviation.
          </p>
          <p>Finally, we feed feature vectors of raw waveforms and acoustic features into
our improved serial neural network of Bi-LSTM and CNN, combining with the
attention mechanism to learn the deep features of different sound representations,
that is, ASV.</p>
          <p>Concatenated Feature Vector --- Text Sentiment Vector (TSV)</p>
          <p>CNN
Attention
Bi-LSTM</p>
          <p>BERT
Text Modal</p>
          <p>Dataset
U1</p>
          <p>U2</p>
          <p>U3</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Text Sentiment Vector (TSV) from Text Feature Fusion (TFF)</title>
        <p>
          The architecture of TSV from TFF is shown in Figure 4. BERT [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is a new
language representation model, standing for Bidirectional Encoder
Representations from Transformers. Thus far, to the best of our knowledge, no studies
have leveraged BERT to pre-train text feature representations on the multimodal
dataset such as CMU-MOSI. We then utilize BERT embeddings for CMU-MOSI.
Next, the Bi-LSTM layer takes the concatenated word embeddings and POS tags
mi = tanh(hi)
a^i = wimi + bi
        </p>
        <p>exp(a^i)
ai = Pj exp(a^j)</p>
        <p>ri = aihi:</p>
        <p>In Equation 1, wimi + bi denotes a linear transformation of mi. Therefore,
the output representation ri is given by:
as its inputs and it outputs each hidden state. Let hi be the output hidden state
at time i. Then its attention weight ai can be formulated by Equation 1.
(1)
(2)
(3)</p>
        <p>Based on such text representations, the sequence of features will be assigned
with different attention weights. Thus, crucial information such as emotional words
can be identified more easily. The convolutional layer takes the text representation
ri as its input, and the output CNN feature maps are concatenated together.
Finally, text sentiment analysis can be improved by using TSV from TFF.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Audio and Text Modal Fusion with the Multimodal-Attention</title>
      </sec>
      <sec id="sec-3-5">
        <title>Mechanism</title>
        <p>
          Inspired by human visual attention, the attention mechanism, proposed by [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]
for neural machine translation, is introduced into the encoder-decoder framework
to select reference words from the source language for the words in the target
language. Based on the existing attention mechanism, inspired by the work in [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ],
we improve the multimodal-attention method on the basis of the multi-feature
fusion strategy, focusing on the fusion of comprehensive and complementary
sentiment information from audio and text. We leverage the multimodal-attention
mechanism to preserve the intermediate outputs of the input sequences by
retaining the Bi-LSTM encoder, and then a model is trained to selectively learn
these inputs and to correlate output sequences with the model’s output.
        </p>
        <p>More specifically, ASV and TSV are firstly encoded with Audio-BiLSTM and
Text-BiLSTM using Equation 3.</p>
        <p>At+1 = f (At; xt+1)
At 1 = f (At; xt 1)
Tt+1 = f (Tt; xt+1)</p>
        <p>Tt 1 = f (Tt; xt 1)</p>
        <p>In Equation 3, f is the LSTM function with the weight parameter . At+1,
At and At 1 represent the hidden states at time (t + 1)th, tth and (t 1)th from
the audio modality, respectively. xt+1 and xt 1 represent the features at time
(t + 1)th and (t
T .</p>
        <p>1)th, respectively. The text modality is similar, represented by
at =
tt =</p>
        <p>exp(eT ht)
Pt exp(eT ht)</p>
        <p>exp(eT h0t)
Pt exp(eT h0t)
Za =
Zt =</p>
        <p>X atht
t
X tth0t</p>
        <p>t
y^i;j = sof tmax(concat(concat(Za; Zt); A)T M + b)</p>
        <p>We then consider the final ASV e as an intermediate vector, as shown in
Figure 1. During each time step t, the dot product of the intermediate vector
e and the hidden state ht is evaluated to calculate a similarity score at. Using
this score as a weight parameter, the weighted sum Pt atht is calculated to
generate a multi-feature fusion vector Za. The multi-feature fusion vector of the
text modality is calculated similarly, represented by Zt. We are therefore able to
obtain two kinds of multi-feature fusion vectors for the audio modality and the
text modality respectively, as shown in Equation 4 and 5. These multi-feature
fusion vectors are respectively concatenated with the final intermediate vectors
of ASV and TSV, which will pass through the softmax function to perform
sentiment analysis, as shown in Equation 6 and 7.</p>
        <p>ASV = g (e)</p>
        <p>T SV = g 0(ht)
y^i = sof tmax(concat(ASV; T SV )T M + b)
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Empirical Evaluation</title>
      <p>
        In this section, we firstly introduce the datasets, the evaluation metrics and the
network structure parameters used in our experiments, and then exhibit the
experimental results and make comparisons with other state-of-the-art models to
show the advantages of DFF-ATMF. At last, more discussions are illustrated to
understand the learning behavior of DFF-ATMF better.
(4)
(5)
(6)
(7)
Datasets The datasets used for training and test are depicted in Table 2. The
CMU-MOSI dataset is rich in sentiment expression, consisting of 2,199 utterances,
that is, 93 videos by 89 speakers. The videos involve a large array of topics such
as movies, books, and other products. These videos were crawled from YouTube
and segmented into utterances where each utterance is annotated with scores
between 3 (strongly negative) and +3 (strongly positive) by five annotators.
We take the average of these five annotations as the sentiment polarity and then
consider only two classes, that is, “positive” and “negative”. Our training and test
splits of the dataset are completely disjoint with respect to speakers. In order
to better compare with the previous work, similar to [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], we divide the dataset
by 7:3 approximately, resulting in 1,616 and 583 utterances for training and test
respectively.
      </p>
      <p>
        The CMU-MOSEI dataset is an upgraded version of the CMU-MOSI dataset,
which has 3,229 videos, that is, 22,676 utterances, from more than 1,000 online
YouTube speakers. The training and test sets include 18,051 and 4,625 utterances
respectively, similar to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        The IEMOCAP dataset was collected following theatrical theory in order to
simulate natural dyadic interactions between actors. We use categorical
evaluations with majority agreement and use only four emotional categories, that is,
“happy”, “sad”, “angry”, and “neutral” to compare the performance of our model
with other researches using the same categories [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>Evaluation Metrics We evaluate the performance of our proposed model by
the weighted accuracy on 2-class or multi-class classifications.</p>
      <p>Additionally, F1-Score is used to evaluate 2-class classification.</p>
      <p>weighted accuracy =
correct utterances</p>
      <p>utterances
F =(1 +
2)</p>
      <p>precision recall
( 2 precision) + recall
(8)
(9)</p>
      <p>In Equation 9, represents the weight between precision and recall. During
our evaluation process, we set = 1 since we consider precision and recall to
have the same weight, and thus F 1-score is adopted.</p>
      <p>However, in emotion recognition, we use Macro F 1-Score to evaluate the
performance.</p>
      <p>In Equation 10, n represents the number of classifications and F1n is the F 1
score on the nth category.</p>
      <p>Network Structure Parameters Our proposed architecture is implemented
on the open-source deep learning framework TensorFlow. More specifically, for the
proposed audio and text multi-modality fusion framework, we use Bi-LSTM with
200 neurons, each followed by a dense layer consisting of 100 neurons. Utilizing the
dense layer, we project the input features of audio and text to the same dimension,
and next combine them with the multimodal-attention mechanism. We set the
dropout hyperparameter to be 0:4 for CMU-MOSI and 0:3 for CMU-MOSEI &amp;
IEMOCAP as a measure of regularization. We also use the same dropout rates
for the Bi-LSTM layers. We employ the ReLu function in the dense layers and
softmax in the final classification layer. When training the network, we set the
batch size to be 32, and use Adam optimizer with the cross-entropy loss function
and train for 50 epochs. In data processing, we make each utterance one-to-one
correspondence with the label and rename the utterance.</p>
      <p>The network structure of the proposed audio and text multi-feature fusion
framework is similar. Taking the audio multi-feature fusion framework as an
example, the hidden states of Bi-LSTM are of 2 200-dim. The kernel sizes of
CNN are 3, 5, 7 and 9 respectively. The size of the feature map is 4 200. A
dropout rate is a random number between 0:3 and 0:4. The loss function used
is MAE, and the batch size is set to 16. We combine the training set and the
development set in our experiments. We use 90% for training and reserve 10% for
cross-validation. To train the feature encoder, we follow the fine-tuning training
strategy.</p>
      <p>
        In order to reduce randomness and improve credibility, we report the average
value over 3 runs for all experiments.
(10)
– [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] proposes an LSTM-based model that enables utterances to capture
contextual information from their surroundings in the video, thus aiding the
classification.
– [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] introduces attention-based networks to improve both context learning
and dynamic feature fusion.
– [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] proposes a novel multimodal fusion technique called Dynamic Fusion
Graph (DFG).
Model
[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]
[
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]
[
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
DFF-ATMF
– [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] explores three different deep learning-based architectures, each improving
upon the previous one, which is the state-of-the-art method on the
IEMOCAP dataset at present.
– [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] proposes a recurrent neural network-based multimodal-attention
framework that leverages the contextual information, which is the
state-of-theart model on the CMU-MOSI dataset at present.
– [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] proposes a new method of learning about the hidden representations
between speech and text data using CNN, which is the state-of-the-art
model on the CMU-MOSEI dataset at present.
      </p>
      <p>Table 3 shows the comparison of DFF-ATMF with other state-of-the-art
models. From Table 3, we can see that DFF-ATMF outperforms the other models
on the CMU-MOSI dataset and the IEMOCAP dataset. At the same time,
the experimental results on the CMU-MOSEI dataset also show DFF-ATMF’s
competitive performance.</p>
      <p>
        Generalization Ability Analysis In order to verify the feature
complementarity of our proposed fusion strategy and its robustness, we conduct experiments
on the IEMOCAP dataset to examine DFF-ATMF’s generalization capability.
Surprisingly, our proposed fusion strategy is effective on the IEMOCAP dataset
and outperforms the current state-of-the-art method in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], which can be seen
from Table 3 and the overall accuracy is improved by 3.17%. More detailed
experimental results on the IEMOCAP dataset are illustrated in Table 4.
4.3
      </p>
      <sec id="sec-4-1">
        <title>Further Discussions</title>
        <p>The above experimental results have already shown that DFF-ATMF can improve
the performance of audio-text Sentiment analysis. We now analyze the attention
values to understand the learning behavior of the proposed architecture better.</p>
        <p>We take a video from the CMU-MOSI test set as an example. From the
attention heatmap in Figure 5, we can see evidently that by applying different
weights across contextual utterances and modalities, the model is able to predict
labels of all the utterances correctly, which shows that our proposed fusion</p>
        <p>Emotion
happy
sad
angry
neutral
Overall</p>
        <p>IEMOCAP
strategy with multi-feature and multi-modality is indeed effective, and thus has
good feature complementarity and excellent robustness of generalization ability.
However, at the same time, we have a doubt about the multi-feature fusion.
When the raw waveform of the audio is fused with the vector of acoustic features,
the dimensions are inconsistent. If the existing method is utilized to reduce the
dimension, some audio information may also be lost. We intend to solve this
problem from the perspective of some mathematical theory such as the angle
between two vectors.</p>
        <p>u1
u2
u3
u1
u2
u3
Fig. 5. Softmax attention weights of an example from the CMU-MOSI test set.
1
u
2
u
3
u
1
u
2
u
3
u
u1
u2
u3
u1
u2
u3
Fig. 6. Softmax attention weights of an example from the CMU-MOSEI test set.
0.90
0.75
0.60
0.45
0.30
0.15
0.90
0.75
0.60
0.45
0.30
0.15
0.75
0.60
1
u
2
u
3
u
u1
u2
u3
u1
u2
u3</p>
        <p>Similarly, the attention weight distribution heatmaps on the CMU-MOSEI
and IEMOCAP test sets are shown in Figure 6 and 7, respectively. Furthermore,
we also give the softmax attention weight comparison of the CMU-MOSI,
CMUMOSEI, and IEMOCAP test sets in Figure 8.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper, we propose a novel fusion strategy, including multi-feature fusion
and multi-modality fusion, and the learned features have strong complementarity
and robustness, leading to the most advanced experimental results on the
audiotext multimodal sentiment analysis tasks. Experiments on both the CMU-MOSI
and CMU-MOSEI datasets show that our proposed model is very competitive.
More surprisingly, the experiments on the IEMOCAP dataset achieve unexpected
state-of-the-art results, indicating that DFF-ATMF can also be generalized for
multimodal emotion recognition. In this paper, we did not consider the video
modality because we try to use only the information of audio and text derived from
videos. To the best of our knowledge, this is the first attempt in the multimodal
domain. In the future, we will consider more fusion strategies supported by basic
mathematical theories for multimodal sentiment analysis.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This research work was supported by the National Undergraduate Training
Programs for Innovation and Entrepreneurship (Grant No. 201810022064) and the
World-Class Discipline Construction and Characteristic Development Guidance
Funds for Beijing Forestry University (Grant No. 2019XKJS0310). We also thank
the anonymous reviewers for their thoughtful comments. Special thanks to the
support of AAAI 2020 and AffCon2020.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baltrušaitis</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahuja</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morency</surname>
            ,
            <given-names>L.P.:</given-names>
          </string-name>
          <article-title>Multimodal machine learning: A survey and taxonomy</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>41</volume>
          (
          <issue>2</issue>
          ),
          <fpage>423</fpage>
          -
          <lpage>443</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bertero</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fung</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A first look into a convolutional neural network for speech emotion detection</article-title>
          .
          <source>In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP)</source>
          . pp.
          <fpage>5115</fpage>
          -
          <lpage>5119</lpage>
          . IEEE (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <article-title>: Multi-view and attention-based bi-lstm for weibo emotion recognition</article-title>
          . In: 2018 International Conference on Network, Communication, Computer Engineering (NCCE),
          <source>Advances in Intelligent Systems Research</source>
          , volume
          <volume>147</volume>
          . pp.
          <fpage>772</fpage>
          -
          <lpage>779</lpage>
          . Atlantis Press (
          <year>2018</year>
          ). https://doi.org/https://doi.org/10.2991/ncce18.
          <year>2018</year>
          .
          <volume>127</volume>
          , https://doi.org/10.2991/ncce-
          <fpage>18</fpage>
          .
          <year>2018</year>
          .127
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chaturvedi</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cambria</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welsch</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herrera</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Distinguishing between facts and opinions for sentiment analysis: Survey and challenges</article-title>
          .
          <source>Information Fusion</source>
          <volume>44</volume>
          ,
          <fpage>65</fpage>
          -
          <lpage>77</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>W.W.</given-names>
          </string-name>
          , Carbonell, J.,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          , R.:
          <article-title>Transformer-xl: Attentive language models beyond a fixed-length context</article-title>
          . arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>02860</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ghosal</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akhtar</surname>
            ,
            <given-names>M.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chauhan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhattacharyya</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Contextual inter-modal attention for multi-modal sentiment analysis</article-title>
          .
          <source>In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <fpage>3454</fpage>
          -
          <lpage>3466</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Han,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Tashev</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          :
          <article-title>Speech emotion recognition using deep neural network and extreme learning machine</article-title>
          . In:
          <article-title>The fifteenth annual conference of the international speech communication association (INTERSPEECH)</article-title>
          . pp.
          <fpage>223</fpage>
          -
          <lpage>227</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hussein</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.M.E.D.M.:</surname>
          </string-name>
          <article-title>A survey on sentiment analysis challenges</article-title>
          .
          <source>Journal of King</source>
          Saud University-Engineering Sciences
          <volume>30</volume>
          (
          <issue>4</issue>
          ),
          <fpage>330</fpage>
          -
          <lpage>338</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jianqiang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiaolin</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xuejun</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Deep convolution neural networks for twitter sentiment analysis</article-title>
          .
          <source>IEEE Access 6</source>
          ,
          <fpage>23253</fpage>
          -
          <lpage>23260</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kozinets</surname>
            ,
            <given-names>R.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scaraboto</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmentier</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>Evolving netnography: how brand auto-netnography, a netnographic sensibility, and more-than-human netnography can transform your research</article-title>
          .
          <source>JOURNAL OF MARKETING MANAGEMENT 34</source>
          (
          <issue>3-4</issue>
          ),
          <fpage>231</fpage>
          -
          <lpage>242</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>C.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>K.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jeong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choi</surname>
          </string-name>
          , W.Y.:
          <article-title>Convolutional attention networks for multimodal emotion recognition from speech and text data</article-title>
          .
          <source>arXiv preprint arXiv:1805</source>
          .
          <volume>06606</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mower</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Busso</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Narayanan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Emotion recognition using a hierarchical binary decision tree approach</article-title>
          .
          <source>Speech Communication</source>
          <volume>53</volume>
          (
          <issue>9-10</issue>
          ),
          <fpage>1162</fpage>
          -
          <lpage>1171</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>W.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>G.Z.</given-names>
          </string-name>
          :
          <article-title>Speech emotion recognition based on feature selection and extreme learning machine decision tree</article-title>
          .
          <source>Neurocomputing</source>
          <volume>273</volume>
          ,
          <fpage>271</fpage>
          -
          <lpage>280</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Audio sentiment analysis by heterogeneous signal features learned from utterance-based parallel neural network</article-title>
          .
          <source>In: Proceedings of the AAAI19 Workshop on Affective Content Analysis</source>
          , Honolulu, USA, AAAI (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Luong</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Effective approaches to attention-based neural machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1508.04025</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Majumder</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hazarika</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cambria</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Multimodal sentiment analysis using hierarchical fusion with context modeling</article-title>
          .
          <source>KnowledgeBased Systems</source>
          <volume>161</volume>
          ,
          <fpage>124</fpage>
          -
          <lpage>133</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>McFee</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raffel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ellis</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McVicar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Battenberg</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nieto</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <article-title>: librosa: Audio and music signal analysis in python</article-title>
          .
          <source>In: Proceedings of the 14th python in science conference</source>
          . pp.
          <fpage>18</fpage>
          -
          <lpage>25</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Minaee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abdolrashidi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deep-emotion: Facial expression recognition using attentional convolutional network</article-title>
          . arXiv preprint arXiv:
          <year>1902</year>
          .
          <volume>01019</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Parthasarathy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tashev</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Convolutional neural network techniques for speech emotion recognition</article-title>
          .
          <source>In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC)</source>
          . pp.
          <fpage>121</fpage>
          -
          <lpage>125</lpage>
          . IEEE (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cambria</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bajpai</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hussain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A review of affective computing: From unimodal analysis to multimodal fusion</article-title>
          .
          <source>Information Fusion</source>
          <volume>37</volume>
          ,
          <fpage>98</fpage>
          -
          <lpage>125</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cambria</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hazarika</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Majumder</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zadeh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morency</surname>
            ,
            <given-names>L.P.</given-names>
          </string-name>
          :
          <article-title>Context-dependent sentiment analysis in user-generated videos</article-title>
          . In:
          <article-title>Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          .
          <source>vol. 1</source>
          , pp.
          <fpage>873</fpage>
          -
          <lpage>883</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cambria</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hazarika</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mazumder</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zadeh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morency</surname>
            ,
            <given-names>L.P.:</given-names>
          </string-name>
          <article-title>Multi-level multiple attentions for contextual multimodal sentiment analysis</article-title>
          .
          <source>In: 2017 IEEE International Conference on Data Mining (ICDM)</source>
          . pp.
          <fpage>1033</fpage>
          -
          <lpage>1038</lpage>
          . IEEE (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hussain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cambria</surname>
          </string-name>
          , E.:
          <article-title>Combining textual clues with audio-visual information for multimodal sentiment analysis</article-title>
          .
          <source>In: Multimodal Sentiment Analysis</source>
          , pp.
          <fpage>153</fpage>
          -
          <lpage>178</lpage>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Majumder</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hazarika</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cambria</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hussain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Multimodal sentiment analysis: Addressing key issues and setting up the baselines</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          <volume>33</volume>
          (
          <issue>6</issue>
          ),
          <fpage>17</fpage>
          -
          <lpage>25</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rigoll</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lang</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Hidden markov model-based speech emotion recognition</article-title>
          .
          <source>In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing</source>
          ,
          <year>2003</year>
          . Proceedings.
          <source>(ICASSP'03)</source>
          . vol.
          <volume>2</volume>
          ,
          <string-name>
            <surname>pp.</surname>
          </string-name>
          <article-title>II-1</article-title>
          . IEEE (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rigoll</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture</article-title>
          .
          <source>In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing</source>
          . vol.
          <volume>1</volume>
          , pp. I
          <article-title>-577</article-title>
          . IEEE (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Thu_ngn at semeval-2018 task 1: Fine-grained tweet sentiment intensity analysis with attention cnn-lstm</article-title>
          .
          <source>In: Proceedings of The 12th International Workshop on Semantic Evaluation</source>
          . pp.
          <fpage>186</fpage>
          -
          <lpage>192</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zimmermann</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Learning and fusing multimodal deep features for acoustic scene categorization</article-title>
          .
          <source>In: 2018 ACM Multimedia Conference on Multimedia Conference</source>
          . pp.
          <fpage>1892</fpage>
          -
          <lpage>1900</lpage>
          . ACM (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Yoon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Byun</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jung</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Multimodal speech emotion recognition using audio and text</article-title>
          .
          <source>In: 2018 IEEE Spoken Language Technology Workshop (SLT)</source>
          . pp.
          <fpage>112</fpage>
          -
          <lpage>118</lpage>
          . IEEE (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Zadeh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zellers</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pincus</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morency</surname>
            ,
            <given-names>L.P.</given-names>
          </string-name>
          :
          <article-title>Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos</article-title>
          .
          <source>arXiv preprint arXiv:1606.06259</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Zadeh</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>P.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cambria</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morency</surname>
            ,
            <given-names>L.P.</given-names>
          </string-name>
          :
          <article-title>Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph</article-title>
          . In:
          <article-title>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          .
          <source>vol. 1</source>
          , pp.
          <fpage>2236</fpage>
          -
          <lpage>2246</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Deep learning for sentiment analysis: A survey</article-title>
          .
          <source>Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery</source>
          <volume>8</volume>
          (
          <issue>4</issue>
          ),
          <year>e1253</year>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>