<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Attentive RNNs for Continuous-time Emotion Prediction in Music Clips</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sanga Chaki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pranjal Doshi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Priyadarshi Patnaik</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sourangshu Bhattacharya</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Advanced Technology Development Centre</institution>
          ,
          <addr-line>IIT Kharagpur</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer Science and Engineering Department</institution>
          ,
          <addr-line>IIT Kharagpur</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Humanities &amp; Social Sciences Department</institution>
          ,
          <addr-line>IIT Kharagpur</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Continuous-time prediction of self reported musical emotions is a challenging problem with many applications. However, there are relatively few studies on design of Deep learning models for the above problem. Existing methods for the same problem has used LSTMs, with modest success. In this work, we describe an attentive LSTM based approach for emotion prediction from music clips. We postulate that attending to speci c regions in the past gives the model, a better chance of predicting the emotions evoked by present notes. We validate our model through extensive experimentation on the standard 1000 Songs for Emotional Analysis of Music dataset, which is annotated with arousal and valence values in continuous time. We nd that the attentive models signi cantly improve the prediction performance of arousal and valence over vanilla LSTM, both in terms of R2 and Kendall- metrics.</p>
      </abstract>
      <kwd-group>
        <kwd>Attention</kwd>
        <kwd>Emotion Prediction</kwd>
        <kwd>LSTM</kwd>
        <kwd>Music Emotion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Music is well known as an e ective means of eliciting emotions in listeners [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Automatic determination of the perceived emotion in music has become a
major area of focus for the music information retrieval (MIR) community. It nds
varied applications in the domains of personalized and/or generalized music
recommendations, organizing music databases, automatic music creation etc. Many
recent studies have used the Circumplex model of a ect, proposed by Russel [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
to denote music emotions. According to the dimensional Circumplex model [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
emotion is mapped into a 2-D plane, spanned by two axes denoting arousal
and valence, as points given by the pair of values &lt;arousal, valence&gt;. Thus,
the problem of emotion recognition/prediction is turned into a two dimensional
regression problem [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Keeping this in mind, a number of publicly available
music clip datasets have been developed, which help to test novel methods for
music emotion prediction, [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. It is understandable that the emotions related to
music are a time-continuous process, where the context of the sequential music
frames play an immense role on the related emotion. Relating this to the
machine learning perspective, one can discern the need of context sensitive models
like recurrent neural networks (RNNs) in music emotion prediction. RNNs can
access previous step activations using the hidden states, which remember
relevant information about a pertinent sequence, to predict current emotion. Long
Short-Term Memory (LSTMs) are such a type of RNNs, which have performed
well in several MIR tasks including music emotion regression [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        In this study, we propose to use the attention mechanism with a deep RNN
structure composed of LSTMs, to predict the perceived emotion in each de ned
time frame of music continuously. We use the well known ComPare 2013 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
set of features, extracted using the openSMILE tool [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and the 1000 Songs for
Emotional Analysis of Music [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] dataset for evaluation in the present study, as
it has proved to produce signi cant results in many recent works [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Current state-of-the-art methods for audio sentiment analysis are mostly based
on deep neural network. RNNs are a class of neural networks that are suited
for time series data. They use the outputs of network units at time t as input
to other units at time t + 1. This allows RNNs to store temporal information
present within the input data. Though in theory, RNNs can keep track of
arbitrary long-term dependencies in the input sequences, practically they su er from
the problem of vanishing gradients [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. RNNs using Long Short Term Memory
(LSTM) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] units partially solve this problem. LSTMs have been found to be
extremely useful to capture long-term context or dependencies in data [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and
are now widely used to solve a large variety of problems, including MIR tasks.
Recently, Coutinho et. al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Weninger et. al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] used RNN-LSTM
networks successfully to perform continuous time music mood regression. Weninger
et. al[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] reports performance by averaging predictions and achieving R2 of upto
0.70 and 0.50 for continuous time arousal and valence respectively. Another of
their works [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] also tries to improve on the performance by using a di erent as
cost function.
      </p>
      <p>
        Though RNN-LSTMs are useful, it must be acknowledged that the di culty
of successfully capturing the context increases with length of input sequence
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This may become problematic for the neural network in case of longer input
sequences like those in music. Here, inter-(musical)event relationships might play
a bigger role in eliciting emotions than the actual sequence of (musical) events.
Change in the order of the musical notes or other events might change the
emotions considerably, much like context sensitive languages. To address this
issue, Bahdanau et. al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposed the attention model, for the encoder-decoder
architecture for neural machine translation. According to the attention model, to
compute each output, the model will attend on those parts of the input sequence,
which are more relevant for that particular output, by assigning higher weights
to the associated encoder-side hidden states, using an alignment model. Though
this model was originally proposed for the purpose of encoder-decoder based
neural machine translation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], it nds application in many di erent problems.
Early works include use of LSTM in nding temporal structure in music [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
music composition and generation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] by Eck et. al. Recently, Coutinho et. al.
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Weninger et. al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] used RNN-LSTM networks successfully
to perform continuous time music mood regression.
      </p>
      <p>
        Most of the MIR tasks utilizing deep RNN-LSTM structures need
considerable amount of training data to produce good results. In the domain of music
emotion recognition, one such widely used dataset is the 1000 Songs for
Emotional Analysis of Music [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In the present work we use this dataset for
evaluation. The openSMILE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] toolkit is used to extract the ComParE [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] feature
set for training.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <sec id="sec-3-1">
        <title>Dataset and Acoustic Features Used</title>
        <p>
          In the present work, we use the 1000 Songs for Emotional Analysis of
Music dataset [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] for all experiments. Of the thousand clips,the dataset provides
arousal and valence annotations for only 744 clips, which are used as ground
truth values. Among these, 10% of the clips were assigned to the test set and the
remaining formed the training set. We also use a set of purely acoustic a ective
features, given by the baseline feature set of the 2013 Computational
Paralinguistics Evaluation (ComParE) tasks [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. It has been shown by Weninger et.
al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] that this set performs well in assessing emotion in terms of arousal and
valence. The feature set contains 6670 features. These features are calculated by
applying statistical functions to the contours of low-level descriptors (LLDs) of
respective xed length segments or time frames of the music audio signal, or the
whole song. The statistical functionals include mean, moments etc. The LLDs
include auditory weighted frequency bands, their sum, spectral measures such as
centroid, roll-of point, skewness, sharpness, and spectral ux, MFCCs etc. The
complete set of the LLDs, functionals and their detailed analysis can be found
in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In the present work, we use TUM's open-source openSMILE feature
extractor [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], to extract these features at non-overlapping intervals of 500 ms, for
each music clip. The feature values for the datset were observed to be of di erent
ranges. Thus, before performing multivariate regression, standard normalization
was performed on the feature set. The features of the last 30 seconds of each
clip from the dataset are used for this work. So, each clip is characterised by 61
feature vectors, each of size 6670. The arousal and valence annotations for each
500 ms time frame provided by the dataset [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] are used as the ground truth
values.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>LSTM-RNN</title>
        <p>The key component of an LSTM is the cell state C, through which the relevant
context/dependency information between the elements in the input sequence
ows, with careful regulations by the forget gate (f ), input gate (i) and the
output gate (o). Intuitively, it can be understood that the forget gate (equation1)
decides which information is irrelevant and can be thrown away from the cell
state.</p>
        <p>ft = (Wf [ht 1; xt] + bf )
Next, the input gate (equation2) regulates what new information needs to be
stored in the cell state, with the help of a vector of new candidate values
(equation3).</p>
        <p>Thus, an update to the cell state is performed (equation4).</p>
        <p>it = (Wi [ht 1; xt] + bi)
C~t = tanh(WC [ht 1; xt] + bC )</p>
        <p>Ct = ft</p>
        <p>Ct 1 + it
~
Ct
Finally, the output gate (equation5) decides the output of the network (equation6),
based on a ltered version of the cell state.</p>
        <p>ot = (Wo [ht 1; xt] + bo)</p>
        <p>ht = ot tanh(Ct)
In the above equations, xt and ht denote the input and output at time t. Each
W and b denote the associated weights and biases of each of the gates. The
LSTM gates use the sigmoid function as the activation function. Though
RNNLSTMs (equation 7) provide a great way to carry relevant information from one
step to the next through the cell state, the di culty increases with length of
input sequence. Basically, the neural network has to compress all the necessary
information of an input sequence into a single xed length vector, the last hidden
state (equation 8). This may become problematic for the neural network in case
of longer input sequences.</p>
        <p>ht = f (xt; ht 1)
c = q(h1; h2; : : : ; hT ) = hT
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Attention Mechanism</title>
        <p>
          To address this issue, Bahdanau et. al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] proposed the attention model, for
the encoder-decoder architecture for neural machine translation. Let xi and yi
denote the ith input and output of the model; hi and si are the hidden states of
the encoder and decoder associated with ith input and output respectively, each
annotation hi contains information about the whole input sequence with strong
focus on the parts surrounding the ith input. ci is the unique context vector
associated with the ith input; g() is a function of yi 1, si and ci. According to
the attention model [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], to compute each output (equation 9), a distinct context
vector (equation 11) is used, which is a function of all the hidden states at the
encoder side and not just the last one. Here, equation 10 is a modi ed form of
equation 7.
        </p>
        <p>p(yijy1; y2; : : : ; yi 1; x) = g(yi 1; si; ci)
(9)
si = f (si 1; yi; ci)
ci = X</p>
        <p>Tx
j=1
ijhj
Each time, the context vector ci is calculated as a weighted sum of all the
hidden states (equation 12). The idea being, for each output, the context vector
will attend on those parts of the input sequence, which are more relevant for
that particular output, by assigning higher weights to the associated
encoderside hidden states, using an alignment model. In equation 13, eij is the score of
how well inputs around position j and the output at position i align or match.
ij =</p>
        <p>exp(eij)</p>
        <p>PkT=x1 exp(eik)
eij = a(si 1; hj)
(12)
(13)</p>
        <p>We use a modi ed form of attention, as the problem we are tackling is music
mood regression and not machine translation, thus there is no need for a decoder
side architecture. The encoder encodes the input into a set of hidden states and
the attention is applied on them to produce target arousal and valence values.
Generally, the encoder in neural machine translation reads the input sequence
x = (x1; x2; : : : ; xT ) { which is a sequence of vectors { and produces the hidden
states (h1; h2; : : : ; hT ), using some RNN approach like in equation 7. In case
LSTM is used, equation 7 takes the speci c form of equation 6. In most cases, the
whole set of hidden states (h1; h2; : : : ; hT ) are available to compute the context
vector for the translation. So, all the hidden states are used for the context vector
, either with attention (equation 11) or without (equation 8). This also makes
sense for natural language processing, as the translation of an input xt might
depend on any input xi, where both i &lt; t or i &gt; t are possible.</p>
        <p>But, when we listen to music, the emotion associated with the music at tth
second is seldom in uenced by the music following it. Rather, it might be argued,
that the associated emotions at the tth second will be more dependent on any
music preceding it. Let the output be y = (y1; y2; : : : ; yT ). For the tth output,
yt, it will be a function of a) the present hidden state ht, b) the previous output
yt 1, c) the unique context vector ct.</p>
        <p>p(ytjy1; y2; : : : ; yt 1; x) = g1(ht; yt 1; ct)
(10)
(11)
(14)
(15)
The unique context vector ct depends on the sequence of annotations (h1; h2; : : : ; ht 1),
and is computed as a weighted sum of these annotations hj. So, the model is
attending to each hj, corresponding to each of the inputs.</p>
        <p>
          t 1
ct = X
j=1
tjhj
As in Bahdanau et. al.'s [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] work, referring to our equations 12 and 13, for each
output yt, we calculate the alignment between the corresponding ht 1 and each
Each of these scores etj are used to calculate the attention weights for each hj
as below
etj = a(ht 1; hj )
tj =
        </p>
        <p>exp(etj )
Ptk=11 exp(etk)
(16)
(17)
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Setup</title>
      <sec id="sec-4-1">
        <title>Training and Evaluation</title>
        <p>
          10-fold cross validation was used on the training and test sets. Evaluation
measures are computed and reported on the entire test set and not by averaging
across folds. We compare the proposed attention approach to the more
traditional LSTM-RNN approach, which has provided good results in the past [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
Both use same input features, standardized to zero mean and unit variance.
Neural networks with one or two hidden layers were used for the experiments.
The number of LSTM units (linear activation) used in each case varied from 32
to 1024. For the attention networks, the attention layer is added after the
hidden layers, using sigmoid attention activation. Root Mean Square Propagation
(RMSProp) optimization with 10 sequences per weight update is used for
training. Training is done for maximum 30 epochs. An early stopping strategy is also
used, making use of a validation set from each fold's training set. If validation
error shows no improvement over 10 4 after 5 epochs, processing is stopped.
Mean squared error (MSE) is used to calculate loss. Sequences are presented
in random order during training. All hyper-parameters not explicitly mentioned
here are left to their default values as in Tensor ow 1.14.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Models Used</title>
        <p>
          The networks used for the current work are assigned names depending on whether
they apply attention (AT) or not (NAT), followed by the layer sizes. For
example, for an LSTM, no attention network with 1 layer of 128 hidden units, the
name is LSTM NAT 128. For an LSTM, attention network with 2 layers of 700
and 128 hidden units each, the name is LSTM AT 700 128. Thus, all networks
belonging to each proposed model class are assigned the su xes a) LSTM, no
attention is LSTM NAT, b) LSTM with attention is LSTM AT. We replicate one
of the best models proposed in Weninger et.al's work [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], with a single layer
LSTM-RNN, though using the whole dataset [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and entire feature set [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. We
get comparable results for layer size of 400 units. This is named LSTM NAT 400
and used as a baseline in this work.
The metrics used for reporting the results are Coe cient of determination (R2),
average Kendall's per song ( )and mean absolute error (MAE). The
determination coe cient (R2) is a key output of regression analysis, which provides a
measure of how well observed outcomes are replicated by the model, based on the
proportion of total variation of outcomes explained by the model. Best possible
score is 1.0. It can also be negative. If a data set has n values marked (y1 : : : yn),
and each associated with a predicted value (f1 : : : fn). So, R2 is de ned as
R2
(a) Clip 2 - Arousal Comparison
(b) Clip 2 - Valence Comparison
where, SSres = Pi (yi fi)2 and SStot = Pi (yi y)2, given y = n1 Pin=1 yi.
Kendall's per song ( ) is a measure of how well the emotional pro le of each
song is captured by the regressor, as opposed to overall correlation. It measures
the correspondence between two rankings. Values close to 1 indicate strong
agreement, values close to -1 indicate strong disagreement. It is de ned as
=
        </p>
        <p>P Q
p(P + Q + T ) (P + Q + U )
(19)
where, P is the number of concordant pairs, Q the number of discordant pairs, T
the number of ties only in target set (y1 : : : yn), and U the number of ties only in
predicted set (f1 : : : fn). The mean absolute error (MAE) is given for reference.
In the next section, we report the results of applying the proposed model for
dynamic music emotion regression.
5.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Results</title>
      <sec id="sec-5-1">
        <title>Comparison of methods for emotion prediction</title>
        <p>In the current work, we use four main types of LSTM-RNN models, which are
{ No Attention (LSTM NAT)</p>
        <p>Single Layer: Eg. LSTM NAT 128</p>
        <p>Double Layer: Eg. LSTM NAT 700 128
{ With Attention (LSTM AT)</p>
        <p>Single Layer: Eg. LSTM AT 128</p>
        <p>Double Layer: Eg. LSTM AT 400 128</p>
        <p>In the rst set of experiments, the performances of di erent models using
di erent network topologies are compared. The best results obtained from each
of these models are summarized in table 1.</p>
        <p>In the case of LSTM NAT networks, separate models for arousal and
valence are trained using the LSTM-RNN architecture. Performances of networks
having one hidden layer with 128, 300, 400, 512, 700, 1024 units and two
hidden layers with (700, 128) and (700, 400) units are calculated. Table 2 reports
the results for regression without attention, using di erent network topologies.
For the single-layer topologies, a clear trend can be seen for arousal. With the
increase in layer size (L1 Size), RA2 increases. A increases till L1 Size = 700,
but decreases for L1 Size = 1024. The metrics for valence does not follow a clear
trend. It can be seen that LSTM NAT 700 performs best in terms of all the
evaluation metrics considered, for both arousal and valence, giving RA2 = 0.70, A
= 0.21, RV2 = 0.39, and V = 0.10. Though, LSTM NAT 1024 performs better
for arousal (RA2 = 0.73), its performance dips for valence (RV2 = 0.39). is also
reduced for both arousal and valence. The two-layer topologies of this model,
LSTM NAT 700 128 and LSTM NAT 700 400 perform comparable to the best
single layer network LSTM NAT 700 and LSTM NAT 1024 for arousal, both in
terms of RA2 and A. The performance for valence decreases in the 2-layer
topologies. Thus, increasing layer size might help improve performance for arousal, but
not for valence. Also, increasing the number of hidden layers might be unable
to produce any signi cant improvement in performance for both arousal and
valence.</p>
        <p>The performances of of LSTM AT networks, using di erent network
topologies are presented in table 3. Performances of networks having one hidden layer
with 32, 64, 128, 300, 400 units and two hidden layers with (300, 128) and
(400, 128) units are calculated. A clear trend for performances of arousal and
valence predictions are observed in this case. For arousal, among the
singlelayer topologies, best performance is recorded for the networks LSTM AT 300
and LSTM AT 400, for R2. It can be seen that addition of the attention
mechanism improves the performance according to both metrics. For both arousal
and valence, the best performances among all the models used is recorded for
LSTM AT 400, with RA2 = 0.75 and RV2 = 0.53. Henceforth, for all comparison
purposes, we use this model as the best proposed model of this study. Increase in
number of layers produce comparable performance for both arousal and valence
and no signi cant change is observed.
(a) Arousal
LSTM AT 400
-(b) Valence</p>
        <p>LSTM AT 400
Best</p>
        <p>Prediction</p>
        <p>Best</p>
        <p>Prediction
(c) Arousal
LSTM NAT 400</p>
        <p>Baseline</p>
        <p>Prediction
-(d) Valence
LSTM NAT 400</p>
        <p>Baseline</p>
        <p>Prediction</p>
        <p>
          ne-grained emotion prediction
In the second set of experiments, the best models for arousal and valence
predictions, as obtained in the previous section, are used for ne-grained (per 500
ms) emotion prediction of some music clips. For arousal and valence predictions,
we use the LSTM AT 400 model (table 1). We choose two clips from the 1000
Songs for Emotional Analysis of Music [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] dataset, with clip ids 2 and 584
respectively. Clip 2 is of the genre Blues and negative valence (gloomy). Clip 584
is of the Folk genre, and signi cantly upbeat and positive valence (happy). We
compare the predicted values with a) The ground truth values as provided by
the 1000 Songs for Emotional Analysis of Music dataset [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], and b) the
baseline model [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], as represented by LSTM NAT 400. Figures 1(a) and 2(a) denote
the time varying arousal predictions, and gures 1(b) and 2(b)denote the time
varying predictions for valence. In case of clip 2, 1(a) shows that the arousal
prediction errors are lower for the proposed model initially, for the rst 20 seconds.
In the last 10 seconds, the errors of the proposed model and the baseline model
are comparable. But for valence prediction, the errors of the proposed model are
signi cantly lower, as seen in gure 1(b). For clip 584, gure 2(a) shows that
the arousal prediction errors are lower across the entire clip for the proposed
model, thus matching the ground truth more. For valence prediction, as seen in
gure 2(b) the errors of the proposed attention model are signi cantly low for
the entire clip.
5.3
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Cross analysis of errors</title>
        <p>In the third set of experiments, we use the best proposed model LSTM AT 400
and the baseline model LSTM NAT 400 on the validation set, to group the clips
into error bins for arousal and valence prediction. These are shown as histograms
in gure 3. Comparing gures 3(a) and 3(c), it can be seen that, for the proposed
model, the number of clips with higher values of errors are less, in case of arousal.
In case of valence, for the proposed model, almost all the clips are grouped into
the error bins 0.05 ( gure 3(b). Whereas for the baseline model, a signi cant
number of clips across bins are present.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We demonstrate that the state of the art models for continuous-time emotion
prediction perform modestly, thus emphasizing the need for further research in
this area. We have proposed an attentive LSTM based model which improves
the state of the art performance signi cantly, on standard benchmark dataset
with standard metrics.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.:</given-names>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>In: 3rd International Conference on Learning Representations, ICLR</source>
          <year>2015</year>
          , San Diego, CA, USA, May 7-
          <issue>9</issue>
          ,
          <year>2015</year>
          , Conference Track Proceedings (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Coutinho</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weninger</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherer</surname>
            ,
            <given-names>K.R.:</given-names>
          </string-name>
          <article-title>The munich lstm-rnn approach to the mediaeval 2014" emotion in music'" task</article-title>
          . In: MediaEval (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Eck</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Finding temporal structure in music: Blues improvisation with lstm recurrent networks</article-title>
          .
          <source>In: Proceedings of the 12th IEEE workshop on neural networks for signal processing</source>
          . pp.
          <volume>747</volume>
          {
          <fpage>756</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Eck</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A rst look at music composition using lstm recurrent neural networks</article-title>
          .
          <source>Istituto Dalle Molle Di Studi Sull Intelligenza Arti ciale 103</source>
          ,
          <issue>48</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Eyben</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Real-time speech and music classi cation by large audio feature space extraction</article-title>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Eyben</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , Wollmer,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Opensmile: the munich versatile and fast open-source audio feature extractor</article-title>
          .
          <source>In: Proceedings of the 18th ACM international conference on Multimedia</source>
          . pp.
          <volume>1459</volume>
          {
          <fpage>1462</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9(8)</source>
          ,
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Juslin</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          :
          <article-title>From mimesis to catharsis: expression, perception, and induction of emotion in music</article-title>
          .
          <source>Musical</source>
          communication pp.
          <volume>85</volume>
          {
          <issue>115</issue>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knowles</surname>
          </string-name>
          , R.:
          <article-title>Six challenges for neural machine translation</article-title>
          .
          <source>In: Proceedings of the First Workshop on Neural Machine Translation</source>
          . pp.
          <volume>28</volume>
          {
          <fpage>39</fpage>
          . Association for Computational Linguistics (
          <year>Aug 2017</year>
          ). https://doi.org/10.18653/v1/
          <fpage>W17</fpage>
          -3204
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Pascanu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>On the di culty of training recurrent neural networks</article-title>
          .
          <source>In: International conference on machine learning</source>
          . pp.
          <volume>1310</volume>
          {
          <issue>1318</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A circumplex model of a ect</article-title>
          .
          <source>Journal of Personality and Social Psychology</source>
          <volume>39</volume>
          (
          <issue>6</issue>
          ),
          <volume>1161</volume>
          {
          <fpage>1178</fpage>
          (
          <year>1980</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steidl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batliner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinciarelli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ringeval</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chetouani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weninger</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eyben</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marchi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , et al.:
          <article-title>The interspeech 2013 computational paralinguistics challenge: social signals, con ict, emotion, autism</article-title>
          .
          <source>In: Proceedings INTERSPEECH</source>
          <year>2013</year>
          ,
          <article-title>14th Annual Conference of the International Speech Communication Association</article-title>
          , Lyon, France (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Soleymani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caro</surname>
            ,
            <given-names>M.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>E.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sha</surname>
            ,
            <given-names>C.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.H.:</given-names>
          </string-name>
          <article-title>1000 songs for emotional analysis of music</article-title>
          .
          <source>In: Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia</source>
          . pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Weninger</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eyben</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The tum approach to the mediaeval music emotion task using generic a ective audio features</article-title>
          .
          <source>In: Proceedings MediaEval 2013 Workshop</source>
          , Barcelona, Spain (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Weninger</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eyben</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>On-line continuous-time music mood regression with deep recurrent neural networks</article-title>
          .
          <source>In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          . pp.
          <volume>5412</volume>
          {
          <fpage>5416</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Weninger</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eyben</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mortillaro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherer</surname>
            ,
            <given-names>K.R.</given-names>
          </string-name>
          :
          <article-title>On the acoustics of emotion in audio: what speech, music, and sound have in common</article-title>
          .
          <source>Frontiers in psychology 4</source>
          ,
          <issue>292</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Weninger</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ringeval</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marchi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.W.</given-names>
          </string-name>
          :
          <article-title>Discriminatively trained recurrent neural networks for continuous dimensional emotion recognition from audio</article-title>
          .
          <source>In: IJCAI</source>
          . vol.
          <year>2016</year>
          , pp.
          <volume>2196</volume>
          {
          <issue>2202</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>Y.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.H.:</given-names>
          </string-name>
          <article-title>A regression approach to music emotion recognition</article-title>
          .
          <source>IEEE Transactions on audio, speech, and language processing 16(2)</source>
          ,
          <volume>448</volume>
          {
          <fpage>457</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>