<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Partners in Crime: Utilizing Arousal-Valence Relationship for Continuous Prediction of Valence in Movies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tanmayee Joshi?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarath Sivaprasad?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Niranjan Pedanekar</string-name>
          <email>n.pedanekarg@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Movies Audio LSTM.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TCS Research, Tata Consultancy Services Limited, 54B Hadapsar Industrial Estate</institution>
          ,
          <addr-line>Pune 411002</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The arousal-valence model is often used in characterizing human emotions. Arousal is de ned as the intensity of emotion, while valence is de ned as the polarity of emotion. Continuous prediction of valence in entertainment media such as movies is important for applications such as ad placement and personalized recommendations. While arousal can be e ectively predicted using audio-visual information in movies, valence is reported to be more di cult to predict as it also involves understanding the semantics of the movie. In this paper, for improving valence prediction, we utilize the insight from psychology that valence and arousal are interrelated. We use Long Short Term Memory networks (LSTMs) to model the temporal context in movies using standard audio features as input. We incorporate arousal-valence interdependence in two ways: 1. as a joint loss function to optimize the prediction network, and 2. as a geometric constraint simulating the distribution of arousalvalence observed in psychology literature. Using a joint arousal-valence model, we predict continuous valence for a dataset containing Academy Award winning movies. We report a signi cant improvement over the state-of-the-art results, with an improved Pearson correlation of 0.69 between the annotation and prediction using the joint model, as compared to a baseline prediction of 0.49 using an independent valence model.</p>
      </abstract>
      <kwd-group>
        <kwd>Emotion Prediction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Entertainment media such as movies can create a variety of emotions in viewers'
minds. These emotions vary in intensity as well as in polarity, and keep on
changing continuously with time in media such as movies. A single scene can
go from low intensity to high intensity and from positive to negative polarity in
a matter of seconds. Such changes are often accompanied by cinematic devices
such as variation in music intensity, speech intensity, shot framing, composition
and character movements. In addition, static aspects such as scene color tones
? Both authors contributed equally to this work.
and ambient sound also contribute towards setting the polarity of the scene.
Prediction and pro ling of emotions that movies can generate in viewers nds
utility in a variety of a ective computing applications. For example, predicted
intensity of emotions in a movie can be used to place advertisements. A viewer
is likely to pay attention where emotional intensity is low. Similarly, the viewer
experience is likely to get adversely a ected if one places a happy advertisement
after a sad scene. Using such insights, Yadati et al. used motion, cut density
and audio energy to predict emotion pro le of YouTube videos for optimizing
advertisement placement in videos [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Additional uses of emotion prediction
have been reported for content recommendation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and content indexing [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
(c)
      </p>
      <p>(b)
(d)
(a) The 2-D emotion map
(b) High arousal, positive valence</p>
      <p>(c) High arousal, negative valence
(d) Low arousal, neutral valence</p>
      <p>
        Hanjalic and Xu proposed that emotional content in entertainment media
such as movies and videos be modeled as a continuous 2-dimensional space of
arousal and valence, the 2-D emotion map, shown in Fig. 1(a) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Arousal is a
measure of how intense a perceived emotion is, while valence is an indication
of whether it is positive or negative or neutral. For example, excited is a high
arousal and positive valence emotional state, distressed is a high arousal and
negative valence emotional state, while relaxed is a low arousal and neutral
valence emotional state. One can nd scenes from movies corresponding to such
emotional states. For example, a scene from American Beauty in Fig. 1(b) shows
an excited protagonist in a high intensity romantic dream sequence and is located
in the top right of the parabolic contour of the 2-D emotion map. Similarly, a high
intensity scene from the movie Crash, where the character is distressed thinking
that his daughter is shot, is located on the top left part of this contour. A scene
from the movie Million Dollar Baby where the protagonist and her coach are
taking a relaxed car ride is located near the bottom at the centerline.
      </p>
      <p>
        Continuous prediction of arousal and valence, while important to the
aforementioned applications in entertainment, is a challenging task since movies
feature a dynamic interplay of audio, visual and textual (semantic) information [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Baveye et al. predicted continuous valence and arousal pro les for a dataset of
30 short lms using kernel methods and deep learning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Malandrakis et al.
predicted continuous valence and arousal pro les using hand-crafted audio-visual
features on an annotated dataset of 30 minute clips from 12 Academy Award
winning movies [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Goyal et al. reported an improvement over these results
using a Mixture-of-Experts (MoE) model for fusing the audio and visual model
predictions of emotion [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Sivaprasad et al. improved the predictions further by
using Long Short Term Memory networks (LSTMs) to capture the context of
the changing audio visual information for predicting the changes in emotions [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        A consistent observation across the aforementioned results of continuous
emotion prediction has been that the correlation of valence prediction to annotation
is worse than that for arousal. This is because valence prediction often requires
higher order semantic information about the movie over and above lower order
audio visual information [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For example, a violent ght scene has a negative
connotation, but if the protagonist is winning, it is perceived as a positive scene.
Also, a bright visual of a garden may lead to a positive connotation, but the
dialogs might indicate a more negative note.
      </p>
      <p>
        We found that in all aforementioned results for continuous prediction, arousal
and valence were modeled separately. Zhang and Zhang suggested that arousal
and valence for videos be modeled together [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. They created a dataset of 200
short videos (5 to 30 seconds) consisting of movies, talk shows, news, dramas
and sports. They annotated the videos on a ve point categorical scale of arousal
and valence. Training a single LSTM model with audio and visual features as
input, they predicted a single value of arousal and valence for each video clip.
      </p>
      <p>
        We believe that for real-life applications such as optimal placement of
advertisements, continuous prediction of arousal and valence on longer videos is
necessary, unlike prediction over short clips mentioned in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. A more useful dataset
for this purpose is that created by Malandrakis et al. consisting of 30-minute
clips from 12 Academy Award winning movies with continuous annotations for
arousal and valence [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We found that the state-of-the-art results on this dataset
reported a Pearson Correlation Coe cient of 0.84 between predicted and
annotated arousal, and that of 0.50 between predicted and annotated valence [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
where arousal and valence models were trained independently. This indicated
that the independent arousal model could capture the variation in the dataset
much better than the independent valence model. Also, the correlation between
annotated arousal and absolute annotated valence was relatively high (0.62) for
this dataset. We argued that given the high accuracy of arousal prediction
models and the high correlation in annotations, we could use the information learned
by the arousal models while predicting valence. Furthermore, if we could
incorporate the insight from cognitive psychology that typically arousal and valence
values lay within the parabolic contour shown in Fig. 1, then we could further
improve valence prediction.
1.1
      </p>
      <p>
        Our Contribution
Zhang and Zhang used a single joint LSTM model to predict arousal and
valence simultaneously. We argued that such a model was not adequate to capture
the interdependence of arousal and valence for the continuous dataset. In this
paper, we use separate LSTM models for continuous prediction of arousal and
valence, but incorporate arousal-valence interdependence in two distinct ways:
1. as a joint loss function to optimize the prediction LSTM network, and 2. as
a geometric constraint simulating the distribution of arousal-valence observed
in psychology literature. Using these models, we improve the baseline for
continuous valence prediction by [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] signi cantly. Since previous work has reported
audio being more important to the prediction of valence [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we use only
audio features as input to our models.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Dataset and Features</title>
      <p>
        In this paper, we used the dataset described by Malandrakis et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] containing
continuous annotations of arousal and valence by experts for 30-minute clips
from 12 Academy award winning movies. The annotation scale for both arousal
and valence was [ 1; 1]. The valence annotation of 1 indicated extreme negative
emotions, while that of +1 indicated extreme positive emotions. Similarly, the
arousal annotation of 1 indicated extremely low intensity, while that of +1
indicated extremely high intensity. We sampled the annotations of arousal and
valence at 5-second intervals as previously suggested Goyal et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We found
that previous work reported audio being more important to the prediction of
valence [
        <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
        ]. So we decided to only audio features as input to our models.
We calculated the following audio features for non-overlapping 5-second clips as
described by Goyal et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: Audio compressibility, Harmonicity, Mel frequency
spectral coe cients (MFCC) with derivatives and statistics (min, max, mean),
and Chroma with derivatives and statistics (min, max, mean). We further used
a correlation-based feature selection prescribed by Witten et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to narrow
down the set of input features.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Prediction Model</title>
      <p>Movie
Clips</p>
      <p>Audio
Feature
Extraction</p>
      <p>back
propagation</p>
      <p>Arousal Annotations</p>
      <p>L1 + (L3 or L3+L4)
Feature
Selection
Feature
Selection
fc STLM SLTM fc
fc STLM SLTM fc</p>
      <p>MSE Loss</p>
      <p>(L1)
Apred
Vpred</p>
      <p>MSE Loss</p>
      <p>(L2)
propbaagcaktion L2 + (L3 or L3+L4)</p>
      <p>Valence Annotations</p>
      <p>Euclidian
Loss (L3)
Shape
Loss (L4)</p>
      <p>
        We implemented a single model as the one mentioned by Zhang and Zhang
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to predict valence and arousal simultaneously. We found that such a model
was not adequately complex to capture the interdependence of arousal and
valence, and performed worse that the baseline results obtained by Sivaprasad et
al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. So, we decided to model arousal and valence independently, but utilize a
joint loss function to train the models thus allowing the interdependence to be
modeled.
      </p>
      <p>
        In particular, we designed one model for independent prediction of valence,
and two models to predict valence using arousal information. Fig. 2 shows a
generalized schematic representing these models. For all models, we used the
LSTM model architecture proposed by Sivaprasad et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], but designed custom
losses to incorporate the arousal-valence relationship in two of them. In the basic
model architecture (denoted by the dotted box in Fig. 2), two LSTMs were used,
rst to build a context around a representation of input (audio features) and
second to model the context for the output (arousal or valence). The details of
the LSTM models used are available in Sivaprasad et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We used one versus
all validation strategy with 12 folds (one for every movie in the dataset). Because
of the inadequacy of data, we did not use a separate validation set. We instead
used the second derivative of training loss as an indicator for early stopping of
training. To incorporate the arousal-valence relationship, we used di erent loss
functions giving us three di erent models as described below:
1. Independent Model We created two models to predict arousal and valence
independently. We used Mean Squared Error (MSE) as a loss function for
      </p>
      <p>
        training the arousal and valence models, denoted by L1 and L2 in Fig. 2,
respectively.
2. Euclidean Distance-based Model We rst used the independent
models of arousal and valence to obtain respective predictions, and then used
the independent model weights as initialization for this model. We
computed the Euclidean distance between the two points, P (Vpred; Apred) and
Q(Vanno; Aanno), where Vpred and Apred are predicted valence and arousal,
and Vanno and Aanno are annotated valence and arousal, respectively. This
distance was treated as an additional loss called the Euclidean loss (L3) while
training the LSTM network. We used combined losses to train the models,
(L1 + L3) for arousal and (L2 + L3) for valence. Thus we allowed the
Euclidean loss to propagate in both the arousal and valence models ensuring
joint prediction.
3. Shape Loss-based Model We used the independent models of arousal
and valence to obtain respective predictions, and then used the independent
model weights as initialization for this model. As can be seen from Fig. 1,
the range of valence at any instance is governed by the value of arousal
at that instance (and vice versa). It has also been observed that the
position of a point in the 2-D emotion map is typically contained within a
parabolic contour on this map [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We argued that the shape could be
described as a set of two parabolas as shown in Fig. 3, one forming an upper
limit and another forming a lower limit on the 2-D emotion map. We used
annotations of arousal and valence from this dataset as well as from the
LIRIS ACCEDE dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and tted these two parabolas as boundaries
of convex hulls obtained over the combined datasets. We then incorporated
this geometric constraint as an additional loss called the shape loss (L4) in
the prediction model. We measured the distance of point P (Vpred; Apred) to
each of the two parabolas along the direction of line joining P (Vpred; Apred)
and Q(Vanno; Aanno). This distance was computed for both the parabolas
and was used as two additional losses to the MSE loss and Euclidean loss
described in models 1 and 2. If both predicted and annotated points lay on
the same side of the parabolas, then the shape loss was zero. We used this
scheme since considering perpendicular distance of the predicted point to the
parabola as an error would not capture the co-dependent nature of arousal
and valence. The shape loss was in addition to the Euclidean loss considered
above. Thus we used combined losses to train the models, (L1 + L3 + L4) for
arousal and (L2 + L3 + L4) for valence.
3.1
      </p>
      <p>
        State Reset Noise Removal
Because of the inadequacy of data, for training the models, we used a batching
scheme where every training batch contained a number of sequences selected
from a random movie with a random starting point in the movie. The length
of these sequences was 3 minutes, given the typical scene lengths of 1.5 to 3
minutes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We used stateless LSTMs in our models mentioned above, and reset
the state variable of the LSTM model after every sequence, since every sequence
was disconnected from the other using the aforementioned training scheme. At
prediction time, we observed that sometimes these models introduced a noise in
the predicted values at every reset of the LSTM (3 minutes). This was similar to
making a fresh prediction without knowing the temporal context of the scene,
only from the current set of input features. The noise was more noticeable when
the model had not learned adequately from the given data. To remove such noise,
we made predictions with a hop length of 1.5 minutes, i.e. half the sequence
length. Thus we produced two sets of prediction sequences o set by the hop
except rst and last 1.5 min of the movie. Since the rst half of every reset
interval was likely to have the reset noise, we used the second half of every
prediction and concatenated these predictions to get the nal prediction. This
scheme enabled a crude approximation of a stateful LSTM. We used this scheme
for all three aforementioned models.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results and Discussion</title>
      <p>
        We treated the valence prediction results obtained by Sivaprasad et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
using only audio input features as the baseline for our experimentation. Table
1 summarizes the comparison of Pearson Correlation Coe cients ( ) between
annotated and predicted valence. We report the following observations:
{ We found that Model 3 performed the best of the three models and showed
a signi cant improvement in and M SE over the baseline.
{ Fig. 4 shows the 2-D emotion maps for annotations as well as for the three
models. We observed that the map for independent models in Fig. 4(b)
occupied the entire dimension of valence and did not adhere to the parabolic
contour prescribed by Hanjalic and Xu [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This was because the independent
valence model could not learn enough variations from the audio features.
Model 2 with Euclidean loss in Fig. 4(c) could bring the predictions closer to
the parabolic contour. Model 3 with shape loss in Fig. 4(d) further improves
the adherence to the parabolic contour by enforcing the geometric constraints
of the parabolic contour.
{ Fig. 5 shows the comparison of continuous valence prediction for the movie
Gladiator for which correlation improved signi cantly, from 0.33 to 0.84. We
observed that models 2 and 3 were much more faithful to the annotation as
compared to the independent model. Speci cally, we identify two regions in
Fig. 5 to discuss the e ect of incorporating the arousal-valence
interdependence in modeling valence:
1. Region R1 contains a scene that is a largely positive scene featuring
discussions about the protagonist's freedom and hope of reuniting with
family. The arousal model predicted low arousal for the scene (between
0.0 and -0.4). However, Model 1 predicted it as a scene with extreme
negative valence. From Fig 1, we understand that valence can be
extremely negative only when arousal is highly positive. The scene had
harsh tones and ominous sounds in the background, and independent
model predicted it wrongly as negative valence in absence of any arousal
information. Model 2 tried to capture the interdependence by predicting
a positive valence. Model 3 further corrected the predictions by enforcing
the geometric constraint.
2. Region R2 contains a scene boundary between an intense scene where the
protagonist walks out victorious from a gladiator ght, and a
conversation between the antagonist and a secondary character. The rst part of
the scene takes place in a noisy Colosseum with loud background music
(high sound energy) and the latter part takes place in a quiet room with
no ambient sound (low sound energy). The independent valence model
(Model 1) failed to interpret this transition in audio as a change in
polarity of valence, as this probably was not a trend seen in other movies in
the training set. The independent arousal model interpreted this fall of
audio energy as a fall in arousal, which was a general trend in detecting
arousal. But this information was available to the valence models 2 and
3, and they could predict the fall in valence accurately. The 2-D emotion
map indicates that valence cannot be at an extreme end when arousal
is low. Hence both the models with losses incorporating this constraint
brought down the valence from extreme positive when arousal fell down.
      </p>
      <p>
        { Predicting polarity of valence is challenging owing to the need for semantic
information, which may not always be represented in the audio-visual
features [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We also calculated the accuracy with which our models predicted
the polarity of valence, as summarized in Table 1. We found that Model 3
provided better prediction of polarity (87%) as opposed to Model 1 (72%).
Also, the MSE of valence predictions was better for Model 3 (0.09) and
Model 2 (0.12) compared to that for Model 1 (0.27). This indicates that
incorporating the arousal-valence interdependence better represented polarity
as well as value information.
{ Fig. 6 shows the improvement in LSTM prediction after the state reset noise
correction. We found that this scheme removed the reset noise, seen
predominantly as spikes in the prediction without noise removal (the dotted
line). This uniformly gave an additional improvement of 0.06 in correlation
over the noisy predictions for all valence models. However, for arousal, this
improvement was only 0.02, which indicated that the arousal models were
already learning well from the audio features.
{ We observed that two animated movies in the dataset did not bene t signi
cantly from incorporating the interdependence between arousal and valence.
      </p>
      <p>
        For Finding Nemo, the correlation went down from 0.74 (Model 1) to 0.73
(Model 3), while for Ratatouille, it increased slightly from 0.73 (Model 1) to
0.77 (Model 3). We believe this could be because animated movies often use
a set grammar of music and audio to directly convey positive or negative
emotions. So, the independent model could predict valence using such audio
information without the need of additional arousal information.
{ There was a slight decrease in performance of the independent model for
arousal (correlation of 0.81 for baseline compared to 0.78 for model 1, 0.75
for Model 2 and 0.78 for Model 3). While arousal can be modeled well
independently using audio information [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], in our models, it also had to account
for the error in valence thus reducing its accuracy. For all practical purposes,
we recommend using the independent arousal model (Model 1), as while it
gave equal performance to Model 2 or Model 3, it is more robust owing to
less complexity.
In this paper, we proposed a way to model the interdependence of arousal and
valence using custom joint loss terms for training di erent LSTM models for
arousal and valence. We used only audio features to model arousal and valence.
We found the method to be useful in improving the prediction of valence. We
believe that a correlation of 0.69 with annotated values is a practically important
result for applications involving continuous prediction of valence.
      </p>
      <p>In future, we would like to improve the accuracy of valence prediction
models by utilizing semantic information such as events and characters. We would
also like to incorporate scene boundaries to allow LSTMs to learn more complex
semantic information such as e ect of scene transitions on emotion. This
necessitates creation of a larger dataset of continuous annotations for movies. We
believe it to be a research direction worth pursuing making use of crowdsourcing,
wearables and machine/deep learning.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baveye</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chamaret</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dellandrea</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>A ective video content analysis: A multidisciplinary insight</article-title>
          .
          <source>IEEE Transactions on A ective Computing</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baveye</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dellandrea</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chamaret</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Liris-accede: A video database for a ective content analysis</article-title>
          .
          <source>IEEE Transactions on A ective Computing</source>
          <volume>6</volume>
          (
          <issue>1</issue>
          ),
          <volume>43</volume>
          {
          <fpage>55</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bordwell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The way Hollywood tells it: Story and style in modern movies</article-title>
          . Univ of California Press (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Canini</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leonardi</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A ective recommendation of movies based on selected connotative features</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          <volume>23</volume>
          (
          <issue>4</issue>
          ),
          <volume>636</volume>
          {
          <fpage>647</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guha</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Narayanan</surname>
            ,
            <given-names>S.S.:</given-names>
          </string-name>
          <article-title>A multimodal mixture-of-experts model for dynamic emotion prediction in movies</article-title>
          .
          <source>In: Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2016</year>
          IEEE International Conference on. pp.
          <volume>2822</volume>
          {
          <fpage>2826</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hanjalic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>L.Q.:</given-names>
          </string-name>
          <article-title>A ective video content representation and modeling</article-title>
          .
          <source>IEEE Transactions on multimedia 7(1)</source>
          ,
          <volume>143</volume>
          {
          <fpage>154</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Malandrakis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potamianos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evangelopoulos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zlatintsi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A supervised approach to movie emotion tracking</article-title>
          .
          <source>In: Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2011</year>
          IEEE International Conference on. pp.
          <volume>2376</volume>
          {
          <fpage>2379</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sivaprasad</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedanekar</surname>
          </string-name>
          , N.:
          <article-title>Multimodal continuous prediction of emotions in movies using long short-term memory networks</article-title>
          .
          <source>In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval</source>
          . pp.
          <volume>413</volume>
          {
          <fpage>419</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pal</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          :
          <article-title>Data Mining: Practical machine learning tools and techniques</article-title>
          . Morgan Kaufmann (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Yadati</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katti</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kankanhalli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Cavva:
          <article-title>Computational a ective video-invideo advertising</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>16</volume>
          (
          <issue>1</issue>
          ),
          <volume>15</volume>
          {
          <fpage>23</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J.:
          <article-title>Synchronous prediction of arousal and valence using lstm network for a ective video content analysis</article-title>
          .
          <source>In: 2017 13th International Conference on Natural Computation</source>
          ,
          <article-title>Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)</article-title>
          . pp.
          <volume>727</volume>
          {
          <fpage>732</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>A ective visualization and retrieval for music video</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>12</volume>
          (
          <issue>6</issue>
          ),
          <volume>510</volume>
          {
          <fpage>522</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>