<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AUTH-SGP in MediaEval 2016 Emotional Impact of Movies Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Timoleon Anastasia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Electrical Computer &amp; Engineering, Aristotle University of Thessaloniki</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This paper presents all the aspects expected for the MediaEval Workshop. The tested and adopted solutions are well described and the interest of using a set of features versus another one is discussed. The conclusion follows state-ofthe-art ndings and allows bringing new inputs in the understanding of emotion prediction.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Recent years videos have been the main medium for many
people to interact with each other and share information. So,
there is a further need to evaluate the quality of this
interaction in terms of emotions, not only to analyze the
videocontent. To serve this purpose, video a ective content
analysis has gained interest among researchers[12]. Many
audiovisual video features can be found useful to depict emotion.
For example, imagine a lm where the background is full
of warm colors. This can induce the viewers to have
positive emotions, namely emotions with high valence values.
Motion is another important lm element that can control
a video's emotion. Films with large motion intensity can
cause stronger emotions, where the arousal score is higher.
This task aims exactly at predicting the emotional feedback
of the users while watching di erent genres of lms[6].
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION</title>
    </sec>
    <sec id="sec-3">
      <title>Feature Extraction</title>
      <p>The key points of our system can be summarized to the
followings: rst we extract multi-modal features that can
successfully represent emotion. These can be either local
features, from speci c patches of the video frames or from
overlapping time windows of sound signals, or directly global
features from the entire image[9]. In the rst case, a
feature encoding technique must be applied, in order to
convert these local features to global. We examined the
BagOf-Words and Fisher Vector approaches[9]. Finally, the
extracted features are regressed and/or combined in order to
predict the emotion scores.</p>
      <sec id="sec-3-1">
        <title>Development-data feature</title>
        <p>These features were provided by the organizers of the task.
A great variety of features were given, including diverse
features from the audio signals of the movies, features
regarding the scene cuts and much more. These features were
used almost directly, the only preprocessing step included
the normalization of them, by subtracting the mean value
and dividing with the standard deviation of each column.
2.1.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Improved Dense Trajectories (IDT)</title>
        <p>This kind of features provide information about the
movement of the videos and are calculated in di erent spatial and
temporal scales[11]. They are extensively used to classify
human actions. We resized the original videos to 320x240.
Then, several descriptors were calculated for each trajectory
(length of 15 frames), including Histogram of Oriented
Gradients (HOG), Histogram of Optical Flow (HOF) and
Motional Boundary Histogram along x and y axes (MBHx and
MBHy). The total number of descriptors for each trajectory
is 426 (30+96+108+96+96)[1].</p>
        <p>For the conversion of the local features into global, the
Fisher Vector approach was used. A Gaussian Mixture Model
(GMM) was employed to construct a codebook with k words
for each descriptor (k = 64). A total of 2500000 points were
sampled from the descriptors of the development-train set
to train the GMM. The features of each descriptor are then
individually projected via PCA to the half of their
dimensions, resulting in 213 dimensions for each trajectory, and
encoded using the Fisher Kernel method. The power and
L2-normalization schemes were applied to each descriptor
and to the resulting vectors, which hopefully can improve
the performance of the system. Finally, an entire video
can be described by a vector of 27264 features (=2 [mean
value, standard deviation due to the Gaussian model]*213
[features]*64[codebook size]).
2.1.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Deep Learning Feature</title>
        <p>Deep learning is a modern sub- eld of computer vision
and machine learning, which uses arti cial neural networks
combined with the principles of convolution in images, to
describe pictures using more abstract and high-level
features. We used the famous BVLC Ca e deep learning
framework, and treated the BVLC Reference Ca eNet pre-trained
model as a feature extractor[8]. In particular, this network
contains 5 convolutional layers, 2 fully connected layers and
a soft-max classi er. We extracted features from the last
fully-connected layer which outputs 4096 neurons.</p>
        <p>The input frames were the keyframes from the 10-seconds
videos, size 256x256[5]. Instead of averaging the results over
the 10 random-crops that the network produces for each
image, the 4096 output activations of each one of the 10 crops
were kept, resulting in 10x4096 feature representations for
each video. Then, the classic Bag-of-Works concept was
used to encode these features. The size of the codebook
was 8, and the BOWKMeansTrainer class from OpenCV[7]
was used to nd the clusters. Each video was nally
represented by a 8-bin normalized histogram of the frequency
of appearance of each codeword. These features were added
to the development-data features to explore, whether the
performance is actually improved with their presentation.
2.1.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Dense SIFT Feature</title>
        <p>The SIFT descriptor was used on the re-scaled videos.
One common approach when we are dealing with videos is,
to densely compute SIFT features along neighbors of pixels
in images, with speci c stride step (counted in pixels) and
speci c frame step. In our approach, the neighborhood size
is 10x10, and a new SIFT descriptor is calculated every 5
pixels and every 5 frames[10]. After the extraction of the
dense SIFT feature, PCA is applied to reduce the dimension
of the descriptor from 128 to 64. Finally, the sher vector is
applied, in a similar manner to the IDT approach.
2.1.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Hue Saturation Histogram (HSH)</title>
        <p>As mentioned above, di erent colors can depict di erent
genres of emotions. We converted the frames from RGB
to Hue Saturation Value (HSV) space and then computed
a two-dimensional histogram keeping only the hue and
saturation channels. The number of hue bins were 15, while
the number of saturation bins were 16. A HSH was
calculated every 5 frames, exactly like the Dense SIFT descriptor.
Finally, PCA and sher vector approaches were applied.
2.1.6</p>
      </sec>
      <sec id="sec-3-6">
        <title>Audio Feature</title>
        <p>We used the Mel Frequency Cepstral Coe cients (MFCC)
as representative audio feature [3]. Each video can be
described by three di erent types of MFCCs. The rst type
is the short-term descriptor, where the input audio signal
is divided into overlapping windows of size 32ms (and
overlap 50%) and then a cepstral representation is computed for
each one of them. The other two types of descriptors are
the mean and standard deviation of the above-mentioned
features, resulting in a 39-dimensional (3*13) vector.
Finally, PCA dimension reduction and encoding with sher
vector were employed.
2.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Regression</title>
      <p>As far as regression is concerned, the Support Vector
Regression (SVR)[4] is employed in this project. For each task,
a grid search cross-validation scheme was used, in order to
determine the best hyper-parameters C, and the type of
kernel for each model. We investigated radial basis function
and linear kernels, while C and were in the range [0.01,10]
and [0.001,1] respectively. The objective function to be
maximized was the Pearson-Correlation Coe cient between
predicted and real output values. The cross-validation scheme
we followed was simple k-fold validation with k=5. The
distribution of di erent types of movie genres in each set (train
and validation) was not taken into account, although it is a
good alternative future direction.</p>
      <p>RESULTS AND DISCUSSION
1st sub-task. We submitted a total of 5 runs for the rst
sub-task only. The rst run was only with the presence of the
already-extracted features from the development-data. The
second run was the combination of the above features with
the deep-learning ones. The features were concatenated
horizontally and then regressed. The third run includes only the
features from the improved dense trajectories. The fourth
run contains only the HSHs, MFCCs, DSIFT as well as IDT
features. The fth run mixes the features from the two
previous runs. Due to the large size of the feature-space, for the
last run, a linear late fusion strategy was implemented and
the scores of the two regressors were combined linearly[2].</p>
      <p>Table 1 displays the name of each run, whether it was
an external or a required run, and the Pearson-Correlation
coe cient for valence and arousal models separately, both
in development- and release test-set. Some cells of the
matrix do not provide scores for the release test-set, because
these runs were executed after the corresponding deadline.
It should be pointed out, that some videos had too little
movement and no IDT features could be extracted. So,
models of Runs 3,4 and 5 were trained, validated and evaluated
in a slightly smaller set of videos (9786 instead of the total
9800 movie-segments).
2nd sub-task. It is worth mentioning also, that an
attempt was made for the second sub-task. A deep-learning
model was trained from scratch for the two variables
(valence, arousal) separately. Because there were di culties
with the converge of these models and the results were not
encouraging, we decided not to publish them.
4.</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSIONS</title>
      <p>Comparing Run1 and Run2, we can conclude that, deep
learning features do actually improve the performance of
the system. From Run3 and Run4 we can notice, that IDT
features (Run3), which represent motion, are more
important for the arousal prediction (emotion intensity), while
HSH features in Run4, which symbolize color, better
affect the performance of the valence model (positive-negative
emotions). These conclusions are con rmed also from our
ndings in bibliography[12]. Finally, combining the features
from Run3 and Run4 leads to a satisfying improvement of
both models.
5. REFERENCES
[1] Activity Recognition in Videos using UCF101 dataset.</p>
      <p>https://github.com/anenbergb/CS221 Project.
[2] Finding optimized weights when combining classi ers.
https://www.kaggle.com/c/
otto-group-product-classi cation-challenge/forums/t/
13868/ensamble-weights/75870#post75870.
[3] pyAudioAnalysis: A Python library for audio feature
extraction, classi cation, segmentation and
applications.
https://github.com/tyiannak/pyAudioAnalysis.
[4] Scikit-learn: Machine learning in Python.</p>
      <p>http://scikit-learn.org/stable/.
[5] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen.</p>
      <p>Deep Learning vs. Kernel Methods: Performance for
Emotion Prediction in Videos. In 2015 Humaine
Association Conference on A ective Computing and
Intelligent Interaction (ACII), 2015.
[6] Emmanuel Dellandrea, Liming Chen, Yoann Baveye,
Mats Sjoberg and Christel Chamaret. The MediaEval
2016 Emotional Impact of Movies Task. In Proc. of
the MediaEval 2016 Workshop, Hilversum,</p>
      <p>Netherlands, Oct. 20-21 2016.
[7] Itseez. Open source computer vision library.</p>
      <p>https://github.com/itseez/opencv, 2015.
[8] A. Karpathy, G. Toderici, S. Shetty, T. Leung,
R. Sukthankar, and L. Fei-Fei. Large-scale video
classi cation with convolutional neural networks. In
Proceedings of the 2014 IEEE Conference on
Computer Vision and Pattern Recognition, CVPR '14,
pages 1725{1732, Washington, DC, USA, 2014. IEEE
Computer Society.
[9] D. Paschalidou and A. Delopoulos. Event detection on
video data with topic modeling algorithms. Master's
thesis, Department of Electrical and Computer
Engineering, Aristotle University of Thessaloniki, Nov.
2015.
[10] A. Vedaldi and B. Fulkerson. Vlfeat: An open and
portable library of computer vision algorithms. In
Proceedings of the 18th ACM International Conference
on Multimedia, MM '10, pages 1469{1472, New York,
NY, USA, 2010. ACM.
[11] H. Wang and C. Schmid. Action recognition with
improved trajectories. In IEEE International
Conference on Computer Vision, Sydney, Australia,
2013.
[12] S. Wang and Q. Ji. Video a ective content analysis: A
survey of state-of-the-art methods. IEEE Transactions
on A ective Computing, 6(4):410{430, Oct. 2015.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>