RUCMM at MediaEval 2015 Affective Impact of Movies
             Task: Fusion of Audio and Visual Cues

             Qin Jin∗, Xirong Li∗, Haibing Cao, Yujia Huo, Shuai Liao, Gang Yang, Jieping Xu
                     Multimedia Computing Lab, School of Information, Renmin University of China
                 Key Lab of Data Engineering and Knowledge Engineering, Renmin University of China
                                                 {qjin,xirong}@ruc.edu.cn


ABSTRACT                                                           2.1   Audio Feature Representation
This paper summarizes our efforts for the first time partici-         We chunk the audio stream into small segments with some
pation in the Violent Scene Detection subtask of the Medi-         overlap (such as a 3-sec segment and 1-sec shift leading to 2-
aEval 2015 Affective Impact of Movies Task. We build vio-          sec of overlap between adjacent segments), and empirically
lent scene detectors using both audio and visual cues. In par-     find that 2s segment length with 1s shift achieves the best
ticular, the audio cue is represented by bag-of-audio-words        detection accuracy. We therefore use this setup.
with fisher vector encoding. The visual cue is exploited              We use the Mel-frequency Cepstral Coefficients (MFCCs)
by extracting CNN features from video frames. The detec-           as our fundamental frame-level feature. The MFCCs are
tors are implemented using two-class linear SVM classifiers.       computed over a sliding short-time window of 25ms with
Evaluation shows that the audio detectors and the visual           a 10ms shift [1]. Each 25ms frame of an audio segment
detectors are comparable and complementary to each other.          is then represented as a 39-dimensional MFCC feature vec-
Among our submissions, multi-modal late fusion leads to the        tor (13-dimensional MFCC + delta + delta delta). An au-
best performance.                                                  dio segment is then represented by a set of MFCC feature
                                                                   vectors. Finally, we use two encoding strategies to trans-
                                                                   form this set of MFCC frames into a single fixed-dimension
1.     INTRODUCTION                                                segment-level feature vector: Bag-of-Audio-Words (BoAW)
   The 2015 Affective Impact of Movies Task consists of two        and Fisher Vector (FV) [6].
subtasks: Induced Affect Detection and Violence Detection             Bag-of-Audio-Words: We first use an acoustic code-
which we participated in for the first time. Violent scene         book to generate the segment-level feature vector. The code-
detection (VSD) which automatically detect violent scenes          book model is a common technique used in the document
in videos is a challenging task due to its large variations in     classification (bag-of-words) [10] and the image classifica-
video quality, content, and broad semantic meaning. Vio-           tion (bag-of-visual-words) [5] fields. We use the bag-of-
lence is defined as “ violent videos are those one would not let   audio-words model to represent each audio segment by as-
an 8 years old child see because of their physical violence”.      signing its low-level acoustic features (MFCCs) to a discrete
MediaEval provides a common corpus and evaluation plat-            set of codewords in the vocabulary (codebook), thus pro-
form that encourages and enables competition and compar-           viding a histogram of codeword counts. The vocabulary of
ison among research teams. In this paper, we describe our          BoAW is learned by applying Kmeans clustering algorithm
VSD system for our first time participation in MediaEval           with K=4096 on the whole training dataset.
2015 [8]. We focus on utilizing both audio and visual cues            Fisher Vector: The Fisher Vector (FV) [6] representa-
in the video for violent scene detection. Our audio-based          tion can be seen as an extension of bag-of-words representa-
system uses bag-of-audio-words with fisher vector encoding,        tion. Both the FV and BoAW are based on an intermediate
while our visual-based system uses deep features extracted         representation, the audio vocabulary built in the low level
by pretrained Convolutional Neural Networks (CNN) mod-             feature space. The Fisher encoding uses Gaussian Mixture
els. We combine both modalities via late fusion, and inves-        Models (GMM) to construct an audio word dictionary. We
tigate two weighting strategies. One is equal weights, and         compute the gradient of the log likelihood with respect to the
the other is non-equal weights learned on a held-out subset        parameters of the model to represent an audio segment. The
of the development dataset.                                        Fisher Vector is the concatenation of these partial deriva-
                                                                   tives and describes in which direction the parameters of the
2.     SYSTEM DESCRIPTION                                          model should be modified to best fit the data. A GMM
  In this task, we build audio-only subsystems and visual-         with 256 mixtures is used in our experiments to generate
only subsystems. We also fuse the two modality subsystems          FV representation.
via late fusion. The detailed description of feature represen-
tation and prediction model of each subsystem is presented         2.2   Visual Feature Representation
in following subsections.
                                                                      We consider both frame-level and video-level representa-
∗                                                                  tions. Given a video, we uniformly extract its frames with
    Equal contribution and corresponding authors.
                                                                   an interval of 0.5 seconds. Subsequently, we extract CNN
Copyright is held by the author/owner(s).                          features from these frames. In particular, we employ two
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany        existing CNN models, i.e., the 16-layer VGGNet [7] and
GoogLeNet [9]. The feature vectors are the last fully con-
nected layer of VGGNet, and the pool5 layer of GoogLeNet,        Table 1: Performance of our VSD system with var-
respectively.                                                    ied settings. Evaluation metric: MAP.
  A video’s feature vector is obtained by mean pooling the          System setting                 dev-val test
feature vectors of its frames.                                      BoAW                            0.320   –
                                                                    FV                              0.313   –
2.3   Classification Model                                          Frame-level GoogLeNet-CNN       0.245   –
                                                                    Video-level GoogLeNet-CNN       0.296   –
   For both the audio and visual systems, we train two-class
linear SVM classifiers as violent scene detectors. A frame is         Run1 (BoAW + FV)                      0.348    0.106
considered as a positive training example if its video is la-         Run2 (Frame-level VGGNet-CNN)         0.347    0.118
belled as positive with respect to the violent class. To learn        Run3 (Video-level VGGNet-CNN)         0.308    0.120
from many training examples, we employ the Negative Boot-             Run4 (Average fusion)                 0.485    0.216
strap algorithm [3]. The algorithm takes a fixed number N             Run5 (Learned fusion)                 0.500    0.211
of positive examples and iteratively selects those negative
examples, which are misclassified the most by the current
classifiers. The algorithm randomly samples 10 × N number        Run2: Frame-level VGGNet-CNN.
of negative examples from the remaining negative examples        Run3: Video-level VGGNet-CNN.
as candidates at each iteration. An ensemble of classifiers      Run4: Average fusion of all audio and visual runs, in-
trained in the previous iterations is used to classify each of   cluding BoAW, FV, Frame-level VGGNet-CNN, Video-level
the negative candidate examples. The top N most misclas-         VGGNet-CNN, Frame-level GoogLeNet-CNN, and Video-
sified candidates are selected and used together with the N      level GoogLeNet-CNN.
positive examples to train a new classifier. The algorithm       Run5: Learned fusion of all audio and visual runs.
takes several bags of positive examples and performs the
training independently on each of the positive bags, result-     3.3     Results
ing in multiple ensembles. They are compressed into a single        The performance of our VSD system with varied settings
vector [2], making the prediction very fast.                     is summarized in Table 1. We observe that fusion is always
                                                                 helpful. For the audio-only runs, fusion of BoAW and FV
2.4   Prediction at Video Level                                  brings additional gain. Fusion of the audio and visual runs
   For detectors trained using the frame-level representa-       results in the best performance. Probably due to the diver-
tions, they make prediction also at frame-level. In order to     gence between the dev-val set and the test set, while Run2
aggregate the frame-level scores to the video-level, we first    (Frame-level VGGNet-CNN) outperforms Run3 (Video-level
apply temporal smoothing to refine scores per frame. For         VGGNet-CNN) on dev-val, the latter is better on the test
the visual-based system, we take the maximum response of         set. Consequently, fusion with learned weights does not yield
the frames as their video score, while for the audio-based       improvement.
system, the video score is obtained by averaging over its
frames.                                                          4.    CONCLUSIONS
   We fuse the two modalities of audio and visual via simple
linear fusion at the decision score level. We experiment two        Our results show that both audio and visual modalities
fusion strategies: 1) simply assigning equal fusion weights      can perform violence detection well and the two modali-
to each modality and 2) learning the optimal fusion weights      ties are complementary to each other and simple late fu-
via coordinate ascent [4].                                       sion of two modalities leads to performance enhancement.
                                                                 The CNN features, although without domain-specific infor-
                                                                 mation engineered, can generalize well for the VSD task. In
3.    EXPERIMENTS                                                the future work, we will explore more effective fusion strat-
                                                                 egy for improving detection performance.
3.1   Dataset
  There are in total 6,144 labelled videos for development in
this year’s task. We split the development set randomly into     Acknowledgements
two partitions, namely 1) dev-train consisting of 4,300 videos   This research was supported by the Fundamental Research
among which 190 videos are labelled as violent videos, and 2)    Funds for the Central Universities and the Research Funds of
dev-val of 1844 videos among which 82 videos are labelled        Renmin University of China (No. 14XNLQ01), the National
as violent videos. The detectors are trained on dev-train,       Science Foundation of China (No. 61303184), the Beijing
with hyper parameters tuned on dev-val.                          Natural Science Foundation (No. 4142029), the Specialized
                                                                 Research Fund for the Doctoral Program of Higher Edu-
3.2   Submitted Runs                                             cation (No. 20130004120006), and the Scientific Research
  All the runs use the previous described subsystems or          Foundation for the Returned Overseas Chinese Scholars,
fused system. We use feature name to indicate a specific         State Education Ministry.
system. For instance, BoAW refers to the system using
the BoAW feature. Frame-level VGGNet-CNN means the
system is learned from frames which are represented by           5.    REFERENCES
VGGNet-CNN, while Video-level VGGNet-CNN means learn-             [1] Q. Jin, J. Liang, X. He, G. Yang, J. Xu, and X. Li.
ing directly from video vectors. We submitted 5 runs:                 Semantic concept annotation for user generated videos
Run1: Learned fusion of BoAW and FV.                                  using soundtracks. In ICMR, 2015.
 [2] X. Li and C. Snoek. Classifying tag relevance with
     relevant positive and negative examples. In ACM MM,
     2013.
 [3] X. Li, C. Snoek, M. Worring, D. Koelma, and
     A. Smeulders. Bootstrapping visual categorization
     with relevant negatives. TMM, 15(4), 2013.
 [4] X. Li, C. Snoek, M. Worring, and A. Smeulders.
     Fusing concept detection and geo context for visual
     search. In ICMR, 2012.
 [5] J. Philbin, O. Chum, M. Isard, J. Sivic, and
     A. Zisserman. Object retrieval with large vocabularies
     and fast spatial matching. In CVPR, 2007.
 [6] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek.
     Image classification with the fisher vector: Theory and
     practice. IJCV, 105(3), 2013.
 [7] K. Simonyan and A. Zisserman. Very deep
     convolutional networks for large-scale image
     recognition. CoRR, abs/1409.1556, 2014.
 [8] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang,
     B. Ionescu, E. Dellandrea, M. Schedl, C.-H. Demarty,
     and L. Chen. The mediaeval 2015 affective impact of
     movies task. In MediaEval 2015 Workshop, 2015.
 [9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
     D. Anguelov, D. Erhan, V. Vanhoucke, and
     A. Rabinovich. Going deeper with convolutions.
     CoRR, abs/1409.4842, 2014.
[10] X. Xue and Z. Zhou. Distributional features for text
     categorization. TKDE, 21(3), 2008.