RUCMM at MediaEval 2015 Affective Impact of Movies Task: Fusion of Audio and Visual Cues Qin Jin∗, Xirong Li∗, Haibing Cao, Yujia Huo, Shuai Liao, Gang Yang, Jieping Xu Multimedia Computing Lab, School of Information, Renmin University of China Key Lab of Data Engineering and Knowledge Engineering, Renmin University of China {qjin,xirong}@ruc.edu.cn ABSTRACT 2.1 Audio Feature Representation This paper summarizes our efforts for the first time partici- We chunk the audio stream into small segments with some pation in the Violent Scene Detection subtask of the Medi- overlap (such as a 3-sec segment and 1-sec shift leading to 2- aEval 2015 Affective Impact of Movies Task. We build vio- sec of overlap between adjacent segments), and empirically lent scene detectors using both audio and visual cues. In par- find that 2s segment length with 1s shift achieves the best ticular, the audio cue is represented by bag-of-audio-words detection accuracy. We therefore use this setup. with fisher vector encoding. The visual cue is exploited We use the Mel-frequency Cepstral Coefficients (MFCCs) by extracting CNN features from video frames. The detec- as our fundamental frame-level feature. The MFCCs are tors are implemented using two-class linear SVM classifiers. computed over a sliding short-time window of 25ms with Evaluation shows that the audio detectors and the visual a 10ms shift [1]. Each 25ms frame of an audio segment detectors are comparable and complementary to each other. is then represented as a 39-dimensional MFCC feature vec- Among our submissions, multi-modal late fusion leads to the tor (13-dimensional MFCC + delta + delta delta). An au- best performance. dio segment is then represented by a set of MFCC feature vectors. Finally, we use two encoding strategies to trans- form this set of MFCC frames into a single fixed-dimension 1. INTRODUCTION segment-level feature vector: Bag-of-Audio-Words (BoAW) The 2015 Affective Impact of Movies Task consists of two and Fisher Vector (FV) [6]. subtasks: Induced Affect Detection and Violence Detection Bag-of-Audio-Words: We first use an acoustic code- which we participated in for the first time. Violent scene book to generate the segment-level feature vector. The code- detection (VSD) which automatically detect violent scenes book model is a common technique used in the document in videos is a challenging task due to its large variations in classification (bag-of-words) [10] and the image classifica- video quality, content, and broad semantic meaning. Vio- tion (bag-of-visual-words) [5] fields. We use the bag-of- lence is defined as “ violent videos are those one would not let audio-words model to represent each audio segment by as- an 8 years old child see because of their physical violence”. signing its low-level acoustic features (MFCCs) to a discrete MediaEval provides a common corpus and evaluation plat- set of codewords in the vocabulary (codebook), thus pro- form that encourages and enables competition and compar- viding a histogram of codeword counts. The vocabulary of ison among research teams. In this paper, we describe our BoAW is learned by applying Kmeans clustering algorithm VSD system for our first time participation in MediaEval with K=4096 on the whole training dataset. 2015 [8]. We focus on utilizing both audio and visual cues Fisher Vector: The Fisher Vector (FV) [6] representa- in the video for violent scene detection. Our audio-based tion can be seen as an extension of bag-of-words representa- system uses bag-of-audio-words with fisher vector encoding, tion. Both the FV and BoAW are based on an intermediate while our visual-based system uses deep features extracted representation, the audio vocabulary built in the low level by pretrained Convolutional Neural Networks (CNN) mod- feature space. The Fisher encoding uses Gaussian Mixture els. We combine both modalities via late fusion, and inves- Models (GMM) to construct an audio word dictionary. We tigate two weighting strategies. One is equal weights, and compute the gradient of the log likelihood with respect to the the other is non-equal weights learned on a held-out subset parameters of the model to represent an audio segment. The of the development dataset. Fisher Vector is the concatenation of these partial deriva- tives and describes in which direction the parameters of the 2. SYSTEM DESCRIPTION model should be modified to best fit the data. A GMM In this task, we build audio-only subsystems and visual- with 256 mixtures is used in our experiments to generate only subsystems. We also fuse the two modality subsystems FV representation. via late fusion. The detailed description of feature represen- tation and prediction model of each subsystem is presented 2.2 Visual Feature Representation in following subsections. We consider both frame-level and video-level representa- ∗ tions. Given a video, we uniformly extract its frames with Equal contribution and corresponding authors. an interval of 0.5 seconds. Subsequently, we extract CNN Copyright is held by the author/owner(s). features from these frames. In particular, we employ two MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany existing CNN models, i.e., the 16-layer VGGNet [7] and GoogLeNet [9]. The feature vectors are the last fully con- nected layer of VGGNet, and the pool5 layer of GoogLeNet, Table 1: Performance of our VSD system with var- respectively. ied settings. Evaluation metric: MAP. A video’s feature vector is obtained by mean pooling the System setting dev-val test feature vectors of its frames. BoAW 0.320 – FV 0.313 – 2.3 Classification Model Frame-level GoogLeNet-CNN 0.245 – Video-level GoogLeNet-CNN 0.296 – For both the audio and visual systems, we train two-class linear SVM classifiers as violent scene detectors. A frame is Run1 (BoAW + FV) 0.348 0.106 considered as a positive training example if its video is la- Run2 (Frame-level VGGNet-CNN) 0.347 0.118 belled as positive with respect to the violent class. To learn Run3 (Video-level VGGNet-CNN) 0.308 0.120 from many training examples, we employ the Negative Boot- Run4 (Average fusion) 0.485 0.216 strap algorithm [3]. The algorithm takes a fixed number N Run5 (Learned fusion) 0.500 0.211 of positive examples and iteratively selects those negative examples, which are misclassified the most by the current classifiers. The algorithm randomly samples 10 × N number Run2: Frame-level VGGNet-CNN. of negative examples from the remaining negative examples Run3: Video-level VGGNet-CNN. as candidates at each iteration. An ensemble of classifiers Run4: Average fusion of all audio and visual runs, in- trained in the previous iterations is used to classify each of cluding BoAW, FV, Frame-level VGGNet-CNN, Video-level the negative candidate examples. The top N most misclas- VGGNet-CNN, Frame-level GoogLeNet-CNN, and Video- sified candidates are selected and used together with the N level GoogLeNet-CNN. positive examples to train a new classifier. The algorithm Run5: Learned fusion of all audio and visual runs. takes several bags of positive examples and performs the training independently on each of the positive bags, result- 3.3 Results ing in multiple ensembles. They are compressed into a single The performance of our VSD system with varied settings vector [2], making the prediction very fast. is summarized in Table 1. We observe that fusion is always helpful. For the audio-only runs, fusion of BoAW and FV 2.4 Prediction at Video Level brings additional gain. Fusion of the audio and visual runs For detectors trained using the frame-level representa- results in the best performance. Probably due to the diver- tions, they make prediction also at frame-level. In order to gence between the dev-val set and the test set, while Run2 aggregate the frame-level scores to the video-level, we first (Frame-level VGGNet-CNN) outperforms Run3 (Video-level apply temporal smoothing to refine scores per frame. For VGGNet-CNN) on dev-val, the latter is better on the test the visual-based system, we take the maximum response of set. Consequently, fusion with learned weights does not yield the frames as their video score, while for the audio-based improvement. system, the video score is obtained by averaging over its frames. 4. CONCLUSIONS We fuse the two modalities of audio and visual via simple linear fusion at the decision score level. We experiment two Our results show that both audio and visual modalities fusion strategies: 1) simply assigning equal fusion weights can perform violence detection well and the two modali- to each modality and 2) learning the optimal fusion weights ties are complementary to each other and simple late fu- via coordinate ascent [4]. sion of two modalities leads to performance enhancement. The CNN features, although without domain-specific infor- mation engineered, can generalize well for the VSD task. In 3. EXPERIMENTS the future work, we will explore more effective fusion strat- egy for improving detection performance. 3.1 Dataset There are in total 6,144 labelled videos for development in this year’s task. We split the development set randomly into Acknowledgements two partitions, namely 1) dev-train consisting of 4,300 videos This research was supported by the Fundamental Research among which 190 videos are labelled as violent videos, and 2) Funds for the Central Universities and the Research Funds of dev-val of 1844 videos among which 82 videos are labelled Renmin University of China (No. 14XNLQ01), the National as violent videos. The detectors are trained on dev-train, Science Foundation of China (No. 61303184), the Beijing with hyper parameters tuned on dev-val. Natural Science Foundation (No. 4142029), the Specialized Research Fund for the Doctoral Program of Higher Edu- 3.2 Submitted Runs cation (No. 20130004120006), and the Scientific Research All the runs use the previous described subsystems or Foundation for the Returned Overseas Chinese Scholars, fused system. We use feature name to indicate a specific State Education Ministry. system. For instance, BoAW refers to the system using the BoAW feature. Frame-level VGGNet-CNN means the system is learned from frames which are represented by 5. REFERENCES VGGNet-CNN, while Video-level VGGNet-CNN means learn- [1] Q. Jin, J. Liang, X. He, G. Yang, J. Xu, and X. Li. ing directly from video vectors. We submitted 5 runs: Semantic concept annotation for user generated videos Run1: Learned fusion of BoAW and FV. using soundtracks. In ICMR, 2015. [2] X. Li and C. Snoek. Classifying tag relevance with relevant positive and negative examples. In ACM MM, 2013. [3] X. Li, C. Snoek, M. Worring, D. Koelma, and A. Smeulders. Bootstrapping visual categorization with relevant negatives. TMM, 15(4), 2013. [4] X. Li, C. Snoek, M. Worring, and A. Smeulders. Fusing concept detection and geo context for visual search. In ICMR, 2012. [5] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In CVPR, 2007. [6] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. IJCV, 105(3), 2013. [7] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. [8] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang, B. Ionescu, E. Dellandrea, M. Schedl, C.-H. Demarty, and L. Chen. The mediaeval 2015 affective impact of movies task. In MediaEval 2015 Workshop, 2015. [9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. [10] X. Xue and Z. Zhou. Distributional features for text categorization. TKDE, 21(3), 2008.