TUDCL at MediaEval 2013 Violent Scenes Detection: Training with Multi-modal Features by MKL Shinichi Goto1 , Terumasa Aoki1, 2 1 Graduate School of Information 2 New Industry Creation Hachery Center Tohoku University, Miyagi, Japan {s-goto, aoki}@riec.tohoku.ac.jp ABSTRACT The purpose of this paper is to describe the work carried out for the Violent Scenes Detection task at MediaEval 2013 by team TUDCL. Our work is based on the combination of vi- sual, temporal and audio features with machine learning at segment-level. Block-saliency-map based dense trajectory is proposed for visual and temporal features, and MFCC and delta-MFCC is used for audio features. For the classifica- tion, Multiple Kernel Learning is applied, which is effective if multi-modal features exist. 1. INTRODUCTION The MediaEval 2013 Affect Task [1] is intended to detect violence scenes in movies. Although two different definitions of violent events are provided this year, our algorithm is Figure 1: Example of dense sampling using saliency developed only to solve the task for the objective definition, map: Original image (upper left), Normal dense which is ”physical violence or accident resulting in human sampling (upper right), Block saliency map (bottom injury or pain.” left), Our dense sampling (bottom right). 2. APPROACH age is densely sampled with the smallest step size, which Rather than focusing on video shots from the beginning, guarantees the more salient a block is, the more points are our approach first handles fixed-length segments, each of obtained there. Figure 1 shows one example of our dense which has 20 frames (0.8 seconds if FPS is 25). After segment- sampling and normal dense sampling. You notice that our based scores are calculated from extracted feature vectors by algorithm is sampling more points in salient regions and less machine learning, shot-based scores are generated. points in non-salient regions, but normal dense sampling, on For our runs only violent and non-violent ground truth the other hand, is taking points more uniformly on a whole are used, and neither a high-level concept nor external data frame. Note points in the homogeneous areas have already is used. been deleted. Trajectories, MBH, and additionally RGB histogram arou- 2.1 Visual and Temporal Features nd trajectories are extracted for visual and temporal infor- Both visual and temporal features based on dense trajec- mation, though in [2] HOG and HOF are also proposed. This tory [2] are calculated at every frame. Although the original is due to the fact that those features have poor contribution dense trajectory algorithm is carried out by sampling frames on our test runs. densely except for homogeneous image areas, we addition- All features are converted to Bag-of-Words form in each ally apply saliency maps proposed by Itti [3] to increase the segment to get 200-d trajectory, 200-d MBH-x, 200-d MBH- precision, supposing that events concerned with violence are y, and 400-d RGB histogram. In total, 1000-d feature vector located in the areas people tend to pay attention to. is used as the visual and temporal feature for classification. In our algorithm, first a normal saliency map is gener- ated, and then it is transformed to a block-based map by 2.2 Audio Features taking the average of salient values in a fixed block area so Major MFCC, delta-MFCC and audio energy is calculated that dense sampling can be applied, changing its sampling every 20ms with 10ms overlap to create 200-d Bag-of-Audio- step size and maximum spatial scale level according to the Words in each segment, which has 0.8 seconds. salient level. For instance, the most salient area in a im- 2.3 Classifier Learning Although a conventional way of tackling this classifying Copyright is held by the author/owner(s). problem is to use Support Vector Machine (SVM), we ap- MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain ply Multiple Kernel Learning (MKL), which aims at finding Table 1: Weights difference learned by MKL. Table 2: Results of shot-level runs (Note all of them movie Audio Traj. MBHx MBHy RGB are AED metrics). Armageddon 0.307 0.319 0.359 0.373 0.350 Run MAP@100 Prec. Rec. F-sc. The Sixth Sense 0.450 0.180 0.407 0.440 0.171 mkl-shot-hik-1 0.470 0.222 0.726 0.340 Dead Poet Society 0.297 0.267 0.425 0.462 0.286 mkl-shot-hik-2 0.470 0.284 0.609 0.387 svm-shot-rbf - 0.0976 0.738 0.172 optimized weights when multiple SVM kernels are applied [4]. This suits well our case since multiple feature spaces ex- Table 3: Results of segment-level runs. ist. The whole kernel is composed of multiple kernels, and Run MAP@100 Prec. Rec. F-sc. is computed according to the following equation: ∑ mkl-seg-hik 0.343 0.214 0.309 0.253 K(xi , xj ) = dk Kk (xi , xj ) (1) svm-seg-rbf - 0.0473 0.466 0.0859 k where Kk are base kernels, and dk is a weight for each ker- nel. In our case, kernels for trajectory, x-direction MBH, y- of the scoring threshold (0.03 for the former, 0.06 for the direction MBH, RGB-histogram and audio features are pre- latter), and therefore it doesn’t affect MAP@100. In addi- pared. For a kernel function, Histogram Intersection Kernel tion to our main runs, results by normal SVM with RBF (HIK) is used since all of our features are histogram-based. kernel are displayed for comparison, although there is no Although MKL can find optimal weights, we found these MAP@100 score since only binary classification results are values are different depending on movies. Table 1 shows decided and no score is calculated for SVM. the difference between weights learned from three different Our results show the approach of Multiple Kernel Learn- movies. Therefore first classifiers for training movies are ing with HIK kernel is effective for violent scenes detection, learned separately to give binary classification for each seg- though its F-score is still not high enough. We investigated ment, and finally they are integrated in the following way. this and came to the presumption that segments which have frequent camera motions, multiple people and loud sound 2.4 Integration tend to be mis-classified as violent. On the other hand, common missed violent segments are The first step here is to calculate a pre-final violence score violent scenes without sound, such as a scene in which a man for each segment. To do so, for each segment in test movies, is wringing on an another man’s neck. It is reasonable to we simply calculate the number of classifiers which classify suppose that segments in which multi-modality cannot be that segment as violent. Therefore for each test movie, a exploited are likely to get missed. score si for the ith segment is: Although MBH, which is proposed as robust to camera ∑ M −1 motions, is extracted, trajectories themselves easily get af- si = ci (m), ci (n) = {0, 1} (n = 0, 1, . . . , M − 1) (2) fected by camera motions, making them unreliable. There- m=0 fore some action against this problem is imperative. where ci (n) is a result of binary classification by the nth It also should be added as classifiers have learned each classifier with 0 for non-violence, 1 for violence. Note M is training movie separately, feature vectors might not be enough the total number of classifiers, which is equal to the number compared to the case in which classifiers learn all movies of training movies. simultaneously. Since not enough comparison with other Finally a moving average is calculated as smoothing method methods have been done, we will continue our investigation. for each test movie in order to decide final scores s′t for all segments following: 4. REFERENCES ∑ [1] C. Demarty, C. Penet, M. Schedl, B. Ionescu, V.L. si + N n=1 α · (si−n + si+n ) n s′i = (0 < α < 1) (3) Quang, and Y. Jiang. The MediaEval 2013 Affect Task: 2N + 1 Violent Scenes Detection. In MediaEval 2013 where α is a smoothing coefficient, N is a neighbor range Workshop, Barcelona, Spain, October 18-19 2013. around a segment. We used 0.5 for α and 2 for N . [2] Heng Wang, Alexander Kläser, Cordelia Schmid, and The reason why this integration process is needed is to Cheng-Lin Liu. Action Recognition by Dense take the continuity of segments into account. Besides, since Trajectories. In IEEE Conference on Computer Vision our classifier is learning each training movie separately, the and Pattern Recognition, pages 3169–3176, Colorado violence concepts which a training movie does not have can Springs, United States, June 2011. be easily missed. Scores for shots are calculated by convert- [3] Laurent Itti, Christof Koch, and Ernst Niebur. A ing segment-based scores after calculating score per frame. Model of Saliency-based Visual Attention for Rapid If this score is higher than a threshold, that segment or shot Scene Analysis. In IEEE Transactions on Pattern is classified as violent. We choose 0.1 for a segment thresh- Analysis and Machine Intelligence archive Volume 20 old, and 0.03 and 0.06 for shot thresholds. Issue 11, pages 1254–1259, IEEE Computer Society Washington, DC, USA, November 1988. 3. RESULTS AND DISCUSSION [4] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Shot-based results of our runs are shown in Table 2, and Ghaoui, and M.I. Jordan. Learning the Kernel Matrix segment-based results are shown in Table 3. The differ- with Semidefinite Programming. In Journal of Machine ence between mkl-shot-hik-1 and mkl-shot-hik-2 is the value Learning Research 5, pages 27–72, 2004.