=Paper=
{{Paper
|id=None
|storemode=property
|title=Technicolor and INRIA/IRISA at MediaEval 2011: learning temporal modality integration with Bayesian Networks
|pdfUrl=https://ceur-ws.org/Vol-807/penet_TECHNICOLOR_Violence_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/PenetDGG11
}}
==Technicolor and INRIA/IRISA at MediaEval 2011: learning temporal modality integration with Bayesian Networks==
Technicolor and INRIA/IRISA at MediaEval 2011: learning temporal modality integration with Bayesian Networks∗ Cédric Penet, Claire-Hélène Demarty Guillaume Gravier, Patrick Gros Technicolor/INRIA Rennes & Technicolor CNRS/IRISA & INRIA Rennes 1 ave de Belle Fontaine Campus de Beaulieu 35510 Cesson-Sévigné, France 35042 Rennes, France cedric.penet@technicolor.com guig@irisa.fr claire-helene.demarty@technicolor.com Patrick.Gros@inria.fr ABSTRACT Video modality Four video features were extracted from This paper presents the work done in Technicolor and IN- the video for each shot: the shot duration, the average RIA regarding the Affect Task at MediaEval 2011. This number of blood pixels, the average activity and the task aims at detecting violent shots in movies. We studied a number of flashes. bayesian network framework, and several ways of introduc- ing temporality and multimodality in the framework. The features histograms were then rank-normalised over 22 values for each movie. Keywords 2.2 Classification Violence detection, Bayesian Networks In our process, we chose to use Bayesian networks (BN) [3] 1. INTRODUCTION as a probability distribution modelisation process. BN ex- The MediaEval 2011 Affect Task aims at detecting violence hibit interesting features: they model statistical distribu- in movies. A complete description of the task and datasets tion using a graph whose structure may be learned. On the may be found in [2]. downside, they consider each feature independently. As explained in the paper cited above, violence has not been For the MediaEval 2011 Affect Task, several types of struc- a widely studied field, despite its importance in parental con- tures were tested: trol. Moreover, most of the work related to violence focuses Naive Bayesian network (NB) This is the simplest net- mainly on either video or audio. However, in movies, our work. Each feature is only linked to the classification intuition is that every information channel is important in node. It therefore makes the assumption that the fea- order to correctly assess violence, even for a human asses- tures are all independent with respect to the classifi- sor, and that these information streams are complementary. cation node. This paper presents our results for both video and audio only systems and for the fusion of both sources of informa- Forest augmented naive Bayesian network (FAN) This tion. We also investigate the importance of temporality in structure has been introduced in [4]. The idea here is the framework. to relax the assumption of features independence with respect to the classification node. The algorithm learns 2. SYSTEM DESCRIPTION a forest between the features before connecting them The system we developed for the task contains four different to the classification node. parts: K2 [1] This structure learning algorithm is a state-of-the- art score based greedy search algorithm. In order to 2.1 Features extraction reduce the number of possible graphs to test, it re- Movies have several components that convey different in- quires a nodes ordering. formation, namely audio, video, subtitles, . . . We chose to extract features from the audio and video modalities. We used the Bayes Net Toolbox1 . Audio modality Five audio features were extracted from the audio stream from 40 msec frames with 20 msec 2.3 Temporal integration overlap. These features were then averaged over video Considering the temporal structure of movies we decided to shots, as the task is to be performed at the shot level. try several types of temporal integrations: we used contex- The audio features are the energy, the centroid, the tual features2 over n ∈ [−5, +5], and two types of temporal asymmetry, the zero crossing rate (ZCR) and the flat- filtering over 5 samples, that are used to smooth decisions: ness. ∗This work was partly achieved as part of the Quaero Pro- decision maximum vote This intervenes once the deci- gram, funded by OSEO, French State agency for innovation. sion has been taken and consists in taking the maxi- mum decision over a few samples. 1 http://code.google.com/p/bnt/ 2 Copyright is held by the author/owner(s). Considering Xt = [x1 , · · · , xK ] the feature vector for sam- MediaEval 2010 Workshop, September 1-2, 2011, Pisa, Italy ple at time t, the contextual features vector becomes Xtc = [Xt−n , · · · , Xt , · · · , Xt+n ]. Description MC F violence, which seems logical considering that the presence LF C A: Me V: Ma A: N V: K2 0.761 0.397 of blood in the violent scenes highly depends on the movie LF C A: Me V: Me A: N V: K2 0.774 0.391 and the chosen definition for violence. As for EF, while we V C Ma K2 0.784 0.305 thought it would improve the results and find correlations A C Me K2 0.805 0.295 between audio and video, it seems that for non contextual V C Me K2 0.840 0.297 data, only the video features are linked to the violence node, A C Me N 0.843 0.354 and that for contextual data the links are messy. This and EF C Ma K2 0.892 0.284 the better results obtained using LF tend to indicate that A C Ma K2 0.943 0.268 the features used in both modalities are from different levels V - Ma N 0.950 0.255 and cannot be compared as such. Secondly, it seems that the A - - K2 0.967 0.251 algorithms produce a strict temporal structure, i.e. the fea- EF C - K2 0.998 0.266 tures from time t = n are linked together and not to features V - Ma FAN 1.009 0.276 from different times unless they are in chains. There are four chains in the graphs: flatness, energy, activity and blood. It Table 1: Results for runs submitted ordered by is easy to see that these features have a temporal structure. increasing value of MediaEval Cost (MC) (F: F- On the other hand, the flash feature is connected only to measure, LF: late fusion, EF: early fusion, A: au- the violence node and forms no chain, which is again logical dio, V: video, C: contextual, N: naive BN, Ma: max as the flash feature only detects high luminance variations, decision vote, Me: mean probability). and has therefore no well definite temporal structure. The use of contextual seems to provide good and promising probability averaging This intervenes before taking the results, which tends to confirm the importance of the tempo- decision, by directly averaging the samples probabili- ral structure of movies. The depth used for this evaluation ties of being violent. has been chosen arbitrarily, however it should be interesting to also consider other depths. On the downside, it seems 2.4 Modalities fusion that these results depend on the algorithm used for learning As for multimodal fusion, two cases were considered: late the BN structures: FAN and non contextual data seem to fusion and early fusion. For early fusion, we simply fused work better, while K2 and contextual data seem to give the the features from both modalities before learning, while for best results. late fusion, we fused the probability of both modalities for This concludes the preliminary analysis that can be inferred the ith shot si using: from the evaluation. max(Pvsai , Pvsvi ) if both are violent si si si Pf used (Pva , Pvv ) = min(P si , P si ) if both are non violent 4. CONCLUSION si vasi vv This paper presents a simple framework based on temporal Pva ∗ Pvv otherwise integration, multimodality and Bayesian network. First, it is (1) experimentally shown that the structure learning algorithm where Pvsai (respectively Pvsvi ) is the probability of being vi- output logical graph with respect to the provided data: they olent for the audio (respectively video) modality for the ith are able to capture the links between features and provide shot. a coherent temporal structure. It is also shown that early This rule gives high scores when both audio and video find fusion with features that have different nature yields to poor a violent segment, a low score if they both do not and an results, while late fusion seems to be more promising. Sec- intermediate score if only one answers yes. ond, the use of contextual data seems to improve the result. 3. RUNS SUBMITTED AND RESULTS This work provides a promising baseline for future work on This section describes the runs submitted to the MediaEval the subject. We have several improvement ideas. We want 2011 Affect Task. For the audio and video experiments, we to add features from the text modality, as we think it also chose to submit the two best runs according to the MediaE- contains important information on the violent nature of the val cost and to the false alarm vs missed curve using cross video shots. We also want to investigate more the contextual validation on the learning set, while for the multimodal runs data and test other structure learning algorithms. (namely, early and late fusion), we chose the best ones ac- 5. REFERENCES cording to both metrics. The selected runs and their results [1] G. F. Cooper and E. Herskovits. A Bayesian method are presented in table 1. for the induction of probabilistic networks from data. Most of the obtained scores have values lower than < 1 which Machine Learning, 9:309–347, 1992. is better than the simple case where each sample is classified [2] C.-H. Demarty, C. Penet, G. Gravier, and as violent, i.e. false alarm rate is 100% and missed detection M. Soleymani. The MediaEval 2011 Affect Task: rate is 0%. Violent Scenes Detection in Hollywood Movies. In MediaEval 2011 Workshop, Pisa, Italy, September 1-2 The analysis of the produced graphs yields nice and encour- 2011. aging observations on the quality of the structure learning algorithms. Firstly, the links between features may be easily [3] D. Heckerman. A Tutorial on Learning with Bayesian interpreted. The ZCR and centroid are linked as they repre- Networks. Technical report, Microsoft Research, 1995. sent the same information, the activity is linked to the shot [4] P. Lucas. Restricted Bayesian Network Structure length as the shot detector used tends to oversegment when Learning. In Advances in Bayesian Networks, Studies the activity is high, and finally blood is not connected to in Fuzziness and Soft Computing, pages 217–232, 2002.