=Paper= {{Paper |id=None |storemode=property |title=Technicolor and INRIA/IRISA at MediaEval 2011: learning temporal modality integration with Bayesian Networks |pdfUrl=https://ceur-ws.org/Vol-807/penet_TECHNICOLOR_Violence_me11wn.pdf |volume=Vol-807 |dblpUrl=https://dblp.org/rec/conf/mediaeval/PenetDGG11 }} ==Technicolor and INRIA/IRISA at MediaEval 2011: learning temporal modality integration with Bayesian Networks== https://ceur-ws.org/Vol-807/penet_TECHNICOLOR_Violence_me11wn.pdf
     Technicolor and INRIA/IRISA at MediaEval 2011: learning
      temporal modality integration with Bayesian Networks∗

        Cédric Penet, Claire-Hélène Demarty                                   Guillaume Gravier, Patrick Gros
          Technicolor/INRIA Rennes & Technicolor                                   CNRS/IRISA & INRIA Rennes
                  1 ave de Belle Fontaine                                             Campus de Beaulieu
              35510 Cesson-Sévigné, France                                           35042 Rennes, France
                   cedric.penet@technicolor.com                                              guig@irisa.fr
              claire-helene.demarty@technicolor.com                                      Patrick.Gros@inria.fr


ABSTRACT                                                          Video modality Four video features were extracted from
This paper presents the work done in Technicolor and IN-              the video for each shot: the shot duration, the average
RIA regarding the Affect Task at MediaEval 2011. This                 number of blood pixels, the average activity and the
task aims at detecting violent shots in movies. We studied a          number of flashes.
bayesian network framework, and several ways of introduc-
ing temporality and multimodality in the framework.               The features histograms were then rank-normalised over 22
                                                                  values for each movie.
Keywords                                                          2.2    Classification
Violence detection, Bayesian Networks
                                                                  In our process, we chose to use Bayesian networks (BN) [3]
1.    INTRODUCTION                                                as a probability distribution modelisation process. BN ex-
The MediaEval 2011 Affect Task aims at detecting violence         hibit interesting features: they model statistical distribu-
in movies. A complete description of the task and datasets        tion using a graph whose structure may be learned. On the
may be found in [2].                                              downside, they consider each feature independently.

As explained in the paper cited above, violence has not been      For the MediaEval 2011 Affect Task, several types of struc-
a widely studied field, despite its importance in parental con-   tures were tested:
trol. Moreover, most of the work related to violence focuses
                                                                  Naive Bayesian network (NB) This is the simplest net-
mainly on either video or audio. However, in movies, our
                                                                       work. Each feature is only linked to the classification
intuition is that every information channel is important in
                                                                       node. It therefore makes the assumption that the fea-
order to correctly assess violence, even for a human asses-
                                                                       tures are all independent with respect to the classifi-
sor, and that these information streams are complementary.
                                                                       cation node.
This paper presents our results for both video and audio
only systems and for the fusion of both sources of informa-       Forest augmented naive Bayesian network (FAN) This
tion. We also investigate the importance of temporality in             structure has been introduced in [4]. The idea here is
the framework.                                                         to relax the assumption of features independence with
                                                                       respect to the classification node. The algorithm learns
2.    SYSTEM DESCRIPTION                                               a forest between the features before connecting them
The system we developed for the task contains four different           to the classification node.
parts:                                                            K2 [1] This structure learning algorithm is a state-of-the-
                                                                       art score based greedy search algorithm. In order to
2.1    Features extraction                                             reduce the number of possible graphs to test, it re-
Movies have several components that convey different in-               quires a nodes ordering.
formation, namely audio, video, subtitles, . . . We chose to
extract features from the audio and video modalities.             We used the Bayes Net Toolbox1 .
Audio modality Five audio features were extracted from
     the audio stream from 40 msec frames with 20 msec            2.3    Temporal integration
     overlap. These features were then averaged over video        Considering the temporal structure of movies we decided to
     shots, as the task is to be performed at the shot level.     try several types of temporal integrations: we used contex-
     The audio features are the energy, the centroid, the         tual features2 over n ∈ [−5, +5], and two types of temporal
     asymmetry, the zero crossing rate (ZCR) and the flat-        filtering over 5 samples, that are used to smooth decisions:
     ness.
∗This work was partly achieved as part of the Quaero Pro-         decision maximum vote This intervenes once the deci-
gram, funded by OSEO, French State agency for innovation.               sion has been taken and consists in taking the maxi-
                                                                        mum decision over a few samples.
                                                                  1
                                                                    http://code.google.com/p/bnt/
                                                                  2
Copyright is held by the author/owner(s).                           Considering Xt = [x1 , · · · , xK ] the feature vector for sam-
MediaEval 2010 Workshop, September 1-2, 2011, Pisa, Italy         ple at time t, the contextual features vector becomes Xtc =
                                                                  [Xt−n , · · · , Xt , · · · , Xt+n ].
               Description                     MC      F             violence, which seems logical considering that the presence
 LF    C    A: Me V: Ma A: N V: K2             0.761   0.397         of blood in the violent scenes highly depends on the movie
 LF    C    A: Me V: Me A: N V: K2             0.774   0.391         and the chosen definition for violence. As for EF, while we
 V     C        Ma          K2                 0.784   0.305         thought it would improve the results and find correlations
 A     C        Me          K2                 0.805   0.295         between audio and video, it seems that for non contextual
 V     C        Me          K2                 0.840   0.297         data, only the video features are linked to the violence node,
 A     C        Me          N                  0.843   0.354         and that for contextual data the links are messy. This and
 EF    C        Ma          K2                 0.892   0.284         the better results obtained using LF tend to indicate that
 A     C        Ma          K2                 0.943   0.268         the features used in both modalities are from different levels
 V     -        Ma          N                  0.950   0.255         and cannot be compared as such. Secondly, it seems that the
 A     -         -          K2                 0.967   0.251         algorithms produce a strict temporal structure, i.e. the fea-
 EF    C         -          K2                 0.998   0.266         tures from time t = n are linked together and not to features
 V     -        Ma         FAN                 1.009   0.276         from different times unless they are in chains. There are four
                                                                     chains in the graphs: flatness, energy, activity and blood. It
Table 1: Results for runs submitted ordered by                       is easy to see that these features have a temporal structure.
increasing value of MediaEval Cost (MC) (F: F-                       On the other hand, the flash feature is connected only to
measure, LF: late fusion, EF: early fusion, A: au-                   the violence node and forms no chain, which is again logical
dio, V: video, C: contextual, N: naive BN, Ma: max                   as the flash feature only detects high luminance variations,
decision vote, Me: mean probability).                                and has therefore no well definite temporal structure.
                                                                     The use of contextual seems to provide good and promising
probability averaging This intervenes before taking the              results, which tends to confirm the importance of the tempo-
    decision, by directly averaging the samples probabili-           ral structure of movies. The depth used for this evaluation
    ties of being violent.                                           has been chosen arbitrarily, however it should be interesting
                                                                     to also consider other depths. On the downside, it seems
2.4    Modalities fusion                                             that these results depend on the algorithm used for learning
As for multimodal fusion, two cases were considered: late            the BN structures: FAN and non contextual data seem to
fusion and early fusion. For early fusion, we simply fused           work better, while K2 and contextual data seem to give the
the features from both modalities before learning, while for         best results.
late fusion, we fused the probability of both modalities for         This concludes the preliminary analysis that can be inferred
the ith shot si using:                                               from the evaluation.
                       
                        max(Pvsai , Pvsvi ) if both are violent
  si      si    si
Pf used (Pva , Pvv ) =   min(P si , P si ) if both are non violent
                                                                     4.   CONCLUSION
                        si vasi vv                                  This paper presents a simple framework based on temporal
                         Pva ∗ Pvv           otherwise
                                                                     integration, multimodality and Bayesian network. First, it is
                                                               (1)
                                                                     experimentally shown that the structure learning algorithm
where Pvsai (respectively Pvsvi ) is the probability of being vi-
                                                                     output logical graph with respect to the provided data: they
olent for the audio (respectively video) modality for the ith        are able to capture the links between features and provide
shot.                                                                a coherent temporal structure. It is also shown that early
This rule gives high scores when both audio and video find           fusion with features that have different nature yields to poor
a violent segment, a low score if they both do not and an            results, while late fusion seems to be more promising. Sec-
intermediate score if only one answers yes.                          ond, the use of contextual data seems to improve the result.

3.    RUNS SUBMITTED AND RESULTS                                     This work provides a promising baseline for future work on
This section describes the runs submitted to the MediaEval           the subject. We have several improvement ideas. We want
2011 Affect Task. For the audio and video experiments, we            to add features from the text modality, as we think it also
chose to submit the two best runs according to the MediaE-           contains important information on the violent nature of the
val cost and to the false alarm vs missed curve using cross          video shots. We also want to investigate more the contextual
validation on the learning set, while for the multimodal runs        data and test other structure learning algorithms.
(namely, early and late fusion), we chose the best ones ac-          5.   REFERENCES
cording to both metrics. The selected runs and their results         [1] G. F. Cooper and E. Herskovits. A Bayesian method
are presented in table 1.                                                for the induction of probabilistic networks from data.
Most of the obtained scores have values lower than < 1 which             Machine Learning, 9:309–347, 1992.
is better than the simple case where each sample is classified       [2] C.-H. Demarty, C. Penet, G. Gravier, and
as violent, i.e. false alarm rate is 100% and missed detection           M. Soleymani. The MediaEval 2011 Affect Task:
rate is 0%.                                                              Violent Scenes Detection in Hollywood Movies. In
                                                                         MediaEval 2011 Workshop, Pisa, Italy, September 1-2
The analysis of the produced graphs yields nice and encour-
                                                                         2011.
aging observations on the quality of the structure learning
algorithms. Firstly, the links between features may be easily        [3] D. Heckerman. A Tutorial on Learning with Bayesian
interpreted. The ZCR and centroid are linked as they repre-              Networks. Technical report, Microsoft Research, 1995.
sent the same information, the activity is linked to the shot        [4] P. Lucas. Restricted Bayesian Network Structure
length as the shot detector used tends to oversegment when               Learning. In Advances in Bayesian Networks, Studies
the activity is high, and finally blood is not connected to              in Fuzziness and Soft Computing, pages 217–232, 2002.