=Paper= {{Paper |id=Vol-379/paper-17 |storemode=property |title=A Unified Framework for Semantic Event Detection |pdfUrl=https://ceur-ws.org/Vol-379/paper7.pdf |volume=Vol-379 }} ==A Unified Framework for Semantic Event Detection== https://ceur-ws.org/Vol-379/paper7.pdf
                    A UNIFIED FRAMEWORK FOR SEMANTIC EVENT DETECTION

                  G. Th. Papadopoulos1,2 , V. Mezaris1 , I. Kompatsiaris1 and M. G. Strintzis1,2
  1
      Information Processing Lab., Electrical & Comp. Eng. Dep., Aristotle Univ. of Thessaloniki, Greece
           2
             Informatics and Telematics Institute, Centre for Research and Technology Hellas, Greece


                            ABSTRACT                                       respect to the motion modality. Moreover, the widely used Mel Fre-
     In this poster, a generic multi-modal context-aware framework         quency Cepstral Coefficients (MFCCs) are utilized for the audio in-
for detecting high-level semantic events in video sequences is pre-        formation processing.
sented. Initially, Hidden Markov Models (HMMs) are employed
for performing an initial association of the examined video with the                          4. HMM-BASED ANALYSIS
events of interest separately for every modality. Then, an integrated
Bayesian Network (BN) is introduced for simultaneously perform-            4.1. Color- and Audio-based Analysis
ing information fusion and contextual knowledge exploitation.
                                                                           After a set of color histograms is estimated for each shot, as de-
                                                                           scribed in Section 3, they are utilized to form the corresponding
                       1. INTRODUCTION                                     shot’s color observation sequence. The latter is provided as input to
                                                                           a HMM structure, which performs the association of every shot with
During the recent years, intense research efforts have concentrated in     the supported events based solely on color information. In particular,
the development of sophisticated and user-friendly systems for skil-       an individual degree of confidence, hCij , is calculated for denoting the
ful management of video sequences. Most of them have adopted the           degree with which shot si is associated with every event ej . With
fundamental principle of shifting video manipulation techniques to-        respect to the audio information processing, the computed MFCCs
wards the processing of the visual content at a semantic level. More-      are used to form the shot’s audio observation sequence. Similarly to
over, the usage of multi-modal as well as contextual information has       the color analysis case, a degree of confidence, hA   ij , is calculated to
emerged as a common practice for overcoming the ambiguity that is          indicate the corresponding association.
inherent in the visual medium.
     In this poster, a generic multi-modal context-aware framework
for detecting high-level semantic events in video sequences, making        4.2. Motion-based Analysis
use of Machine Learning algorithms for implicit knowledge acqui-           4.2.1. Polynomial Approximation
sition, is presented. Initially, HMMs are employed for performing
an initial association of the examined video with the events of con-       For every estimated dense motion field, which is computed as de-
cern separately for every utilized modality. Then, an integrated BN        scribed in Section 3, a corresponding motion energy field, M (b, c, t),
is introduced for simultaneously performing information fusion of          is calculated. The latter, which actually represents a motion energy
the individual modality analysis results and contextual knowledge          distribution surface, is approximated by a 2D polynomial function,
exploitation.                                                              of the following form:

                   2. OBJECTIVE OF WORK                                                          X
                                                                                    f (p, q) =         akl · ((p − p0 )k · (q − q0 )l ) ,
                                                                                                 k,l
The objective of the proposed approach is the detection of a set of
predefined semantic events, denoted by E = {ej , j = 1, ..J}, for                              0 ≤ k, l ≤ T and 0 ≤ k + l ≤ T                     (1)
a particular domain. The latter represent semantically meaningful
incidents that are of interest in a possible application case and have a   The approximation is performed using the least-squares method. Sub-
temporal duration. The accurate and efficient detection of them can        sequently, the estimated polynomial coefficients, akl , are used to
facilitate tasks like video indexing, search and retrieval with respect    form the respective shot’s motion observation sequence. The latter
to semantic criteria [1].                                                  is provided as input to the developed HMM structure for performing
                                                                           the association of each shot with the supported events based solely
                                                                           on motion information [2]. Similarly to the color and audio analy-
                 3. VIDEO PRE-PROCESSING                                   sis cases, a degree of confidence, hMij , is calculated to indicate the
                                                                           corresponding association.
At the signal level, the examined video sequence is initially seg-
mented into shots, denoted by S = {si , i = 1, ...I}, which con-
stitute the elementary image sequences of video. For every shot, a         4.2.2. Accumulated Motion Energy Field Computation
global-level color histogram is calculated at equally spaced time in-      In order to overcome the problem of distinguishing between events
tervals. Similarly, a set of dense motion fields are estimated with        that may present similar motion patterns over a period of time during
    The work presented in this paper was supported by the European Com-    their occurrence, an accumulated motion energy field is estimated
mission under contracts FP6-001765 aceMedia, FP6-027685 MESH and           with respect to every computed M (b, c, t) [2], according to the fol-
FP6-027026 K-Space.                                                        lowing equation:
                                                                                                                                                  0.2


                                                                                                                                                   0


                                                                                                                                                 −0.2
                                                                                                                                                 40


                                                                                                                                                        30


                                                                                                                                                             20

                                                                                                                                                                                                40
                                                                                                                                                                  10                     30
                                                                                                                                                                                    20
                                                                                                                                                                               10
                                                                                                                                                                       0
                                                                                                                                                                           0




                Selected frame              Macc (b, c, t, τ ), f or τ = 0   Macc (b, c, t, τ ), f or τ = 2                     cacc (x, y, t, τ ), f or τ = 2
                                                                                                                                M


   Fig. 1. Example of accumulated motion energy field estimation and polynomial approximation for the reporting event in a news video.


                                                                                                                          i-1                                                                          i+1
                                        Event j                                         Event i-TW
                                                                                              1                  Event 1          Event 1i                    Event F
                                                                                                                                                                    i1                        Event 1             Event i+TW
                                                                                                                                                                                                                        1
                                         True                                            True                     True             True                                True                    True                 True
                                         False                                           False                    False            False                               False                   False                False
                                                                                                                                           i                                                           i+1
                                                                                        Eventi-TW2               Event i-12       Event 2                     Event F          i2             Event 2              Event i+TW
                       Audio ( hA             C               M                                                                                                                                                          2
                                ij ) Color ( h ij ) Motion ( h ij )                      True                     True             True                                True                    True                 True
                           A j1           C j1            M j1                           False                    False            False                               False                   False                False
                           A j2           C j2            M j2
                                                                                                 i-TW            Event i-1                 i
                           ...            ...             ...                           Event J                        J          Event J                     Event F
                                                                                                                                                                    iJ                        Event i+1
                                                                                                                                                                                                    J              Event i+TW
                                                                                                                                                                                                                         J
                           A jQ           C jQ           M jQ                            True                     True             True                                True                    True                 True
                                                                                         False                    False            False                               False                   False                False




             Fig. 2. Developed BN for modality fusion.
                                                                                     Fig. 3. Integrated BN for joint modality fusion and temporal context
                                                                                     modeling.

                       Pτ
                          0 w(τP
                               ) · M (b, c, t − τ )                                                              Table 1. Event detection results.
Macc (b, c, t, τ ) =              τ                 , τ = 0, 1, ... ,        (2)                      Actual                                       Detected Event
                                  0 w(τ )                                                             Event           Anchor                   Reporting    Reportage                                        Graphics
where w(τ ) is modeled by the following time descending function:                                     Anchor          77.97%                    10.17%                                    11.86%              0.00%
                                                                                                     Reporting         0.00%                    60.71%                                    39.29%              0.00%
                                                                                                     Reportage         4.60%                    1.15%                                     94.25%              0.00%
                                      1
                       w(τ ) =             , v>1 .                           (3)                     Graphics          9.38%                    0.00%                                     2.30%              88.32%
                                    v f ·τ
                                                                                                                              Overall accuracy                                                               86.01%
Following their extraction, a procedure similar to the one described
in Section 4.2.1 is followed for providing motion information to the
respective HMM structure, where now the computed Macc (b, c, t, τ )
are used during the polynomial approximation process, instead of the                 certain time window. It must be noted that all network nodes ex-
M (b, c, t). In Fig. 1, an indicative example of energy field polyno-                pect EventF  ij correspond to the appropriate parent nodes of the BNs
mial approximation (M  cacc (x, y, t, τ )) is presented.                             that have been developed for performing information fusion (Section
                                                                                     5.1). At the evaluation stage, the integrated BN estimates a degree of
        5. FUSION AND CONTEXT EXPLOITATION                                           belief for every EventF                        F
                                                                                                             ij node, denoted by hij , which indicates the
                                                                                     degree of confidence with which event ej is eventually associated
5.1. Information Fusion                                                              with shot si .
Under the proposed approach, BNs are employed for fusing the com-
puted single-modality analysis results. In particular, a set of J BNs                    6. EXPERIMENTAL RESULTS AND CONCLUSIONS
is introduced, one for every defined event ej . In Fig. 2 the network
                                                                                     The proposed framework was tested on videos belonging to the news
structure of every utilized BN is illustrated. This network topol-
                                                                                     broadcast domain [2]. The results presented in Table 1 demonstrate
ogy defines explicitly the causal relationships between the respective
                                                                                     the efficiency of the proposed approach.
variables, i.e. the event that is depicted in a shot determines the fea-
tures observed with respect to every modality. Every BN estimates a
degree of belief for the parent node, which constitutes a quantitative                                                          7. REFERENCES
indication of the association between each shot si and the respective
event ej based on multi-modal information.                                           [1] G. Th. Papadopoulos, V. Mezaris, I. Kompatsiaris and M. G.
                                                                                         Strintzis, Accumulated Motion Energy Fields Estimation and
                                                                                         Representation for Semantic Event Detection, in Proc. of CIVR,
5.2. Context Exploitation
                                                                                         Niagara Falls, Canada, July 2008.
In order to overcome the inherent ambiguity of the visual medium,                    [2] G. Th. Papadopoulos, V. Mezaris, I. Kompatsiaris and M. G.
an integrated BN model is introduced for acquiring and exploiting                        Strintzis, Estimation and Representation of Accumulated Mo-
the appropriate contextual information, i.e. the supported events’                       tion Characteristics for Semantic Event Detection, in Proc. of
temporal occurrence order. Specifically, the developed BN, whose                         IEEE ICIP-MIR 2008), San Diego, USA, October 2008.
topology is illustrated in Fig. 3, receives as input the shot-event as-
sociations based on multi-modal information of every shot si (Sec-
tion 5.1), as well as of all its neighboring shots than lie within a