=Paper=
{{Paper
|id=Vol-379/paper-17
|storemode=property
|title=A Unified Framework for Semantic Event Detection
|pdfUrl=https://ceur-ws.org/Vol-379/paper7.pdf
|volume=Vol-379
}}
==A Unified Framework for Semantic Event Detection==
A UNIFIED FRAMEWORK FOR SEMANTIC EVENT DETECTION
G. Th. Papadopoulos1,2 , V. Mezaris1 , I. Kompatsiaris1 and M. G. Strintzis1,2
1
Information Processing Lab., Electrical & Comp. Eng. Dep., Aristotle Univ. of Thessaloniki, Greece
2
Informatics and Telematics Institute, Centre for Research and Technology Hellas, Greece
ABSTRACT respect to the motion modality. Moreover, the widely used Mel Fre-
In this poster, a generic multi-modal context-aware framework quency Cepstral Coefficients (MFCCs) are utilized for the audio in-
for detecting high-level semantic events in video sequences is pre- formation processing.
sented. Initially, Hidden Markov Models (HMMs) are employed
for performing an initial association of the examined video with the 4. HMM-BASED ANALYSIS
events of interest separately for every modality. Then, an integrated
Bayesian Network (BN) is introduced for simultaneously perform- 4.1. Color- and Audio-based Analysis
ing information fusion and contextual knowledge exploitation.
After a set of color histograms is estimated for each shot, as de-
scribed in Section 3, they are utilized to form the corresponding
1. INTRODUCTION shot’s color observation sequence. The latter is provided as input to
a HMM structure, which performs the association of every shot with
During the recent years, intense research efforts have concentrated in the supported events based solely on color information. In particular,
the development of sophisticated and user-friendly systems for skil- an individual degree of confidence, hCij , is calculated for denoting the
ful management of video sequences. Most of them have adopted the degree with which shot si is associated with every event ej . With
fundamental principle of shifting video manipulation techniques to- respect to the audio information processing, the computed MFCCs
wards the processing of the visual content at a semantic level. More- are used to form the shot’s audio observation sequence. Similarly to
over, the usage of multi-modal as well as contextual information has the color analysis case, a degree of confidence, hA ij , is calculated to
emerged as a common practice for overcoming the ambiguity that is indicate the corresponding association.
inherent in the visual medium.
In this poster, a generic multi-modal context-aware framework
for detecting high-level semantic events in video sequences, making 4.2. Motion-based Analysis
use of Machine Learning algorithms for implicit knowledge acqui- 4.2.1. Polynomial Approximation
sition, is presented. Initially, HMMs are employed for performing
an initial association of the examined video with the events of con- For every estimated dense motion field, which is computed as de-
cern separately for every utilized modality. Then, an integrated BN scribed in Section 3, a corresponding motion energy field, M (b, c, t),
is introduced for simultaneously performing information fusion of is calculated. The latter, which actually represents a motion energy
the individual modality analysis results and contextual knowledge distribution surface, is approximated by a 2D polynomial function,
exploitation. of the following form:
2. OBJECTIVE OF WORK X
f (p, q) = akl · ((p − p0 )k · (q − q0 )l ) ,
k,l
The objective of the proposed approach is the detection of a set of
predefined semantic events, denoted by E = {ej , j = 1, ..J}, for 0 ≤ k, l ≤ T and 0 ≤ k + l ≤ T (1)
a particular domain. The latter represent semantically meaningful
incidents that are of interest in a possible application case and have a The approximation is performed using the least-squares method. Sub-
temporal duration. The accurate and efficient detection of them can sequently, the estimated polynomial coefficients, akl , are used to
facilitate tasks like video indexing, search and retrieval with respect form the respective shot’s motion observation sequence. The latter
to semantic criteria [1]. is provided as input to the developed HMM structure for performing
the association of each shot with the supported events based solely
on motion information [2]. Similarly to the color and audio analy-
3. VIDEO PRE-PROCESSING sis cases, a degree of confidence, hMij , is calculated to indicate the
corresponding association.
At the signal level, the examined video sequence is initially seg-
mented into shots, denoted by S = {si , i = 1, ...I}, which con-
stitute the elementary image sequences of video. For every shot, a 4.2.2. Accumulated Motion Energy Field Computation
global-level color histogram is calculated at equally spaced time in- In order to overcome the problem of distinguishing between events
tervals. Similarly, a set of dense motion fields are estimated with that may present similar motion patterns over a period of time during
The work presented in this paper was supported by the European Com- their occurrence, an accumulated motion energy field is estimated
mission under contracts FP6-001765 aceMedia, FP6-027685 MESH and with respect to every computed M (b, c, t) [2], according to the fol-
FP6-027026 K-Space. lowing equation:
0.2
0
−0.2
40
30
20
40
10 30
20
10
0
0
Selected frame Macc (b, c, t, τ ), f or τ = 0 Macc (b, c, t, τ ), f or τ = 2 cacc (x, y, t, τ ), f or τ = 2
M
Fig. 1. Example of accumulated motion energy field estimation and polynomial approximation for the reporting event in a news video.
i-1 i+1
Event j Event i-TW
1 Event 1 Event 1i Event F
i1 Event 1 Event i+TW
1
True True True True True True True
False False False False False False False
i i+1
Eventi-TW2 Event i-12 Event 2 Event F i2 Event 2 Event i+TW
Audio ( hA C M 2
ij ) Color ( h ij ) Motion ( h ij ) True True True True True True
A j1 C j1 M j1 False False False False False False
A j2 C j2 M j2
i-TW Event i-1 i
... ... ... Event J J Event J Event F
iJ Event i+1
J Event i+TW
J
A jQ C jQ M jQ True True True True True True
False False False False False False
Fig. 2. Developed BN for modality fusion.
Fig. 3. Integrated BN for joint modality fusion and temporal context
modeling.
Pτ
0 w(τP
) · M (b, c, t − τ ) Table 1. Event detection results.
Macc (b, c, t, τ ) = τ , τ = 0, 1, ... , (2) Actual Detected Event
0 w(τ ) Event Anchor Reporting Reportage Graphics
where w(τ ) is modeled by the following time descending function: Anchor 77.97% 10.17% 11.86% 0.00%
Reporting 0.00% 60.71% 39.29% 0.00%
Reportage 4.60% 1.15% 94.25% 0.00%
1
w(τ ) = , v>1 . (3) Graphics 9.38% 0.00% 2.30% 88.32%
v f ·τ
Overall accuracy 86.01%
Following their extraction, a procedure similar to the one described
in Section 4.2.1 is followed for providing motion information to the
respective HMM structure, where now the computed Macc (b, c, t, τ )
are used during the polynomial approximation process, instead of the certain time window. It must be noted that all network nodes ex-
M (b, c, t). In Fig. 1, an indicative example of energy field polyno- pect EventF ij correspond to the appropriate parent nodes of the BNs
mial approximation (M cacc (x, y, t, τ )) is presented. that have been developed for performing information fusion (Section
5.1). At the evaluation stage, the integrated BN estimates a degree of
5. FUSION AND CONTEXT EXPLOITATION belief for every EventF F
ij node, denoted by hij , which indicates the
degree of confidence with which event ej is eventually associated
5.1. Information Fusion with shot si .
Under the proposed approach, BNs are employed for fusing the com-
puted single-modality analysis results. In particular, a set of J BNs 6. EXPERIMENTAL RESULTS AND CONCLUSIONS
is introduced, one for every defined event ej . In Fig. 2 the network
The proposed framework was tested on videos belonging to the news
structure of every utilized BN is illustrated. This network topol-
broadcast domain [2]. The results presented in Table 1 demonstrate
ogy defines explicitly the causal relationships between the respective
the efficiency of the proposed approach.
variables, i.e. the event that is depicted in a shot determines the fea-
tures observed with respect to every modality. Every BN estimates a
degree of belief for the parent node, which constitutes a quantitative 7. REFERENCES
indication of the association between each shot si and the respective
event ej based on multi-modal information. [1] G. Th. Papadopoulos, V. Mezaris, I. Kompatsiaris and M. G.
Strintzis, Accumulated Motion Energy Fields Estimation and
Representation for Semantic Event Detection, in Proc. of CIVR,
5.2. Context Exploitation
Niagara Falls, Canada, July 2008.
In order to overcome the inherent ambiguity of the visual medium, [2] G. Th. Papadopoulos, V. Mezaris, I. Kompatsiaris and M. G.
an integrated BN model is introduced for acquiring and exploiting Strintzis, Estimation and Representation of Accumulated Mo-
the appropriate contextual information, i.e. the supported events’ tion Characteristics for Semantic Event Detection, in Proc. of
temporal occurrence order. Specifically, the developed BN, whose IEEE ICIP-MIR 2008), San Diego, USA, October 2008.
topology is illustrated in Fig. 3, receives as input the shot-event as-
sociations based on multi-modal information of every shot si (Sec-
tion 5.1), as well as of all its neighboring shots than lie within a