1. INTRODUCTION

A UNIFIED FRAMEWORK FOR SEMANTIC EVENT DETECTION

G. Th. Papadopoulos

0 1

V. Mezaris

I. Kompatsiaris

M. G. Strintzis

0 1 0 Informatics and Telematics Institute, Centre for Research and Technology Hellas , Greece 1 Information Processing Lab., Electrical & Comp. Eng. Dep., Aristotle Univ. of Thessaloniki , Greece

In this poster, a generic multi-modal context-aware framework for detecting high-level semantic events in video sequences is presented. Initially, Hidden Markov Models (HMMs) are employed for performing an initial association of the examined video with the events of interest separately for every modality. Then, an integrated Bayesian Network (BN) is introduced for simultaneously performing information fusion and contextual knowledge exploitation.

1. INTRODUCTION

During the recent years, intense research efforts have concentrated in the development of sophisticated and user-friendly systems for skilful management of video sequences. Most of them have adopted the fundamental principle of shifting video manipulation techniques towards the processing of the visual content at a semantic level. Moreover, the usage of multi-modal as well as contextual information has emerged as a common practice for overcoming the ambiguity that is inherent in the visual medium.

In this poster, a generic multi-modal context-aware framework for detecting high-level semantic events in video sequences, making use of Machine Learning algorithms for implicit knowledge acquisition, is presented. Initially, HMMs are employed for performing an initial association of the examined video with the events of concern separately for every utilized modality. Then, an integrated BN is introduced for simultaneously performing information fusion of the individual modality analysis results and contextual knowledge exploitation.

2. OBJECTIVE OF WORK

The objective of the proposed approach is the detection of a set of predefined semantic events, denoted by E = fej ; j = 1; ::J g, for a particular domain. The latter represent semantically meaningful incidents that are of interest in a possible application case and have a temporal duration. The accurate and efficient detection of them can facilitate tasks like video indexing, search and retrieval with respect to semantic criteria [1].

3. VIDEO PRE-PROCESSING

At the signal level, the examined video sequence is initially segmented into shots, denoted by S = fsi; i = 1; :::Ig, which constitute the elementary image sequences of video. For every shot, a global-level color histogram is calculated at equally spaced time intervals. Similarly, a set of dense motion fields are estimated with

The work presented in this paper was supported by the European Commission under contracts FP6-001765 aceMedia, FP6-027685 MESH and FP6-027026 K-Space. respect to the motion modality. Moreover, the widely used Mel Frequency Cepstral Coefficients (MFCCs) are utilized for the audio information processing.

4. HMM-BASED ANALYSIS 4.1. Color- and Audio-based Analysis

After a set of color histograms is estimated for each shot, as described in Section 3, they are utilized to form the corresponding shot’s color observation sequence. The latter is provided as input to a HMM structure, which performs the association of every shot with the supported events based solely on color information. In particular, an individual degree of confidence, hiCj , is calculated for denoting the degree with which shot si is associated with every event ej . With respect to the audio information processing, the computed MFCCs are used to form the shot’s audio observation sequence. Similarly to the color analysis case, a degree of confidence, hiAj , is calculated to indicate the corresponding association.

4.2. Motion-based Analysis

4.2.1. Polynomial Approximation For every estimated dense motion field, which is computed as described in Section 3, a corresponding motion energy field, M (b; c; t), is calculated. The latter, which actually represents a motion energy distribution surface, is approximated by a 2D polynomial function, of the following form: f (p; q) =

X akl ¢ ((p ¡ p0)k ¢ (q ¡ q0)l) ; k;l 0 · k; l · T and 0 · k + l · T (1) The approximation is performed using the least-squares method. Subsequently, the estimated polynomial coefficients, akl, are used to form the respective shot’s motion observation sequence. The latter is provided as input to the developed HMM structure for performing the association of each shot with the supported events based solely on motion information [2]. Similarly to the color and audio analysis cases, a degree of confidence, hiMj , is calculated to indicate the corresponding association. 4.2.2. Accumulated Motion Energy Field Computation In order to overcome the problem of distinguishing between events that may present similar motion patterns over a period of time during their occurrence, an accumulated motion energy field is estimated with respect to every computed M (b; c; t) [2], according to the following equation: Selected frame

Macc(b; c; t; ¿ ); f or ¿ = 0

Macc(b; c; t; ¿ ); f or ¿ = 2

Mcacc(x; y; t; ¿ ); f or ¿ = 2

Audio ( hAij ) Color ( hCij ) Motion ( h Mij )

Aj1 Cj1 Mj1 Aj2 Cj2 Mj2 ... ... ...

AjQ CjQ MjQ where w(¿ ) is modeled by the following time descending function: P¿0 w(¿ ) ¢ M (b; c; t ¡ ¿ )

P¿0 w(¿ ) w(¿ ) =

1 vf¢¿ ; v > 1 : (3) Following their extraction, a procedure similar to the one described in Section 4.2.1 is followed for providing motion information to the respective HMM structure, where now the computed Macc(b; c; t; ¿ ) are used during the polynomial approximation process, instead of the M (b; c; t). In Fig. 1, an indicative example of energy field polynomial approximation (Mcacc(x; y; t; ¿ )) is presented.

5. FUSION AND CONTEXT EXPLOITATION 5.1. Information Fusion

Under the proposed approach, BNs are employed for fusing the computed single-modality analysis results. In particular, a set of J BNs is introduced, one for every defined event ej . In Fig. 2 the network structure of every utilized BN is illustrated. This network topology defines explicitly the causal relationships between the respective variables, i.e. the event that is depicted in a shot determines the features observed with respect to every modality. Every BN estimates a degree of belief for the parent node, which constitutes a quantitative indication of the association between each shot si and the respective event ej based on multi-modal information.

5.2. Context Exploitation

In order to overcome the inherent ambiguity of the visual medium, an integrated BN model is introduced for acquiring and exploiting the appropriate contextual information, i.e. the supported events’ temporal occurrence order. Specifically, the developed BN, whose topology is illustrated in Fig. 3, receives as input the shot-event associations based on multi-modal information of every shot si (Section 5.1), as well as of all its neighboring shots than lie within a

Event i1+1

True False

Event i2+1

True False

Event iJ+1

True False certain time window. It must be noted that all network nodes expect EventiFj correspond to the appropriate parent nodes of the BNs that have been developed for performing information fusion (Section 5.1). At the evaluation stage, the integrated BN estimates a degree of belief for every EventiFj node, denoted by hiFj , which indicates the degree of confidence with which event ej is eventually associated with shot si.

6. EXPERIMENTAL RESULTS AND CONCLUSIONS

The proposed framework was tested on videos belonging to the news broadcast domain [2]. The results presented in Table 1 demonstrate the efficiency of the proposed approach.

7. REFERENCES Event i1-TW

True False

Eventi-2TW

True False

Event iJ-TW

True False

Event i1-1

True False

Event 2i-1

True False

Event iJ-1

True False

Event 1i

True False

Event 2i

True False

Event Ji

True False