<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A UNIFIED FRAMEWORK FOR SEMANTIC EVENT DETECTION</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>G. Th. Papadopoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>V. Mezaris</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>I. Kompatsiaris</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. G. Strintzis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Informatics and Telematics Institute, Centre for Research and Technology Hellas</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Processing Lab., Electrical &amp; Comp. Eng. Dep., Aristotle Univ. of Thessaloniki</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this poster, a generic multi-modal context-aware framework for detecting high-level semantic events in video sequences is presented. Initially, Hidden Markov Models (HMMs) are employed for performing an initial association of the examined video with the events of interest separately for every modality. Then, an integrated Bayesian Network (BN) is introduced for simultaneously performing information fusion and contextual knowledge exploitation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>During the recent years, intense research efforts have concentrated in
the development of sophisticated and user-friendly systems for
skilful management of video sequences. Most of them have adopted the
fundamental principle of shifting video manipulation techniques
towards the processing of the visual content at a semantic level.
Moreover, the usage of multi-modal as well as contextual information has
emerged as a common practice for overcoming the ambiguity that is
inherent in the visual medium.</p>
      <p>In this poster, a generic multi-modal context-aware framework
for detecting high-level semantic events in video sequences, making
use of Machine Learning algorithms for implicit knowledge
acquisition, is presented. Initially, HMMs are employed for performing
an initial association of the examined video with the events of
concern separately for every utilized modality. Then, an integrated BN
is introduced for simultaneously performing information fusion of
the individual modality analysis results and contextual knowledge
exploitation.</p>
    </sec>
    <sec id="sec-2">
      <title>2. OBJECTIVE OF WORK</title>
      <p>The objective of the proposed approach is the detection of a set of
predefined semantic events, denoted by E = fej ; j = 1; ::J g, for
a particular domain. The latter represent semantically meaningful
incidents that are of interest in a possible application case and have a
temporal duration. The accurate and efficient detection of them can
facilitate tasks like video indexing, search and retrieval with respect
to semantic criteria [1].</p>
    </sec>
    <sec id="sec-3">
      <title>3. VIDEO PRE-PROCESSING</title>
      <p>At the signal level, the examined video sequence is initially
segmented into shots, denoted by S = fsi; i = 1; :::Ig, which
constitute the elementary image sequences of video. For every shot, a
global-level color histogram is calculated at equally spaced time
intervals. Similarly, a set of dense motion fields are estimated with</p>
      <p>The work presented in this paper was supported by the European
Commission under contracts FP6-001765 aceMedia, FP6-027685 MESH and
FP6-027026 K-Space.
respect to the motion modality. Moreover, the widely used Mel
Frequency Cepstral Coefficients (MFCCs) are utilized for the audio
information processing.</p>
    </sec>
    <sec id="sec-4">
      <title>4. HMM-BASED ANALYSIS</title>
    </sec>
    <sec id="sec-5">
      <title>4.1. Color- and Audio-based Analysis</title>
      <p>After a set of color histograms is estimated for each shot, as
described in Section 3, they are utilized to form the corresponding
shot’s color observation sequence. The latter is provided as input to
a HMM structure, which performs the association of every shot with
the supported events based solely on color information. In particular,
an individual degree of confidence, hiCj , is calculated for denoting the
degree with which shot si is associated with every event ej . With
respect to the audio information processing, the computed MFCCs
are used to form the shot’s audio observation sequence. Similarly to
the color analysis case, a degree of confidence, hiAj , is calculated to
indicate the corresponding association.</p>
    </sec>
    <sec id="sec-6">
      <title>4.2. Motion-based Analysis</title>
      <p>4.2.1. Polynomial Approximation
For every estimated dense motion field, which is computed as
described in Section 3, a corresponding motion energy field, M (b; c; t),
is calculated. The latter, which actually represents a motion energy
distribution surface, is approximated by a 2D polynomial function,
of the following form:
f (p; q) =</p>
      <p>X akl ¢ ((p ¡ p0)k ¢ (q ¡ q0)l) ;
k;l
0 · k; l · T and 0 · k + l · T
(1)
The approximation is performed using the least-squares method.
Subsequently, the estimated polynomial coefficients, akl, are used to
form the respective shot’s motion observation sequence. The latter
is provided as input to the developed HMM structure for performing
the association of each shot with the supported events based solely
on motion information [2]. Similarly to the color and audio
analysis cases, a degree of confidence, hiMj , is calculated to indicate the
corresponding association.
4.2.2. Accumulated Motion Energy Field Computation
In order to overcome the problem of distinguishing between events
that may present similar motion patterns over a period of time during
their occurrence, an accumulated motion energy field is estimated
with respect to every computed M (b; c; t) [2], according to the
following equation:
Selected frame</p>
      <p>Macc(b; c; t; ¿ ); f or ¿ = 0</p>
      <p>Macc(b; c; t; ¿ ); f or ¿ = 2</p>
      <p>Mcacc(x; y; t; ¿ ); f or ¿ = 2</p>
      <sec id="sec-6-1">
        <title>Audio ( hAij ) Color ( hCij ) Motion ( h Mij )</title>
        <p>Aj1 Cj1 Mj1
Aj2 Cj2 Mj2
... ... ...</p>
        <p>AjQ CjQ MjQ
where w(¿ ) is modeled by the following time descending function:
P¿0 w(¿ ) ¢ M (b; c; t ¡ ¿ )</p>
        <p>P¿0 w(¿ )
w(¿ ) =</p>
        <p>1
vf¢¿
; v &gt; 1 :
(3)
Following their extraction, a procedure similar to the one described
in Section 4.2.1 is followed for providing motion information to the
respective HMM structure, where now the computed Macc(b; c; t; ¿ )
are used during the polynomial approximation process, instead of the
M (b; c; t). In Fig. 1, an indicative example of energy field
polynomial approximation (Mcacc(x; y; t; ¿ )) is presented.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5. FUSION AND CONTEXT EXPLOITATION</title>
    </sec>
    <sec id="sec-8">
      <title>5.1. Information Fusion</title>
      <p>Under the proposed approach, BNs are employed for fusing the
computed single-modality analysis results. In particular, a set of J BNs
is introduced, one for every defined event ej . In Fig. 2 the network
structure of every utilized BN is illustrated. This network
topology defines explicitly the causal relationships between the respective
variables, i.e. the event that is depicted in a shot determines the
features observed with respect to every modality. Every BN estimates a
degree of belief for the parent node, which constitutes a quantitative
indication of the association between each shot si and the respective
event ej based on multi-modal information.</p>
    </sec>
    <sec id="sec-9">
      <title>5.2. Context Exploitation</title>
      <p>In order to overcome the inherent ambiguity of the visual medium,
an integrated BN model is introduced for acquiring and exploiting
the appropriate contextual information, i.e. the supported events’
temporal occurrence order. Specifically, the developed BN, whose
topology is illustrated in Fig. 3, receives as input the shot-event
associations based on multi-modal information of every shot si
(Section 5.1), as well as of all its neighboring shots than lie within a</p>
      <sec id="sec-9-1">
        <title>Event i1+1</title>
        <p>True
False</p>
      </sec>
      <sec id="sec-9-2">
        <title>Event i2+1</title>
        <p>True
False</p>
      </sec>
      <sec id="sec-9-3">
        <title>Event iJ+1</title>
        <p>True
False
certain time window. It must be noted that all network nodes
expect EventiFj correspond to the appropriate parent nodes of the BNs
that have been developed for performing information fusion (Section
5.1). At the evaluation stage, the integrated BN estimates a degree of
belief for every EventiFj node, denoted by hiFj , which indicates the
degree of confidence with which event ej is eventually associated
with shot si.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>6. EXPERIMENTAL RESULTS AND CONCLUSIONS</title>
      <p>The proposed framework was tested on videos belonging to the news
broadcast domain [2]. The results presented in Table 1 demonstrate
the efficiency of the proposed approach.</p>
    </sec>
    <sec id="sec-11">
      <title>7. REFERENCES</title>
      <sec id="sec-11-1">
        <title>Event i1-TW</title>
        <p>True
False</p>
      </sec>
      <sec id="sec-11-2">
        <title>Eventi-2TW</title>
        <p>True
False</p>
      </sec>
      <sec id="sec-11-3">
        <title>Event iJ-TW</title>
        <p>True
False</p>
      </sec>
      <sec id="sec-11-4">
        <title>Event i1-1</title>
        <p>True
False</p>
      </sec>
      <sec id="sec-11-5">
        <title>Event 2i-1</title>
        <p>True
False</p>
      </sec>
      <sec id="sec-11-6">
        <title>Event iJ-1</title>
        <p>True
False</p>
      </sec>
      <sec id="sec-11-7">
        <title>Event 1i</title>
        <p>True
False</p>
      </sec>
      <sec id="sec-11-8">
        <title>Event 2i</title>
        <p>True
False</p>
      </sec>
      <sec id="sec-11-9">
        <title>Event Ji</title>
        <p>True
False</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>