<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Real-time event recognition from video via a “bag-of-activities”</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rolf H. Baxter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Neil M. Robertson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David M. Lane</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Heriot-Watt University Edinburgh</institution>
          ,
          <addr-line>Scotland, EH14 4AS</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present a new method for highlevel event recognition, demonstrated in realtime on video. Human behaviours have underlying activities that can be used as salient features. We do not assume that the exact temporal ordering of such features is necessary, so can represent behaviours using an unordered “bagof-activities”. A weak temporal ordering is imposed during inference, so fewer training exemplars are necessary compared to other methods. Our three-tier architecture comprises low-level tracking, event analysis and high-level recognition. High-level inference is performed using a new extension of the Rao-Blackwellised Particle Filter. We validate our approach using the PETS 2006 video surveillance dataset and our own sequences. Further, we simulate temporal disruption and increased levels of sensor noise.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Considerable attention has been given to the detection of
events in video. These can be considered low-level events
and include agents entering and exiting areas
        <xref ref-type="bibr" rid="ref7">(Fusier et al.,
2007)</xref>
        , and object abandonment
        <xref ref-type="bibr" rid="ref9">(Grabner et al., 2006)</xref>
        .
High-level goals have been recognised from none-visual
data sources with reasonable success
        <xref ref-type="bibr" rid="ref12">(Liao et al., 2007)</xref>
        .
However, there has been far less progress towards
recognising high level goals from low-level video.
      </p>
      <p>
        Detecting events from surveillance video is particularly
challenging due to occlusions and lighting changes. False
detections are frequent, leading to a high degree of noise
for high-level inference. Although complex events can be
specified using semantic models, they are largely
deterministic and treat events as facts (e.g.
        <xref ref-type="bibr" rid="ref20">(Robertson et al.,
2008)</xref>
        ). Mechanisms for dealing with observation
uncertainty are unavailable in these models
        <xref ref-type="bibr" rid="ref10">(Lavee et al., 2009)</xref>
        .
On the other hand, probabilistic models are very successful
in noisy environments, and are at the core of our approach.
Plan recognition researchers such as
        <xref ref-type="bibr" rid="ref17 ref3">(Bui and Venkatesh,
2002; Nguyen et al., 2005)</xref>
        used hierarchical structures to
model human behaviour. By decomposing a goal into states
at different levels of abstraction (e.g. sub-goals, actions), a
training corpus can be used to learn the probability of
transitioning between the states. Although this work does
consider video, a major shortfall is the necessity for training
data, which is often unavailable in surveillance domains.
A common way to avoid this issue is to model “normal”
behaviours for which training data is easier to obtain
        <xref ref-type="bibr" rid="ref11 ref18 ref2 ref20 ref21">(Boiman
and Irani, 2007; Xiang and Gong, 2008)</xref>
        . Activities with a
low probability can then be identified as abnormal.
Because semantic meanings cannot be attached to the
abnormal activities, they cannot be automatically reasoned about
at a higher level, nor explained to an operator.
      </p>
      <p>
        Another alternative to learning temporal structure is to have
it defined by an expert. For simple events this is trivial,
but increases at least proportionally with the complexity
of the event. In
        <xref ref-type="bibr" rid="ref11">(Laxton et al., 2007)</xref>
        the Dynamic Belief
Network for making French Toast was manually specified.
Their approach only considers a single goal.
      </p>
      <p>
        Dee and Hogg showed that interesting behaviour can be
identified using motion trajectories
        <xref ref-type="bibr" rid="ref4">(Dee and Hogg, 2004)</xref>
        .
Their model identified regions of the scene that were
visible or obstructed from the agent’s location, and produced
a set of goal locations that were consistent with the agent’s
direction of travel. Goal transitions were penalised so
irregular behaviours were identified via their high-cost.
In
        <xref ref-type="bibr" rid="ref1">(Baxter et al., 2010)</xref>
        a simulated proof of concept
suggested behaviours could be identified using temporally
unordered features. This has the advantage that training
exemplars are not required. Our work furthers the idea that
complex behaviour can be semantically recognised using a
feature-based approach. We present methods for
representing behaviours, performing efficient inference, and
demonstrate validity and scalability on real, multi-person video.
      </p>
      <sec id="sec-1-1">
        <title>Enter Agent</title>
      </sec>
      <sec id="sec-1-2">
        <title>Place Item</title>
      </sec>
      <sec id="sec-1-3">
        <title>Form Group</title>
      </sec>
      <sec id="sec-1-4">
        <title>Exit Agent</title>
      </sec>
      <sec id="sec-1-5">
        <title>Part Group</title>
        <p>Leave Item
(1) (2)
(2) (3)
(3)(4)
(4)
((54))</p>
        <sec id="sec-1-5-1">
          <title>a) Watched</title>
        </sec>
        <sec id="sec-1-5-2">
          <title>Item</title>
          <p>LeaveItem
EnterAgent</p>
          <p>ExitAgent
PartGroup
PlaceItem</p>
        </sec>
        <sec id="sec-1-5-3">
          <title>b) Abandoned</title>
        </sec>
        <sec id="sec-1-5-4">
          <title>Item</title>
        </sec>
      </sec>
      <sec id="sec-1-6">
        <title>Leave Item</title>
      </sec>
      <sec id="sec-1-7">
        <title>PlaceItem</title>
      </sec>
      <sec id="sec-1-8">
        <title>ExitAgent</title>
      </sec>
      <sec id="sec-1-9">
        <title>FormGroup</title>
      </sec>
      <sec id="sec-1-10">
        <title>EnterAgent</title>
      </sec>
      <sec id="sec-1-11">
        <title>PartGroup</title>
        <p>(1)
(1)</p>
      </sec>
      <sec id="sec-1-12">
        <title>Enter Agent</title>
      </sec>
      <sec id="sec-1-13">
        <title>Place Item Part Group</title>
      </sec>
      <sec id="sec-1-14">
        <title>Leave Item</title>
      </sec>
      <sec id="sec-1-15">
        <title>Exit Agent</title>
        <p>
          (1)
(2)
(2)
This paper presents a framework with three major
components : (1) low-level object detection and tracking from
video; (2) detecting and labelling simple visual events (e.g.
object placed on floor), and (3) detecting and labelling
high-level, complex events, typically including multiple
people/objects and lasting several minutes in duration. Our
high-level inference algorithm is based upon the
RaoBlackwellised Particle Filter
          <xref ref-type="bibr" rid="ref5 ref6">(Doucet et al., 2000a)</xref>
          , and can
recognise both concatenated and switched behaviour. Our
entire framework is capable of real-time inference.
We validate our approach chiefly on real, benchmarked
surveillance data: the PETS 2006 video surveillance
dataset. We report classification accuracy and speed on
four of the original scenarios, and one additional scenario.
The fifth scenario was acquired by merging frames from
different videos to provide a complex, yet commonly
observed behaviour. Further evaluation is conducted by
simulating sensor noise and temporal disruption, and on
additional video recorded in our own vision laboratory.
Throughout this paper the term activity is used to refer to a
specific short-term behaviour that achieves a purpose. An
activity is comprised of any number of atomic actions.
Activities are recognised as simple events. These terms are
interchanged depending upon context. Similarly,
collections of activities construct goals, and will be referred to as
features of that goal. Goals are detected as complex events.
2
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>RECOGNITION FRAMEWORK</title>
      <p>Figure 1 illustrates two complex behaviours: Watched Item
and Abandoned Item. Watched Item involves two persons
who enter the scene together. One person places an item
of luggage on the floor and leaves, while the other person
remains in close proximity to the luggage. This scenario
is representative of a person being helped with their bags.
Abandoned Item is subtly different: the two people do not
enter the scene together (Frames 1 and 3 in Figure 1b).</p>
      <p>Traditionally, the proximity of people to their luggage is
used to detect abandonment. This would generate an alert
for both of the above scenarios. To distinguish between
them, we integrate low-level image processing with
highlevel reasoning (Figure 2). We use a hierarchical, modular
framework to provide an extendible system that can be
easily updated with new techniques. Video data is provided
as the source of observations and is processed at three
different levels: Object Detection and Tracking, Simple Event
Recognition, and Complex Event Recognition. Image
processing techniques provide information about objects in
the scene, allowing simple semantic events to be detected.
These then form observations for high-level recognition.
2.1</p>
      <p>OBJECT DETECTION AND TRACKING
Static cameras allow foreground pixels to be identified
using background subtraction. This technique compares the
current frame with a known background frame. Pixels that
} Processing</p>
      <sec id="sec-2-1">
        <title>Image</title>
      </sec>
      <sec id="sec-2-2">
        <title>Object Detection &amp;</title>
      </sec>
      <sec id="sec-2-3">
        <title>Tracking</title>
      </sec>
      <sec id="sec-2-4">
        <title>Simple Event</title>
      </sec>
      <sec id="sec-2-5">
        <title>Recognition</title>
      </sec>
      <sec id="sec-2-6">
        <title>Visual Data</title>
        <p>
          are different are classed as the foreground. Connected
foreground pixels give foreground blobs, and are collectively
referred to as Bt. The size/location of each blob can be
projected onto real-world coordinates using the camera
calibration information. Two trackers operate on Bt.
Person Tracker: Our person tracker consists of a set of
SIR filters
          <xref ref-type="bibr" rid="ref8">(Gordon et al., 1993)</xref>
          . SIR filters are similar to
Hidden Markov Models (HMMs) in that they determine the
probability of a set of latent variables given a sequence of
observations
          <xref ref-type="bibr" rid="ref19">(Rabiner, 1989)</xref>
          . However, when latent
variables are continuous, exact approaches to inference become
intractable. The SIR filter is an approximation technique
that uses random sampling to reduce the state space.
Our filters consist of one hundred particles representing the
person’s position on the ground plane, velocity, and
direction of travel
          <xref ref-type="bibr" rid="ref13">(Limprasert, 2010)</xref>
          . For each video frame, the
blobs (groups of foreground pixels) that contain people are
quickly identified from Bt using ellipsoid detection. We
denote these blobs Et. For each ellipsoid that cannot be
explained by an existing filter, a new filter is instantiated to
track the person.
        </p>
        <p>
          In order to address the temporary occlusion of a person
(e.g. people crossing paths), particles also contain a
visibility variable (0/1) to indicate the person’s disappearance.
This variable applies to all particles in the filter. By
combing this variable with a time limit, the filter continues to
predict the person’s location for short occlusions, while
longer occlusions will cause the track to be terminated.
Object Tracker: Our second tracking component
consists of an object detector. In the video sequences this
detects luggage and is similarly heuristic to other
successful approaches
          <xref ref-type="bibr" rid="ref15">(Lv et al., 2006)</xref>
          . To remove person blobs
and counteract the effect of lighting changes, which
spuriously create small foreground blobs, the tracker eliminates
blobs that are not within the heuristically defined range:
0:3 width=height 1m. Each remaining blob is
classified as a stationary luggage item if the blob centroid
remains within 30cm of its original position, and is present
for at least 2 continuous seconds. The red rectangle
identifies a tracked luggage item in Figures 1a&amp;b, frame 2.
Inversely, if the blob matching a tracked luggage object
cannot be identified for 1 second, the luggage is classed as
“removed”. To prevent incorrect object removal (e.g. when
a person is occluding the object), the maximum object size
constraint is suspended once an object is recognised.
2.2
        </p>
        <p>SIMPLE EVENT RECOGNITION
Simple events can be generated by combining foreground
detection/tracking with basic rules. Table 1 specifies the
set of heuristic modules used in our architecture to encode
these rules. It should be highlighted that the GroupTracker
only uses proximity rules to determine group membership
(we suggest improvements in Future Work). Group Formed
events are trigged when two people approach, and remain
within close proximity of each other. Inversely, GroupSplit
events are triggered when two previously ”grouped” people
cease being in close proximity.</p>
        <p>Although these naive modules achieve reasonable accuracy
on the PETS dataset, it is important to acknowledge that
they would be insufficient for more complex video. The
focus of our work is high-level inference and thus
stateof-the-art video processing techniques may not have been
used. The modularity of our framework allows any
component to be swapped, and thus readily supports the adoption
of improved video processing algorithms. Furthermore,
we demonstrate via simulation that high-level inference
remains robust to increased noise.
2.3</p>
        <p>
          COMPLEX EVENT RECOGNITION
Human behaviour involves sequential activities, so it is
natural to model them using directed graphs as in Figure 1.
Dynamic Bayesian Networks (Figure 3) are frequently
chosen for this task, where nodes represent an agent’s state,
solid edges denote dependence, and dashed edges denote
state transitions between time steps
          <xref ref-type="bibr" rid="ref16">(Murphy, 2002)</xref>
          . Each
edge has an associated probability which can be used to
model the inherent variability of human behaviour 1.
Like many others,
          <xref ref-type="bibr" rid="ref3">(Bui and Venkatesh, 2002)</xref>
          learnt model
probabilities from a large dataset. However, annotated
libraries of video surveillance do not exist for many
interesting behaviours, making there no clear path for training
high-level probabilistic models. Similar problems occur
when dealing with military or counter-terrorism
applications, where data is restricted by operational factors.
Alternative approaches include manually specifying the
probabilities, and using a distribution that determines when
transitions are likely to occur
          <xref ref-type="bibr" rid="ref11">(Laxton et al., 2007)</xref>
          .
We hypothesise that many human behaviours can be
recog1Figure 3 will be fully explained in section 3
nised without modelling the exact temporal order of
activities. This means that model parameters do not need to
be defined by either an expert, or training exemplar. We
consider activities as salient features that characterise a
behaviour. Goals can be recognised by combining a
collection (bag) of activities with a weak temporal ordering.
Feature based recognition algorithms have primarily been
developed for object detection applications. To identify
features that are invariant to scale and rotation, object
images are often transformed into the frequency or scale
domains, where invariant salient features can be more readily
identified
          <xref ref-type="bibr" rid="ref14">(Lowe, 1999)</xref>
          . The similarities between
recognising objects and human behaviours has previously been
noted
          <xref ref-type="bibr" rid="ref1 ref18">(Baxter et al., 2010; Patron et al., 2008)</xref>
          , and it is this
similarity upon which we draw our inspiration.
The agents progress towards a target event can be
monitored by tracking the simple events generated.
Fundamentally, the simple events should be consistent with T if T
correctly represents the agent’s behaviour. For instance, if
simple event i is observed but i 3 T , then i must be a
false detection, or T is not the agent’s true behaviour.
As time increases more events from T should be generated.
If we make the assumption that each element of a behaviour
is only performed once, then the set of expected simple
events reduces to the elements of T not yet observed. If
T = h ; ; i and has already been observed, then the set
of expected events is h ; i. In this way a weak temporal
ordering can be applied to the elements of T without learning
their absolute ordering from exemplar.
        </p>
        <p>
          If C is defined as the set of currently observed simple
events, T nC is the set of expected events. At each time
step, events in T nC have equal probability, while all other
events have 0 probability. This probability distribution
encapsulates the assumption that each simple event is only
truthfully generated once per behaviour, and is consistent
with other work in the field
          <xref ref-type="bibr" rid="ref11">(Laxton et al., 2007)</xref>
          . We
discuss the implications and limitations of this assumption in
section 5.
        </p>
        <p>Worked Example: Using Figure 1’s Watched Item
behaviour as an example, at time step t=0 each of the
5 events (LeaveItem, EnterAgent, ExitAgent, PartGroup,
PlaceItem) has equal probability. In frame 1 (t = 1),
p (EnterAgent) = 0:2. At t = 2, p(EnterAgent) =
0, while 8i 2 T nC : p(i) = 0:25. Note that
p(F ormGroupjT = W atchedItem) = 0 at all time
steps, because FormGroup 3 WatchedItem.
This approach can be captured by the Dynamic Bayesian
Network (DBN) in Figure 3. Nodes within the top two
layers represent elements of the person’s state and can be
collectively referred to as x. The bottom layer represents the
simple event that is observed. The vertical dashed line
distinguishes the boundary between time slices, t 1 and t.
Activity observations: Recognition commences at the
bottom of the DBN using the simple-event detection
modules. Ours are described in section 3. Each detection
must be attributed to a tracked object or person.
Desire: Moving up the DBN hierarchy the middle layer
represents the agent’s current desire. A desire is
instantiated with a simple-event (activity) that supports the
complex-event (goal). Given the previous definitions of T
and C the conditional probability for D (desire) is:
p di = p dj 8i;j : di; dj 2 CnT
(2)
It-1
Dt-1
At-1</p>
        <p>It
Dt
At</p>
      </sec>
      <sec id="sec-2-7">
        <title>Interruption I</title>
      </sec>
      <sec id="sec-2-8">
        <title>Target feature set T</title>
      </sec>
      <sec id="sec-2-9">
        <title>Current feature set C</title>
      </sec>
      <sec id="sec-2-10">
        <title>Desire D</title>
      </sec>
      <sec id="sec-2-11">
        <title>Activity A</title>
        <p>Define T P i as the true positive detection probability
of simple event i. Having now defined A and D the the
emission probabilities can also be defined by the function
E (At; Dt):</p>
        <p>E (At; Dt)
=
=
p At =</p>
        <p>i</p>
        <p>
          T P
Goal Representation: The top layer in the DBN
represents the agent’s top-level goal and tracks the features that
have been observed. The final node; I, removes an
important limitation in
          <xref ref-type="bibr" rid="ref1">(Baxter et al., 2010)</xref>
          . I represents
behaviour interruption, which indicates that observation At
cannot be explained by the state xt (the top two layers of
the DBN). It implies one of two conditions. 1) A person
has switched their complex behaviour (e.g. goal) and thus
Tt 1 6= Tt. Although humans frequently switch between
behaviours, this condition breaks the assumptions made by
          <xref ref-type="bibr" rid="ref1">(Baxter et al., 2010)</xref>
          , causing catastrophic failure. 2) At is
a false detection. In this case, the elements of T and C are
temporarily ignored.
        </p>
        <p>Variables: is the set of detectable simple events. T
represents a single behaviour (complex event) and 8t 2 T :
t 2 . C represents the elements of T that have been
observed and thus 8c 2 C : c 2 T . D is a prediction of the
next simple-event and is drawn from T nC. Finally, A is the
observed simple event and is drawn from .</p>
        <p>Probabilities: Define Beh ( ) as the target feature set for
behaviour , and P r ( ) as the prior probability of . The
transition probabilities for latent variables C and T can
then be defined as per Table 2.</p>
        <p>The distribution on values of D is defined by equations 1
and 2, and the emission probabilities by equations 3 to 5.
It should be noted that of all these parameters, only
functions Beh ( ) and E (At; Dt) need to be defined by the
user. It is expected that Beh ( ) (the set of features
representing behaviour ) can be easily defined by an expert,
while E (At; Dt) may be readily obtained by evaluating
the simple-event detectors on a sample dataset. All other
parameters are calculated at run-time, eliminating learning.
3.2</p>
        <p>
          RAO-BLACKWELLISED INFERENCE
The DBN in Figure 3 is a finite state Markov chain and
could be computed analytically. However, given our target
application of visual surveillance, which has the
requirement of near real-time processing, we adopt a particle
filtering approach to reduce the execution time. In Particle
Filtering the aim is to recursively estimate p(x0:tjy0:t), in
which a state sequence fx0; :::; xtg is assumed to be a
hidden Markov process and each element in the observation
sequence fy0; :::; ytg is assumed to be independent given
the state (i.e. p(ytjxt))
          <xref ref-type="bibr" rid="ref5 ref6">(Doucet et al., 2000b)</xref>
          .
        </p>
        <p>We utilise a Rao-Blackwellised Particle Filter (RBPF) so
that the inherent structure of a DBN can be utilised. We
wish to recursively estimate p(xtjy1:t 1), for which the
RBPF partitions xt into two components xt : (xt1; xt2)
Doucet et al. (2000a). This paper will denote the sampled
component by the variable rt, and the marginalised
component as zt. In the DBN in Figure 3, rt : h Ct; Tt; It i
and zt : Dt. This leads to the following factorisations:
p(xtjy1:t 1) = p(ztjrt; y1:t 1)p(rtjy1:t 1)
= p(DtjCt; Tt; It; y1:t 1)p(Ct; Tt; Itjy1:t 1)
(6)
(7)
The factorisation in 7 utilises the inherent structure of
the Bayesian network to perform exact inference on D,
which can be efficiently performed once h Ct; Tt; It i
has been sampled. Each particle i in the RBPF
represents a posterior estimate (hypothesis) of the form hit :
h Cti; Tti; Iti; Dti; Wtii, where Wt is the weight of the
particle calculated as p(ytijxit).</p>
        <p>
          For brevity we will focus on the application of the RBPF
to our work, but refer the interested reader to
          <xref ref-type="bibr" rid="ref3 ref5 ref6">(Bui and
Venkatesh, 2002; Doucet et al., 2000a)</xref>
          for a generic
introduction to the approach.
3.2.1
        </p>
        <p>Algorithm
At time-step 0, T is sampled from the prior and C = ;
for all N particles. For all other time steps, N particles
are sampled from the weighted distribution from t 1 and
each particle predicts the new state hCti; Tti; Itii using the
transition probabilities in Table 2.</p>
        <p>After sampling is complete, the particle set is partitioned
into those where p(ytjCt; Tt; It) is non-zero, and zero. The
first partition is termed the Eligible set because the
particle states are consistent with the new observation, while
the second partition is termed the Rebirth set. Particles in
the Rebirth set represent those where an interruption has
occurred. For each particle in this set, T and C are
reinitialised according to the prior distribution with a
probability of p(T P ), indicating the true positive rate of the
observation. With a probability of 1 p(T P ), particles are
flagged as “FP” (False Positive), and are not re-initialised.
At the next step, the Eligible and Rebirth sets are
recombined and the Rao-Blackwellised posterior is calculated:
p(ztijrti; y1:t 1) = p(DtijCi; Tti; Iti; y1:t 1). The value of
t
Dti (the agent’s next desire) is then predicted according to
the Rao-Blackwellised posterior. At this point each particle
has a complete state estimate xit, and can be weighted
according to equation 8. It is important to note that particles
The final step in the algorithm is to calculate the
transition probabilities. This step ensures that the algorithm is
robust to activity recognition errors. The transition
probability encapsulates the probability that the agent really has
performed the predicted feature Dti, observed via At.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>RESULTS AND DISCUSSION</title>
      <p>Two datasets were used to evaluate our framework. Five
complex behaviours were extracted from four PETS 2006
scenarios, and our own video dataset contains the same
behaviours but encompasses more variability than PETS in
terms of luggage items and the ordering of events.
Experiments were run on a Dual Core 2.4Ghz PC with 4GB RAM.
Figure 4 shows the average F-Scores for the low-level
detectors (trackers, event modules). An F-score is a weighted
average of a classifiers accuracy and recall with range [0:1],
where 1 is optimal. Our person tracker performs well
(FScore 0.92), but occasionally misclassified non-persons
(e.g. trolley), instantiates multiple trackers for a single
person, or does not detect all persons entering in close
proximity. The object tracker has an F-Score 0.83, and is
limited by partial obstructions from the body and shadows.
The naivety of our simple event modules makes them
reliant on good tracker performance. Although the average
score is 0.83, the “Group Formed” module is particularly
unreliable (F-Score: 0.6).
4.1</p>
      <p>COMPLEX EVENT RECOGNITION
The five complex behaviours used in our evaluation are:
Passing Through 1 (PT-1): Person enters and leaves,
Passing Through 2 (PT-2): Person enters, places luggage, picks
it up and leaves, Abandoned Object 1 (AO-1): Person meets
with a second person, places luggage and leaves,
Abandoned Object 2 (AO-2): Person enters, places luggage and
leaves, and Watched Item (WI): Two people enter together,
1
0.8</p>
      <p>In section 3 we highlighted that our naive simple-event
modules would perform poorly on more complex video.
To simulate these conditions, we artificially inserted noise
into the observation stream to lower the true positive rate
to 60%. Figure 5 shows that even with this high degree of
noise, complex events can be detected with 0.65 F-Score.
We proposed that the exact temporal order of
observations does not need to be modelled to recognise
human behaviour. Figure 6 supports this thesis by showing
complex-event likelihood for three different activity
permutations of the AO-1 behaviour. In all three cases AO1
is highly probable, although there are differences in
probability. These differences are because some activity
subsequences are shared between multiple behaviours. For
instance, hP laceItem; LeaveItem; Exiti matches both
1
e0ntPartGroOubpjectPlacedObjectLeftExitAg
ent</p>
      <p>ent
EnterAOgbjectRemo
ved</p>
      <p>E
xitAg
ent
AO-1 and AO-2. There is a low probability that
observations hF ormGroup; P artGroupi were false detections,
and thus some probability is removed from AO1 in
support of AO2, which can also explain the subsequence
hP laceItem; LeaveItem; Exiti. Although observation
order can have an impact on goal probability, it is clear that
our thesis holds for these behaviours.</p>
      <p>In observation 1 the agent enters. The distributions on
the features within each behaviour causes PT-1 to be most
probable because it has the least features. The second
observation can only be explained by two behaviours and is
reflected in the figure. At observation six “EnterAgent”
cannot be explained by any of the behaviours, triggering
behaviour interruption. Observation seven can only be
explained by PT-2 and this is reflected in the figure. As a
result, the behaviours that best explain the observations are
WI and PT-2, which matches the ground truth.</p>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>This paper has argued that data scarcity prevents the
advancement of high-level automated visual surveillance
using probabilistic techniques, and that anomaly detection
side-steps the issue for low-level events. We proposed that
simple visual events can be considered as salient features
and used to recognise more complex events by imposing
a weak temporal ordering. We developed a framework
for end-to-end recognition of complex events from
surveillance video, and demonstrated that our “bag-of-activities”
approach is robust and scalable.</p>
      <p>Section 2.3 made the assumption that for a set of features
defining a behaviour, each feature is only performed once.
This assumption limits our approach but is not as strong as
it may at first appear. An agent who enters and exits the
scene can still re-enter, as this is simply the concatenation
of two behaviours. Each individual behaviour has only
involved one ’EnterAgent’ event so the assumption is not in
conflict. Furthermore, it is also possible to consider actions
that are opposites. For instance, placing and removing a
bag, or entering and exiting the scene, can both be
considered action pairs that ’roll-back’ the state. Although not
implemented in this paper, further work has shown that this
is an effective means of allowing some action repetition.
The only behaviours prevented by the assumption are those
that require performing action A twice (e.g. placing two
individual bags).</p>
      <p>Clearly, improving the sophistication of the simple event
detection modules is a priority in extending our approach to
more complicated data. The Group Tracker module could
be improved by estimating each person’s velocity and
direction using a Kalman filter. These attributes could then
be merged with the proximity based approach to more
accurately detect the forming and splitting of groups.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work was partially funded by the UK Ministry of
Defence under the Competition of Ideas initiative.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Rolf H. Baxter</surname>
          </string-name>
          ,
          <string-name>
            <surname>Neil M. Robertson</surname>
            , and
            <given-names>David M.</given-names>
          </string-name>
          <string-name>
            <surname>Lane</surname>
          </string-name>
          .
          <article-title>Probabilistic behaviour signatures: Feature-based behaviour recognition in data-scarce domains</article-title>
          .
          <source>In Proceedings of the 13th International Conference on Information Fusion</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Oren</given-names>
            <surname>Boiman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Michal</given-names>
            <surname>Irani</surname>
          </string-name>
          .
          <article-title>Detecting irregularities in images and in video</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>74</volume>
          (
          <issue>1</issue>
          ):
          <fpage>17</fpage>
          -
          <lpage>31</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Hung H.</given-names>
            <surname>Bui</surname>
          </string-name>
          and
          <string-name>
            <given-names>Svetha</given-names>
            <surname>Venkatesh</surname>
          </string-name>
          .
          <article-title>Policy recognition in the abstract hidden markov model</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          ,
          <volume>17</volume>
          :
          <fpage>451</fpage>
          -
          <lpage>499</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Hannah</given-names>
            <surname>Dee</surname>
          </string-name>
          and
          <string-name>
            <given-names>David</given-names>
            <surname>Hogg</surname>
          </string-name>
          .
          <article-title>Detecting inexplicable behaviour</article-title>
          .
          <source>In British Machine Vision Conference</source>
          , volume
          <volume>477</volume>
          , page 486,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Arnaud</given-names>
            <surname>Doucet</surname>
          </string-name>
          , Nando de Freitas, Kevin Murphy, and Stuart Russell.
          <article-title>Rao-blackwellised particle filtering for dynamic bayesian networks</article-title>
          .
          <source>In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence</source>
          , pages
          <fpage>176</fpage>
          -
          <lpage>183</lpage>
          ,
          <year>2000a</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Arnaud</given-names>
            <surname>Doucet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Simon J.</given-names>
            <surname>Godsill</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Christophe</given-names>
            <surname>Andrieu</surname>
          </string-name>
          .
          <article-title>On sequential monte carlo sampling methods for bayesian filtering</article-title>
          .
          <source>Statistics and computing</source>
          ,
          <volume>10</volume>
          (
          <issue>3</issue>
          ):
          <fpage>197</fpage>
          -
          <lpage>208</lpage>
          ,
          <year>2000b</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Florent</given-names>
            <surname>Fusier</surname>
          </string-name>
          , Valry Valentin, Franois Brmond, Monique Thonnat,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Borg</surname>
          </string-name>
          , David Thirde,
          <string-name>
            <given-names>and James</given-names>
            <surname>Ferryman</surname>
          </string-name>
          .
          <article-title>Video understanding for complex activity recognition</article-title>
          .
          <source>Machine Vision and Applications</source>
          ,
          <volume>18</volume>
          :
          <fpage>167</fpage>
          -
          <lpage>188</lpage>
          ,
          <year>2007</year>
          . ISSN 0932-8092.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Neil J. Gordon</surname>
            ,
            <given-names>David J.</given-names>
          </string-name>
          <string-name>
            <surname>Salmond</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.F.M.</given-names>
            <surname>Smith.</surname>
          </string-name>
          <article-title>Novel approach to nonlinear/non-gaussian bayesian state estimation</article-title>
          .
          <source>Radar and Signal Processing</source>
          ,
          <source>IEE Proceedings F</source>
          ,
          <volume>140</volume>
          (
          <issue>2</issue>
          ):
          <fpage>107</fpage>
          -
          <lpage>113</lpage>
          ,
          <year>April 1993</year>
          .
          <article-title>ISSN 0956-375X</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Helmut</given-names>
            <surname>Grabner</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter M. Roth</surname>
            ,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Grabner</surname>
            , and
            <given-names>Horst</given-names>
          </string-name>
          <string-name>
            <surname>Bischof</surname>
          </string-name>
          .
          <article-title>Autonomous learning of a robust background model for change detection</article-title>
          .
          <source>In Workshop on Performance Evaluation of Tracking and Surveillance</source>
          , pages
          <fpage>39</fpage>
          -
          <lpage>46</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Gal</given-names>
            <surname>Lavee</surname>
          </string-name>
          , Ehud Rivlin, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Rudzsky</surname>
          </string-name>
          .
          <article-title>Understanding video events: a survey of methods for automatic interpretation of semantic occurrences in video</article-title>
          .
          <source>Systems, Man, and Cybernetics</source>
          , Part C:
          <article-title>Applications</article-title>
          and Reviews, IEEE Transactions on,
          <volume>39</volume>
          (
          <issue>5</issue>
          ):
          <fpage>489</fpage>
          -
          <lpage>504</lpage>
          ,
          <year>2009</year>
          . ISSN 1094-6977.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Laxton</surname>
          </string-name>
          , Jongwoo Lim, and David Kriegman.
          <article-title>Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2007</year>
          . CVPR '07. IEEE Conference on, pages
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Lin</surname>
            <given-names>Liao</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donald J. Patterson</surname>
            , Dieter Fox, and
            <given-names>Henry</given-names>
          </string-name>
          <string-name>
            <surname>Kautz</surname>
          </string-name>
          .
          <article-title>Learning and inferring transportation routines</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>171</volume>
          (
          <issue>5-6</issue>
          ):
          <fpage>311</fpage>
          -
          <lpage>331</lpage>
          ,
          <year>2007</year>
          . ISSN 0004-3702.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Wasit</given-names>
            <surname>Limprasert</surname>
          </string-name>
          .
          <article-title>People detection and tracking with a static camera</article-title>
          .
          <source>Technical report</source>
          , School of Mathematical and Computer Sciences, Heriot-Watt University,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>David G.</given-names>
            <surname>Lowe.</surname>
          </string-name>
          <article-title>Object recognition from local scale-invariant features</article-title>
          .
          <source>In International Conference on Computer Vision</source>
          , volume
          <volume>2</volume>
          , pages
          <fpage>1150</fpage>
          -
          <lpage>1157</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Fengjun</given-names>
            <surname>Lv</surname>
          </string-name>
          , Xuefeng Song, Bo Wu, Vivek Kumar, and
          <article-title>Singh Ramakant Nevatia</article-title>
          .
          <article-title>Left luggage detection using bayesian inference</article-title>
          .
          <source>In Proceedings of PETS</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Kevin P.</given-names>
            <surname>Murphy</surname>
          </string-name>
          .
          <article-title>Dynamic Bayesian networks: representation, inference and learning</article-title>
          .
          <source>PhD thesis</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Nam</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>Dinh Q.</given-names>
          </string-name>
          <string-name>
            <surname>Phung</surname>
            , Svetha Venkatesh, and
            <given-names>Hung</given-names>
          </string-name>
          <string-name>
            <surname>Bui</surname>
          </string-name>
          .
          <article-title>Learning and detecting activities from movement trajectories using the hierarchical hidden markov models</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          , volume
          <volume>2</volume>
          , pages
          <fpage>955</fpage>
          -
          <lpage>960</lpage>
          ,
          <year>2005</year>
          . ISBN 0-7695-2372-2.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Alonso</given-names>
            <surname>Patron</surname>
          </string-name>
          , Eric Sommerlade, and
          <string-name>
            <given-names>Ian</given-names>
            <surname>Reid</surname>
          </string-name>
          .
          <article-title>Action recognition using shared motion parts</article-title>
          .
          <source>In Proceedings of the 8th International Workshop on Visual Surveillance</source>
          ,
          <year>October 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Lawrence R.</given-names>
            <surname>Rabiner</surname>
          </string-name>
          .
          <article-title>A tutorial on hidden markov models and selected applications in speech recognition</article-title>
          .
          <source>In Proceedings of the IEEE</source>
          , volume
          <volume>77</volume>
          , pages
          <fpage>257</fpage>
          -
          <lpage>286</lpage>
          , San Francisco, CA, USA,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Neil</given-names>
            <surname>Robertson</surname>
          </string-name>
          , Ian Reid, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Brady</surname>
          </string-name>
          .
          <article-title>Automatic human behaviour recognition and explanation for CCTV video surveillance</article-title>
          .
          <source>Security Journal</source>
          ,
          <volume>21</volume>
          (
          <issue>3</issue>
          ):
          <fpage>173</fpage>
          -
          <lpage>188</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Tao</given-names>
            <surname>Xiang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Shaogang</given-names>
            <surname>Gong</surname>
          </string-name>
          .
          <article-title>Video behavior profiling for anomaly detection</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>30</volume>
          (
          <issue>5</issue>
          ):
          <fpage>893</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>