Real-time event recognition from video via a “bag-of-activities”


              Rolf H. Baxter                         Neil M. Robertson                         David M. Lane

                                                  Heriot-Watt University
                                              Edinburgh, Scotland, EH14 4AS


                        Abstract                                On the other hand, probabilistic models are very successful
                                                                in noisy environments, and are at the core of our approach.
     In this paper we present a new method for high-            Plan recognition researchers such as (Bui and Venkatesh,
     level event recognition, demonstrated in real-             2002; Nguyen et al., 2005) used hierarchical structures to
     time on video. Human behaviours have under-                model human behaviour. By decomposing a goal into states
     lying activities that can be used as salient fea-          at different levels of abstraction (e.g. sub-goals, actions), a
     tures. We do not assume that the exact tempo-              training corpus can be used to learn the probability of tran-
     ral ordering of such features is necessary, so can         sitioning between the states. Although this work does con-
     represent behaviours using an unordered “bag-              sider video, a major shortfall is the necessity for training
     of-activities”. A weak temporal ordering is im-            data, which is often unavailable in surveillance domains.
     posed during inference, so fewer training exem-
                                                                A common way to avoid this issue is to model “normal” be-
     plars are necessary compared to other methods.
                                                                haviours for which training data is easier to obtain (Boiman
     Our three-tier architecture comprises low-level
                                                                and Irani, 2007; Xiang and Gong, 2008). Activities with a
     tracking, event analysis and high-level recogni-
                                                                low probability can then be identified as abnormal. Be-
     tion. High-level inference is performed using a
                                                                cause semantic meanings cannot be attached to the abnor-
     new extension of the Rao-Blackwellised Particle
                                                                mal activities, they cannot be automatically reasoned about
     Filter. We validate our approach using the PETS
                                                                at a higher level, nor explained to an operator.
     2006 video surveillance dataset and our own se-
     quences. Further, we simulate temporal disrup-             Another alternative to learning temporal structure is to have
     tion and increased levels of sensor noise.                 it defined by an expert. For simple events this is trivial,
                                                                but increases at least proportionally with the complexity
                                                                of the event. In (Laxton et al., 2007) the Dynamic Belief
                                                                Network for making French Toast was manually specified.
1   INTRODUCTION
                                                                Their approach only considers a single goal.
Considerable attention has been given to the detection of       Dee and Hogg showed that interesting behaviour can be
events in video. These can be considered low-level events       identified using motion trajectories (Dee and Hogg, 2004).
and include agents entering and exiting areas (Fusier et al.,   Their model identified regions of the scene that were vis-
2007), and object abandonment (Grabner et al., 2006).           ible or obstructed from the agent’s location, and produced
High-level goals have been recognised from none-visual          a set of goal locations that were consistent with the agent’s
data sources with reasonable success (Liao et al., 2007).       direction of travel. Goal transitions were penalised so ir-
However, there has been far less progress towards recog-        regular behaviours were identified via their high-cost.
nising high level goals from low-level video.
                                                                In (Baxter et al., 2010) a simulated proof of concept sug-
Detecting events from surveillance video is particularly        gested behaviours could be identified using temporally un-
challenging due to occlusions and lighting changes. False       ordered features. This has the advantage that training ex-
detections are frequent, leading to a high degree of noise      emplars are not required. Our work furthers the idea that
for high-level inference. Although complex events can be        complex behaviour can be semantically recognised using a
specified using semantic models, they are largely deter-        feature-based approach. We present methods for represent-
ministic and treat events as facts (e.g. (Robertson et al.,     ing behaviours, performing efficient inference, and demon-
2008)). Mechanisms for dealing with observation uncer-          strate validity and scalability on real, multi-person video.
tainty are unavailable in these models (Lavee et al., 2009).
   a)   Watched                                                                       Part Group
         Item               Enter Agent                 Place Item                                                 Exit Agent
                                                                                      Leave Item
           LeaveItem
        EnterAgent                        (1)            (1)          (2)       (2)                (3)   (3)                    (4)

           ExitAgent

    PartGroup
        PlaceItem


   b) Abandoned                                                                             Part Group
           Item             Enter Agent           Place Item             Form Group                                Exit Agent
                                                                                            Leave Item
     Leave Item
              PlaceItem                   (1)            (1)   (2)              (2) (3)                  (3) (4)                (5)
                                                                                                                                (4)
    ExitAgent
            FormGroup
    EnterAgent
         PartGroup


Figure 1: (a) When two agents enter together, an item left by one agent is not a threat when the second agent remains close.
(b) When two agents enter separately, it cannot be assumed that the item is the responsibility of the remaining agent


This paper presents a framework with three major compo-              2      RECOGNITION FRAMEWORK
nents : (1) low-level object detection and tracking from
video; (2) detecting and labelling simple visual events (e.g.        Figure 1 illustrates two complex behaviours: Watched Item
object placed on floor), and (3) detecting and labelling             and Abandoned Item. Watched Item involves two persons
high-level, complex events, typically including multiple             who enter the scene together. One person places an item
people/objects and lasting several minutes in duration. Our          of luggage on the floor and leaves, while the other person
high-level inference algorithm is based upon the Rao-                remains in close proximity to the luggage. This scenario
Blackwellised Particle Filter (Doucet et al., 2000a), and can        is representative of a person being helped with their bags.
recognise both concatenated and switched behaviour. Our              Abandoned Item is subtly different: the two people do not
entire framework is capable of real-time inference.                  enter the scene together (Frames 1 and 3 in Figure 1b).
We validate our approach chiefly on real, benchmarked                Traditionally, the proximity of people to their luggage is
surveillance data: the PETS 2006 video surveillance                  used to detect abandonment. This would generate an alert
dataset. We report classification accuracy and speed on              for both of the above scenarios. To distinguish between
four of the original scenarios, and one additional scenario.         them, we integrate low-level image processing with high-
The fifth scenario was acquired by merging frames from               level reasoning (Figure 2). We use a hierarchical, modular
different videos to provide a complex, yet commonly ob-              framework to provide an extendible system that can be eas-
served behaviour. Further evaluation is conducted by sim-            ily updated with new techniques. Video data is provided
ulating sensor noise and temporal disruption, and on addi-           as the source of observations and is processed at three dif-
tional video recorded in our own vision laboratory.                  ferent levels: Object Detection and Tracking, Simple Event
Throughout this paper the term activity is used to refer to a        Recognition, and Complex Event Recognition. Image pro-
specific short-term behaviour that achieves a purpose. An            cessing techniques provide information about objects in
activity is comprised of any number of atomic actions. Ac-           the scene, allowing simple semantic events to be detected.
tivities are recognised as simple events. These terms are            These then form observations for high-level recognition.
interchanged depending upon context. Similarly, collec-
tions of activities construct goals, and will be referred to as      2.1    OBJECT DETECTION AND TRACKING
features of that goal. Goals are detected as complex events.
                                                                     Static cameras allow foreground pixels to be identified us-
                                                                     ing background subtraction. This technique compares the
                                                                     current frame with a known background frame. Pixels that
                                          }}
                                                                   mains within 30cm of its original position, and is present
                      Complex Event
                                                                   for at least 2 continuous seconds. The red rectangle iden-
                       Recognition
                                              Reasoning            tifies a tracked luggage item in Figures 1a&b, frame 2. In-
                                                                   versely, if the blob matching a tracked luggage object can-
                       Simple Event                                not be identified for 1 second, the luggage is classed as
                       Recognition                                 “removed”. To prevent incorrect object removal (e.g. when
                                                                   a person is occluding the object), the maximum object size
                     Object Detection &           Image            constraint is suspended once an object is recognised.
                         Tracking               Processing
                                                                   2.2        SIMPLE EVENT RECOGNITION
                        Visual Data                                Simple events can be generated by combining foreground
                                                                   detection/tracking with basic rules. Table 1 specifies the
                                                                   set of heuristic modules used in our architecture to encode
 Figure 2: Our architecture for complex event recognition
                                                                   these rules. It should be highlighted that the GroupTracker
                                                                   only uses proximity rules to determine group membership
                                                                   (we suggest improvements in Future Work). Group Formed
are different are classed as the foreground. Connected fore-       events are trigged when two people approach, and remain
ground pixels give foreground blobs, and are collectively          within close proximity of each other. Inversely, GroupSplit
referred to as Bt . The size/location of each blob can be          events are triggered when two previously ”grouped” people
projected onto real-world coordinates using the camera cal-        cease being in close proximity.
ibration information. Two trackers operate on Bt .
                                                                   Although these naive modules achieve reasonable accuracy
Person Tracker: Our person tracker consists of a set of            on the PETS dataset, it is important to acknowledge that
SIR filters (Gordon et al., 1993). SIR filters are similar to      they would be insufficient for more complex video. The
Hidden Markov Models (HMMs) in that they determine the             focus of our work is high-level inference and thus state-
probability of a set of latent variables given a sequence of       of-the-art video processing techniques may not have been
observations (Rabiner, 1989). However, when latent vari-           used. The modularity of our framework allows any compo-
ables are continuous, exact approaches to inference become         nent to be swapped, and thus readily supports the adoption
intractable. The SIR filter is an approximation technique          of improved video processing algorithms. Furthermore,
that uses random sampling to reduce the state space.               we demonstrate via simulation that high-level inference re-
                                                                   mains robust to increased noise.
Our filters consist of one hundred particles representing the
person’s position on the ground plane, velocity, and direc-
tion of travel (Limprasert, 2010). For each video frame, the       2.3        COMPLEX EVENT RECOGNITION
blobs (groups of foreground pixels) that contain people are
                                                                   Human behaviour involves sequential activities, so it is nat-
quickly identified from Bt using ellipsoid detection. We
                                                                   ural to model them using directed graphs as in Figure 1.
denote these blobs Et . For each ellipsoid that cannot be
                                                                   Dynamic Bayesian Networks (Figure 3) are frequently cho-
explained by an existing filter, a new filter is instantiated to
                                                                   sen for this task, where nodes represent an agent’s state,
track the person.
                                                                   solid edges denote dependence, and dashed edges denote
In order to address the temporary occlusion of a person            state transitions between time steps (Murphy, 2002). Each
(e.g. people crossing paths), particles also contain a visi-       edge has an associated probability which can be used to
bility variable (0/1) to indicate the person’s disappearance.      model the inherent variability of human behaviour 1 .
This variable applies to all particles in the filter. By comb-
                                                                   Like many others, (Bui and Venkatesh, 2002) learnt model
ing this variable with a time limit, the filter continues to
                                                                   probabilities from a large dataset. However, annotated li-
predict the person’s location for short occlusions, while
                                                                   braries of video surveillance do not exist for many inter-
longer occlusions will cause the track to be terminated.
                                                                   esting behaviours, making there no clear path for training
Object Tracker: Our second tracking component con-                 high-level probabilistic models. Similar problems occur
sists of an object detector. In the video sequences this           when dealing with military or counter-terrorism applica-
detects luggage and is similarly heuristic to other success-       tions, where data is restricted by operational factors. Alter-
ful approaches (Lv et al., 2006). To remove person blobs           native approaches include manually specifying the proba-
and counteract the effect of lighting changes, which spuri-        bilities, and using a distribution that determines when tran-
ously create small foreground blobs, the tracker eliminates        sitions are likely to occur (Laxton et al., 2007).
blobs that are not within the heuristically defined range:
                                                                   We hypothesise that many human behaviours can be recog-
0.3 ≤ width/height ≤ 1m. Each remaining blob is clas-
                                                                         1
sified as a stationary luggage item if the blob centroid re-                 Figure 3 will be fully explained in section 3
nised without modelling the exact temporal order of activ-
                                                                   Table 1: The simple event modules used by our architecture
ities. This means that model parameters do not need to
                                                                     Module                   Description
be defined by either an expert, or training exemplar. We
consider activities as salient features that characterise a be-      Agent Tracker            Detects the entry/departure
haviour. Goals can be recognised by combining a collec-                                       of people from the scene.
tion (bag) of activities with a weak temporal ordering.              Object Tracker           Upon luggage detection, as-
                                                                                              sociates that luggage with
Feature based recognition algorithms have primarily been                                      the closest person.
developed for object detection applications. To identify             Group Tracker            Identifies when people are
features that are invariant to scale and rotation, object im-                                 in close proximity, and split
ages are often transformed into the frequency or scale do-                                    from a single location.
mains, where invariant salient features can be more readily          Abandoned Object Detects when luggage is ≥ 3
identified (Lowe, 1999). The similarities between recog-             Detector                 metres from its owner.
nising objects and human behaviours has previously been
noted (Baxter et al., 2010; Patron et al., 2008), and it is this
similarity upon which we draw our inspiration.                                                                  Interruption I
                                                                              It-1                 It     Target feature set T
Figure 1 helps visualise a behaviour as a set of features.                                                Current feature set C
Each ellipse represents a complex event as a bag of activi-
ties (cardinality: one). We formally denote a bag by T , the
Target event, where each element is drawn from the set of              Ct-1          Tt-1     Ct        Tt
detectable simple events α. Each simple-event is a feature.
The agents progress towards a target event can be moni-
tored by tracking the simple events generated. Fundamen-                                                             Desire D
tally, the simple events should be consistent with T if T                     Dt-1                 Dt
correctly represents the agent’s behaviour. For instance, if
simple event αi is observed but αi 3 T , then αi must be a                                                          Activity A
false detection, or T is not the agent’s true behaviour.                      At-1                 At
As time increases more events from T should be generated.
If we make the assumption that each element of a behaviour
is only performed once, then the set of expected simple            Figure 3: The top two layers of the Dynamic Bayesian Net-
events reduces to the elements of T not yet observed. If           work predict low-level events for a complex event
T = hγ, δ, i and γ has already been observed, then the set
of expected events is hδ, i. In this way a weak temporal or-      3    DYNAMIC BAYESIAN NETWORK
dering can be applied to the elements of T without learning
their absolute ordering from exemplar.                             This approach can be captured by the Dynamic Bayesian
If C is defined as the set of currently observed simple            Network (DBN) in Figure 3. Nodes within the top two lay-
events, T \C is the set of expected events. At each time           ers represent elements of the person’s state and can be col-
step, events in T \C have equal probability, while all other       lectively referred to as x. The bottom layer represents the
events have 0 probability. This probability distribution en-       simple event that is observed. The vertical dashed line dis-
capsulates the assumption that each simple event is only           tinguishes the boundary between time slices, t − 1 and t.
truthfully generated once per behaviour, and is consistent         Activity observations: Recognition commences at the
with other work in the field (Laxton et al., 2007). We dis-        bottom of the DBN using the simple-event detection
cuss the implications and limitations of this assumption in        modules. Ours are described in section 3. Each detection
section 5.                                                         must be attributed to a tracked object or person.
Worked Example: Using Figure 1’s Watched Item be-
haviour as an example, at time step t=0 each of the                Desire: Moving up the DBN hierarchy the middle layer
5 events (LeaveItem, EnterAgent, ExitAgent, PartGroup,             represents the agent’s current desire. A desire is in-
PlaceItem) has equal probability. In frame 1 (t = 1),              stantiated with a simple-event (activity) that supports the
p (EnterAgent) = 0.2. At t = 2, p(EnterAgent) =                    complex-event (goal). Given the previous definitions of T
0, while ∀i ∈ T \C : p(i) = 0.25. Note that                        and C the conditional probability for D (desire) is:
p(F ormGroup|T = W atchedItem) = 0 at all time
                                                                               p di = p dj ∀i,j : di , dj ∈ C\T
                                                                                            
steps, because FormGroup 3 WatchedItem.                                                                                    (1)

                                                                                   p dk = 0 ∀k : dk 3 C\T
                                                                                        
                                                                                                                           (2)
               
Define T P αi as the true positive detection probability        Filtering the aim is to recursively estimate p(x0:t |y0:t ), in
of simple event αi . Having now defined A and D the the         which a state sequence {x0 , ..., xt } is assumed to be a hid-
emission probabilities can also be defined by the function      den Markov process and each element in the observation
E (At , Dt ):                                                   sequence {y0 , ..., yt } is assumed to be independent given
                                                                the state (i.e. p(yt |xt )) (Doucet et al., 2000b).
         E (At , Dt ) = p At = αi |Dt = αj
                                                 
                                                       (3)
                       = TP α     i
                                    
                                           :i=j        (4)      We utilise a Rao-Blackwellised Particle Filter (RBPF) so
                                      i
                                                               that the inherent structure of a DBN can be utilised. We
                       = 1 − TP α          : i 6= j    (5)      wish to recursively estimate p(xt |y1:t−1 ), for which the
                                                                RBPF partitions xt into two components xt : (x1t , x2t )
Goal Representation: The top layer in the DBN repre-
                                                                Doucet et al. (2000a). This paper will denote the sampled
sents the agent’s top-level goal and tracks the features that
                                                                component by the variable rt , and the marginalised com-
have been observed. The final node; I, removes an im-
                                                                ponent as zt . In the DBN in Figure 3, rt : h Ct , Tt , It i
portant limitation in (Baxter et al., 2010). I represents be-
                                                                and zt : Dt . This leads to the following factorisations:
haviour interruption, which indicates that observation At
cannot be explained by the state xt (the top two layers of               p(xt |y1:t−1 ) = p(zt |rt , y1:t−1 )p(rt |y1:t−1 )        (6)
the DBN). It implies one of two conditions. 1) A person
has switched their complex behaviour (e.g. goal) and thus               = p(Dt |Ct , Tt , It , y1:t−1 )p(Ct , Tt , It |y1:t−1 )    (7)
Tt−1 6= Tt . Although humans frequently switch between
behaviours, this condition breaks the assumptions made by       The factorisation in 7 utilises the inherent structure of
(Baxter et al., 2010), causing catastrophic failure. 2) At is   the Bayesian network to perform exact inference on D,
a false detection. In this case, the elements of T and C are    which can be efficiently performed once h Ct , Tt , It i
temporarily ignored.                                            has been sampled. Each particle i in the RBPF repre-
                                                                sents a posterior estimate (hypothesis) of the form hit :
                                                                h Cti , Tti , Iti , Dti , Wti i, where Wt is the weight of the
3.1   MODEL PARAMETERS
                                                                particle calculated as p(yti |xit ).
Given the model description above, the DBN parameters           For brevity we will focus on the application of the RBPF
can be summarised as follows.                                   to our work, but refer the interested reader to (Bui and
Variables: α is the set of detectable simple events. T rep-     Venkatesh, 2002; Doucet et al., 2000a) for a generic intro-
resents a single behaviour (complex event) and ∀t ∈ T :         duction to the approach.
t ∈ α. C represents the elements of T that have been ob-
served and thus ∀c ∈ C : c ∈ T . D is a prediction of the       3.2.1    Algorithm
next simple-event and is drawn from T \C. Finally, A is the     At time-step 0, T is sampled from the prior and C = ∅
observed simple event and is drawn from α.                      for all N particles. For all other time steps, N particles
Probabilities: Define Beh (β) as the target feature set for     are sampled from the weighted distribution from t − 1 and
behaviour β, and P r (β) as the prior probability of β. The     each particle predicts the new state hCti , Tti , Iti i using the
transition probabilities for latent variables C and T can       transition probabilities in Table 2.
then be defined as per Table 2.                                 After sampling is complete, the particle set is partitioned
The distribution on values of D is defined by equations 1       into those where p(yt |Ct , Tt , It ) is non-zero, and zero. The
and 2, and the emission probabilities by equations 3 to 5.      first partition is termed the Eligible set because the parti-
                                                                cle states are consistent with the new observation, while
It should be noted that of all these parameters, only func-     the second partition is termed the Rebirth set. Particles in
tions Beh (β) and E (At , Dt ) need to be defined by the        the Rebirth set represent those where an interruption has
user. It is expected that Beh (β) (the set of features rep-     occurred. For each particle in this set, T and C are re-
resenting behaviour β) can be easily defined by an expert,      initialised according to the prior distribution with a prob-
while E (At , Dt ) may be readily obtained by evaluating        ability of p(T P ), indicating the true positive rate of the
the simple-event detectors on a sample dataset. All other       observation. With a probability of 1 − p(T P ), particles are
parameters are calculated at run-time, eliminating learning.    flagged as “FP” (False Positive), and are not re-initialised.

3.2   RAO-BLACKWELLISED INFERENCE                               At the next step, the Eligible and Rebirth sets are recom-
                                                                bined and the Rao-Blackwellised posterior is calculated:
The DBN in Figure 3 is a finite state Markov chain and          p(zti |rti , y1:t−1 ) = p(Dti |Cti , Tti , Iti , y1:t−1 ). The value of
could be computed analytically. However, given our target       Dti (the agent’s next desire) is then predicted according to
application of visual surveillance, which has the require-      the Rao-Blackwellised posterior. At this point each particle
ment of near real-time processing, we adopt a particle fil-     has a complete state estimate xit , and can be weighted ac-
tering approach to reduce the execution time. In Particle       cording to equation 8. It is important to note that particles
                               Table 2: DBN transition probabilities between time steps t − 1 and t

                     p (Ct = Ct−1 ∪ {Dt−1 }|It = 0)         = T P (At−1 )                           when Dt−1 = At−1
                     p (Ct = Ct−1 ∪ {Dt−1 }|It = 0)         =0                                      when Dt−1 6= At−1
                     p (Ct = ∅|It = 1)                      =1

                     p (Tt 6= Tt−1 |It = 0)                 =0
                     p (Tt = Beh (β) |It = 1)               = pr (β)                                if At−1 not assumed false positive
                     p (Tt = Tt−1 |It = 1)                  =1                                      if At−1 assumed false positive


flagged as “FP” are weighted with 1 − p(T P ).                                                 1

              p(yt |xit ) = p(At |Cti , Tti , Iti , Dti )   (8)                               0.8
                                                                                                                         People


                                                                        F−score
                                                                                              0.6
                                                                                                                         Objects
The final step in the algorithm is to calculate the transi-                                   0.4
tion probabilities. This step ensures that the algorithm is                                                              Simple Events
robust to activity recognition errors. The transition proba-                                  0.2
bility encapsulates the probability that the agent really has                                  0
performed the predicted feature Dti , observed via At .                                                     PETS                    Dataset 2


                                                                       Figure 4: The Low-level F-scores for objects and people
4     RESULTS AND DISCUSSION                                           tracking, and simple events

Two datasets were used to evaluate our framework. Five
complex behaviours were extracted from four PETS 2006                                          1
scenarios, and our own video dataset contains the same be-
                                                                         Classifier F−Score


                                                                                              0.8
haviours but encompasses more variability than PETS in
                                                                                              0.6
terms of luggage items and the ordering of events. Experi-
ments were run on a Dual Core 2.4Ghz PC with 4GB RAM.                                         0.4                            PETS
                                                                                                                             Dataset 2
Figure 4 shows the average F-Scores for the low-level de-                                     0.2
                                                                                                                             40% Event Noise (sim)
tectors (trackers, event modules). An F-score is a weighted                                    0
average of a classifiers accuracy and recall with range [0:1],                                            100       200       300        400         500
where 1 is optimal. Our person tracker performs well (F-                                                           Number of Particles
Score ≥ 0.92), but occasionally misclassified non-persons
(e.g. trolley), instantiates multiple trackers for a single per-       Figure 5: Classifier F-Score as the number of particles is
son, or does not detect all persons entering in close prox-            increased (reducing speed).
imity. The object tracker has an F-Score ≥ 0.83, and is
limited by partial obstructions from the body and shadows.
                                                                       one places luggage and leaves, one remains. This last be-
The naivety of our simple event modules makes them re-                 haviour was synthesized for the PETS dataset by merging
liant on good tracker performance. Although the average                track data from scenarios six and four.
score is 0.83, the “Group Formed” module is particularly
unreliable (F-Score: 0.6).                                             Figure 5 compares the average classifier F-Scores as the
                                                                       number of particles is increased. Classifications are made
                                                                       after all simple events have been observed by selecting
4.1   COMPLEX EVENT RECOGNITION                                        the most likely complex event. A minimum likelihood of
                                                                       0.3 was imposed to remove extremely weak classifications.
The five complex behaviours used in our evaluation are:
                                                                       As the number of particles increases accuracy/recall is im-
Passing Through 1 (PT-1): Person enters and leaves, Pass-
                                                                       proved. The algorithm remains very efficient with 500 par-
ing Through 2 (PT-2): Person enters, places luggage, picks
                                                                       ticles, and is capable of processing in excess of 38,000 sim-
it up and leaves, Abandoned Object 1 (AO-1): Person meets
                                                                       ple events per second. The classifiers achieve 0.8 F-Score
with a second person, places luggage and leaves, Aban-
                                                                       on Dataset 2, and 0.87 on PETS.
doned Object 2 (AO-2): Person enters, places luggage and
leaves, and Watched Item (WI): Two people enter together,              In section 3 we highlighted that our naive simple-event
                                            Ent→ FrmG→ PrtG→ Place→ Leave→ Ext
                                                                                                    1
                                            Ent→ Place→ FrmG→ PrtG→ Leave→Ext
                                            Ent→ FrmG→ Place→ PrtG→ Leave→ Ext                     0.9               PT−1
Probability of AO−1


                        1                                                                                            PT−2
                                                                                                   0.8
                                                                                                                     WI
                      0.75
                                                                                                   0.7               AO−1
                       0.5                                                                                           AO−2
                                                                                                   0.6


                                                                                     Probability
                      0.25                                                                         0.5

                        0                                                                          0.4
                             1          2         3          4        5      6
                                               Observation Number
                                                                                                   0.3

Figure 6: Observations arriving in different orders still
                                                                                                   0.2
match the correct goal (AO-1). Nomenclature: Enter
(Ent), Form Group (FrmG), Part Group (PrtG), Place Item                                            0.1
(Place), Leave Item (Leave), Exit (Ext)
                                                                                              0t                                t            t          t                   t
                                                                                                            p         ced ctLef                                d
                                                                                          A gen         rou        Pla                tA
                                                                                                                                         gen         gen move           gen
modules would perform poorly on more complex video.                                  te r         ar tG         ct        b j e    x i         te rA         e    xi tA
                                                                                  En            P            je         O        E         En             tR    E
                                                                                                         Ob                                           jec
To simulate these conditions, we artificially inserted noise                                                                                       Ob
into the observation stream to lower the true positive rate
to 60%. Figure 5 shows that even with this high degree of                         Figure 7: Probability of each behaviour as observations are
noise, complex events can be detected with 0.65 F-Score.                          made. Behaviour switches from WI to PT-2 at observation
                                                                                  6, causing a similar shift in behaviour probability.
Table 3 shows classifier confusion across both datasets.
Missing object detections cause confusion between PT-1
and PT-2. Behaviours AO-1 and WI differ by only one                               AO-1 and AO-2. There is a low probability that observa-
event (Form Group), and thus absent group detections lead                         tions hF ormGroup, P artGroupi were false detections,
to confusion here.                                                                and thus some probability is removed from AO1 in sup-
                                                                                  port of AO2, which can also explain the subsequence
                                                                                  hP laceItem, LeaveItem, Exiti. Although observation
Table 3: Confusion Matrix for the combined video datasets
                                                                                  order can have an impact on goal probability, it is clear that
                              Scenario                                            our thesis holds for these behaviours.

                                            PT-1   PT-2    WI       AO-1   AO-2
                                                                                  4.3              BEHAVIOUR SWITCHING
                                 PT-1       0.92     0      0       0.08     0
                                 PT-2       0.33   0.58     0        0     0.08   Our inference algorithm contains components to detect be-
                                  WI         0      0      0.9      0.1     0     haviour switching, which occurs when an agent concate-
                                 AO-1        0      0      0.2      0.8      0    nates or otherwise changes their behaviour (see Section
                                 AO-2        0      0       0        0       1    2.3). To demonstrate the effectiveness of these components
                                                                                  Figure 7 plots the probability of each behaviour as observa-
                                                                                  tions are received from two concatenated behaviours. The
                                                                                  behaviours are WI, followed by PT-2.
4.2                     TEMPORAL ORDER
                                                                                  In observation 1 the agent enters. The distributions on
We proposed that the exact temporal order of observa-                             the features within each behaviour causes PT-1 to be most
tions does not need to be modelled to recognise hu-                               probable because it has the least features. The second ob-
man behaviour. Figure 6 supports this thesis by showing                           servation can only be explained by two behaviours and is
complex-event likelihood for three different activity per-                        reflected in the figure. At observation six “EnterAgent”
mutations of the AO-1 behaviour. In all three cases AO1                           cannot be explained by any of the behaviours, triggering
is highly probable, although there are differences in prob-                       behaviour interruption. Observation seven can only be ex-
ability. These differences are because some activity sub-                         plained by PT-2 and this is reflected in the figure. As a
sequences are shared between multiple behaviours. For                             result, the behaviours that best explain the observations are
instance, hP laceItem, LeaveItem, Exiti matches both                              WI and PT-2, which matches the ground truth.
5   CONCLUSION AND FUTURE WORK                                     Arnaud Doucet, Nando de Freitas, Kevin Murphy, and Stuart Rus-
                                                                     sell. Rao-blackwellised particle filtering for dynamic bayesian
This paper has argued that data scarcity prevents the ad-            networks. In Proceedings of the Sixteenth Conference on Un-
                                                                     certainty in Artificial Intelligence, pages 176–183, 2000a.
vancement of high-level automated visual surveillance us-
                                                                   Arnaud Doucet, Simon J. Godsill, and Christophe Andrieu. On
ing probabilistic techniques, and that anomaly detection
                                                                     sequential monte carlo sampling methods for bayesian filter-
side-steps the issue for low-level events. We proposed that          ing. Statistics and computing, 10(3):197–208, 2000b.
simple visual events can be considered as salient features         Florent Fusier, Valry Valentin, Franois Brmond, Monique Thon-
and used to recognise more complex events by imposing                 nat, Mark Borg, David Thirde, and James Ferryman. Video
a weak temporal ordering. We developed a framework                    understanding for complex activity recognition. Machine Vi-
for end-to-end recognition of complex events from surveil-            sion and Applications, 18:167–188, 2007. ISSN 0932-8092.
lance video, and demonstrated that our “bag-of-activities”         Neil J. Gordon, David J. Salmond, and A.F.M. Smith. Novel
approach is robust and scalable.                                     approach to nonlinear/non-gaussian bayesian state estimation.
                                                                     Radar and Signal Processing, IEE Proceedings F, 140(2):107
Section 2.3 made the assumption that for a set of features           –113, April 1993. ISSN 0956-375X.
defining a behaviour, each feature is only performed once.         Helmut Grabner, Peter M. Roth, Michael Grabner, and Horst
This assumption limits our approach but is not as strong as          Bischof. Autonomous learning of a robust background model
it may at first appear. An agent who enters and exits the            for change detection. In Workshop on Performance Evaluation
scene can still re-enter, as this is simply the concatenation        of Tracking and Surveillance, pages 39–46, 2006.
of two behaviours. Each individual behaviour has only in-          Gal Lavee, Ehud Rivlin, and Michael Rudzsky. Understanding
                                                                     video events: a survey of methods for automatic interpretation
volved one ’EnterAgent’ event so the assumption is not in
                                                                     of semantic occurrences in video. Systems, Man, and Cyber-
conflict. Furthermore, it is also possible to consider actions       netics, Part C: Applications and Reviews, IEEE Transactions
that are opposites. For instance, placing and removing a             on, 39(5):489–504, 2009. ISSN 1094-6977.
bag, or entering and exiting the scene, can both be con-           Benjamin Laxton, Jongwoo Lim, and David Kriegman. Lever-
sidered action pairs that ’roll-back’ the state. Although not        aging temporal, contextual and ordering constraints for recog-
implemented in this paper, further work has shown that this          nizing complex activities in video. In Computer Vision and
is an effective means of allowing some action repetition.            Pattern Recognition, 2007. CVPR ’07. IEEE Conference on,
                                                                     pages 1 –8, 2007.
The only behaviours prevented by the assumption are those
that require performing action A twice (e.g. placing two           Lin Liao, Donald J. Patterson, Dieter Fox, and Henry Kautz.
                                                                     Learning and inferring transportation routines. Artificial In-
individual bags).                                                    telligence, 171(5-6):311–331, 2007. ISSN 0004-3702.
Clearly, improving the sophistication of the simple event          Wasit Limprasert. People detection and tracking with a static cam-
detection modules is a priority in extending our approach to         era. Technical report, School of Mathematical and Computer
more complicated data. The Group Tracker module could                Sciences, Heriot-Watt University, 2010.
be improved by estimating each person’s velocity and di-           David G. Lowe. Object recognition from local scale-invariant fea-
rection using a Kalman filter. These attributes could then           tures. In International Conference on Computer Vision, vol-
                                                                     ume 2, pages 1150–1157, 1999.
be merged with the proximity based approach to more ac-
                                                                   Fengjun Lv, Xuefeng Song, Bo Wu, Vivek Kumar, and Singh Ra-
curately detect the forming and splitting of groups.
                                                                     makant Nevatia. Left luggage detection using bayesian infer-
                                                                     ence. In Proceedings of PETS, 2006.
Acknowledgements                                                   Kevin P. Murphy. Dynamic Bayesian networks: representation,
                                                                     inference and learning. PhD thesis, 2002.
This work was partially funded by the UK Ministry of De-           Nam T. Nguyen, Dinh Q. Phung, Svetha Venkatesh, and Hung
fence under the Competition of Ideas initiative.                     Bui. Learning and detecting activities from movement trajec-
                                                                     tories using the hierarchical hidden markov models. In Com-
                                                                     puter Vision and Pattern Recognition, volume 2, pages 955–
References                                                           960, 2005. ISBN 0-7695-2372-2.
Rolf H. Baxter, Neil M. Robertson, and David M. Lane. Proba-       Alonso Patron, Eric Sommerlade, and Ian Reid. Action recog-
  bilistic behaviour signatures: Feature-based behaviour recog-      nition using shared motion parts. In Proceedings of the 8th
  nition in data-scarce domains. In Proceedings of the 13th In-      International Workshop on Visual Surveillance, October 2008.
  ternational Conference on Information Fusion, 2010.              Lawrence R. Rabiner. A tutorial on hidden markov models and
Oren Boiman and Michal Irani. Detecting irregularities in images     selected applications in speech recognition. In Proceedings
  and in video. International Journal of Computer Vision, 74(1):     of the IEEE, volume 77, pages 257–286, San Francisco, CA,
  17–31, 2007.                                                       USA, 1989.
Hung H. Bui and Svetha Venkatesh. Policy recognition in the ab-    Neil Robertson, Ian Reid, and Michael Brady. Automatic hu-
  stract hidden markov model. Journal of Artificial Intelligence     man behaviour recognition and explanation for CCTV video
  Research, 17:451–499, 2002.                                        surveillance. Security Journal, 21(3):173–188, 2008.
Hannah Dee and David Hogg. Detecting inexplicable behaviour.       Tao Xiang and Shaogang Gong. Video behavior profiling for
  In British Machine Vision Conference, volume 477, page 486,        anomaly detection. IEEE Transactions on Pattern Analysis and
  2004.                                                              Machine Intelligence, 30(5):893, 2008.