Real-time event recognition from video via a “bag-of-activities” Rolf H. Baxter Neil M. Robertson David M. Lane Heriot-Watt University Edinburgh, Scotland, EH14 4AS Abstract On the other hand, probabilistic models are very successful in noisy environments, and are at the core of our approach. In this paper we present a new method for high- Plan recognition researchers such as (Bui and Venkatesh, level event recognition, demonstrated in real- 2002; Nguyen et al., 2005) used hierarchical structures to time on video. Human behaviours have under- model human behaviour. By decomposing a goal into states lying activities that can be used as salient fea- at different levels of abstraction (e.g. sub-goals, actions), a tures. We do not assume that the exact tempo- training corpus can be used to learn the probability of tran- ral ordering of such features is necessary, so can sitioning between the states. Although this work does con- represent behaviours using an unordered “bag- sider video, a major shortfall is the necessity for training of-activities”. A weak temporal ordering is im- data, which is often unavailable in surveillance domains. posed during inference, so fewer training exem- A common way to avoid this issue is to model “normal” be- plars are necessary compared to other methods. haviours for which training data is easier to obtain (Boiman Our three-tier architecture comprises low-level and Irani, 2007; Xiang and Gong, 2008). Activities with a tracking, event analysis and high-level recogni- low probability can then be identified as abnormal. Be- tion. High-level inference is performed using a cause semantic meanings cannot be attached to the abnor- new extension of the Rao-Blackwellised Particle mal activities, they cannot be automatically reasoned about Filter. We validate our approach using the PETS at a higher level, nor explained to an operator. 2006 video surveillance dataset and our own se- quences. Further, we simulate temporal disrup- Another alternative to learning temporal structure is to have tion and increased levels of sensor noise. it defined by an expert. For simple events this is trivial, but increases at least proportionally with the complexity of the event. In (Laxton et al., 2007) the Dynamic Belief Network for making French Toast was manually specified. 1 INTRODUCTION Their approach only considers a single goal. Considerable attention has been given to the detection of Dee and Hogg showed that interesting behaviour can be events in video. These can be considered low-level events identified using motion trajectories (Dee and Hogg, 2004). and include agents entering and exiting areas (Fusier et al., Their model identified regions of the scene that were vis- 2007), and object abandonment (Grabner et al., 2006). ible or obstructed from the agent’s location, and produced High-level goals have been recognised from none-visual a set of goal locations that were consistent with the agent’s data sources with reasonable success (Liao et al., 2007). direction of travel. Goal transitions were penalised so ir- However, there has been far less progress towards recog- regular behaviours were identified via their high-cost. nising high level goals from low-level video. In (Baxter et al., 2010) a simulated proof of concept sug- Detecting events from surveillance video is particularly gested behaviours could be identified using temporally un- challenging due to occlusions and lighting changes. False ordered features. This has the advantage that training ex- detections are frequent, leading to a high degree of noise emplars are not required. Our work furthers the idea that for high-level inference. Although complex events can be complex behaviour can be semantically recognised using a specified using semantic models, they are largely deter- feature-based approach. We present methods for represent- ministic and treat events as facts (e.g. (Robertson et al., ing behaviours, performing efficient inference, and demon- 2008)). Mechanisms for dealing with observation uncer- strate validity and scalability on real, multi-person video. tainty are unavailable in these models (Lavee et al., 2009). a) Watched Part Group Item Enter Agent Place Item Exit Agent Leave Item LeaveItem EnterAgent (1) (1) (2) (2) (3) (3) (4) ExitAgent PartGroup PlaceItem b) Abandoned Part Group Item Enter Agent Place Item Form Group Exit Agent Leave Item Leave Item PlaceItem (1) (1) (2) (2) (3) (3) (4) (5) (4) ExitAgent FormGroup EnterAgent PartGroup Figure 1: (a) When two agents enter together, an item left by one agent is not a threat when the second agent remains close. (b) When two agents enter separately, it cannot be assumed that the item is the responsibility of the remaining agent This paper presents a framework with three major compo- 2 RECOGNITION FRAMEWORK nents : (1) low-level object detection and tracking from video; (2) detecting and labelling simple visual events (e.g. Figure 1 illustrates two complex behaviours: Watched Item object placed on floor), and (3) detecting and labelling and Abandoned Item. Watched Item involves two persons high-level, complex events, typically including multiple who enter the scene together. One person places an item people/objects and lasting several minutes in duration. Our of luggage on the floor and leaves, while the other person high-level inference algorithm is based upon the Rao- remains in close proximity to the luggage. This scenario Blackwellised Particle Filter (Doucet et al., 2000a), and can is representative of a person being helped with their bags. recognise both concatenated and switched behaviour. Our Abandoned Item is subtly different: the two people do not entire framework is capable of real-time inference. enter the scene together (Frames 1 and 3 in Figure 1b). We validate our approach chiefly on real, benchmarked Traditionally, the proximity of people to their luggage is surveillance data: the PETS 2006 video surveillance used to detect abandonment. This would generate an alert dataset. We report classification accuracy and speed on for both of the above scenarios. To distinguish between four of the original scenarios, and one additional scenario. them, we integrate low-level image processing with high- The fifth scenario was acquired by merging frames from level reasoning (Figure 2). We use a hierarchical, modular different videos to provide a complex, yet commonly ob- framework to provide an extendible system that can be eas- served behaviour. Further evaluation is conducted by sim- ily updated with new techniques. Video data is provided ulating sensor noise and temporal disruption, and on addi- as the source of observations and is processed at three dif- tional video recorded in our own vision laboratory. ferent levels: Object Detection and Tracking, Simple Event Throughout this paper the term activity is used to refer to a Recognition, and Complex Event Recognition. Image pro- specific short-term behaviour that achieves a purpose. An cessing techniques provide information about objects in activity is comprised of any number of atomic actions. Ac- the scene, allowing simple semantic events to be detected. tivities are recognised as simple events. These terms are These then form observations for high-level recognition. interchanged depending upon context. Similarly, collec- tions of activities construct goals, and will be referred to as 2.1 OBJECT DETECTION AND TRACKING features of that goal. Goals are detected as complex events. Static cameras allow foreground pixels to be identified us- ing background subtraction. This technique compares the current frame with a known background frame. Pixels that }} mains within 30cm of its original position, and is present Complex Event for at least 2 continuous seconds. The red rectangle iden- Recognition Reasoning tifies a tracked luggage item in Figures 1a&b, frame 2. In- versely, if the blob matching a tracked luggage object can- Simple Event not be identified for 1 second, the luggage is classed as Recognition “removed”. To prevent incorrect object removal (e.g. when a person is occluding the object), the maximum object size Object Detection & Image constraint is suspended once an object is recognised. Tracking Processing 2.2 SIMPLE EVENT RECOGNITION Visual Data Simple events can be generated by combining foreground detection/tracking with basic rules. Table 1 specifies the set of heuristic modules used in our architecture to encode Figure 2: Our architecture for complex event recognition these rules. It should be highlighted that the GroupTracker only uses proximity rules to determine group membership (we suggest improvements in Future Work). Group Formed are different are classed as the foreground. Connected fore- events are trigged when two people approach, and remain ground pixels give foreground blobs, and are collectively within close proximity of each other. Inversely, GroupSplit referred to as Bt . The size/location of each blob can be events are triggered when two previously ”grouped” people projected onto real-world coordinates using the camera cal- cease being in close proximity. ibration information. Two trackers operate on Bt . Although these naive modules achieve reasonable accuracy Person Tracker: Our person tracker consists of a set of on the PETS dataset, it is important to acknowledge that SIR filters (Gordon et al., 1993). SIR filters are similar to they would be insufficient for more complex video. The Hidden Markov Models (HMMs) in that they determine the focus of our work is high-level inference and thus state- probability of a set of latent variables given a sequence of of-the-art video processing techniques may not have been observations (Rabiner, 1989). However, when latent vari- used. The modularity of our framework allows any compo- ables are continuous, exact approaches to inference become nent to be swapped, and thus readily supports the adoption intractable. The SIR filter is an approximation technique of improved video processing algorithms. Furthermore, that uses random sampling to reduce the state space. we demonstrate via simulation that high-level inference re- mains robust to increased noise. Our filters consist of one hundred particles representing the person’s position on the ground plane, velocity, and direc- tion of travel (Limprasert, 2010). For each video frame, the 2.3 COMPLEX EVENT RECOGNITION blobs (groups of foreground pixels) that contain people are Human behaviour involves sequential activities, so it is nat- quickly identified from Bt using ellipsoid detection. We ural to model them using directed graphs as in Figure 1. denote these blobs Et . For each ellipsoid that cannot be Dynamic Bayesian Networks (Figure 3) are frequently cho- explained by an existing filter, a new filter is instantiated to sen for this task, where nodes represent an agent’s state, track the person. solid edges denote dependence, and dashed edges denote In order to address the temporary occlusion of a person state transitions between time steps (Murphy, 2002). Each (e.g. people crossing paths), particles also contain a visi- edge has an associated probability which can be used to bility variable (0/1) to indicate the person’s disappearance. model the inherent variability of human behaviour 1 . This variable applies to all particles in the filter. By comb- Like many others, (Bui and Venkatesh, 2002) learnt model ing this variable with a time limit, the filter continues to probabilities from a large dataset. However, annotated li- predict the person’s location for short occlusions, while braries of video surveillance do not exist for many inter- longer occlusions will cause the track to be terminated. esting behaviours, making there no clear path for training Object Tracker: Our second tracking component con- high-level probabilistic models. Similar problems occur sists of an object detector. In the video sequences this when dealing with military or counter-terrorism applica- detects luggage and is similarly heuristic to other success- tions, where data is restricted by operational factors. Alter- ful approaches (Lv et al., 2006). To remove person blobs native approaches include manually specifying the proba- and counteract the effect of lighting changes, which spuri- bilities, and using a distribution that determines when tran- ously create small foreground blobs, the tracker eliminates sitions are likely to occur (Laxton et al., 2007). blobs that are not within the heuristically defined range: We hypothesise that many human behaviours can be recog- 0.3 ≤ width/height ≤ 1m. Each remaining blob is clas- 1 sified as a stationary luggage item if the blob centroid re- Figure 3 will be fully explained in section 3 nised without modelling the exact temporal order of activ- Table 1: The simple event modules used by our architecture ities. This means that model parameters do not need to Module Description be defined by either an expert, or training exemplar. We consider activities as salient features that characterise a be- Agent Tracker Detects the entry/departure haviour. Goals can be recognised by combining a collec- of people from the scene. tion (bag) of activities with a weak temporal ordering. Object Tracker Upon luggage detection, as- sociates that luggage with Feature based recognition algorithms have primarily been the closest person. developed for object detection applications. To identify Group Tracker Identifies when people are features that are invariant to scale and rotation, object im- in close proximity, and split ages are often transformed into the frequency or scale do- from a single location. mains, where invariant salient features can be more readily Abandoned Object Detects when luggage is ≥ 3 identified (Lowe, 1999). The similarities between recog- Detector metres from its owner. nising objects and human behaviours has previously been noted (Baxter et al., 2010; Patron et al., 2008), and it is this similarity upon which we draw our inspiration. Interruption I It-1 It Target feature set T Figure 1 helps visualise a behaviour as a set of features. Current feature set C Each ellipse represents a complex event as a bag of activi- ties (cardinality: one). We formally denote a bag by T , the Target event, where each element is drawn from the set of Ct-1 Tt-1 Ct Tt detectable simple events α. Each simple-event is a feature. The agents progress towards a target event can be moni- tored by tracking the simple events generated. Fundamen- Desire D tally, the simple events should be consistent with T if T Dt-1 Dt correctly represents the agent’s behaviour. For instance, if simple event αi is observed but αi 3 T , then αi must be a Activity A false detection, or T is not the agent’s true behaviour. At-1 At As time increases more events from T should be generated. If we make the assumption that each element of a behaviour is only performed once, then the set of expected simple Figure 3: The top two layers of the Dynamic Bayesian Net- events reduces to the elements of T not yet observed. If work predict low-level events for a complex event T = hγ, δ, i and γ has already been observed, then the set of expected events is hδ, i. In this way a weak temporal or- 3 DYNAMIC BAYESIAN NETWORK dering can be applied to the elements of T without learning their absolute ordering from exemplar. This approach can be captured by the Dynamic Bayesian If C is defined as the set of currently observed simple Network (DBN) in Figure 3. Nodes within the top two lay- events, T \C is the set of expected events. At each time ers represent elements of the person’s state and can be col- step, events in T \C have equal probability, while all other lectively referred to as x. The bottom layer represents the events have 0 probability. This probability distribution en- simple event that is observed. The vertical dashed line dis- capsulates the assumption that each simple event is only tinguishes the boundary between time slices, t − 1 and t. truthfully generated once per behaviour, and is consistent Activity observations: Recognition commences at the with other work in the field (Laxton et al., 2007). We dis- bottom of the DBN using the simple-event detection cuss the implications and limitations of this assumption in modules. Ours are described in section 3. Each detection section 5. must be attributed to a tracked object or person. Worked Example: Using Figure 1’s Watched Item be- haviour as an example, at time step t=0 each of the Desire: Moving up the DBN hierarchy the middle layer 5 events (LeaveItem, EnterAgent, ExitAgent, PartGroup, represents the agent’s current desire. A desire is in- PlaceItem) has equal probability. In frame 1 (t = 1), stantiated with a simple-event (activity) that supports the p (EnterAgent) = 0.2. At t = 2, p(EnterAgent) = complex-event (goal). Given the previous definitions of T 0, while ∀i ∈ T \C : p(i) = 0.25. Note that and C the conditional probability for D (desire) is: p(F ormGroup|T = W atchedItem) = 0 at all time p di = p dj ∀i,j : di , dj ∈ C\T   steps, because FormGroup 3 WatchedItem. (1) p dk = 0 ∀k : dk 3 C\T  (2)  Define T P αi as the true positive detection probability Filtering the aim is to recursively estimate p(x0:t |y0:t ), in of simple event αi . Having now defined A and D the the which a state sequence {x0 , ..., xt } is assumed to be a hid- emission probabilities can also be defined by the function den Markov process and each element in the observation E (At , Dt ): sequence {y0 , ..., yt } is assumed to be independent given the state (i.e. p(yt |xt )) (Doucet et al., 2000b). E (At , Dt ) = p At = αi |Dt = αj  (3) = TP α i  :i=j (4) We utilise a Rao-Blackwellised Particle Filter (RBPF) so i  that the inherent structure of a DBN can be utilised. We = 1 − TP α : i 6= j (5) wish to recursively estimate p(xt |y1:t−1 ), for which the RBPF partitions xt into two components xt : (x1t , x2t ) Goal Representation: The top layer in the DBN repre- Doucet et al. (2000a). This paper will denote the sampled sents the agent’s top-level goal and tracks the features that component by the variable rt , and the marginalised com- have been observed. The final node; I, removes an im- ponent as zt . In the DBN in Figure 3, rt : h Ct , Tt , It i portant limitation in (Baxter et al., 2010). I represents be- and zt : Dt . This leads to the following factorisations: haviour interruption, which indicates that observation At cannot be explained by the state xt (the top two layers of p(xt |y1:t−1 ) = p(zt |rt , y1:t−1 )p(rt |y1:t−1 ) (6) the DBN). It implies one of two conditions. 1) A person has switched their complex behaviour (e.g. goal) and thus = p(Dt |Ct , Tt , It , y1:t−1 )p(Ct , Tt , It |y1:t−1 ) (7) Tt−1 6= Tt . Although humans frequently switch between behaviours, this condition breaks the assumptions made by The factorisation in 7 utilises the inherent structure of (Baxter et al., 2010), causing catastrophic failure. 2) At is the Bayesian network to perform exact inference on D, a false detection. In this case, the elements of T and C are which can be efficiently performed once h Ct , Tt , It i temporarily ignored. has been sampled. Each particle i in the RBPF repre- sents a posterior estimate (hypothesis) of the form hit : h Cti , Tti , Iti , Dti , Wti i, where Wt is the weight of the 3.1 MODEL PARAMETERS particle calculated as p(yti |xit ). Given the model description above, the DBN parameters For brevity we will focus on the application of the RBPF can be summarised as follows. to our work, but refer the interested reader to (Bui and Variables: α is the set of detectable simple events. T rep- Venkatesh, 2002; Doucet et al., 2000a) for a generic intro- resents a single behaviour (complex event) and ∀t ∈ T : duction to the approach. t ∈ α. C represents the elements of T that have been ob- served and thus ∀c ∈ C : c ∈ T . D is a prediction of the 3.2.1 Algorithm next simple-event and is drawn from T \C. Finally, A is the At time-step 0, T is sampled from the prior and C = ∅ observed simple event and is drawn from α. for all N particles. For all other time steps, N particles Probabilities: Define Beh (β) as the target feature set for are sampled from the weighted distribution from t − 1 and behaviour β, and P r (β) as the prior probability of β. The each particle predicts the new state hCti , Tti , Iti i using the transition probabilities for latent variables C and T can transition probabilities in Table 2. then be defined as per Table 2. After sampling is complete, the particle set is partitioned The distribution on values of D is defined by equations 1 into those where p(yt |Ct , Tt , It ) is non-zero, and zero. The and 2, and the emission probabilities by equations 3 to 5. first partition is termed the Eligible set because the parti- cle states are consistent with the new observation, while It should be noted that of all these parameters, only func- the second partition is termed the Rebirth set. Particles in tions Beh (β) and E (At , Dt ) need to be defined by the the Rebirth set represent those where an interruption has user. It is expected that Beh (β) (the set of features rep- occurred. For each particle in this set, T and C are re- resenting behaviour β) can be easily defined by an expert, initialised according to the prior distribution with a prob- while E (At , Dt ) may be readily obtained by evaluating ability of p(T P ), indicating the true positive rate of the the simple-event detectors on a sample dataset. All other observation. With a probability of 1 − p(T P ), particles are parameters are calculated at run-time, eliminating learning. flagged as “FP” (False Positive), and are not re-initialised. 3.2 RAO-BLACKWELLISED INFERENCE At the next step, the Eligible and Rebirth sets are recom- bined and the Rao-Blackwellised posterior is calculated: The DBN in Figure 3 is a finite state Markov chain and p(zti |rti , y1:t−1 ) = p(Dti |Cti , Tti , Iti , y1:t−1 ). The value of could be computed analytically. However, given our target Dti (the agent’s next desire) is then predicted according to application of visual surveillance, which has the require- the Rao-Blackwellised posterior. At this point each particle ment of near real-time processing, we adopt a particle fil- has a complete state estimate xit , and can be weighted ac- tering approach to reduce the execution time. In Particle cording to equation 8. It is important to note that particles Table 2: DBN transition probabilities between time steps t − 1 and t p (Ct = Ct−1 ∪ {Dt−1 }|It = 0) = T P (At−1 ) when Dt−1 = At−1 p (Ct = Ct−1 ∪ {Dt−1 }|It = 0) =0 when Dt−1 6= At−1 p (Ct = ∅|It = 1) =1 p (Tt 6= Tt−1 |It = 0) =0 p (Tt = Beh (β) |It = 1) = pr (β) if At−1 not assumed false positive p (Tt = Tt−1 |It = 1) =1 if At−1 assumed false positive flagged as “FP” are weighted with 1 − p(T P ). 1 p(yt |xit ) = p(At |Cti , Tti , Iti , Dti ) (8) 0.8 People F−score 0.6 Objects The final step in the algorithm is to calculate the transi- 0.4 tion probabilities. This step ensures that the algorithm is Simple Events robust to activity recognition errors. The transition proba- 0.2 bility encapsulates the probability that the agent really has 0 performed the predicted feature Dti , observed via At . PETS Dataset 2 Figure 4: The Low-level F-scores for objects and people 4 RESULTS AND DISCUSSION tracking, and simple events Two datasets were used to evaluate our framework. Five complex behaviours were extracted from four PETS 2006 1 scenarios, and our own video dataset contains the same be- Classifier F−Score 0.8 haviours but encompasses more variability than PETS in 0.6 terms of luggage items and the ordering of events. Experi- ments were run on a Dual Core 2.4Ghz PC with 4GB RAM. 0.4 PETS Dataset 2 Figure 4 shows the average F-Scores for the low-level de- 0.2 40% Event Noise (sim) tectors (trackers, event modules). An F-score is a weighted 0 average of a classifiers accuracy and recall with range [0:1], 100 200 300 400 500 where 1 is optimal. Our person tracker performs well (F- Number of Particles Score ≥ 0.92), but occasionally misclassified non-persons (e.g. trolley), instantiates multiple trackers for a single per- Figure 5: Classifier F-Score as the number of particles is son, or does not detect all persons entering in close prox- increased (reducing speed). imity. The object tracker has an F-Score ≥ 0.83, and is limited by partial obstructions from the body and shadows. one places luggage and leaves, one remains. This last be- The naivety of our simple event modules makes them re- haviour was synthesized for the PETS dataset by merging liant on good tracker performance. Although the average track data from scenarios six and four. score is 0.83, the “Group Formed” module is particularly unreliable (F-Score: 0.6). Figure 5 compares the average classifier F-Scores as the number of particles is increased. Classifications are made after all simple events have been observed by selecting 4.1 COMPLEX EVENT RECOGNITION the most likely complex event. A minimum likelihood of 0.3 was imposed to remove extremely weak classifications. The five complex behaviours used in our evaluation are: As the number of particles increases accuracy/recall is im- Passing Through 1 (PT-1): Person enters and leaves, Pass- proved. The algorithm remains very efficient with 500 par- ing Through 2 (PT-2): Person enters, places luggage, picks ticles, and is capable of processing in excess of 38,000 sim- it up and leaves, Abandoned Object 1 (AO-1): Person meets ple events per second. The classifiers achieve 0.8 F-Score with a second person, places luggage and leaves, Aban- on Dataset 2, and 0.87 on PETS. doned Object 2 (AO-2): Person enters, places luggage and leaves, and Watched Item (WI): Two people enter together, In section 3 we highlighted that our naive simple-event Ent→ FrmG→ PrtG→ Place→ Leave→ Ext 1 Ent→ Place→ FrmG→ PrtG→ Leave→Ext Ent→ FrmG→ Place→ PrtG→ Leave→ Ext 0.9 PT−1 Probability of AO−1 1 PT−2 0.8 WI 0.75 0.7 AO−1 0.5 AO−2 0.6 Probability 0.25 0.5 0 0.4 1 2 3 4 5 6 Observation Number 0.3 Figure 6: Observations arriving in different orders still 0.2 match the correct goal (AO-1). Nomenclature: Enter (Ent), Form Group (FrmG), Part Group (PrtG), Place Item 0.1 (Place), Leave Item (Leave), Exit (Ext) 0t t t t t p ced ctLef d A gen rou Pla tA gen gen move gen modules would perform poorly on more complex video. te r ar tG ct b j e x i te rA e xi tA En P je O E En tR E Ob jec To simulate these conditions, we artificially inserted noise Ob into the observation stream to lower the true positive rate to 60%. Figure 5 shows that even with this high degree of Figure 7: Probability of each behaviour as observations are noise, complex events can be detected with 0.65 F-Score. made. Behaviour switches from WI to PT-2 at observation 6, causing a similar shift in behaviour probability. Table 3 shows classifier confusion across both datasets. Missing object detections cause confusion between PT-1 and PT-2. Behaviours AO-1 and WI differ by only one AO-1 and AO-2. There is a low probability that observa- event (Form Group), and thus absent group detections lead tions hF ormGroup, P artGroupi were false detections, to confusion here. and thus some probability is removed from AO1 in sup- port of AO2, which can also explain the subsequence hP laceItem, LeaveItem, Exiti. Although observation Table 3: Confusion Matrix for the combined video datasets order can have an impact on goal probability, it is clear that Scenario our thesis holds for these behaviours. PT-1 PT-2 WI AO-1 AO-2 4.3 BEHAVIOUR SWITCHING PT-1 0.92 0 0 0.08 0 PT-2 0.33 0.58 0 0 0.08 Our inference algorithm contains components to detect be- WI 0 0 0.9 0.1 0 haviour switching, which occurs when an agent concate- AO-1 0 0 0.2 0.8 0 nates or otherwise changes their behaviour (see Section AO-2 0 0 0 0 1 2.3). To demonstrate the effectiveness of these components Figure 7 plots the probability of each behaviour as observa- tions are received from two concatenated behaviours. The behaviours are WI, followed by PT-2. 4.2 TEMPORAL ORDER In observation 1 the agent enters. The distributions on We proposed that the exact temporal order of observa- the features within each behaviour causes PT-1 to be most tions does not need to be modelled to recognise hu- probable because it has the least features. The second ob- man behaviour. Figure 6 supports this thesis by showing servation can only be explained by two behaviours and is complex-event likelihood for three different activity per- reflected in the figure. At observation six “EnterAgent” mutations of the AO-1 behaviour. In all three cases AO1 cannot be explained by any of the behaviours, triggering is highly probable, although there are differences in prob- behaviour interruption. Observation seven can only be ex- ability. These differences are because some activity sub- plained by PT-2 and this is reflected in the figure. As a sequences are shared between multiple behaviours. For result, the behaviours that best explain the observations are instance, hP laceItem, LeaveItem, Exiti matches both WI and PT-2, which matches the ground truth. 5 CONCLUSION AND FUTURE WORK Arnaud Doucet, Nando de Freitas, Kevin Murphy, and Stuart Rus- sell. Rao-blackwellised particle filtering for dynamic bayesian This paper has argued that data scarcity prevents the ad- networks. In Proceedings of the Sixteenth Conference on Un- certainty in Artificial Intelligence, pages 176–183, 2000a. vancement of high-level automated visual surveillance us- Arnaud Doucet, Simon J. Godsill, and Christophe Andrieu. On ing probabilistic techniques, and that anomaly detection sequential monte carlo sampling methods for bayesian filter- side-steps the issue for low-level events. We proposed that ing. Statistics and computing, 10(3):197–208, 2000b. simple visual events can be considered as salient features Florent Fusier, Valry Valentin, Franois Brmond, Monique Thon- and used to recognise more complex events by imposing nat, Mark Borg, David Thirde, and James Ferryman. Video a weak temporal ordering. We developed a framework understanding for complex activity recognition. Machine Vi- for end-to-end recognition of complex events from surveil- sion and Applications, 18:167–188, 2007. ISSN 0932-8092. lance video, and demonstrated that our “bag-of-activities” Neil J. Gordon, David J. Salmond, and A.F.M. Smith. Novel approach is robust and scalable. approach to nonlinear/non-gaussian bayesian state estimation. Radar and Signal Processing, IEE Proceedings F, 140(2):107 Section 2.3 made the assumption that for a set of features –113, April 1993. ISSN 0956-375X. defining a behaviour, each feature is only performed once. Helmut Grabner, Peter M. Roth, Michael Grabner, and Horst This assumption limits our approach but is not as strong as Bischof. Autonomous learning of a robust background model it may at first appear. An agent who enters and exits the for change detection. In Workshop on Performance Evaluation scene can still re-enter, as this is simply the concatenation of Tracking and Surveillance, pages 39–46, 2006. of two behaviours. Each individual behaviour has only in- Gal Lavee, Ehud Rivlin, and Michael Rudzsky. Understanding video events: a survey of methods for automatic interpretation volved one ’EnterAgent’ event so the assumption is not in of semantic occurrences in video. Systems, Man, and Cyber- conflict. Furthermore, it is also possible to consider actions netics, Part C: Applications and Reviews, IEEE Transactions that are opposites. For instance, placing and removing a on, 39(5):489–504, 2009. ISSN 1094-6977. bag, or entering and exiting the scene, can both be con- Benjamin Laxton, Jongwoo Lim, and David Kriegman. Lever- sidered action pairs that ’roll-back’ the state. Although not aging temporal, contextual and ordering constraints for recog- implemented in this paper, further work has shown that this nizing complex activities in video. In Computer Vision and is an effective means of allowing some action repetition. Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, pages 1 –8, 2007. The only behaviours prevented by the assumption are those that require performing action A twice (e.g. placing two Lin Liao, Donald J. Patterson, Dieter Fox, and Henry Kautz. Learning and inferring transportation routines. Artificial In- individual bags). telligence, 171(5-6):311–331, 2007. ISSN 0004-3702. Clearly, improving the sophistication of the simple event Wasit Limprasert. People detection and tracking with a static cam- detection modules is a priority in extending our approach to era. Technical report, School of Mathematical and Computer more complicated data. The Group Tracker module could Sciences, Heriot-Watt University, 2010. be improved by estimating each person’s velocity and di- David G. Lowe. Object recognition from local scale-invariant fea- rection using a Kalman filter. These attributes could then tures. In International Conference on Computer Vision, vol- ume 2, pages 1150–1157, 1999. be merged with the proximity based approach to more ac- Fengjun Lv, Xuefeng Song, Bo Wu, Vivek Kumar, and Singh Ra- curately detect the forming and splitting of groups. makant Nevatia. Left luggage detection using bayesian infer- ence. In Proceedings of PETS, 2006. Acknowledgements Kevin P. Murphy. Dynamic Bayesian networks: representation, inference and learning. PhD thesis, 2002. This work was partially funded by the UK Ministry of De- Nam T. Nguyen, Dinh Q. Phung, Svetha Venkatesh, and Hung fence under the Competition of Ideas initiative. Bui. Learning and detecting activities from movement trajec- tories using the hierarchical hidden markov models. In Com- puter Vision and Pattern Recognition, volume 2, pages 955– References 960, 2005. ISBN 0-7695-2372-2. Rolf H. Baxter, Neil M. Robertson, and David M. Lane. Proba- Alonso Patron, Eric Sommerlade, and Ian Reid. Action recog- bilistic behaviour signatures: Feature-based behaviour recog- nition using shared motion parts. In Proceedings of the 8th nition in data-scarce domains. In Proceedings of the 13th In- International Workshop on Visual Surveillance, October 2008. ternational Conference on Information Fusion, 2010. Lawrence R. Rabiner. A tutorial on hidden markov models and Oren Boiman and Michal Irani. Detecting irregularities in images selected applications in speech recognition. In Proceedings and in video. International Journal of Computer Vision, 74(1): of the IEEE, volume 77, pages 257–286, San Francisco, CA, 17–31, 2007. USA, 1989. Hung H. Bui and Svetha Venkatesh. Policy recognition in the ab- Neil Robertson, Ian Reid, and Michael Brady. Automatic hu- stract hidden markov model. Journal of Artificial Intelligence man behaviour recognition and explanation for CCTV video Research, 17:451–499, 2002. surveillance. Security Journal, 21(3):173–188, 2008. Hannah Dee and David Hogg. Detecting inexplicable behaviour. Tao Xiang and Shaogang Gong. Video behavior profiling for In British Machine Vision Conference, volume 477, page 486, anomaly detection. IEEE Transactions on Pattern Analysis and 2004. Machine Intelligence, 30(5):893, 2008.