=Paper= {{Paper |id=Vol-463/paper-1 |storemode=property |title=Video Surveillance Framework for Crime Prevention and Event Indexing |pdfUrl=https://ceur-ws.org/Vol-463/paper1.pdf |volume=Vol-463 |dblpUrl=https://dblp.org/rec/conf/ict4justice/KovacsSBHPLULCS08 }} ==Video Surveillance Framework for Crime Prevention and Event Indexing== https://ceur-ws.org/Vol-463/paper1.pdf
    Video Surveillance Framework for Crime Prevention
                   and Event Indexing

          Levente Kovács1, Zoltán Szlávik1, Csaba Benedek1, László Havasi1,
             István Petrás1, Dávid Losteiner1, Ákos Utasi2, Attila Licsár2,
                         László Czúni2, and Tamás Szirányi1
                         1
                          Distributed Events Analysis Research Group,
                                   MTA SZTAKI, Hungary
                                  levente.kovacs@sztaki.hu
                                     http://www.sztaki.hu
                      1
                        Dept.of Image Processing and Neurocomputing,
                              University of Pannonia, Hungary
                                     utasi@uni-pannon.hu
                                  http://www.uni-pannon.hu



       Abstract. The paper will present a video surveillance and event detection and
       annotation framework for semi-supervised surveillance use. The system is
       intended to be used in automatic mode on camera feeds that are not actively
       watched by surveillance personnel, raising alarms and enrolling annotation data
       when unusual events occur. We present the current detector filters, and the
       easily extendable modular interface. Current filters include local and global
       unusual motion detectors, left object detector, motion detector,
       tampering/failure detector, etc. The system stores the events and associated
       data, which can be organized, searched, annotated and (re)viewed. It has been
       tested in real life situation for police street surveillance, and we are working
       towards developing a deployable version.




1 Introduction

While the literature of content based image search contains quite a large number of
proposed and in some cases practical systems, the open literature of real time
surveillance feed analysis is much more limited. Also, processing surveillance feeds
for event detection is mostly done as a post-processing feature, working with archived
footage [1,2], not aiding the operators’ work in real time. [3] is a method based on
tracking moving regions from an aerial view. A system with similar goals is that of
[4] with many architectural and procedural differences, but most importantly our
system provides a wider range of detectors, multiple event alarms, and an easy
modular interface. The presented system is not based on later content based
processing and indexing of archived footage, but runs real time filters on live footage
and signals unusual events, so as to reduce the need to actively and constantly watch
the live feeds. This is important, since in many existing systems a few operators need
to survey hundreds of cameras. At the same time, it has the classical archive
functionalities, which makes it also suitable for eventual later content based data
mining. The filters we present are all real time, and are based on pixel level
approaches complemented with robust statistical evaluation and learning steps. The
framework has been developed as part of a project especially started to produce such
a system for real life application by the local police stations in the city’s districts.


2 System Architecture

In simple terms, the system consists of a user interface and a database backend. The
interface runs on a workstation with dual display, which also contains the multihead
frame grabber cards that accept the camera feeds. The feeds come from large camera
matrix multiplexer stations. The users monitor and use the system through this
interface, while the database backend can be anywhere, given the proper internet
connections. Figure 1 shows a simplified diagram of the general system architecture,
while Figure 2 shows a snapshot of the main interface, handling three feeds.
   The main application can handle a maximum of four live camera feeds
simultaneously, and each feed can be assigned a chain of selected detection filters
(which we call filter chains). The feed handlers, the filter chains, and the filters
themselves were all written to be real time, and with SMP and heavy multithreading
in mind; all filters and chains run separately and concurrently. Increasing the number
of available processors can greatly increase the possibility of using more filters on
more feeds. Currently all filters run real time, but a combination of multiple filters on




                         Fig. 1. Main parts of the system architecture.
multiple feeds can quickly run into processing and memory bottlenecks. Thus a
careful selection of hardware and filter combinations is necessary.
    Modules/filters can be added easily, either by coding them by using a provided
class template as internal filters, or by a provided a library template, as a plugin.
Either way, the coder needs only focus on developing the core algorithm, interfacing
is seamless.
                             Fig. 2. Main interface window.




3 Functions, Modules

In this section we describe some of the more important modules/filters currently
deployed in the framework. The system also contains classical surveillance functions
like image and video archiving, large display of the feeds on a secondary monitor and
so on, which we will not detail here. The functions described here are all automatic
and consist of a panorama image creator from panning camera feeds, maskable
motion detector, camera jump detector for cameras that iterate among different
stationary positions, unusual global and local motion detector, fight detector, left
object detector, camera fail/tampering detector, annotation, search and review of
events. Any order of filter combinations can be assigned to each camera feed
separately, and they will run concurrently and independently of each other. All filters
run automatically, in real time, and need no manual intervention.


3.1 Panorama/Mosaic Image

The need of constructing and displaying a panorama image of the scene arose since
there have been a lot of panning cameras that cover a large field of view. This module
allows us to construct and display the full field of view of a camera for the operator,
and also to identify the actual camera position. The method continuously registers the
incoming frames and builds a mosaic. The properties of these cameras are totally
unknown and different on each camera source. There are some articles (e.g. [5])
dealing with moving cameras but they are not based on statistical approach to
segment background and foreground. Our approach computes the transformation
matrices by using the extracted optical flow vectors. The stable points (good features
to track) are determined by the Harris corner [6] detector. The corresponding points
between frames are verified with the motion vector of the flow field. Instead of using
RANSAC [7] to compute the homography we implemented a simpler hit-and-miss
iterative algorithm: every iteration drops the worst points from the dataset, and the
remaining are the base points for the computation of transformation between frames.
Figure 3 shows examples of built panoramas.




        Fig. 3. Sample panorama images. Red rectangle shows actual camera position.




3.2 Motion Detector

Motion detection and optical flow field extraction steps are required for many of the
system’s filters, as a base for higher processing stages. But the extracted flows can
also be used for other purposes, e.g. the simple task of raising alarms when any type
of motion occurs on a surveyed area. In this case, the interface’s panorama image
pane gives the possibility to mark a certain area of interest with the mouse, and alarms
will be raised when motions occur over the masked area. This function can be used to
automatically signal e.g. the departure or arrival of a car, open/close of a door/gate, or
any activity over a security area (e.g. Figure 4).




         Fig. 4. Motion detector on masked area of interest (blue: mask, red: motion).




3.3 Unusual Global Motion Detector

For this filter, the goal was to detect unusual large motion patterns. Intended use cases
are, e.g.:
        • Someone goes against the traffic in a one way street: long term statistics
              show one major motion direction, and then a different motion occurs.
        • One lane jams in a two way street: long term statistics show two typical
              motion directions, and then one of them disappears.
        • Accident, traffic jam: statistics show intensive various motions, which
            considerably slows, stops, or drops in variance.
   To achieve this goal, in a learning phase long term global statistics are built from
direction distributions and typical motion types based on extracted motion fields of
the image sequence. Then, in the detection phase, we try to fit the actual motion data
to one of the statistics, and raise alarms when no good fit can be found.
   The motion field extracted from the image sequence is cut every t1 second into t 2
long segments, and we take the mean of such segments as a sample. If t1 < t 2 then
we will get overlapping segments, which will help in smoothing the blockiness at
segment borders. We keep collecting the samples for a t 3 time period, which will
produce N samples. From the samples directional distributions are constructed, and
directional histograms are built from the motion fields. The histograms are quantized
into ε degree bins between 0 and 360. These directional histograms will represent the
typical motion forms of the scene.
    In the learning phase, the N samples are classified into k classes by K-means
clustering. Distance between the samples is calculated by L2 norm. K-means needs a k
to be given a priori, which we overcome by starting with a large class number, then
performing a class consolidation step. If the means A0, B0 of classes A and B are closer
than n0, then B is merged into A with a new A0* mean. In the end we will have k*
classes, where k* min alarm then raise an alarm.
where: min Tlength and max Tlength are minimal and maximal trajectory lengths, N is
the actual trajectory length, min C var and max C var are the min. and max. curvature
variances, curved k is the actual trajectory curvature, and
       • norm _ curved k = curved k / lengthk            is the normalized curvature
            measure,

                        ∑(m − t              )/ N is the deviation from the trajectory’s
                                         2
       • curved k =             k    k

            mean,
       • lengthk = t k (0) − t k ( N ) is the distance of the first and last trajectory
            points, and
       • alarm is the number of trajectories where the above shape constraints are
            fulfilled simultaneously.
  Figure 7 shows two excerpts from real feeds where people were fighting, red
overlay showing the frames where the filter raised an alarm signal.


3.6 Left and Removed Object Detector

In conventional video surveillance applications, the aims of background modeling and
background subtraction modules are usually limited to moving object detection and
analysis. However, relevant information can be exploited by following the changes in
the background as well. We implemented a filter, which not only detects objects
moving in front of the camera, but it detects changes in the static background and
signals the appearance of new objects (i.e. objects that are brought into the field of
view, then left there, see Figure 8, or objects that are taken from the field). The
method can be used to observe abandoned or stolen objects, which is an important
surveillance task.




                              a                                   b
    Fig. 7. Fight detector feeds. Red overlay shows frames where fight alarms were raised.

   The proposed method (building on [15]) extends the widely used Gaussian mixture
background modeling approach of [11]. Each pixel s is considered as a separate
process, which generates an observed pixel value sequence over time (t is the time
index):
                                  {x (s), x (s),K x (s)}
                                   [1]      [ 2]         [t ]

   To model the recent history of the pixels, [11] suggested a mixture of K Gaussians
distribution:
                               (              )                      (                                      )
                                                        K
                            P x [t ] ( s ) = ∑ wk[t ] ( s ) ⋅η x [t ] ( s ), µ k[t ] ( s ), σ k[t ] ( s )
                                                      k =1
where k = 1,K, K are unique and in time static id’s of the mixture components, while
η (.) is a Gaussian density function, with given µ mean and σ deviation. We
ignore multi modal background processes, and consider the background Gaussian
term to be equivalent to the Gaussian component in the mixture with the largest
weight.
   The mixture parameters are iteratively refreshed. The weight as updated as follows:
                                                                                        (
                            wk[t +1] ( s) = (1 − α ) ⋅ wk[t ] (s) + α ⋅ M [t ] k , x [t ] (s)           )
where the following matching operator is used:
                                                           x[ t ] ( s) − µ k[t ] ( s)
                                                      1 if