<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Video Visual Analytics of Tracked Moving Objects</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Markus HÖFERLIN</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin HÖFERLIN</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel WEISKOPF</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligent Systems Group, Universität Stuttgart</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Visualization Research Center, Universität Stuttgart</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Exploring video data by simply watching does not scale for large databases. Especially, this problem becomes obvious in the field of video surveillance. Motivated by a mini challenge of the contest of the IEEE Symposium on Visual Analytics Science and Technology 2009 (Detecting the encounter of persons in a provided video stream utilizing the techniques of visual analytics), we propose an approach for fast identification of relevant objects based on the properties of their trajectories. We present a novel visual and interactive filter process for fast video exploration that yields good results even with challenging video data. The video material includes changing illumination and was captured with low temporal resolution by a camera panning between different views.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>Visual analytics</kwd>
        <kwd>video surveillance</kwd>
        <kwd>video visualization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Over the last few years the amount of CCTV cameras has been increasing rapidly,
especially but not solely in the field of video surveillance. For example, the human rights
group Liberty estimated about 4.5 million CCTVs for the UK in 2009 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. That is one
CCTV camera for every 14th citizen. On the one hand, directly watching video footage
does not scale up with the growing number of CCTVs. Even if we assume an operator
to watch multiple video streams in fast-forward mode this ends up in a cost expensive
process: Haering et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] put the cost for monitoring 25 cameras by human observers to
$150k per annum. Additionally, the attention of an operator decreases within 20 minutes
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. On the other hand, fully automated computer vision systems that process video data
to a high semantic level are not reliable yet [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>A solution to these problems is provided by the visual analytics (VA) process, which
is situated between the two mentioned extreme cases: fully automated video analysis and
manual video analysis. Therefore, VA combines automated video analysis on lower
semantic levels with an appropriate visualization. For the classification of these features,
VA relies on human recognition abilities, linked to the system by interaction. This
enables an accelerated exploration of video data relying on the specialized capabilities of
each: human and computer.</p>
      <p>
        Contributions: Based on the VA methodology, we propose a novel framework for video
analysis and evaluate it, using the IEEE VAST Challenge 2009 video data set [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The
structure of the proposed
      </p>
      <p>
        VA framework is depicted in Fig.1(a). In a preprocessing step,
we analyze the moving objects of a video and apply established approaches like optical
flow
computation, background subtraction, and object tracking by a Kalman
filter.
Further we visualize the video as a VideoPerpetuoGram (VPG) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. As second contribution,
we enable the users to interact with the VPG by defining filters. Thus, parts of low
interest can be neglected, while users can focus on relevant periods. By this interactive
visualization and filtering process, users are enabled to refine their hypotheses in an iterative
manner and thus, the
      </p>
      <p>VA</p>
      <p>process is completed.
(a)</p>
      <p>(b)</p>
      <p>
        Prior work on visual video abstraction condensed actions in a temporal mat er by
showing several actions at the same time even if they occur chronological y in succession
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Caspi et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] selected informative poses of objects and merged them into images
or short video clips. Their approach dealt with occlusion by rotating and translating the
poses in the video volume.
      </p>
      <p>
        Other works introduced video browsing techniques for bet er video exploration. For
example, Dragicevic et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed to browse interactively through videos by direct
object manipulation. This is the selection of a temporal video position by dragging an
object of the video to the desired spatial position.
      </p>
      <p>
        The foundations of the VPG were developed during the last years. The first who
rendered a video as 3D volume were Fels and Mase [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Daniel and Chen [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] mapped
the 3D volume to a horseshoe and additional y displayed image changes with the aim of
video summarization. Chen et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] deployed flow visualization techniques to video
volumes and identified relevant visual signatures in video volumes. Final y, the VPG was
introduced as a seismograph-like visualization technique for continuous video streams
by Botchen et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>1. Video Vision</title>
      <p>The part of video analysis depicted in Fig. 1(a) applies established computer vision
techniques. We segment temporally changing regions combining background subtraction and
optical flow computation. Both methods on their own are not reliable enough to analyze
challenging data sets like the one provided with the VAST Challenge 2009, which
includes changing illumination and was captured with low temporal resolution by a camera
panning between different views. Thus, we use background subtraction and optical flow
segmentation in a complementary manner. After segmenting the foreground blobs we
associate them with the blobs detected in the previous frames utilizing a Kalman tracking
filter. Finally, different properties of these trajectories are calculated, which are used in
the subsequent steps.</p>
      <p>Background Subtraction: Background subtraction is the specialized case of change
detection primarily used for foreground segmentation of video sequences captured by static
cameras. Although there exist sophisticated approaches of background models, we go for
a very basic but robust method able to cope with the lack of training data. The model of
each background (one for each camera position) is calculated as median of the last 150
frames. Violations to this background model are considered as foreground regions and
are calculated as luminance distances between model and sensed image that are above
a certain threshold. After the segmentation step we update the background model, using
the most recent frame. Since the camera does sometimes change its viewing direction, a
precise overlap of the actual viewing volume with the background model cannot be
assumed. Therefore, we have to realign the sensed image with the model by translating the
image to the maximal cross-correlation response within a region of several pixels from
its initial position. The median is a convenient method to shape an adaptive background
model because it is statistically insensitive to outliers originating from noisy video data.
Due to the rotation of the camera, static changes (i.e. scene changes that are uncorrelated
to any motion) affect the background subtraction. That happens if the scene changes
while the camera points into another direction. This results in background violations of
non-foreground objects.</p>
      <p>
        Motion Segmentation: Another common concept to extract relevant objects of a video
sequence assumes that these objects are moving. These objects can be identified by
segmenting regions with homogeneous motion and a velocity above a certain threshold. For
motion analysis we rely on the pyramidal Lucas-Kanade method [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Subsequently, we
segment the motion field based on motion homogeneity. By applying motion
segmentation we are able to reject regions originating from a badly initialized background model
or static changes during camera rotation. Motion segmentation on its own would lack
robustness in cases of strong video noise.
      </p>
      <p>Tracking of Segmented Regions: By tracing the detected regions over several frames we
build up their trajectories. These trajectories are the principal objects we use for further
visual analysis. A linear Kalman filter up to the third order is used to track the detected
region’s position and size in image space.</p>
      <p>Properties of the Trajectories: As final step of the video analysis, several properties of
the extracted trajectories are calculated. Among them are the object’s speed, average
direction, and distance to other trajectories at its start and end positions. To obtain these
information we homographically project the trajectory’s position onto the top-view plane
and measure the distance in world space.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Video Visualization</title>
      <p>
        We propose a video visualization approach based on the VPG by Botchen et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The
VPG is a visualization technique that enables continuous video streams to be displayed
in a manner similar to seismographs and electrocardiograms (ECG). The two spatial axes
of the video are extended by time as third axis, yielding a 3D video volume. Inside the
volume, keyframes are displayed at sparse, equidistant intervals to convey context
information. Trajectories extracted in the preprocessing step are included in the volume
and reveal movement information. The video sequence is split into independent
camera views each represented by a VPG side by side. An example for one camera
direction is illustrated in Fig. 1(b). Additional, blue bars inside the volume indicate skipped
time intervals. Their durations are plotted onto the bars. Time is skipped between every
scene change. If there is no relevant content within a period or a scene, we omit it. The
relevance of a scene is defined by filters that we will discuss in the next section.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. User Interaction</title>
      <p>To cope with the large amount of video data, we enable the users to interact with the
visualization by applying filters and thus achieve scalability. Trajectories are filtered by
their relevance according to properties like camera location, temporal and spatial start
and end positions in image coordinates, mean speed, and average direction. Complex
filters can be created by defining arbitrary numbers of filters concatenated by logical
operators like AND and OR. Beyond that, it is possible to apply trajectory interaction
filters. These filters empower the users to focus on trajectories by specifying their relation
to other trajectories. For example, an interaction filter may require that trajectories have
to begin or end in spatial and temporal vicinity. The aim of filtering is to decrease the
number of trajectories and to focus on regions of interest.</p>
      <p>To gain a deeper understanding, we provide an example related to the VAST
Challenge. In the VPG illustrated in Fig. 2(a) a lot of trajectories are extracted that originate
from moving cars. Since we are searching for encounters of people, these trajectories
are not relevant to us. Therefore, we add a spatial start and end position filter in image
coordinates (cf. Fig. 2(b)(bottom)). Another possibility to reject the trajectories of cars is
to filter the average direction as depicted in Fig. 2(b)(top). Since cars are typically faster
than pedestrians, the application of a mean speed filter would be an option, too.</p>
      <p>In VA, the interaction is typically an iterative process guiding the hypothesis
generation. The users infer hypotheses by inductive and deductive steps based on the
visualization. The typical scenario looks like this: First, the unfiltered video volume is
observed. The users explore the video and identify some uninteresting events, e.g.
pedestrians waiting at a pedestrian crossing. Thus, they define filters to ignore the trajectories
according to their features. By further exploration they detect other events without
relevance, e.g. pedestrians just crossing the scene or trajectories that do not affect each other.
These will also be neglected, decreasing the number of trajectories in the result set. The
filtering process does not necessarily narrow the set of trajectories, but can also widen
them by unifying the result sets of two filters. Uninteresting events can be ignored using
a black list. This way, the users iteratively build hypotheses based on their exploration
of the video. Simultaneously, they verify their hypotheses by defining an appropriate set
(a)
(b)
(c)
of filters. These steps help the users to reduce the amount of video data that remains
for watching. Finally, the users obtain a manageable amount of data, small enough for a
detailed manual analysis. Note that defining filters in this way will not lead to a semantic
gap, since the formulation of filters and the visualization directly depend on low-level
features.</p>
      <p>To gain confidence on the automatic video analysis of the preprocessing step, the
users are able to examine the foundations from which the visualized data is inferred. The
volume slices of the trajectories (cf. Fig 2(c)(top)) or the playback of a part of the video
sequence showing the traced object highlighted (cf. Fig 2(c)(bottom)), are tools to serve
this purpose.</p>
      <p>We point out the advantage of our approach over the recent method of watching the
whole video sequence, using the task provided by the VAST Challenge 2009. The task
was to find the encounter of people within 10 hours of video surveillance material. In
contrast to manually inspecting the 10 hours of video footage, we begin the proposed
VA process with an amount of 809 trajectories and the initial hypothesis provided by
the VAST Challenge’s task. After a short period of time an experienced user was able to
reduce the number of remaining trajectories to 22. Similar as described in the examples
above, this was achieved by an iterative refinement of the hypotheses and the applied set
of filters. Especially, the application of an interaction filter adjusted to detect the split
and merge of trajectories was able to condense the numbers of remaining object’s traces.
Finally, a suspicious encounter of two people could be tracked and validated by the user.
In this paper we have proposed a method for scalable video analysis based on the
visual analytics methodology. Reliability issues arising with fully automated video
analysis approaches are avoided by involving human recognition abilities. We have proposed
a novel framework that consists of three building blocks providing scalability to large
quantities of video data. First, a video sequence is automatically analyzed on a low
semantic level. Extracted features are then visualized in relation to the original content
using the VPG. As third and principle concept, the users interact with the system and apply
filters based on iteratively refined hypotheses. Finally, we have illustrated the usefulness
of our approach by an example derived from the VAST Challenge 2009.</p>
      <p>Future work could consider other features than trajectories. Also, additional
confidence information of these features and their relation to each other will increase
reliability and scalability. An important area of future research is a more detailed evaluation by
quantitative user studies.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments References</title>
      <p>This work was funded by DFG as part of the Priority Program “Scalable Visual
Analytics” (SPP 1335).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Liberty</surname>
          </string-name>
          .
          <article-title>Closed circuit television - CCTV</article-title>
          . [Online]. Available: http://www.liberty-humanrights.org.uk/issues/3-privacy/32-cctv/index.shtml
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Haering</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Venetianer</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Lipton</surname>
          </string-name>
          , “
          <article-title>The evolution of video surveillance: an overview</article-title>
          ,”
          <source>Machine Vision and Applications</source>
          , vol.
          <volume>19</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>279</fpage>
          -
          <lpage>290</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dick</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Brooks</surname>
          </string-name>
          , “Issues in automated visual surveillance,” New Scientist, pp.
          <fpage>195</fpage>
          -
          <lpage>204</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>IEEE</given-names>
            <surname>VAST</surname>
          </string-name>
          <article-title>Challenge 2009</article-title>
          .
          <article-title>IEEE VAST 2009 Symposium</article-title>
          . [Online]. Available: http://hcil.cs.umd.edu/localphp/hcil/vast/index.php
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Botchen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bachthaler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weiskopf</surname>
          </string-name>
          , and T. Ertl, “
          <article-title>Action-based multifield video visualization</article-title>
          ,
          <source>” IEEE Transactions on Visualization and Computer Graphics</source>
          , vol.
          <volume>14</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>885</fpage>
          -
          <lpage>899</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Thomas</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Cook</surname>
          </string-name>
          ,
          <article-title>Illuminating the path: The research and development agenda for visual analytics</article-title>
          .
          <source>IEEE Computer Society</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pritch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rav-Acha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Peleg</surname>
          </string-name>
          , “
          <article-title>Nonchronological video synopsis and indexing</article-title>
          ,
          <source>” IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , vol.
          <volume>30</volume>
          , no.
          <issue>11</issue>
          , pp.
          <fpage>1971</fpage>
          -
          <lpage>1984</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Caspi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Axelrod</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsushita</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Gamliel</surname>
          </string-name>
          , “
          <article-title>Dynamic stills and clip trailers,” The Visual Computer</article-title>
          , vol.
          <volume>22</volume>
          , no.
          <issue>9</issue>
          , pp.
          <fpage>642</fpage>
          -
          <lpage>652</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Dragicevic</surname>
          </string-name>
          , G. Ramos,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bibliowitcz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nowrouzezahrai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Balakrishnan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Singh</surname>
          </string-name>
          , “
          <article-title>Video browsing by direct manipulation,” in Proceeding of the 26th Annual SIGCHI Conference on Human Factors in Computing Systems</article-title>
          . Florence, Italy: ACM,
          <year>2008</year>
          , pp.
          <fpage>237</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fels</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Mase</surname>
          </string-name>
          , “Interactive video cubism,”
          <source>in Proceedings of the 1999 Workshop on New Paradigms in Information Visualization and Manipulation (NPIVM)</source>
          .
          <source>ACM NY, USA</source>
          ,
          <year>1999</year>
          , pp.
          <fpage>78</fpage>
          -
          <lpage>82</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Daniel and M. Chen</surname>
          </string-name>
          , “Video visualization,”
          <source>in Proceedings of the 14th IEEE Visualization</source>
          <year>2003</year>
          (
          <article-title>VIS'03)</article-title>
          . IEEE Computer Society Washington, DC, USA,
          <year>2003</year>
          , pp.
          <fpage>409</fpage>
          -
          <lpage>416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Botchen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hashim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weiskopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ertl</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Thornton</surname>
          </string-name>
          , “
          <article-title>Visual signatures in video visualization</article-title>
          ,
          <source>” IEEE Transactions on Visualization and Computer Graphics</source>
          , vol.
          <volume>12</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>1093</fpage>
          -
          <lpage>1100</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bouguet</surname>
          </string-name>
          , “
          <article-title>Pyramidal implementation of the Lucas Kanade feature tracker: description of the algorithm,” OpenCV Documentation, Intel Corp</article-title>
          .,
          <source>Microprocessor Research Labs</source>
          , pp.
          <fpage>593</fpage>
          -
          <lpage>600</lpage>
          ,
          <year>Jun 2000</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>