<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From Distributed Vision Networks to Human Behavior Interpretation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hamid Aghajan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chen Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical Engineering Stanford University</institution>
          ,
          <addr-line>Stanford CA, 94305</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>129</fpage>
      <lpage>143</lpage>
      <abstract>
        <p>Analysing human behavior is a key step in smart home applications. Many reasoning approaches utilize information of location and posture of the occupant in qualitative assessment of the user's status and events. In this paper, we propose a vision-based framework to provide quantitative information of the user's posture which can be used to deduct qualitative representations for high-level reasoning. Furthermore, our approach is motivated by potentials introduced by interactions between the vision module and the high-level reasoning module. While quantitative knowledge from the vision network can either complement or provide specific qualitative distinctions for AI-based problems, these qualitative representations can offer clues to direct the vision network to adjust its processing operation according to the interpretation state. The paper outlines potentials for such interactions and describes two visionbased fusion mechanisms. The first employs an opportunistic approach to recover the full-parameterized human model by the vision network, while the second employs directed deductions from vision to address a particular smart home application in fall detection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The increasing interest in understanding human behaviors and events in a
camera context has heightened the need for gesture analysis of image sequences.
Gesture recognition problems have been extensively studied in Human
Computer Interactions (HCI), where often a set of pre-defined gestures are used for
delivering instructions to machines [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. However, “passive gestures”
predominate in behavior descriptions in many applications. Some traditional
application examples include surveillance and security applications, while more novel
applications arise in emergency detection in clinical environments [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], video
conferencing [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], and multimedia and gaming applications. Some approaches to
analyzing passive gestures have been investigated in [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ].
      </p>
      <p>In a multi-camera network, access to multiple sources of visual data often
allows for making more comprehensive interpretations of events and gestures. It
also creates a pervasive sensing environment for applications where it is
impractical for the users to wear sensors. Having access to interpretations of posture
and gesture elements obtained from visual data over time enables higher-level</p>
    </sec>
    <sec id="sec-2">
      <title>Enabling technologies:</title>
      <p>o Vision processing
o Wireless sensor networks
o Embedded computing
o Signal processing</p>
    </sec>
    <sec id="sec-3">
      <title>Distributed</title>
    </sec>
    <sec id="sec-4">
      <title>Vision Networks ( DVN )</title>
    </sec>
    <sec id="sec-5">
      <title>Artificial Intelligence (AI)</title>
    </sec>
    <sec id="sec-6">
      <title>Context</title>
    </sec>
    <sec id="sec-7">
      <title>Event interpretation</title>
    </sec>
    <sec id="sec-8">
      <title>Behavior models</title>
    </sec>
    <sec id="sec-9">
      <title>Feedback ( features, parameters, decisions, etc. )</title>
    </sec>
    <sec id="sec-10">
      <title>Quantitative</title>
    </sec>
    <sec id="sec-11">
      <title>Knowledge</title>
    </sec>
    <sec id="sec-12">
      <title>Human Computer</title>
    </sec>
    <sec id="sec-13">
      <title>Interaction</title>
    </sec>
    <sec id="sec-14">
      <title>Multimedia</title>
    </sec>
    <sec id="sec-15">
      <title>Qualitative</title>
    </sec>
    <sec id="sec-16">
      <title>Knowledge</title>
    </sec>
    <sec id="sec-17">
      <title>Immersive virtual reality</title>
    </sec>
    <sec id="sec-18">
      <title>Non-restrictive interface</title>
    </sec>
    <sec id="sec-19">
      <title>Interactive robotics</title>
    </sec>
    <sec id="sec-20">
      <title>Scene construction</title>
    </sec>
    <sec id="sec-21">
      <title>Virtual reality</title>
    </sec>
    <sec id="sec-22">
      <title>Gaming</title>
    </sec>
    <sec id="sec-23">
      <title>Smart</title>
    </sec>
    <sec id="sec-24">
      <title>Environments</title>
    </sec>
    <sec id="sec-25">
      <title>Agents</title>
    </sec>
    <sec id="sec-26">
      <title>Robotics Response systems</title>
    </sec>
    <sec id="sec-27">
      <title>User interactions</title>
      <p>Fig. 1. The relationship between vision networks and high-level AI reasoning, and a
variety of novel applications enabled by both.
reasoning modules to deduct the user’s actions, context, and behavior models,
and decide upon suitable actions or responses to the situation.</p>
      <p>
        Our notion of the role a vision network can play in enabling novel
intelligent applications derives from the potential interactions between the various
disciplines outlined in Fig. 1. The vision network offers access to quantitative
knowledge about the events of interest such as the location and other attributes
of a human subject. Such quantitative knowledge can either complement or
provide specific qualitative distinctions for AI-based problems. On the other hand,
we may not intend to extract all the detailed quantitative knowledge available
in visual data since often a coarse qualitative representation may be sufficient
in addressing the application [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In turn, qualitative representations can offer
clues to the features of interest to be derived from the visual data allowing the
vision network to adjust its processing operation according to the interpretation
state. Hence, the interaction between the vision processing module and the
reasoning module can in principle enable both sides to function more effectively.
For example, in a human gesture analysis application, the observed elements of
gesture extracted by the vision module can assist the AI-based reasoning module
in its interpretative tasks, while the deductions made by the high-level reasoning
system can provide feedback to the vision system from the available context or
behavior model knowledge.
      </p>
      <p>In this paper we introduce a model-based data fusion framework for human
posture analysis using opportunistic use of manifold sources of vision-based
information obtained from the camera network in a principled way. The framework
spans the three dimensions of time (each camera collecting data over time), space
(different camera views), and feature levels (selecting and fusing different feature</p>
      <sec id="sec-27-1">
        <title>Vision</title>
      </sec>
      <sec id="sec-27-2">
        <title>Human Model</title>
      </sec>
      <sec id="sec-27-3">
        <title>Kinematics</title>
      </sec>
      <sec id="sec-27-4">
        <title>Attributes</title>
      </sec>
      <sec id="sec-27-5">
        <title>States</title>
      </sec>
      <sec id="sec-27-6">
        <title>Reasoning /</title>
      </sec>
      <sec id="sec-27-7">
        <title>Interpretations</title>
        <p>subsets). Furthermore, the paper outlines potentials for interaction between the
distributed vision network and the high-level reasoning system.</p>
        <p>The structure of the vision-based processing operation has been designed in
such a way that the lower-level functions as well as other in-node processing
operations will utilize feedback from higher levels of processing. While feedback
mechanisms have been studied in active vision areas, our approach aims to
incorporate interactions between the vision and the AI operations as the source
of active vision feedback. To facilitate such interactions, we introduce a human
model as the convergence point and a bridge for the two sides, enabling both to
incorporate the results of their deductions into a single merging entity. For the
vision network, the human model acts as the embodiment of the fused visual data
contributed by the multiple cameras over observation periods. For the AI-based
functions, the human model acts as a carrier of all the sensed data from which
gesture interpretations can be deducted over time through rule-based methods
or mapping to training data sets of interesting gestures. Fig. 2 illustrates this
concept in a concise way.</p>
        <p>In Section 2 we outline the different interactions between the vision and AI
modules as well as the temporal and spatial model-based feedback mechanisms
employed in our vision analysis approach. Section 3 presents details and
examples for our model-based and opportunistic feature fusion mecahnisms in human
posture analysis. In Section 4 an example collaborative vision-based scheme for
deriving qualitative assessment for fall detection is described. Section 5 offers
some concluding remarks and the topics of current investigation.
2</p>
        <p>The Framework</p>
        <sec id="sec-27-7-1">
          <title>Interpretation levels</title>
        </sec>
        <sec id="sec-27-7-2">
          <title>Behavior analysis</title>
        </sec>
        <sec id="sec-27-7-3">
          <title>Instantaneous action</title>
        </sec>
        <sec id="sec-27-7-4">
          <title>Low-level features</title>
        </sec>
        <sec id="sec-27-7-5">
          <title>AI reasoning</title>
        </sec>
        <sec id="sec-27-7-6">
          <title>Posture / attributes</title>
        </sec>
        <sec id="sec-27-7-7">
          <title>Model parameters</title>
        </sec>
        <sec id="sec-27-7-8">
          <title>Feedback AI</title>
        </sec>
        <sec id="sec-27-7-9">
          <title>Vision</title>
        </sec>
        <sec id="sec-27-7-10">
          <title>Processing</title>
          <p>Queries
Context
Persistence
Behavior attributes
example enable the camera to examine the persistence of those features, or to
avoid re-initialization of local parameters. In the network of cameras, spatial
fusion of data in any of the forms of merged estimates or a collective decision,
or in our model-based approach in the form of updates from body part tracking,
can provide feedback information to each camera. The feedback can for example
be in the form of indicating the features of interest that need to be tracked by
the camera, or as initialization parameters for the local segmentation functions.
Fig. 4 illustrates the different feedback paths within the vision processing unit.
3</p>
          <p>Collaborative Vision Network
We introduce a generic opportunistic fusion approach in multi-camera networks
in order to both employ the rich visual information provided by cameras and
incorporate learned knowledge of the subject into active vision analysis. The
opportunistic fusion is composed of three dimensions, space, time and feature
levels. For human gesture analysis in a multi-camera network, spatial
collaboration between multi-view cameras naturally facilitates solving occlusions. It is
especially advantageous for gesture analysis since human body is self-occlusive.
Moreover, temporal and feature fusion help to gain subject-specific knowledge,
such as the current gesture and subject appearance. This knowledge is in turn
used for a more actively directed vision analysis.
3.1</p>
          <p>
            The 3D Human Body Model
Fitting human models to images or videos has been an interesting topic for
which a variety of methods have been developed. Usually assuming a dynamic
model (such as walking)[
            <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
            ] will greatly help us to predict and validate the
posture estimates. But tracking can easily fail in case of sudden motions or other
CAM 1
          </p>
          <p>Early vision
processing
CAM 2</p>
          <p>Early vision
processing
CAM N</p>
          <p>Early vision
processing</p>
          <p>Temporal
fusion
Temporal
fusion
Temporal
fusion</p>
          <p>Features
Features
Features</p>
          <p>Spatial
fusion</p>
          <p>Estimate fusion
Decision fusion</p>
          <p>
            Model-based fusion
Human
model
movements that differ much from the dynamic model. Therefore we always need
to be aware of the balance between the limited dynamics and the capability
to discover more diversified postures. For multi-view scenarios, a 3D model can
be reconstructed by combining observation from different views [
            <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
            ]. Most
methods start from silhouettes in different cameras, then points occupied by the
subject can be estimated, and finally a 3D model with principle body parts is
fit in the 3D space [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. The approach above is relatively “clean” since the only
image component it is based on are the silhouettes. But at the same time the 3D
voxel reconstruction is sensitive to the quality of the silhouettes and accuracy
of camera calibrations. It is not difficult to find situations where background
subtraction for silhouettes suffers for quality or is almost impossible (clustered,
complex background, and the subject is wearing clothes with similar colors to the
background). Another aspect of the human model fitting problem is the choice
of image features. All human model fitting methods are based on some image
features as targets to fit the model. Most of them are based on generic features
such as silhouettes or edges [
            <xref ref-type="bibr" rid="ref12 ref14">14, 12</xref>
            ]. Some use skin colors but those methods
are prone to failure in some situations since lighting usually has big influence in
colors and skin color varies from person to person.
          </p>
          <p>In our work, we aim to incorporate appearance attributes adaptively learned
from the network for initialization of segmentation, because usually color or
texture regions are easier to find than generic features such as edges. Another
emphasis of our work is that images from a single camera are first reduced to
short descriptions and then reconstruction of the 3D human model is based on
time</p>
          <p>local processing and spatial
collaboration in the camera network</p>
          <p>space</p>
          <p>Description Layers
Description Layer 4:</p>
          <p>Gestures</p>
          <p>G
Description Layer 3: E1
Gesture Elements</p>
          <p>E2</p>
          <p>E3
DescriFpteiaotnurLeasyer 2: F1f11 f12
descriptions collected from multiple cameras. Therefore concise descriptions are
the expected outputs from image segmentation.</p>
          <p>In our approach a 3D human body model embodies up-to-date information
from both current and historical observations of all cameras in a concise way.
It has the following components: 1. Geometric configuration: body part lengths,
angles. 2. Color or texture of body parts. 3. Motion of body parts. The three
components are all updated from the three dimensions of space, time and features
of the opportunistic fusion.</p>
          <p>Apart from providing flexibility in gesture interpretations, the 3D human
model also plays significant roles in the vision analysis process. First, the total
size of parameters to reconstruct the model is very small compared to the raw
images, and affordable through communication. For each camera, only segment
descriptions are needed for collaboratively reconstructing the 3D model. Second,
the model is a converging point of spatiotemporal and feature fusion. All the
parameters it maintains are updated from the three dimensions of space, time and
features of the opportunistic fusion. In sufficient confidence levels, parameters of
the 3D human body model are again used as feedback to aid subsequent vision
analysis. Third, although predefined appearance attributes are generally not
reliable, adaptively learned appearance attributes can be used to identify the person
Color segmentation and ellipse fitting in local processing
Background
subtraction</p>
          <p>Rough
segmentation</p>
          <p>EM: refine
color models</p>
          <p>Watershed
segmentation</p>
          <p>Ellipse fitting
Maintain current</p>
          <p>model
Update 3D model
(color/texture,
motion)</p>
          <p>Previous color
distribution
3D human body model</p>
          <p>Previous geometric
configuration and motion
Y
or body parts. Those attributes are usually more distinguishable than generic
features such as edges once correctly discovered.</p>
          <p>
            The 3D model maps to the Gesture Elements layer in the layered architecture
for gesture analysis (lower left part of Fig. 5) we proposed in [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. However, here
it not only assumes spatial collaboration between cameras, but also connects
decisions from history observations with current observations.
3.2
          </p>
          <p>The Opportunistic Fusion Mechanisms
The opportunistic fusion framework for gesture analysis is shown in Fig. 5. On
the top of Fig. 5 are spatial fusion modules. In parallel is the progression of the
3D human body model. Suppose now it is t0, and we have the model with the
collection of parameters as M0. At the next instance t1, the current model M0
is input to the spatial fusion module for t1, and the output decisions are used to
update M0 from which we get the new 3D model M1.</p>
          <p>Now we look into a specific spatial fusion module (the lower part of Fig. 5)
for the detailed process. In the bottom layer of the layered gesture analysis,
image features are extracted from local processing. Distinct features (e.g. colors)
specific for the subject are registered in the current model M0 and are used
for analysis, which may be much easier than always looking for patterns of the
generic features (arrow °1 in Fig. 5). After local processing, data is shared
between cameras to derive for a new estimate of the model. Parameters in M0
specify a smaller space of possible M1’s. Then decisions from spatial fusion of
cameras are used to update M0 to get the new model M1 (arrow °2 in Fig. 5).
Therefore for every update of the model M , it combines space (spatial
collaboration between cameras), time (the previous model M0) and feature levels (choice
of image features in local processing from both new observations and
subjectspecific attributes in M0). Finally the new model M1 is used for high-level gesture
deductions in a certain scenario (arrow °2 in Fig. 5).</p>
          <p>An implementation for the 3D human body posture estimation is illustrated
in Fig. 6. Local processing in single cameras include segmentation and ellipse
fitting for a concise parametrization of segments. For spatial collaboration,
ellipses from all cameras are merged to find the geometric configuration of the 3D
skeleton model.
3.3</p>
          <p>In-Node Feature Extraction
The goal of local processing in a single camera is to reduce raw images/videos to
simple descriptions so that they can be efficiently transmitted between cameras.
The output of the algorithm will be ellipses fitted from segments and the mean
color of the segments. As shown in the upper part of Fig. 6, local processing
includes image segmentation for the subject and ellipse fitting to the extracted
segments.</p>
          <p>We assume the subject is characterized by a distinct color distribution.
Foreground area is obtained through background subtraction. Pixels with high or
low illumination are also removed since for those pixels chrominance may not
be reliable. Then a rough segmentation for the foreground is done either based
on K-means on chrominance of the foreground pixels or color distributions from
the known model. In the initialization stage when the model hasn’t been well
established, or when we don’t have a high confidence in the model, we need
to start from the image itself and use a method such as K-means to find color
distribution of the subject. However, when a model with a reliable color
distribution is available, we can directly assign pixels to different segments based on
the existing color distribution. The color distribution maintained by the model
may not be accurate for all cameras, since in different cameras illumination may
change. Also the subject’s appearance may change due to the movement or
lighting conditions. Therefore the color distribution of the model is only used for a
rough segmentation in initialization of the segmentation scheme. Then an EM
(expectation maximization) algorithm is used to refine the color distribution for
the current image. The initial estimated color distribution plays an important
role because it can prevent EM from being trapped in local minima.</p>
          <p>Suppose the color distribution is a mixture of N Gaussian modes, with
parameters Θ = {θ1, θ2, . . . , θ3}, where θl = {μl, Σl} are the mean and covariance
matrix of the modes. Mixing weights of different modes are A = {α1, α2, . . . , α3}.
The EM algorithms aims to find the probability of each pixel xi belonging to a
certain mode θl: P r(yi = l|xi).</p>
          <p>
            However, the basic EM algorithm takes each pixel independently, without
considering the fact that pixels belonging to the same mode are usually
spatially close to each other. In [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] Perceptually Organized EM (POEM) is
introduced. In POEM, influence of neighbors is incorporated by a weighting measure
w(xi, xj ) = e− kxiσ−12xjk − ks(xi)σ−22s(xj)k . s(xi) is the spatial coordinate of xi. Then
“votes” for xi from the neighborhood is given by
          </p>
          <p>Vl(xi) =</p>
          <p>X αl(xj )w(xi, xj ), where αl(xj)=P r(yj=l|xj)
xj
(1)</p>
          <p>Then modifications are made to EM steps. In the E step, αl(k) is changed to
α(k)(xi), which means that for every pixel xi, mixing weights for different modes
l
are different. This is partially due to the influence of neighbors. In the M step,
mixing weights are updated by
αl(k)(xi) =
eηVl(xi)
k=1 eηVk(xi)
PN
(2)
η controls the “softness” of neighbors’ votes. If η is as small as 0, then mixing
weights are always uniform. If η approaches infinity, the mixing weight for the
mode with the largest vote will be 1.</p>
          <p>After refinement of the color distribution with POEM, we set pixels with high
probability (e.g., bigger than 99.9%) that belong to a certain mode as markers
for that mode. Then watershed segmentation algorithm is implemented to assign
labels for undecided pixels. Finally for every segment an ellipse is fitted to it in
order to obtain a concise parameterization for the segment.
3.4</p>
          <p>Posture Estimation
Human posture estimation is essentially an optimization problem, in which we
try to minimize the distance between the posture and ellipses from multi-view
cameras. There can be several different ways to find the 3D skeleton model based
on observations from multi-view images. One method is to directly solve for the
unknown parameters through geometric calculation. In this method we need
to first establish correspondence between points/segments in different cameras,
which is itself a hard problem. Common observations for points are rare for
human problems, and body parts may take on very different appearance from
different views. Therefore it is difficult to resolve ambiguity in 3D space based
on 2D observations. A second method would be to cast a standard optimization
problem, in which we find optimal θi’s and φi’s to minimize an objective function
(e.g., difference between projections due to a certain 3D model and the actual
segments) based on properties of the objective function. However, if the
problem is highly nonlinear or non-convex, it’ll be very difficult or time consuming
to solve. Therefore searching strategies which do not explicitly depend on the
objective function formulation are desired.</p>
          <p>
            Motivated by [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ], Particle Swarm Optimization (PSO) is used as the
optimization technique. The lower part of Fig. 6 shows the estimation process.
Ellipses from local processing of single cameras are merged together to
reconstruct the skeleton. Here we consider a simplified problem in which only arms
change in position while other body parts are kept in the default location.
Elevation angles (θi) and azimuth angles (φi) of the left/right upper/lower parts
of the arms are specified as parameters. The assumption is that projection
matrices from 3D skeleton to 2D image planes are known. This can be achieved
either from locations of cameras and the subject, or it can be calculated from
some known projective correspondences between the 3D subject and points in
the images, without knowing exact locations of cameras or the subject.
          </p>
          <p>PSO is suitable for posture estimation as an evolutionary optimization
mechanism. It starts from a group of initial particles. During the evolution of the
particles towards an optimal, they are directed to the good position while keep
some randomness to explore the search space. Suppose there are N particles
(test configurations) xi, each is a vector of θi’s and φi’s. vi is the velocity of xi.
The best position of xi so far is xˆi, and the global best position of all xi’s so far
is g. f (·) is the objective function that we wish to find the optimal position x to
minimize f (x). The PSO algorithm is as follows:
1. Initialize xi and vi. vi is usually set to 0, and xˆi = xi. Evaluate f (xi) and
set g = argminf (xi).
2. While the stop criterion is not satisfied, do for every xi
– vi ← ωvi + c1r1(xˆi − xi) + c2r2(g − xi);
– xi ← xi + vi;
– If f (xi) &lt; f (xˆi), xˆi = xi; If f (xi) &lt; f (g), g = xi.</p>
          <p>
            The stop criterion: after updating all N xi’s once, the increase in f (g) falls below
a threshold, then the algorithm exits. ω is the “inertial” coefficient, while c1 and
c2 are the “social” coefficients. r1 and r2 are random vectors with each element
uniformly distributed on [
            <xref ref-type="bibr" rid="ref1">0,1</xref>
            ]. Choice of ω, c1 and c2 controls the convergence
process of the evolution. If ω is big, the particles have more inertia and tend
to keep their own directions to explore the search space. This allows for more
chance of finding the “true” global optimal if the group of particles are currently
Description Layer 4 :
          </p>
          <p>Gestures
Description Layer 3 :</p>
          <p>Gesture Elements</p>
          <p>E1</p>
          <p>E2</p>
          <p>E3
Description Layer 2 : f11 f12</p>
          <p>Features F1
around a local optimal. While if c1 and c2 are big, the particles are more “social”
with the other particles and go quickly to the best positions known by the group.
In our experiment, N = 16, ω = 0.3 and c1 = c2 = 1.</p>
          <p>Examples for in-node segmentation are shown in Fig. 7(a). Some examples
showing images from 3 views and the posture estimates are in Fig. 7(b).
4</p>
          <p>Towards Behavior Interpretation
An appropriate classification is essential towards a better understanding of the
variety of passive gestures. Therefore, we propose a categorization of the gestures
as follows:
– Static gestures, such as standing, sitting, lying;
– Dynamic gestures, such as waving arms, jumping;
– Interactions with other people, such as chatting;
– Interactions with the environment, such as dropping or picking up objects.
Silhouettebased Shape</p>
          <p>Fitting</p>
          <p>Body
Orientation
Aspect Ratio
Goodness of
Shape Fit</p>
          <p>Alert
Level</p>
          <p>Weight</p>
          <p>Camera 1</p>
          <p>Camera 2
Camera 3</p>
          <p>Logic to
Combine
States
Multi-Camera Model Fitting</p>
          <p>Event Interpretation
-V I-n
a i</p>
          <p>t
lid i</p>
          <p>a
a lz
t i
e e
M S
lode rceah</p>
          <p>S
p
a
c
e
fr
o
3
D
M
o
d
e
l</p>
          <p>Posture
Orientation</p>
          <p>Head
Position</p>
          <p>Arms
Positions</p>
          <p>Legs
Positions</p>
          <p>Vertical Horizontal</p>
          <p>??
Top Side ?? Left Right ??
??</p>
          <p>W
a
iftr
o
M
o
r
e
O
b
s
e
r
v
a
it
o
n
s
to the silhouette and estimates the orientation and the aspect ratio of the fitted
(e.g. elliptical) shape. The network’s objective at this stage is to decide on one of
the branches in the top level of a tree structure (see Fig. 9) between the possible
posture values of vertical, horizontal, or undetermined. To this end, each camera
uses the orientation angle and the aspect ratio of the fitted ellipse to produce
an alert level, which ranges from -1 (for safe) to 1 (for danger). Combining the
angle and the aspect ratio is based on the assumption that nearly vertical or
nearly horizontal ellipses with aspect ratios away from one provide a better
basis for choosing one of the vertical and horizontal branches in the decision tree
than when the aspect ratio is close to one or when the ellipse has for example,
a 45-degree orientation.</p>
          <p>Fig. 10 illustrates an example of the alert level function combining the
orientation and aspect ratio attributes in each camera. The camera broadcasts the
value of this function for the collaborative decision making process. Along with
the alert level, the camera also produces a figure of merit value for the shape
fitted to the human silhouette. The figure of merit is used as a weighting parameter
when the alert level values declared by the cameras are combined.</p>
          <p>Fig. 11 presents cases in which the user is walking, falling and lying down.
The posture detection outcome is superimposed on the silhouette of the person
for each camera. The resulting alert levels and their respective weights are shared
by the cameras, from which the overall alert level shown in the figure is obtained.
In this paper we explore the interactive framework between vision and AI. While
vision is helpful to derive reasoning building blocks for higher levels, there is more
in the framework. We claim that the feedback between the vision module and
the reasoning module is able to benefit both.</p>
          <p>A framework of data fusion in distributed vision networks is proposed.
Motivated by the concept of opportunistic use of available information across the
different processing and interpretation levels, the proposed framework has been
designed to incorporate interactions between the vision module and the
highlevel reasoning module. Such interactions allow the quantitative knowledge from
the vision network to provide specific qualitative distinctions for AI-based
problems, and in turn, allows the qualitative representations to offer clues to direct
the vision network to adjust its processing operation according to the
interpretation state. Two vision-based fusion algorithms were presented, one based
on reconstructing the full-parameterized human model and the other based on
a sequence of direct deductions about the posture elements in a fall detection
application.</p>
          <p>The current work includes incorporation of body part motion into the
fullparameterized human body model allowing the model to carry the gesture
elements in interactions between the vision network and the high-level reasoning
module. Other extensions of interest include creating a link from the human
model to the reduced qualitative description set for a specific application, and
utilizing deductions made by the AI system as a basis for active vision in
multicamera settings.</p>
          <p>Safe
-1</p>
          <p>Alert Levels</p>
          <p>Uncertain</p>
          <p>Danger
0</p>
          <p>1
Alert level
= -0.9369
Confidence
= 0.8391
Alert level
= -0.1651
Confidence
= 0.7346
Alert level
= 0.6598
Confidence</p>
          <p>= 0
are shown. After combining observations from the three cameras a final score is given
indicating whether the person is standing (safe) or lying (danger).</p>
          <p>Alert level
= -0.9107
Confidence
= 0.9282
Alert level
= -0.7920
Confidence
= 0.8153
combine
Standing, safe
( -0.8075 )
combine
Uncertain
( -0.3039 )
Alert level
= 0.8370
Confidence
= 0.7389</p>
          <p>combine
Lying down, danger</p>
          <p>( 0.6201 )</p>
          <p>Alert level
= -0.9534
Confidence
= 0.8298
Alert level
= -0.5945
Confidence
= 0.6517
Alert level
= 0.8080
Confidence
= 0.7695</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kwolek</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Visual system for tracking and interpreting selected human actions</article-title>
          .
          <source>In: WSCG</source>
          . (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>G.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Corso</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. D.</surname>
          </string-name>
          <article-title>Hager: 7: Visual Modeling of Dynamic Gestures Using 3D Appearance and Motion Features</article-title>
          . In:
          <article-title>Real-Time Vision for Human-Computer Interaction</article-title>
          . Springer-Verlag (
          <year>2005</year>
          )
          <fpage>103</fpage>
          -
          <lpage>120</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Aghajan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Augusto</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCullagh</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , ,
          <string-name>
            <surname>Walkden</surname>
          </string-name>
          , J.:
          <article-title>Distributed visionbased accident management for assisted living</article-title>
          .
          <source>In: ICOST</source>
          <year>2007</year>
          , Nara, Japan
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Patil</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rybski</surname>
            ,
            <given-names>P.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanade</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veloso</surname>
            ,
            <given-names>M.M.:</given-names>
          </string-name>
          <article-title>People detection and tracking in high resolution panoramic video mosaic</article-title>
          .
          <source>In: Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS)</source>
          . Volume
          <volume>1</volume>
          . (Oct.
          <year>2004</year>
          )
          <fpage>1323</fpage>
          -
          <lpage>1328</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trucco</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Human body posture via hierarchical evolutionary optimization</article-title>
          .
          <source>In: BMVC06</source>
          . (
          <year>2006</year>
          ) III:
          <fpage>999</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rittscher</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blake</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Towards the automatic analysis of complex human body motions</article-title>
          .
          <source>Image and Vision Computing</source>
          (
          <volume>12</volume>
          ) (
          <year>2002</year>
          )
          <fpage>905</fpage>
          -
          <lpage>916</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cucchiara</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prati</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vezzani</surname>
          </string-name>
          , R.:
          <article-title>Posture classification in a multi-camera indoor environment</article-title>
          .
          <source>In: ICIP05</source>
          . (
          <year>2005</year>
          ) I:
          <fpage>725</fpage>
          -
          <lpage>728</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Bj</surname>
          </string-name>
          <article-title>¨orn Gottfried, Hans Werner Guesgen, and Sebastian Hu¨bner: Spatiotemporal Reasoning for Smart Homes</article-title>
          . In: Designing Smart Homes. Springer (
          <year>2006</year>
          )
          <fpage>16</fpage>
          -
          <lpage>34</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sidenbladh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Black</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sigal</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Implicit probabilistic models of human motion for synthesis and tracking</article-title>
          .
          <source>In: ECCV '02: Proceedings of the 7th European Conference on Computer</source>
          <string-name>
            <surname>Vision-Part</surname>
            <given-names>I</given-names>
          </string-name>
          , London, UK, Springer-Verlag (
          <year>2002</year>
          )
          <fpage>784</fpage>
          -
          <lpage>800</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Deutscher</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blake</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reid</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Articulated body motion capture by annealed particle filtering</article-title>
          . (
          <year>2000</year>
          ) II:
          <fpage>126</fpage>
          -
          <lpage>133</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Cheung</surname>
            ,
            <given-names>K.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanade</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Shape-from-silhouette across time: Part ii: Applications to human modeling and markerless motion tracking</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>63</volume>
          (
          <issue>3</issue>
          ) (
          <year>August 2005</year>
          )
          <fpage>225</fpage>
          -
          <lpage>245</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. M´enier,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Boyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Raffin</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>3d skeleton-based body pose recovery</article-title>
          .
          <source>In: Proceedings of the 3rd International Symposium on 3D Data Processing, Visualization and Transmission</source>
          , Chapel Hill (USA).
          <source>(june</source>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Mikic</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trivedi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hunter</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cosman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Human body model acquisition and tracking using voxel data</article-title>
          .
          <source>Int. J. Comput. Vision</source>
          <volume>53</volume>
          (
          <issue>3</issue>
          ) (
          <year>2003</year>
          )
          <fpage>199</fpage>
          -
          <lpage>223</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Sidenbladh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Black</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Learning the statistics of people in images and video</article-title>
          .
          <volume>54</volume>
          (
          <issue>1-3</issue>
          ) (
          <year>August 2003</year>
          )
          <fpage>183</fpage>
          -
          <lpage>209</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aghajan</surname>
          </string-name>
          , H.:
          <article-title>Layered and collaborative gesture analysis in multi-camera networks</article-title>
          .
          <source>In: ICASSP. (Apr</source>
          .
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adelson</surname>
          </string-name>
          , E.:
          <article-title>Perceptually organized em: A framework for motion segmentaiton that combines information about form and motion</article-title>
          .
          <source>Technical Report 315</source>
          ,
          <string-name>
            <given-names>M.I.T Media</given-names>
            <surname>Lab</surname>
          </string-name>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Ivecovic</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trucco</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Human body pose estimation with pso</article-title>
          .
          <source>In: IEEE Congress on Evolutionary Computation</source>
          . (
          <year>2006</year>
          )
          <fpage>1256</fpage>
          -
          <lpage>1263</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>