<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Enhancement of Low-Level Classifications for Ambient Assisted Living</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rachel GOSHORN</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deborah GOSHORN</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathias KÖLSCH</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department and MOVES Institute, Naval Postgraduate School</institution>
          ,
          <country country="US">U.S.A</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer Science and Engineering Department, University of California, San Diego</institution>
          ,
          <country country="US">U.S.A</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Systems Engineering Department, Naval Postgraduate School</institution>
          ,
          <country country="US">U.S.A</country>
        </aff>
      </contrib-group>
      <fpage>87</fpage>
      <lpage>101</lpage>
      <abstract>
        <p>Assisted living means providing the assisted with custom services, specific to their needs and capabilities. Computer monitoring can supply some of these services, be it through attached devices or “smart” environments. In this paper, we describe an ambient system that we have built to facilitate non-verbal interaction that is not bound to the traditional input means of a keyboard and mouse. We investigated the reliability of hand gesture behavior recognition, from which computer commands for AAL communications are interpreted. Our findings show that hand gesture behavioral analysis reduces false classifications while at the same time more than doubling the available vocabulary. These results will influence the design of gestural and multimodal user interfaces.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>human-computer interaction</kwd>
        <kwd>hand postures</kwd>
        <kwd>hand gestures</kwd>
        <kwd>user interface</kwd>
        <kwd>ambient intelligence</kwd>
        <kwd>posture recognition</kwd>
        <kwd>gesture recognition</kwd>
        <kwd>smart environments</kwd>
        <kwd>computer vision</kwd>
        <kwd>human body tracking</kwd>
        <kwd>hand gesture behaviors</kwd>
        <kwd>Ambient Assited Living (AAL)</kwd>
        <kwd>interpreted computer commands</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In a variety of situations gestural communication is either preferable over verbal
communication or advantageous if used in a multimodal combination with voice. For example,
noisy environments might render voice recognition unreliable. People with speech
impairments might have difficulties communicating verbally. And some intentions are best
communicated multimodally, best illustrated in Bolt’s “Put That There” elaboration [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
In addition, people may be bedridden or elderly living alone at home, and may need
assistance in communicating and controlling various devices.
      </p>
      <p>Through behavior analysis of human hand movements over time, observed through
vision sensor data, interpretation of communication commands can be carried out and
communicated for human computer interaction to enable smart environments. A great
need for smart environments exists with those needing assistance, such as the elderly
at home, others bedridden, etc. In this, vision sensor data is providing the means to
become aware of the surrounding environment, “ambient intelligence”, and through hand
gestures, providing the human computer interaction, smart environments are enabled.
Combining these two worlds of “ambient intelligence” and “smart environments”, those
needing help at home are assisted; in other words “ambient assisted living (AAL)” is
provided. In AAL, people can for example be assisted in turning devices on and off,
carry out phone calls (e.g. emergency calls), change the television channel, etc. Fig. 1
shows the overall systems view of AAL discussed in this paper. If these people needing
assistance could communicate commands to devices, using their hands, it would remove
the need for tools and remote controls, and allow for hands free communications. If
remote controls were required, and for example these fell under the bed where a person is
bedridden, they would need some way to communicate.</p>
      <p>This paper will demonstrate the high level classification of hand gesture behaviors,
based on sequences of hand postures over time. The hand gesture behaviors are then
interpreted as various control commands for various computer devices. The levels of
analysis for AAL, are shown in a pyramid process in Fig. 2.</p>
      <p>
        In this paper, we demonstrate the use of hand gestures as a robust input for AAL.
Based on an alphabet of atomic hand postures illustrated in Fig. 3, we compose a small
“gesture” vocabulary composed of posture sequences. The postures are observed with
ceiling-mounted cameras and recognized with the “HandVu” library. We then use a
robust hand gesture behavior classification method [
        <xref ref-type="bibr" rid="ref4 ref5">5,4</xref>
        ] to distinguish the gestures. In an
AAL system, these hand gesture behaviors are then interpreted as computer commands
to control various devices of interest. In an experiment, we demonstrate improved
recognition performance over individual hand postures, and providing additional computer
commands per gesture (versus being limited in commands to the fixed set of postures),
thus making the system robust for application in AAL. The overview of the AAL systems
experiment focus can be seen in Fig. 4.
      </p>
      <p>After reviewing the related work in the following section, Sec. 3 will introduce the
posture recognition library, the robust hand gesture classification method for syntactic
analysis is descried in Sec. 4 and the experimental design and results in Sec. 5. The last
two sections cover the summary and conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Smart environments have long attracted people’s interests. In many ways, rooms have
become more aware of its inhabitants: consider motion-activated doors, lights that turn
off automatically if noone is present, and even the thermostat “senses” the environment.
For active user interaction with the environment, there’s the clap switch that turns on
and off electricity to your favorite household appliance at the clap of a hand. However,
networked systems that can continuously monitor and react to a person’s state are still
the dreams of researchers.</p>
      <p>
        One of the earliest demonstrations of the of gestural and multimodal
humancomputer interaction was Bolt’s 1980 “Put That There” article [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. An early user
interface implementation using temporal gestures was shown by Pausch and Williams [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
in 1990, making use of a tracked data glove. Various researchers and now also
commercial products employ handheld devices instead of bare-hand gestures: The XWand [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ],
a stick the size of a remote control, enables natural interaction with consumer devices
through gestures and speech. One can point at and gesture or speak commands to, for
example, your TV or stereo. Earlier, Kohtake et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] showed a similar wand-like device
that enabled data transfer between consumer appliances such as a digital camera, a
computer, and a printer by pointing the “InfoPoint” at it. Probably the most popular device
for gestural HCI as of late is Nintendo’s “Wii Remote” which makes use of sensors in its
game controller to estimate position and orientation.
      </p>
      <p>
        Computer vision as sensing technology has advantages over sensors embedded in
handheld devices due to its silent, unobtrusive operation that enables unencumbered
interaction. (Your hands are not likely to get lost between the sofa cushions.) There is a vast
body of research on hand detection, posture recognition, hand tracking, and
trajectorybased gesture analysis. Early work on utilizing computer vision to facilitate
humancomputer interaction in relatively unconstrained environments include work by Freeman
et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In their implementation, a variety of image observations including edges and
optical flow allow distinction of hand gestures and full-body motions. Ong and
Bowden [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] show methods for real-time hand detection and posture classification. 3D
pointing directions as discussed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] can be determined with methods such as those described
by Nickel et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Recognition of a select vocabulary of the American Sign Language
using temporal hand motion aspects has been demonstrated by Starner and Pentland [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
Wachs et al.’s Gestix [
        <xref ref-type="bibr" rid="ref22 ref23">22,23</xref>
        ] combines many of these technologies to create a user
interface suitable to interaction in antiseptic surgical environments where keyboards can
not be used. An analysis of and recommendations for using gesture recognition for user
interaction can be found in a book chapter by Turk [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        Behavior classification has grown to be a popular area of research in the computer
vision area. There are several approaches to classifying high-level behaviors of
visionbased data. If taken the approach of decomposing behavioral classification into two
stages: (1) first, a low-level event classifier based on features of raw video data, and (2)
a high-level behavioral classifier based on sequences of events that were outputted from
the first stage, then a behavior classifier can be thought of as a classifier of sequences
of symbols (events). Sequential classification has been an interest in several applications
such as genetic algorithms, natural language processing, speech recognition, and
compilers. For computer vision applications, behavior classification has been mostly popular
using state-space models, like Hidden Markov Models (HMMs). However, when certain
sequences of low-level data inherently fall into meaningful behaviors, Ivanov and Bobick
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] conclude that using syntactic (grammar-based) structural approaches can outperform
more statistically-based approaches such approaches as HMMs . Ivanov and Bobick [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
use a stochastic context-free grammar for behavior modeling (hand gesture and behavior
modeling in car parking).
      </p>
      <p>
        Others have designed specific deterministic finite state machines for a for classifying
behaviors such as airborne surveillance scenarios [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], but are not able to handle noisy
data. To fix this problem, augmented finite state machines can be used, so that noisy data
can still be parsed and accepted by each finite state machine representing behavior. In
[
        <xref ref-type="bibr" rid="ref4 ref5">4,5</xref>
        ], such a novel robust sequential classifier is developed and proved to classify
behaviors on noisy data in various applications, such as modeling behaviors in freeway traffic,
human behavioral patterns in a lab room, and signal processing patterns seen in
communications channels for distinguishing types of transmitted signals. An extended version
of this classifier [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is used to classify hand gestures based on three hand postures,
assuming the hand postures were classified ahead of time and were heavily mislabeled. The
data was simulated with to be heavily noisy to demonstrate the classifier’s robustness and
ability to correct errors from a poor low-level posture classifier. This paper further
improves this extended classifier (as the high-level gesture behavior classifier) as described
in Sec. 4 and runs it on real data (posture labels) outputted from the real-time posture
recognition classifier discussed in Sec. 3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Hand Posture Recognition</title>
      <p>This section introduces HandVu, a library and program to recognize a set of six hand
postures in real-time in video streams. Its three main components are described in the
following subsections: hand detection, 2D hand tracking, and posture recognition. HandVu’s
vision processing methods typically require less than 100 milliseconds combined
processing time per frame. They are mostly robust to different environmental conditions such as
lighting changes, color temperature and lens distortion. HandVu performs fast, accurate,
and robust enough for a user interface’s quality and usability. HandVu’s output includes
which of the six postures was recognized or, if none was recognized, an “unknown
posture” identifier. This data was fed into the classification method described in This section
introduces HandVu, a library and program to recognize a set of six hand postures in
realtime in video streams. Its three main components are described in the following
subsections: hand detection, 2D hand tracking, and posture recognition. HandVu’s vision
processing methods typically require less than 100 milliseconds combined processing time
per frame. They are mostly robust to different environmental conditions such as lighting
changes, color temperature and lens distortion. HandVu performs fast, accurate, and
robust enough for a user interface’s quality and usability. HandVu’s output includes which
of the six postures was recognized or, if none was recognized, an “unknown posture”
identifier. This data was fed into the classification method described in Sec. 4.</p>
      <sec id="sec-3-1">
        <title>3.1. Hand Detection</title>
        <p>
          HandVu’s hand detection algorithm detects the hand in the closed posture based on
appearance and color. It uses a customized method based on the Viola-Jones detection
method [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] to find the hand in this posture and view-dependent configuration. This
posture/view combination is advantageous because it can be distinguished rather reliably
from background noise [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          Upon detection of a hand area based on gray level texture, the area’s color is
compared against a user-independent histogram-based statistical model of skin color, built
from a large collection of hand-segmented pictures from many imaging sources (similar
to Jones and Rehg’s approach [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]). If the amount of skin pixels falls below a threshold
the detection is rejected. These two image cues combined reduce the amount of false
detections to about a dozen per hour video.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Hand Tracking</title>
        <p>
          Next, the hand’s motion is tracked in the video stream. To that end, the system learns the
observed hand color in a histogram, hence adjusting to user skin-color variation,
lighting differences, and camera color temperature settings. Hand tracking uses the “Flock
of Features” approach [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] which calculates the optical flow for small patches and
occasionally resorts to local color information as backup. This multicue integration of
graylevel texture with textureless color information increases the algorithm’s robustness,
permitting hand tracking despite vast and rapid appearance changes. It further alleviates
interdependency problems with staged-cue approaches, it improves the robustness, and
increases confidence in the tracking results.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Posture Classification</title>
        <p>
          The algorithm’s last stage attempts to recognize various predefined postures. A posture in
our sense is a combination of a hand/finger configuration and a view direction, allowing
for the possibility to distinguish two different views of the same finger configuration such
as Lback and Lpalm. The focus of the recognition method is on reliability, not
expressiveness. That is, it distinguishes a few postures reliably and does not attempt less
consistent recognition of a larger number of postures. HandVu’s recognition method uses a
texture-based approach to fairly reliably classify image areas into seven classes, six
postures and “no known hand posture.” The confusion matrix of a video-based experiment
is shown in Fig. 7 and described to more detail in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>
          A two-stage hierarchy achieves both accuracy and good speed performance. In the
first step, a detector looks for any of the six hand postures without distinguishing between
them. This is faster than executing six separate detectors because different postures’
appearances share common features and can thus be eliminated in one classification. In the
second step, only those areas that passed the first step successfully are investigated
further. Each of the second-step detectors for the individual postures was trained on the
result of the combined classifier which had already eliminated 99.999879% of image areas
in a validation set [
          <xref ref-type="bibr" rid="ref12 ref13 ref14">14,12,13</xref>
          ]
        </p>
        <p>After a successful classification, the tracking stage is initialized again (the Flock of
Feature locations and the observed skin color model). HandVu is largely user
independent and not negatively influenced by different cameras or lenses.</p>
        <p>Results from HandVu are sent as “Gesture Events” to the gesture classification
module in a unidirectional TCP/IP stream of ASCII messages in the following format:
1.2 timestamp obj_id: tracked, recognized,...</p>
        <p>The two identifiers “tracked” and “recognized” are boolean values indicating the
tracking and recognition state of HandVu.</p>
        <p>
          More detailed descriptions of HandVu’s architecture [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], robust hand
detection, [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and hand tracking[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] are available elsewhere.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Hand Gesture Behavior Recognition Classifier</title>
      <p>Sequences of hand postures, over time, compose a hand gesture behavior, which are then
interpreted as a computer command (as represented in the overall system, described in
Sec. 1 and Fig. 4.). This section describes the hand gesture recognition classifier
theory, implementation, and cost autmation method used within the hand gesture behavior
recognition classifier.</p>
      <sec id="sec-4-1">
        <title>4.1. Hand Gesture Behavior Recognition Theory</title>
        <p>Sequences of events or features, over time and space, compose a behavior. The
events/features are the low-level classifications; in this paper, they are the detected hand
postures (as described in Sec. 3). The detected hand postures are concatenated in a
temporal sequence to form a behavior, which is a hand gesture. As sequences of hand
postures are detected, a method is needed to read these sequences and classify which hand
gesture the sequence is most similar to.</p>
        <p>
          In order to read and classify sequences, we use a syntactical grammar-based
approach [
          <xref ref-type="bibr" rid="ref4 ref5">5,4</xref>
          ]. Before classifying sequences, the various hand gesture behavior structures
need to be defined a priori. Each hand gesture behavior can be seen as an infinite set
of sequences of hand postures of similar temporal structure; this infinite set is known
as a language. Each hand gesture behavior will have different temporal structures of the
hand postures. These sequence structures are defined with syntax rules, specifically
regular grammars [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], where the alphabet are the various hand postures (also known as
symbols); the syntax rules then define ways these postures can be combined together to
form the temporal structures of interest. A set of syntax rules, defining a hand gesture
behavior, is also called a grammar. The grammar is implemented through a finite state
machine (FSM). In other words, the FSM reads the sequence of hand postures. If a
sequence of hand postures matches a certain hand gesture behavior, its corresponding FSM
will accept this sequence. Therefore, the sequence of hand postures is classified into the
hand gesture behavior whose corresponding FSM accepts the sequence. Since systems
are not one-hundred percent predictable and reliable, its likely that a sequence will not
be accepted by any of the predefined hand gesture behaviors. This could be due to
errors in the low-level classification of the hand postures, or a user error by making the
hand gesture with an incorrect hand posture in the sequence of postures. Therefore, the
sequence of hand postures must be classified as the hand gesture behavior to which it is
most similar. In order to do so, a distance metric between a sequence of hand postures
and a hand gesture behavior is defined. To continue in defining the hand gesture behavior
recognition classification, preliminary definitions are shown.
        </p>
        <p>Let an alphabet 6 be the set of predefined hand postures. An example alphabet is
6 = a, b, c, d, e, f where each letter represents a detected hand posture. More details</p>
        <p>abkaabk · · · ka · · · ab
 S → a Q1 
Bk =  Q1 → a Q1 </p>
        <p>Q1 → bF
on the hand posture symbols is shown in Sec. 5. A hand gesture behavior is then a set of
syntax rules combining the elements of 6.</p>
        <p>If an infinite set of sequences of postures is the following kth language:
then the hand gesture behavior (grammar) that generated this language is:</p>
        <p>Let a hand gesture, a temporal sequence of detected hand postures be denoted by
s = s1s2 . . . sn, where each s j is a hand posture from 6. If s matches one of the sequences
in L(Bk ), it is the kth hand gesture behavior, and its corresponding FSM Mk will accept
this sequence. Let there be K predefined hand gesture behaviors B1, B2, . . . BK , with K
corresponding finite state machines M1, M2, . . . MK that implement each hand gesture
behavior. The sequence s will be classified into the hand gesture behavior to whose
corresponding FSM it is accepted. If s is not accepted by any Ml , sequence s will then be
classified as the hand gesture behavior to which it is most similar. Therefore, as Ml is
parsing sequence s, it will edit the sequence so that the Ml accepts it, but with a cost
per edit, and then a total cost. Therefore, the distance between a sequence s and a hand
gesture behavior Bl , denoted by d(s, Bl ), is determined by the cost-weighted number of
editions required to transform s into Bl . The possible posture symbol editions are
substitution and deletion, where each edition has an a priori cost assigned. As s is being
parsed, Ml , will carry out the minimum number of edits required to transform s into a
sequence in Bl . In order to allow edits with an associated cost in a hand gesture behavior,
the original set of syntax rules per behavior and corresponding FSM must be augmented.
Let the augmented kth hand gesture behavior and corresponding FSM be denoted by Bk0
and M k0 . With the example, Bk , let the augmented set of syntax rules be denoted by
0
Bk =
 QQS11→→→aaQbQF11,,,000 
 S → bQ1, CS(b,a) 
</p>
        <p>Q1 → bQ1, CS(b,a)
 Q1 → a F, CS(a,b) 
 S → ε Q1, CD(b) 
 Q1 → ε Q1, CD(b) 
 
 Q1 → ε F, CD(a) 
(1)
(2)
(3)
(4)
Let S(a, b) denote substituting the true posture b for the mislabeled posture a, and the
associated cost CS(b,a), and D(a) denote deleting a mislabeled posture a with a cost
CD(a). The corresponding modified FSM M k0 is shown in Fig. 5.</p>
        <p>In order to calculate the distance between a sequence of hand postures and a hand
gesture behavior, let the possible hand postures be 6 = r1, r2, · · · , rN , where N is the
total number of hand postures. With this, the distance from a sequence of hand postures
s and a hand gesture behavior Bl is given by,</p>
        <p>|6| |6|
d(s, Bl ) = X</p>
        <p>X CS(ri ,r j ))nS(ri ,r j ) +
i=1 j=1
|6|
X CD(ri )n D(ri )
i=1</p>
        <p>where nS(ri ,r j ) is the number of substitutions of true hand posture r j for mislabeled
hand posture ri , and n D(ri ) number of deletions of the mislabeled hand posture ri .</p>
        <p>With a distance metric between a sequence of hand postures and an a priori
defined hand gesture behavior, the classification definition can be elaborated upon.
Assuming K hand gesture behaviors, each behavior and its associated FSM are augmented a
priori so that any sequence of hand postures is accepted by each hand gesture
behavior, but with a total cost. The augmented hand gesture behaviors are then denoted by
B10, B20, . . . BK0 , and their K corresponding augmented finite state machines are denoted
by M 10, M 20, . . . M K0 . An unknown sequence s of hand postures is then parsed by each
Ml0 , with a cost d(s, Bl0 ). The sequence s is then classified as the hand gesture Bg, where
Bg = min d(s, Bl0 , l = 1, 2, · · · , K ), and is the hand gesture behavior to which it is most
similar. Therefore, sequences of hand postures are classifed based upon Maximum
Similarity Classification (MSC) as seen in Fig. 6.</p>
        <p>The hand gesture behavior classifier is innovatively implemented. The
implementation is generalized so that the overall hand gesture behavior classification structure stays
the same, and can operate on various number of hand gesture behaviors, various number
of syntax rules per hand gesture behavior, various number of sequences of hand postures
to classify, and various number of hand postures per sequence to classify as a hand
gesture. The hand gesture classifier is also easily scalable to additional hand postures and
hand gesture behaviors.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Hand Gesture Recognition Cost Automation</title>
        <p>In low-level classification of hand postures, hand postures can be mislabeled at times.
The probability of misclassifying certain hand postures as other hand postures, is known
a priori. For example, the probability of mislabeling the true hand posture b as hand
posture a is known a priori. This knowledge is used in classifying sequences of hand
postures; in other words, these probabilities are used in automating the costs per syntax
rule in a hand gesture behavior. The probabilities of mislabeling certain hand postures as
other hand postures is extracted from the confusion matrix, also known as the recognition
summary in section Sec .3.</p>
        <p>The probability that the low-level hand posture recognition classifier mislabeled
hand postures is then calculated from the confusion matrix for the posture recognition
classifier, as seen in Table 1. Let N be the total number of hand postures classified by the
low-level hand posture recognition classifier (also is the size |6|).</p>
        <p>P(labeled postur e = a|tr uepostur e = b) =
=</p>
        <p>P(labeled postur e = a, tr uepostur e = b)</p>
        <p>P(tr uepostur e = b)</p>
      </sec>
      <sec id="sec-4-3">
        <title>Count (b, a)/N</title>
        <p>= (Count (b, a) + Count (b, b) + Count (b, c))/N</p>
      </sec>
      <sec id="sec-4-4">
        <title>Count (b, a)</title>
        <p>= Count (b, a) + Count (b, b) + Count (b, c)</p>
        <p>The conditional probability estimate for all possible mislabeled hand postures can
now detected. The substitution costs are defined as the inversion of the conditional
probability. This can be seen in an example; let the cost for substituting the mislabeled
posture a with the true posture b be the inversion of the probability that the low-level hand
posture classifier mislabeled posture b as posture a. By denoting the conditional
probabilities P(labeled post ur e = a|tr uepost ur e = b) as as P(a|b), then the cost for
substituting a mislabeled the true posture a with the true posture b is:</p>
        <p>1
CS(a,b) = 10log10( P(a|b) )</p>
        <p>Thinking about this intuitively, if there is a high probability that the low-level hand
posture recognition classifier mislabels the true posture b with posture a, where P(a|b)
is high, then the cost for substituting posture a for b, CS(b,a), is low.</p>
        <p>Additionally, if this probability P(a|b) is nonzero, then P(b|b) is less than one,
P(b|b) = 1 − P(a|b) − P(c|b).
(10)</p>
        <p>In order to get an understanding of the range of potential costs per edit, let x
be an entry in the confusion matrix, the cost is normalized in a range by taking the
10log10(1/x ), where x is the probability such as P(a|b). In addition, to avoid infinite
costs, such as when the probability is zero, probabilities of zero have added, 0 + ,
where = 2x 10−16. To get insight into the costs from the probabilities of misclassifying
hand postures, see the costs in Table 2 lists the range.</p>
        <p>For this data set, we set the cost for deleting a posture instance b, for example, D(b)
be 20log10(1/(0 + eps)), with this, constraining the hand gesture classifier to chose
substitutions for the minimum cost edits possible, and scale deletions to other applications
in future work. The infrastructure is in place. Deleting could be used for cases where edit
is not a result of low-level classification, but the user used the wrong hand posture by
accident.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Design and Results</title>
      <p>This section will demonstrate hand gesture behaviors enhancing low-level classifications
of hand postures and scaling the possible number of computer commands available for
AAL. Since low-level hand posture classifications can have errors, sequencing the hand
postures together over time can enhance the low-level classifications for ambient assisted
living environments. In addition to enhancing the low-level classification, if you limit
the computer commands to the fixed set of hand postures, the number of computer
commands possible is the number of hand postures possible. If you sequence various hand
postures together over time, additional computer commands are possible. Therefore, the
experimental results of this paper will show that sequencing hand postures together will
enhance low-level classifications and sequencing various hand postures together allows
for more possible computer commands. For each classification, the hand gesture behavior
classification performance accuracy results will also be shown. The overall experiment
design focus can be seen in Fig. 4.</p>
      <sec id="sec-5-1">
        <title>5.1. Experimental Design</title>
        <p>This section describes the experimental design.</p>
        <p>As discussed in section Sec .3, the possible hand postures are Closed, Open, Lback,
Lpalm, Victory, and Sidepoint. In order to define the various hand gesture behaviors, a
symbol is assigned to each hand posture as seen in Table 3, with the alphabet of hand
postures, 6 = a, b, c, d, e, f .</p>
        <p>Various data sets were created, where each data set has six hand gesture behaviors, in
order to compare six hand gesture behaviors with six individual hand postures, as seen in
Table 4. Data Set 1 sequences the same hand posture together, and therefore classifying
this hand gesture is similar to a weighted average over the detected hand posture over
time (e.g. a low-pass filter to remove and smooth out the noise of posture mislabeling).
In addition, combining various hand postures, extends the number of potential computer
commands usable in an AAL environment. Therefore, various combinations of the hand
postures were defined per hand gesture behavior, such as Data Set 2 uses the most reliable
pairs of hand postures, Data Set 3 uses the least reliable pairs of hand postures, Data Set
4 uses a combination of the most reliable and least reliable hand posture label.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Experimental Results</title>
        <p>This section shows the results of hand gesture behavior classification, based upon
sequences of lower-level classifications of hand postures.</p>
        <p>The low-level hand posture detection is shown in Fig. 7. In this figure, probability
of classifying a certain hand posture, given the true hand posture is shown. The x-axis
is true posture label, the y-axis is probability of classifying an observed label (shown in
the various colors, represented in the legend) given the true label. The six hand postures,
from Table 3 are shown. Note, the inherent high probability of the low-level posture
classifier mislabeling the true Sidepoint, f , hand posture as the Lpalm, d, hand posture.</p>
        <p>The hand gesture behavior detection is shown in Fig. 8. In this figure, probability
of classifying a certain hand gesture behavior, given the true hand gesture behavior is
shown. The x-axis is true hand gesture behavior label, the y-axis is probability of
classifying an observed label (shown in the various colors, represented in the legend) given
the true label. Six hand gesture behaviors per data sets, from Table 4 are shown. The
robustness of the high-level hand gesture classifier is portrayed in these results. The first
dataset, Data Set 1, the gestures made up of single postures, was designed to show the
error-correcting capability of the high-level behavior classifier. Notice the 100%
accuracy of classifying the sequences of Sidepoint postures, even though the underlying raw
posture symbols were mostly mislabeled. In addition, the other Data Sets show various
hand gesture behavior classifications, enhancing the low-level hand posture recognitions
and increasing the vocabulary size, therefore increasing the number of computer
commands that can be interpreted for AAL.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>Future work can be broken into improved algorithm development for usability and scaled
to additional applications. For improved algorithm development, we plan to incorporate
usability factors into the costs (fuse these factors with the costs from the low-level
classifier). For example, “how often does a user misuse a hand posture in a hand gesture?”
should be additionally incorporated into the hand gesture behavior costs. In addition, this
paper can scale to various applications, such as increased awareness and environment
enabled AAL, where a house would have a network of cameras throughout the house
that can interpret hand gestures as commands. Another potential application is a enabling
a smart environments and communications using a surveillance system infrastructure.
In addition to surveillance, people with certain features (e.g. airport security, etc.) can
communicate signals to a central control station, and communicate information such as
unusual behavior they observed or reporting a status on certain suspicious persons. In
this case, the start and end of a hand gesture would be defined, e.g. specific hand postures
would be defined to initiate and terminate a hand gesture communication. In addition, the
structure of hand gesture behavior classification can be scaled to classifying sequences
of body postures for human gesture behavior classification for surveillance applications.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Summary</title>
      <p>We built a vision-based user interface to support intentional interaction with a smart
assisted-living environment. Our experiments show that the use of temporal hand
postures, creating hand gesture behaviors, improves the recognition rates and increases the
available vocabulary, two very important considerations for a user interface. The
lowlevel classifications of hand postures are therefore enhanced through hand gesture
behaviors for more robust human computer interaction; in addition, through hand gesture
behaviors, with an increased vocabulary possible, the number of computer commands
interpreted from hand gesture behaviors also increases. Overall, ambient assisted living
is enabled through computer command interpretations from hand gesture behaviors, as a
result of low-level posture recognitions over time. The vision-based intentional interface
is robust in that it can be scaled to a range of hand posture types and hand gesture
behavior types, and therefore additional applications as discussed in Sec. 6 can be carried
out.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Bolt</surname>
          </string-name>
          .
          <article-title>Put-That-There: Voice and Gesture in the Graphics Interface</article-title>
          .
          <source>Computer Graphics</source>
          , ACM SIGGRAPH,
          <volume>14</volume>
          (
          <issue>3</issue>
          ):
          <fpage>262</fpage>
          -
          <lpage>270</lpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bremond</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Medioni</surname>
          </string-name>
          .
          <article-title>Scenario recognition in airborne video imagery</article-title>
          .
          <source>In DARPA98</source>
          , pages
          <fpage>211</fpage>
          -
          <lpage>216</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W. T.</given-names>
            <surname>Freeman</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. B. Anderson</surname>
            ,
            <given-names>P. A.</given-names>
          </string-name>
          <string-name>
            <surname>Beardsley</surname>
            ,
            <given-names>C. N.</given-names>
          </string-name>
          <string-name>
            <surname>Dodge</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          <string-name>
            <surname>Weissman</surname>
            , and
            <given-names>W. S.</given-names>
          </string-name>
          <string-name>
            <surname>Yerazunis</surname>
          </string-name>
          .
          <article-title>Computer Vision for Interactive Computer Graphics</article-title>
          .
          <source>IEEE Computer Graphics and Applications</source>
          , pages
          <fpage>42</fpage>
          -
          <lpage>53</lpage>
          , May-June
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Goshorn</surname>
          </string-name>
          .
          <article-title>Sequential Behavior Classification Using Augmented Grammars</article-title>
          .
          <source>Master's thesis</source>
          , University of California, San Diego,
          <year>June 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Goshorn</surname>
          </string-name>
          .
          <article-title>Syntactical Classification of Extracted Sequential Spectral Features Adapted to Priming Selected Interference Cancelers</article-title>
          .
          <source>PhD thesis</source>
          , University of California, San Diego,
          <year>June 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Goshorn</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Goshorn</surname>
          </string-name>
          .
          <article-title>Vision-based syntactical classification of hand gestures to enable robust human computer interaction</article-title>
          .
          <source>In 3rd Workshop on AI Techniques for Ambient Intelligence, co-located with European Conference on Ambient Intelligence (ECAI08)</source>
          ., pages
          <fpage>211</fpage>
          -
          <lpage>216</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ivanov</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Bobick</surname>
          </string-name>
          .
          <article-title>Probabilistic parsing in action recognition</article-title>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y. A.</given-names>
            <surname>Ivanov</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Bobick</surname>
          </string-name>
          .
          <article-title>Recognition of visual activities and interactions by stochastic parsing</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell.,
          <volume>22</volume>
          (
          <issue>8</issue>
          ):
          <fpage>852</fpage>
          -
          <lpage>872</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Jones</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Rehg</surname>
          </string-name>
          .
          <article-title>Statistical Color Models with Application to Skin Detection</article-title>
          .
          <source>Int. Journal of Computer Vision</source>
          ,
          <volume>46</volume>
          (
          <issue>1</issue>
          ):
          <fpage>81</fpage>
          -
          <lpage>96</lpage>
          ,
          <year>Jan 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kohtake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rekimoto</surname>
          </string-name>
          , and
          <string-name>
            <surname>Y. Anzai.</surname>
          </string-name>
          <article-title>InfoPoint: A Device that Provides a Uniform User Interface to Allow Appliances to Work Together over a Network</article-title>
          .
          <source>Personal and Ubiquitous Computing</source>
          ,
          <volume>5</volume>
          (
          <issue>4</issue>
          ):
          <fpage>264</fpage>
          -
          <lpage>274</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kölsch</surname>
          </string-name>
          .
          <article-title>Vision Based Hand Gesture Interfaces for Wearable Computing and Virtual Environments</article-title>
          .
          <source>PhD thesis</source>
          , Computer Science Department, University of California, Santa Barbara,
          <year>September 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kölsch</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Turk</surname>
          </string-name>
          .
          <article-title>Robust Hand Detection</article-title>
          .
          <source>In Proc. IEEE Intl. Conference on Automatic Face and Gesture Recognition</source>
          , May
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kölsch</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Turk</surname>
          </string-name>
          .
          <article-title>Hand Tracking with Flocks of Features</article-title>
          .
          <source>In Video Proc. CVPR IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kölsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Turk</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Höllerer</surname>
          </string-name>
          .
          <article-title>Vision-Based Interfaces for Mobility</article-title>
          .
          <source>In Intl. Conference on Mobile and Ubiquitous Systems (MobiQuitous)</source>
          ,
          <year>August 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Nickel</surname>
          </string-name>
          , E. Seemann, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Stiefelhagen</surname>
          </string-name>
          .
          <article-title>3D-tracking of Head and Hands for Pointing Gesture Recognition in a Human-Robot Interaction Scenario</article-title>
          .
          <source>In Proc. IEEE Intl. Conference on Automatic Face and Gesture Recognition</source>
          , May
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Ong</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Bowden</surname>
          </string-name>
          .
          <article-title>A Boosted Classifier Tree for Hand Shape Detection</article-title>
          .
          <source>In Proc. IEEE Intl. Conference on Automatic Face and Gesture Recognition</source>
          , pages
          <fpage>889</fpage>
          -
          <lpage>894</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pausch</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Williams</surname>
          </string-name>
          .
          <article-title>Tailor: creating custom user interfaces based on gesture</article-title>
          .
          <source>In Proceedings of the the third annual ACM SIGGRAPH symposium on User interface software and technology</source>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sipser</surname>
          </string-name>
          . Thoery of Computation. PWS Publishing Company, Massachusetts,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Starner</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Pentland</surname>
          </string-name>
          .
          <article-title>Visual Recognition of American Sign Language Using Hidden Markov Models</article-title>
          .
          <source>In AFGR, Zurich</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Turk</surname>
          </string-name>
          .
          <article-title>Gesture recognition</article-title>
          . In K. Stanney, editor,
          <source>Handbook of Virtual Environments: Design</source>
          ,
          <article-title>Implementation and Applications</article-title>
          . Lawrence Erlbaum Associates Inc.,
          <year>December 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Viola</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Jones. Robust</surname>
          </string-name>
          Real-time
          <source>Object Detection. Int. Journal of Computer Vision</source>
          , May
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wachs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Edan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gillam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Feied</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and J.</given-names>
            <surname>Handler</surname>
          </string-name>
          .
          <article-title>Gestix: a doctor-computer sterile gesture interface for dynamic environments</article-title>
          .
          <source>In Soft Computing in Industrial Applications: Recent and Emerging Methods and Techniques</source>
          , pages
          <fpage>30</fpage>
          -
          <lpage>39</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Wachs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stern</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Edan</surname>
          </string-name>
          .
          <article-title>Cluster labeling and parameter estimation for the automated setup of a hand-gesture recognition system</article-title>
          .
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          ,
          <volume>35</volume>
          (
          <issue>6</issue>
          ):
          <fpage>932</fpage>
          -
          <lpage>944</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wilson</surname>
          </string-name>
          and
          <string-name>
            <surname>S. Shafer.</surname>
          </string-name>
          <article-title>XWand: UI for Intelligent Spaces</article-title>
          .
          <source>In ACM CHI</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>