<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Simultaneous segmentation and recognition of gestures for human-machine interaction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Harold Vasquez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>L. Enrique Sucar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hugo Jair Escalante</string-name>
          <email>hugojairg@inaoep.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computational Sciences Instituto Nacional de Astrof ́ısica</institution>
          ,
          <addr-line>O</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Human-activity and gesture recognition are two problems lying at the core of human-centric and ubiquitous systems: knowing what activities/gestures users are performing allows systems to execute actions accordingly. State-of-the-art technology from computer vision and machine intelligence allow us to recognize gestures at acceptable rates when gestures are segmented (i.e., each video contains a single gesture). In ubiquitous environments, however, continuous video is available and thus systems must be capable of detecting when a gesture is being performed and recognizing it. This paper describes a new method for the simultaneous segmentation and recognition of gestures from continuous videos. A multi-window approach is proposed in which predictions of several recognition models are combined; where each model is evaluated using a different segment of the continuous video. The proposed method is evaluated in the problem of recognition of gestures to command a robot. Preliminary results show the proposed method is very effective for recognizing the considered gestures when they are correctly segmented; although there is still room for improvement in terms of its segmentation capabilities. The proposed method is highly efficient and does not require learning a model for no-gesture, as opposed to related methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Human-computer interaction technology plays a key role in
ubiquitous data mining (i.e., the extraction of interesting
patterns from data generated in human-centric environments),
see [Eunju, 2010]. From all of the alternative forms of
interaction, gestures are among the most natural and
intuitive for users. In fact, gestures are widely used to
complement verbal communication between humans. Research
advances in computer vision and machine learning have lead
to the development of gesture recognition technology that
is able to recognize gestures at very acceptable rates
[Aggarwal and Ryoo, 2011; Mitra, 2007]. However, most of
the available methods for gesture recognition require
gestures to be segmented before the recognition process
begins [Aviles et al., 2011]. Clearly, this type of methods
is not well suited for ubiquitous systems (and real
applications in general), where the recognition of gestures must
be done from a continuous video in real time [Eunju, 2010;
Huynh et al., 2008].</p>
      <p>This paper introduces a new approach for the
simultaneous segmentation and recognition of gestures in continuous
video. The proposed method implements a voting strategy
using the predictions obtained from multiple gesture models
evaluated at different time-windows, see Figure 1. Windows
are dynamically created by incrementally scanning the
continuous video. When the votes from the multiple models favor
a particular gesture, we segment the video and make a
prediction: we predict the gesture corresponding to the model that
obtained the majority of votes across windows.</p>
      <p>We use as features the body-part positions obtained by
a KinectT M sensor. As predictive model we used Hidden
Markov Models (HMMs), one of the most used for gesture
recognition [Aviles et al., 2011; Aggarwal and Ryoo, 2011;
Mitra, 2007]. The proposed method is evaluated in the
problem of recognition of gestures to command a robot.
Preliminary results show the proposed method is very effective for
recognizing the considered gestures when they are correctly
segmented. However, there is still room for improvement in
terms of its segmentation capabilities. The proposed method
is highly efficient and does not require learning a model for
no-gesture, as opposed in related works.</p>
      <p>The rest of this paper is organized as follows. The next
section briefly reviews related works on gesture spotting.
Section 3 describes the proposed approach. Section 4 reports
experimental results that show evidence of the performance
of proposed technique. Section 5 outlines preliminary
conclusions and discusses future work direction.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>Several methods for the simultaneous segmentation and
recognition of gestures (a task also known as gesture
spotting) have been proposed so far [Derpanis et al., 2010;
Yuan et al., 2009; Malgireddy et al., 2012; Kim et al., 2007;
Yang et al., 2007]. Some methods work directly with
spatiotemporal patterns extracted from video [Derpanis et al., 2010;
Yuan et al., 2009]. Although being effective, these methods
are very sensitive to to changes in illumination, scale,
appearance and viewpoint.</p>
      <p>On the other hand, there are model-based techniques that
use the position of body-parts to train probabilistic models
(e.g., HMMs) [Aggarwal and Ryoo, 2011; Mitra, 2007]. In
the past, these type of methods were limited because of the
need of specialized sensors to obtain body-part positions.
Nowadays, the availability of KinectT M (which can extract
skeleton information in real time) has partially circumvented
such limitation [Webb and Ashley, 2012].</p>
      <p>
        Besides the data acquisition process, some of these
methods require the construction of a no-gesture model
        <xref ref-type="bibr" rid="ref11 ref6">(e.g., [Kim
et al., 2007])</xref>
        or transition-gesture model
        <xref ref-type="bibr" rid="ref11 ref6">(e.g., [Yang et al.,
2007])</xref>
        . The goal of such models is to determine within a
video when the user (if any) is not performing any gesture or
the transition between different gestures. Building a model
for no-gesture is a complicated and subjective task that
depends on the particular application where the gesture
recognition system is to be implemented [Kim et al., 2007]. In
ubiquitous systems, however, we want gesture recognition
methods to work in very general conditions and under highly
dynamic environments. Hence, a model for no-gesture is much
more complicated to generate in these conditions.
      </p>
      <p>Finally, it is worth to mention that many of the available
techniques for gesture spotting can be very complex to
implement. This is a particularly important aspect to consider
for some domains, for example in mobile devices and/or for
human-robot interaction; where there are limited resources
and restricted programming tools for the implementation of
algorithms. Thus, sometimes simplicity is preferred at the
expense of loosing a little bit in precision in these domains.</p>
      <p>The method we propose in this paper performs
segmentation and recognition of gestures simultaneously and attempts
to address the limitations of most of the available techniques.
Specifically, our proposal is efficient and very simple to
implement; it is robust, to some extend, to problems present in
appearance-based methods; and, more importantly, does not
require the specification of a no-gesture model.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Multiple-windows approach</title>
      <p>We face the problem of simultaneously segmenting and
recognizing gestures in continuous video1. That is, given a
sequence of images (video) we want to determine where a
gesture is being performed (independently of the type of gesture)
and next to recognize what is the actual gesture being
performed. We propose a solution based on multiple windows
that are incrementally and dynamically created. Each window
is passed through predictive models each trained to recognize
a particular gesture. The predictions of models for different
windows are accumulated, when the model for a particular
gesture obtains a majority of votes, we segment the video and
make a prediction, cf. Figure 1.</p>
      <p>The underlying hypothesis of our work is that when a
window covers a large portion of a particular gesture, the
confidence in the prediction of the correct model will be high and
1Although we use (processed) body-part positions as features,
we refer to the sequence of these features as video. This is in order
to simplify explanations.
those of other models will be low. Accumulating predictions
allow us to be more confident in that the gesture is being
performed within a neighborhood of temporal windows.</p>
      <p>The rest of this section describes in detail the proposed
technique. First we describe the considered features, next the
predictive models and finally the approach to simultaneous
segmentation and recognition of gestures.
We use the information obtained through a KinectT M as
inputs for our gesture spotting method. The KinectT M is
capable of capturing RGB and depth video, as well as the
positions of certain body-parts at rates up to 30 frames-per-second
(fps). In this work we considered gestures to command a
robot that are performed with the hands. Therefore, we used
the position of hands as given by KinectT M as features. For
each hand, we obtain per each frame a sextuple indicating the
position of both hands in the x, y, and z coordinates. Since we
consider standard hidden Markov models (HMMs) for
classification, we had to preprocess the continuous data provided
by the considered sensor. Our preprocessing consisted in
estimating tendencies: we obtain the difference in the positions
obtained in consecutive frames and codify them into two
values: +1 when the difference is positive and a 0 when the
difference is zero or negative. Thus, the observations are
sextuples of zeros and ones (the number of different observations
is 26). These are the inputs for the HMMs.
As classification model we consider an HMM2, one of the
most popular models for gesture recognition [Aviles et al.,
2011; Aggarwal and Ryoo, 2011; Mitra, 2007]. For each
gesture i to be recognized we trained an HMM, let Mi denote the
2We used the HMM implementation from MatlabR’s statistics
toolbox.</p>
      <p>HMM for the ith gesture, where i = f1; : : : Kg when
considering K different gestures. The models are trained with
the Baum-Welch algorithm using complete sequences
depicting (only) the gestures of interest. Each HMM was trained
for a maximum of 200 iterations and a tolerance of 0:00001
(the training process stops when changes between
probabilities of successive transition/emission matrices do not exceed
this value); the number of states in the HMM was fixed to 3,
after some preliminary experimentation.</p>
      <p>For making predictions we evaluate the different HMMs
over the test sequence using the Forward algorithm, see
[Rabiner, 1990] for details. We use the probabilities returned by
each HMM as its confidence on the gesture class for a
particular window.
3.3</p>
      <p>Simultaneous segmentation and recognition
The multi-windows approach to gesture segmentation and
recognition is as follows, see Figure 1. For processing a
continuous video we trigger windows incrementally: at time t0
a temporal window W0 of length is triggered and all of
the (trained) HMMs are evaluated in this window. At time
t1 we trigger another window W1 of length and increase
window W0 by frames, the HMMs are evaluated in these
two windows too. This process is repeated until certain
condition is met (see below) or until window W1 surpass a
maximum length, which corresponds to the maximum number of
allowed simultaneous window, q.</p>
      <p>In this way, at a time tg we have g windows of varying
lengths, and the outputs of the K HMMs for each window
(i.e., a total of g K probabilities, where K is the number of
gestures or activities that the system can recognize). The
outputs of the HMMs are given in the form of probabilities. To
obtain a prediction for each window i we simply keep the
label/gesture corresponding to the model that obtains the
highest probability in window i, that is, argmaxkP (Mk; Wi).</p>
      <p>In order to detect the presence of a gesture in the
continuous video we estimate at each time tj the percentage of votes
that each of the K gestures obtains, by considering the
predictions for the j windows. If the number of votes exceeds
a threshold, , we trigger a flag indicating that a gesture has
been recognized. When the flag is on, we keep increasing
and generating windows and storing predictions until there is
a decrement in the percentages of votes for the dominant
gesture. That is, end of the gesture happens in the frame where
there is a decrement in the number of votes. Alternatively, we
also experimented with varying the window in which we
segment the gesture: we segmented the gesture 10 frames before
and 10 frames after we detect the decrement in the
percentage of votes, we report experimental results under the three
settings in Section 4. At this instant the votes for each type
of gesture are counted, and the gesture with the maximum
number of votes is selected as the recognized gesture. Once a
gesture is recognized, the system is reset; that is, all ongoing
windows are discarded and a the process starts again with a
single window.</p>
      <p>One should note that the less windows we consider for
taking a decision the higher the chances that we make a mistake.
Therefore, we ban the proposed technique for making
predictions before having analyzed at least p windows. Under
these settings, our proposal will try to segment and recognize
gestures only when the number of windows/predictions is
between (p; q).</p>
      <p>Figure 2 illustrates the process for simultaneous
segmentation and recognition for a particular test sequence containing
one gesture. The first three plots show the probabilities
returned by the HMMs for three gestures; we show the
probabilities for windows starting at different frames of the
continuous sequence. The fourth plot shows the percentage of votes
for a particular gesture at different segments of the video.
For this particular example, the proposed approach is able to
segment correctly the gesture (the boundaries for the gesture
present in the sequence are shown in gray). In the next
section we report experimental results obtained with our method
for simultaneous segmentation and recognition of gestures.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental results</title>
      <p>We performed experiments with the multi-windows approach
by trying to recognize gestures to command a robot.
Specifically, we consider three gestures: move-right (MR),
attention (ATTN), move-left (ML), these are illustrated in Figure 3.
For evaluation we generated sequences of gestures of varying
lengths and applied our method. The number of training and
testing gestures are shown in Table 1. Training gestures were
manually segmented. Test sequences are not segmented; they
contain a single gesture, but the gesture is surrounded by large
portions of continuous video without a gesture, see Figure 2.</p>
      <p>Three different subjects recorded the training videos. The
test sequences were recorded by six subjects (three of which
were different from those that recorded the training ones).
The skeleton information was recorded with the NUI Capture
software3 at a rate of 30fps. The average duration of training
gestures was of 35:33 frames, whereas the average duration
of test sequences was of 94 frames (maximum and minimum
durations were of 189 and 55 frames respectively).</p>
      <p>All of the parameters of our model were fixed after
preliminary experimentation. The better values we found for
them are as follows: = 10; p = 30; q = 60; = 100.
After training the HMMs individually, we applied the
multiwindows approach to each of the test sequences.</p>
      <p>We evaluate the segmentation and recognition performance
as follows. We say the proposed method correctly segments
a video when the segmentation prediction is at a distance of
frames (or less) from the final frame for the gesture; we
report results for = 5; 10; 15; 20. On the other hand, we say
the proposed method correctly recognizes a gesture, when the
gesture predicted by our method (previously segmented) was
the correct one.</p>
      <p>Table 2 shows the segmentation and recognition
performance obtained by the multi-windows approach. We report
results when segmenting the gesture before, in and after the
decrement in percentage of votes is detected, see Section 3.</p>
      <p>From Table 2 it can be observed that segmentation
performance is low under a hard criteria (i.e., = 5 frames of
distance), the highest performance in this setting was of 29:82%.
However, the recognition performance is quite high for the
same configuration, achieving recognition rates of 82:35%.
Thus, the method offers a good tradeoff4 between
segmenta4Despite the fact that segmentation performance may seem low,
one should note that for the considered application it is not too bad
for an user to repeat a gesture 3 times in order that a robot correctly
tion and recognition performance.</p>
      <p>In order to determine how good/bad our recognition results
were, we performed an experiment in which we classified all
of the gestures in test sequences after we manually segmented
them (top-line). The average recognition performance for that
experiment was of 85:96%. This performance represents the
best recognition performance we could obtain with the
features and trained models. By looking at our best recognition
result (columns Before, row 1), we can see that the
recognition performance of the multi-windows approach is very close
to that we would obtain when classifying segmented gestures.</p>
      <p>As expected, segmentation performance improves when
we relax the distance to the boundaries of the gesture (i.e.,
for increasing ). When the allowed distance is of = 20
identifies the command we want to transmit. Instead, accurate
recognition systems are required so that the robot clearly understand the
ordered command, even when the user has to repeat the gesture a
couple of times.
frames, we were able to segment up to 80% of the gestures.
Recognition rates decreased accordingly. When we compare
the segmentation performance obtained when segmenting the
gesture before, in or after the decrement of votes, we found
that the performance was very similar. Although,
segmenting the gesture 10 frames before we detected the decrement
seems to be a better option. This makes sense, as we would
expect to see a decrement of votes when the gesture already
has finished.</p>
      <p>Regarding efficiency, in preliminarily experiments we have
found the proposed method can run in near real-time. In a
state-of-the art workstation, it can process data at a rate of
30fps, which is enough for many human-computer interaction
tasks. Nevertheless, we still have to perform a comprehensive
evaluation of our proposal in terms of efficiency and taking
into account that in some scenarios a high-performance
computers are not available.</p>
      <p>From the experimental study presented in this section we
can conclude that the proposed method is a promising
solution to the problem of simultaneous gesture segmentation and
recognition. The simplicity of implementation and the
efficiency of our approach are beneficial for the development of
ubiquious and human-centric systems.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and future work directions</title>
      <p>We proposed a new method for simultaneous segmentation
and recognition of gestures in continuous video. The
proposed approach combines the outputs of classification
models evaluated in multiple temporal windows. These windows
are dynamically and incrementally created as the video us
scanned. We report preliminary results obtained with the
proposed technique for segmenting and recognizing gestures to
command a robot. Experimental results reveal that the
recognition performance of our method is very close to that
obtained when using manually segmented gestures.
Segmentation performance of out proposal is still low, yet current
performance is acceptable for the considered application. The
following conclusions can be drawn so far:</p>
      <p>The proposed method is capable of segmenting gestures
(with an error of 5 frames) at low-mild recognition rates.
Nevertheless, these rates are accurate-enough for some
applications. Recall we are analyzing a continuous
sequence of video and that we do not require of a model
for no-gesture, as required in related models.</p>
      <p>Recognition rates achieved by the method are acceptable
for a number of applications and domains. In fact,
recognition results were very close to what we would obtain
when classifying manually-segmented gestures.</p>
      <p>The proposed method is very easy to implement and can
work in near real-time, hence its applicability in
ubiquitous data mining and human-centric applications are
quite possible.</p>
      <p>The proposed method can be improved in several ways, but
it remains to be compared to alternative techniques. In this
aspect we have already implemented the method from [Kim
et al., 2007], but results are too bad in comparison with our
proposal. We are looking for alternative methods to compare
our proposal.</p>
      <p>Current and future work includes extending the number
of gestures considered in this study and implementing the
method in the robot of our laboratory5. Additionally, we
are working in different ways to improve the segmentation
performance of our method, including using different voting
schemes to combine the outputs of the different windows.
5http://ccc.inaoep.mx/ markovito/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Aggarwal and Ryoo</source>
          , 2011]
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Ryoo</surname>
          </string-name>
          .
          <article-title>Human activity analysis: a review</article-title>
          .
          <source>ACM Computing Surveys</source>
          ,
          <volume>43</volume>
          :
          <issue>16</issue>
          (
          <issue>3</issue>
          ),
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Aviles et al.,
          <year>2011</year>
          ]
          <string-name>
            <given-names>H.H.</given-names>
            <surname>Aviles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.E.</given-names>
            <surname>Sucar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.E.</given-names>
            <surname>Mendoza</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.A.</given-names>
            <surname>Pineda</surname>
          </string-name>
          .
          <article-title>A comparison of dynamic naive bayesian classifiers and hidden</article-title>
          .
          <source>Journal of Applied Research and Technology</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>81</fpage>
          -
          <lpage>102</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Derpanis et al.,
          <year>2010</year>
          ]
          <string-name>
            <given-names>K. G.</given-names>
            <surname>Derpanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sizintsev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cannons</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Wildes</surname>
          </string-name>
          .
          <article-title>Efficient action spotting based on a spacetime oriented structure representation</article-title>
          .
          <source>In Proc. of CVPR</source>
          , pages
          <fpage>1990</fpage>
          -
          <lpage>1997</lpage>
          . IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[Eunju</source>
          , 2010]
          <string-name>
            <given-names>K.</given-names>
            <surname>Eunju</surname>
          </string-name>
          .
          <article-title>Human activity recognition and pattern discovery</article-title>
          .
          <source>Pervasive Computing</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>48</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Huynh et al.,
          <year>2008</year>
          ]
          <string-name>
            <given-names>T.</given-names>
            <surname>Huynh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fritz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiele</surname>
          </string-name>
          .
          <article-title>Discovery of activity patterns using topic models</article-title>
          .
          <source>In Proc. of UbiComp'08</source>
          , pages
          <fpage>10</fpage>
          -
          <lpage>19</lpage>
          . ACM Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Kim et al.,
          <year>2007</year>
          ]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Song</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <article-title>Simultaneous gesture segmentation and recognition based on forward spotting accumulative hmms</article-title>
          .
          <source>Pattern recognition</source>
          ,
          <volume>40</volume>
          (
          <issue>11</issue>
          ):
          <fpage>3012</fpage>
          -
          <lpage>3026</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Malgireddy et al.,
          <year>2012</year>
          ]
          <string-name>
            <given-names>M.R.</given-names>
            <surname>Malgireddy</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Inwogu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Govindaraju</surname>
          </string-name>
          .
          <article-title>A temporal bayesian model for classifying, detecting and localizing activities in video sequences</article-title>
          .
          <source>In Proc. of CVPRW</source>
          , pages
          <fpage>43</fpage>
          -
          <lpage>48</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[Mitra</source>
          , 2007]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mitra</surname>
          </string-name>
          .
          <article-title>Gesture recognition: a survey</article-title>
          .
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          , Part C,
          <volume>37</volume>
          (
          <issue>3</issue>
          ):
          <fpage>311</fpage>
          -
          <lpage>324</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Rabiner</source>
          , 1990]
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Rabiner</surname>
          </string-name>
          .
          <article-title>Readings in speech recognition, chapter A tutorial on hidden Markov models and selected applications in speech recognition</article-title>
          , pages
          <fpage>267</fpage>
          -
          <lpage>296</lpage>
          . Morgan Kaufmann,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[Webb and Ashley</source>
          , 2012]
          <string-name>
            <given-names>J.</given-names>
            <surname>Webb</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ashley</surname>
          </string-name>
          .
          <article-title>Beginning Kinect Programming with the Microsoft Kinect SDK</article-title>
          . Apres,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Yang et al.,
          <year>2007</year>
          ]
          <string-name>
            <given-names>H.D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Park</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Gesture spotting and recognition for humanrobot interaction</article-title>
          .
          <source>IEEE Transactions on robotics</source>
          ,
          <volume>23</volume>
          (
          <issue>2</issue>
          ):
          <fpage>256</fpage>
          -
          <lpage>270</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [Yuan et al.,
          <year>2009</year>
          ]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Discriminative subvolume search for efficient action detection</article-title>
          .
          <source>In Proc. of CVPR. IEEE</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>