Simultaneous segmentation and recognition of gestures
                                for human-machine interaction
                            Harold Vasquez, L. Enrique Sucar, Hugo Jair Escalante
                                      Department of Computational Sciences
                             Instituto Nacional de Astrofı́sica, Óptica y Electrónica,
                                       Tonantzintla, 72840, Puebla, Mexico.
                              {hvasquez,esucar,hugojair}@inaoep.mx

                          Abstract                                 the available methods for gesture recognition require ges-
                                                                   tures to be segmented before the recognition process be-
     Human-activity and gesture recognition are two                gins [Aviles et al., 2011]. Clearly, this type of methods
     problems lying at the core of human-centric                   is not well suited for ubiquitous systems (and real appli-
     and ubiquitous systems: knowing what activi-                  cations in general), where the recognition of gestures must
     ties/gestures users are performing allows systems to          be done from a continuous video in real time [Eunju, 2010;
     execute actions accordingly. State-of-the-art tech-           Huynh et al., 2008].
     nology from computer vision and machine intelli-                 This paper introduces a new approach for the simultane-
     gence allow us to recognize gestures at acceptable            ous segmentation and recognition of gestures in continuous
     rates when gestures are segmented (i.e., each video           video. The proposed method implements a voting strategy
     contains a single gesture). In ubiquitous environ-            using the predictions obtained from multiple gesture models
     ments, however, continuous video is available and             evaluated at different time-windows, see Figure 1. Windows
     thus systems must be capable of detecting when a              are dynamically created by incrementally scanning the con-
     gesture is being performed and recognizing it. This           tinuous video. When the votes from the multiple models favor
     paper describes a new method for the simultane-               a particular gesture, we segment the video and make a predic-
     ous segmentation and recognition of gestures from             tion: we predict the gesture corresponding to the model that
     continuous videos. A multi-window approach is                 obtained the majority of votes across windows.
     proposed in which predictions of several recogni-                We use as features the body-part positions obtained by
     tion models are combined; where each model is                 a KinectT M sensor. As predictive model we used Hidden
     evaluated using a different segment of the contin-            Markov Models (HMMs), one of the most used for gesture
     uous video. The proposed method is evaluated                  recognition [Aviles et al., 2011; Aggarwal and Ryoo, 2011;
     in the problem of recognition of gestures to com-             Mitra, 2007]. The proposed method is evaluated in the prob-
     mand a robot. Preliminary results show the pro-               lem of recognition of gestures to command a robot. Prelim-
     posed method is very effective for recognizing the            inary results show the proposed method is very effective for
     considered gestures when they are correctly seg-              recognizing the considered gestures when they are correctly
     mented; although there is still room for improve-             segmented. However, there is still room for improvement in
     ment in terms of its segmentation capabilities. The           terms of its segmentation capabilities. The proposed method
     proposed method is highly efficient and does not              is highly efficient and does not require learning a model for
     require learning a model for no-gesture, as opposed           no-gesture, as opposed in related works.
     to related methods.                                              The rest of this paper is organized as follows. The next sec-
                                                                   tion briefly reviews related works on gesture spotting. Sec-
                                                                   tion 3 describes the proposed approach. Section 4 reports
1   Introduction                                                   experimental results that show evidence of the performance
Human-computer interaction technology plays a key role in          of proposed technique. Section 5 outlines preliminary con-
ubiquitous data mining (i.e., the extraction of interesting pat-   clusions and discusses future work direction.
terns from data generated in human-centric environments),
see [Eunju, 2010]. From all of the alternative forms of            2   Related work
interaction, gestures are among the most natural and intu-         Several methods for the simultaneous segmentation and
itive for users. In fact, gestures are widely used to comple-      recognition of gestures (a task also known as gesture spot-
ment verbal communication between humans. Research ad-             ting) have been proposed so far [Derpanis et al., 2010;
vances in computer vision and machine learning have lead           Yuan et al., 2009; Malgireddy et al., 2012; Kim et al., 2007;
to the development of gesture recognition technology that          Yang et al., 2007]. Some methods work directly with spatio-
is able to recognize gestures at very acceptable rates [Ag-        temporal patterns extracted from video [Derpanis et al., 2010;
garwal and Ryoo, 2011; Mitra, 2007]. However, most of              Yuan et al., 2009]. Although being effective, these methods
are very sensitive to to changes in illumination, scale, appear-
ance and viewpoint.
   On the other hand, there are model-based techniques that
use the position of body-parts to train probabilistic models
(e.g., HMMs) [Aggarwal and Ryoo, 2011; Mitra, 2007]. In
the past, these type of methods were limited because of the
need of specialized sensors to obtain body-part positions.
Nowadays, the availability of KinectT M (which can extract
skeleton information in real time) has partially circumvented
such limitation [Webb and Ashley, 2012].
   Besides the data acquisition process, some of these meth-
ods require the construction of a no-gesture model (e.g., [Kim
et al., 2007]) or transition-gesture model (e.g., [Yang et al.,
2007]). The goal of such models is to determine within a                Figure 1: Graphical illustration of the proposed approach. On
video when the user (if any) is not performing any gesture or           the top we show a video sequence that can be divided into
the transition between different gestures. Building a model             sections of no gesture (NG) and gesture, which are identified
for no-gesture is a complicated and subjective task that de-            by the class of gesture (G1 , G2 , G3 ). Below we illustrate a
pends on the particular application where the gesture recogni-          series of windows that are dynamically created and extended
tion system is to be implemented [Kim et al., 2007]. In ubiq-           each ∆ time units. That is, at the beginning W1 is created,
uitous systems, however, we want gesture recognition meth-              then at t1 , W2 is created and W1 is extended by ∆, and so on.
ods to work in very general conditions and under highly dy-             At t5 there are 5 windows of different size, for each window
namic environments. Hence, a model for no-gesture is much               we estimate the probability of all gestures using HMMs.
more complicated to generate in these conditions.
   Finally, it is worth to mention that many of the available           those of other models will be low. Accumulating predictions
techniques for gesture spotting can be very complex to im-              allow us to be more confident in that the gesture is being per-
plement. This is a particularly important aspect to consider            formed within a neighborhood of temporal windows.
for some domains, for example in mobile devices and/or for                 The rest of this section describes in detail the proposed
human-robot interaction; where there are limited resources              technique. First we describe the considered features, next the
and restricted programming tools for the implementation of              predictive models and finally the approach to simultaneous
algorithms. Thus, sometimes simplicity is preferred at the              segmentation and recognition of gestures.
expense of loosing a little bit in precision in these domains.
   The method we propose in this paper performs segmenta-               3.1   Features
tion and recognition of gestures simultaneously and attempts
                                                                        We use the information obtained through a KinectT M as in-
to address the limitations of most of the available techniques.
                                                                        puts for our gesture spotting method. The KinectT M is ca-
Specifically, our proposal is efficient and very simple to im-
                                                                        pable of capturing RGB and depth video, as well as the posi-
plement; it is robust, to some extend, to problems present in
                                                                        tions of certain body-parts at rates up to 30 frames-per-second
appearance-based methods; and, more importantly, does not
                                                                        (fps). In this work we considered gestures to command a
require the specification of a no-gesture model.
                                                                        robot that are performed with the hands. Therefore, we used
                                                                        the position of hands as given by KinectT M as features. For
3       Multiple-windows approach                                       each hand, we obtain per each frame a sextuple indicating the
We face the problem of simultaneously segmenting and rec-               position of both hands in the x, y, and z coordinates. Since we
ognizing gestures in continuous video1 . That is, given a se-           consider standard hidden Markov models (HMMs) for clas-
quence of images (video) we want to determine where a ges-              sification, we had to preprocess the continuous data provided
ture is being performed (independently of the type of gesture)          by the considered sensor. Our preprocessing consisted in es-
and next to recognize what is the actual gesture being per-             timating tendencies: we obtain the difference in the positions
formed. We propose a solution based on multiple windows                 obtained in consecutive frames and codify them into two val-
that are incrementally and dynamically created. Each window             ues: +1 when the difference is positive and a 0 when the
is passed through predictive models each trained to recognize           difference is zero or negative. Thus, the observations are sex-
a particular gesture. The predictions of models for different           tuples of zeros and ones (the number of different observations
windows are accumulated, when the model for a particular                is 26 ). These are the inputs for the HMMs.
gesture obtains a majority of votes, we segment the video and
make a prediction, cf. Figure 1.                                        3.2   Gesture recognition models
   The underlying hypothesis of our work is that when a win-            As classification model we consider an HMM2 , one of the
dow covers a large portion of a particular gesture, the confi-          most popular models for gesture recognition [Aviles et al.,
dence in the prediction of the correct model will be high and           2011; Aggarwal and Ryoo, 2011; Mitra, 2007]. For each ges-
                                                                        ture i to be recognized we trained an HMM, let Mi denote the
    1
     Although we use (processed) body-part positions as features,
we refer to the sequence of these features as video. This is in order      2
                                                                             We used the HMM implementation from MatlabR ’s statistics
to simplify explanations.                                               toolbox.
HMM for the ith gesture, where i = {1, . . . K} when con-          these settings, our proposal will try to segment and recognize
sidering K different gestures. The models are trained with         gestures only when the number of windows/predictions is be-
the Baum-Welch algorithm using complete sequences depict-          tween (p, q).
ing (only) the gestures of interest. Each HMM was trained             Figure 2 illustrates the process for simultaneous segmenta-
for a maximum of 200 iterations and a tolerance of 0.00001         tion and recognition for a particular test sequence containing
(the training process stops when changes between probabili-        one gesture. The first three plots show the probabilities re-
ties of successive transition/emission matrices do not exceed      turned by the HMMs for three gestures; we show the proba-
this value); the number of states in the HMM was fixed to 3,       bilities for windows starting at different frames of the contin-
after some preliminary experimentation.                            uous sequence. The fourth plot shows the percentage of votes
   For making predictions we evaluate the different HMMs           for a particular gesture at different segments of the video.
over the test sequence using the Forward algorithm, see [Ra-       For this particular example, the proposed approach is able to
biner, 1990] for details. We use the probabilities returned by     segment correctly the gesture (the boundaries for the gesture
each HMM as its confidence on the gesture class for a partic-      present in the sequence are shown in gray). In the next sec-
ular window.                                                       tion we report experimental results obtained with our method
                                                                   for simultaneous segmentation and recognition of gestures.
3.3   Simultaneous segmentation and recognition
The multi-windows approach to gesture segmentation and             4       Experimental results
recognition is as follows, see Figure 1. For processing a con-
tinuous video we trigger windows incrementally: at time t0         We performed experiments with the multi-windows approach
a temporal window W0 of length ∆ is triggered and all of           by trying to recognize gestures to command a robot. Specif-
the (trained) HMMs are evaluated in this window. At time           ically, we consider three gestures: move-right (MR), atten-
t1 we trigger another window W1 of length ∆ and increase           tion (ATTN), move-left (ML), these are illustrated in Figure 3.
window W0 by ∆ frames, the HMMs are evaluated in these             For evaluation we generated sequences of gestures of varying
two windows too. This process is repeated until certain con-       lengths and applied our method. The number of training and
dition is met (see below) or until window W1 surpass a max-        testing gestures are shown in Table 1. Training gestures were
imum length, which corresponds to the maximum number of            manually segmented. Test sequences are not segmented; they
allowed simultaneous window, q.                                    contain a single gesture, but the gesture is surrounded by large
   In this way, at a time tg we have g− windows of varying         portions of continuous video without a gesture, see Figure 2.
lengths, and the outputs of the K−HMMs for each window
(i.e., a total of g × K probabilities, where K is the number of
gestures or activities that the system can recognize). The out-
puts of the HMMs are given in the form of probabilities. To
obtain a prediction for each window i we simply keep the la-
bel/gesture corresponding to the model that obtains the high-
est probability in window i, that is, argmaxk P (Mk , Wi ).
   In order to detect the presence of a gesture in the continu-
ous video we estimate at each time tj the percentage of votes
that each of the K−gestures obtains, by considering the pre-
dictions for the j−windows. If the number of votes exceeds
a threshold, τ , we trigger a flag indicating that a gesture has   Figure 3: The three gestures considered for experimentation.
been recognized. When the flag is on, we keep increasing           From left to right: move-right, attention, move-left.
and generating windows and storing predictions until there is
a decrement in the percentages of votes for the dominant ges-         Three different subjects recorded the training videos. The
ture. That is, end of the gesture happens in the frame where       test sequences were recorded by six subjects (three of which
there is a decrement in the number of votes. Alternatively, we     were different from those that recorded the training ones).
also experimented with varying the window in which we seg-         The skeleton information was recorded with the NUI Capture
ment the gesture: we segmented the gesture 10 frames before        software3 at a rate of 30fps. The average duration of training
and 10 frames after we detect the decrement in the percent-        gestures was of 35.33 frames, whereas the average duration
age of votes, we report experimental results under the three       of test sequences was of 94 frames (maximum and minimum
settings in Section 4. At this instant the votes for each type     durations were of 189 and 55 frames respectively).
of gesture are counted, and the gesture with the maximum              All of the parameters of our model were fixed after pre-
number of votes is selected as the recognized gesture. Once a      liminary experimentation. The better values we found for
gesture is recognized, the system is reset; that is, all ongoing   them are as follows: ∆ = 10; p = 30; q = 60; τ = 100.
windows are discarded and a the process starts again with a        After training the HMMs individually, we applied the multi-
single window.                                                     windows approach to each of the test sequences.
   One should note that the less windows we consider for tak-         We evaluate the segmentation and recognition performance
ing a decision the higher the chances that we make a mistake.      as follows. We say the proposed method correctly segments
Therefore, we ban the proposed technique for making pre-
                                                                       3
dictions before having analyzed at least p−windows. Under                  http://nuicapture.com/
Figure 2: Multi-windows technique in action. The first three plots show probabilities obtained per each HMM for windows
starting at different times. In the bottom-right plot we show the number of votes obtained by the dominant HMM, note that the
number of votes start to diminish, this is taken as an indication of the end of the gesture (best viewed in color).


Table 1: Characteristics of the data set considered for exper-            Table 2: Segmentation (Seg.) and recognition (Rec.) perfor-
imentation. We show the number of training videos per ges-                mance of the multi-windows technique. .
ture, and, in row two, the number of gestures present in the
test sequences.                                                                       Before                    In                  After
               Feature       MR ATTN ML                                     δ     Seg.      Rec.        Seg.          Rec.     Seg.       Rec.
                                                                           5     29.82% 82.35%         26.32%        60.00%   26.32% 80.00%
            Training vids.    30       30      30
                                                                           10    54.39% 67.74%         63.16%        66.67%   50.88% 68.97%
             Testing vids.    18       18      21                          15    59.65% 64.71%         70.18%        67.50%   56.14% 68.75%
                                                                           20    78.95% 62.22%         80.70%        63.04%   73.68% 66.67%

a video when the segmentation prediction is at a distance of
δ−frames (or less) from the final frame for the gesture; we               tion and recognition performance.
report results for δ = 5, 10, 15, 20. On the other hand, we say              In order to determine how good/bad our recognition results
the proposed method correctly recognizes a gesture, when the              were, we performed an experiment in which we classified all
gesture predicted by our method (previously segmented) was                of the gestures in test sequences after we manually segmented
the correct one.                                                          them (top-line). The average recognition performance for that
   Table 2 shows the segmentation and recognition perfor-                 experiment was of 85.96%. This performance represents the
mance obtained by the multi-windows approach. We report                   best recognition performance we could obtain with the fea-
results when segmenting the gesture before, in and after the              tures and trained models. By looking at our best recognition
decrement in percentage of votes is detected, see Section 3.              result (columns Before, row 1), we can see that the recogni-
   From Table 2 it can be observed that segmentation perfor-              tion performance of the multi-windows approach is very close
mance is low under a hard criteria (i.e., δ = 5 frames of dis-            to that we would obtain when classifying segmented gestures.
tance), the highest performance in this setting was of 29.82%.               As expected, segmentation performance improves when
However, the recognition performance is quite high for the                we relax the distance to the boundaries of the gesture (i.e.,
same configuration, achieving recognition rates of 82.35%.                for increasing δ). When the allowed distance is of δ = 20
Thus, the method offers a good tradeoff4 between segmenta-
                                                                          identifies the command we want to transmit. Instead, accurate recog-
   4
     Despite the fact that segmentation performance may seem low,         nition systems are required so that the robot clearly understand the
one should note that for the considered application it is not too bad     ordered command, even when the user has to repeat the gesture a
for an user to repeat a gesture 3 times in order that a robot correctly   couple of times.
frames, we were able to segment up to 80% of the gestures.         proposal. We are looking for alternative methods to compare
Recognition rates decreased accordingly. When we compare           our proposal.
the segmentation performance obtained when segmenting the             Current and future work includes extending the number
gesture before, in or after the decrement of votes, we found       of gestures considered in this study and implementing the
that the performance was very similar. Although, segment-          method in the robot of our laboratory5 . Additionally, we
ing the gesture 10 frames before we detected the decrement         are working in different ways to improve the segmentation
seems to be a better option. This makes sense, as we would         performance of our method, including using different voting
expect to see a decrement of votes when the gesture already        schemes to combine the outputs of the different windows.
has finished.
   Regarding efficiency, in preliminarily experiments we have      References
found the proposed method can run in near real-time. In a          [Aggarwal and Ryoo, 2011] J. K. Aggarwal and M. S. Ryoo.
state-of-the art workstation, it can process data at a rate of
                                                                     Human activity analysis: a review. ACM Computing Sur-
30fps, which is enough for many human-computer interaction
                                                                     veys, 43:16(3), 2011.
tasks. Nevertheless, we still have to perform a comprehensive
evaluation of our proposal in terms of efficiency and taking       [Aviles et al., 2011] H.H. Aviles, L.E. Sucar, C.E. Mendoza,
into account that in some scenarios a high-performance com-          and L.A. Pineda. A comparison of dynamic naive bayesian
puters are not available.                                            classifiers and hidden. Journal of Applied Research and
   From the experimental study presented in this section we          Technology, 9(1):81–102, 2011.
can conclude that the proposed method is a promising solu-         [Derpanis et al., 2010] K. G. Derpanis, M. Sizintsev, K. Can-
tion to the problem of simultaneous gesture segmentation and         nons, and R. P. Wildes. Efficient action spotting based on
recognition. The simplicity of implementation and the effi-          a spacetime oriented structure representation. In Proc. of
ciency of our approach are beneficial for the development of         CVPR, pages 1990–1997. IEEE, 2010.
ubiquious and human-centric systems.                               [Eunju, 2010] K. Eunju. Human activity recognition and pat-
                                                                     tern discovery. Pervasive Computing, 9(1):48–53, 2010.
5     Conclusions and future work directions                       [Huynh et al., 2008] T. Huynh, M. Fritz, and B. Schiele. Dis-
We proposed a new method for simultaneous segmentation               covery of activity patterns using topic models. In Proc. of
and recognition of gestures in continuous video. The pro-            UbiComp’08, pages 10–19. ACM Press, 2008.
posed approach combines the outputs of classification mod-         [Kim et al., 2007] D. Kim, J. Song, and D. Kim. Simulta-
els evaluated in multiple temporal windows. These windows            neous gesture segmentation and recognition based on for-
are dynamically and incrementally created as the video us            ward spotting accumulative hmms. Pattern recognition,
scanned. We report preliminary results obtained with the pro-        40(11):3012–3026, 2007.
posed technique for segmenting and recognizing gestures to         [Malgireddy et al., 2012] M.R. Malgireddy, I. Inwogu, and
command a robot. Experimental results reveal that the recog-
                                                                     V. Govindaraju. A temporal bayesian model for classify-
nition performance of our method is very close to that ob-
                                                                     ing, detecting and localizing activities in video sequences.
tained when using manually segmented gestures. Segmenta-
                                                                     In Proc. of CVPRW, pages 43–48, 2012.
tion performance of out proposal is still low, yet current per-
formance is acceptable for the considered application. The         [Mitra, 2007] S. Mitra. Gesture recognition: a survey. IEEE
following conclusions can be drawn so far:                           Transactions on Systems, Man, and Cybernetics, Part C,
                                                                     37(3):311–324, 2007.
    • The proposed method is capable of segmenting gestures
      (with an error of 5 frames) at low-mild recognition rates.   [Rabiner, 1990] L. E. Rabiner. Readings in speech recogni-
      Nevertheless, these rates are accurate-enough for some         tion, chapter A tutorial on hidden Markov models and se-
      applications. Recall we are analyzing a continuous se-         lected applications in speech recognition, pages 267–296.
      quence of video and that we do not require of a model          Morgan Kaufmann, 1990.
      for no-gesture, as required in related models.               [Webb and Ashley, 2012] J. Webb and J. Ashley. Begin-
    • Recognition rates achieved by the method are acceptable        ning Kinect Programming with the Microsoft Kinect SDK.
      for a number of applications and domains. In fact, recog-      Apres, 2012.
      nition results were very close to what we would obtain       [Yang et al., 2007] H.D. Yang, A. Y. Park, and S. W. Lee.
      when classifying manually-segmented gestures.                  Gesture spotting and recognition for humanrobot inter-
                                                                     action. IEEE Transactions on robotics, 23(2):256–270,
    • The proposed method is very easy to implement and can
                                                                     2007.
      work in near real-time, hence its applicability in ubiq-
      uitous data mining and human-centric applications are        [Yuan et al., 2009] J. Yuan, Z. Liu, and Y. Wu. Discrimi-
      quite possible.                                                native subvolume search for efficient action detection. In
                                                                     Proc. of CVPR. IEEE, 2009.
   The proposed method can be improved in several ways, but
it remains to be compared to alternative techniques. In this
aspect we have already implemented the method from [Kim
                                                                      5
et al., 2007], but results are too bad in comparison with our             http://ccc.inaoep.mx/ markovito/