Simultaneous segmentation and recognition of gestures for human-machine interaction Harold Vasquez, L. Enrique Sucar, Hugo Jair Escalante Department of Computational Sciences Instituto Nacional de Astrofı́sica, Óptica y Electrónica, Tonantzintla, 72840, Puebla, Mexico. {hvasquez,esucar,hugojair}@inaoep.mx Abstract the available methods for gesture recognition require ges- tures to be segmented before the recognition process be- Human-activity and gesture recognition are two gins [Aviles et al., 2011]. Clearly, this type of methods problems lying at the core of human-centric is not well suited for ubiquitous systems (and real appli- and ubiquitous systems: knowing what activi- cations in general), where the recognition of gestures must ties/gestures users are performing allows systems to be done from a continuous video in real time [Eunju, 2010; execute actions accordingly. State-of-the-art tech- Huynh et al., 2008]. nology from computer vision and machine intelli- This paper introduces a new approach for the simultane- gence allow us to recognize gestures at acceptable ous segmentation and recognition of gestures in continuous rates when gestures are segmented (i.e., each video video. The proposed method implements a voting strategy contains a single gesture). In ubiquitous environ- using the predictions obtained from multiple gesture models ments, however, continuous video is available and evaluated at different time-windows, see Figure 1. Windows thus systems must be capable of detecting when a are dynamically created by incrementally scanning the con- gesture is being performed and recognizing it. This tinuous video. When the votes from the multiple models favor paper describes a new method for the simultane- a particular gesture, we segment the video and make a predic- ous segmentation and recognition of gestures from tion: we predict the gesture corresponding to the model that continuous videos. A multi-window approach is obtained the majority of votes across windows. proposed in which predictions of several recogni- We use as features the body-part positions obtained by tion models are combined; where each model is a KinectT M sensor. As predictive model we used Hidden evaluated using a different segment of the contin- Markov Models (HMMs), one of the most used for gesture uous video. The proposed method is evaluated recognition [Aviles et al., 2011; Aggarwal and Ryoo, 2011; in the problem of recognition of gestures to com- Mitra, 2007]. The proposed method is evaluated in the prob- mand a robot. Preliminary results show the pro- lem of recognition of gestures to command a robot. Prelim- posed method is very effective for recognizing the inary results show the proposed method is very effective for considered gestures when they are correctly seg- recognizing the considered gestures when they are correctly mented; although there is still room for improve- segmented. However, there is still room for improvement in ment in terms of its segmentation capabilities. The terms of its segmentation capabilities. The proposed method proposed method is highly efficient and does not is highly efficient and does not require learning a model for require learning a model for no-gesture, as opposed no-gesture, as opposed in related works. to related methods. The rest of this paper is organized as follows. The next sec- tion briefly reviews related works on gesture spotting. Sec- tion 3 describes the proposed approach. Section 4 reports 1 Introduction experimental results that show evidence of the performance Human-computer interaction technology plays a key role in of proposed technique. Section 5 outlines preliminary con- ubiquitous data mining (i.e., the extraction of interesting pat- clusions and discusses future work direction. terns from data generated in human-centric environments), see [Eunju, 2010]. From all of the alternative forms of 2 Related work interaction, gestures are among the most natural and intu- Several methods for the simultaneous segmentation and itive for users. In fact, gestures are widely used to comple- recognition of gestures (a task also known as gesture spot- ment verbal communication between humans. Research ad- ting) have been proposed so far [Derpanis et al., 2010; vances in computer vision and machine learning have lead Yuan et al., 2009; Malgireddy et al., 2012; Kim et al., 2007; to the development of gesture recognition technology that Yang et al., 2007]. Some methods work directly with spatio- is able to recognize gestures at very acceptable rates [Ag- temporal patterns extracted from video [Derpanis et al., 2010; garwal and Ryoo, 2011; Mitra, 2007]. However, most of Yuan et al., 2009]. Although being effective, these methods are very sensitive to to changes in illumination, scale, appear- ance and viewpoint. On the other hand, there are model-based techniques that use the position of body-parts to train probabilistic models (e.g., HMMs) [Aggarwal and Ryoo, 2011; Mitra, 2007]. In the past, these type of methods were limited because of the need of specialized sensors to obtain body-part positions. Nowadays, the availability of KinectT M (which can extract skeleton information in real time) has partially circumvented such limitation [Webb and Ashley, 2012]. Besides the data acquisition process, some of these meth- ods require the construction of a no-gesture model (e.g., [Kim et al., 2007]) or transition-gesture model (e.g., [Yang et al., 2007]). The goal of such models is to determine within a Figure 1: Graphical illustration of the proposed approach. On video when the user (if any) is not performing any gesture or the top we show a video sequence that can be divided into the transition between different gestures. Building a model sections of no gesture (NG) and gesture, which are identified for no-gesture is a complicated and subjective task that de- by the class of gesture (G1 , G2 , G3 ). Below we illustrate a pends on the particular application where the gesture recogni- series of windows that are dynamically created and extended tion system is to be implemented [Kim et al., 2007]. In ubiq- each ∆ time units. That is, at the beginning W1 is created, uitous systems, however, we want gesture recognition meth- then at t1 , W2 is created and W1 is extended by ∆, and so on. ods to work in very general conditions and under highly dy- At t5 there are 5 windows of different size, for each window namic environments. Hence, a model for no-gesture is much we estimate the probability of all gestures using HMMs. more complicated to generate in these conditions. Finally, it is worth to mention that many of the available those of other models will be low. Accumulating predictions techniques for gesture spotting can be very complex to im- allow us to be more confident in that the gesture is being per- plement. This is a particularly important aspect to consider formed within a neighborhood of temporal windows. for some domains, for example in mobile devices and/or for The rest of this section describes in detail the proposed human-robot interaction; where there are limited resources technique. First we describe the considered features, next the and restricted programming tools for the implementation of predictive models and finally the approach to simultaneous algorithms. Thus, sometimes simplicity is preferred at the segmentation and recognition of gestures. expense of loosing a little bit in precision in these domains. The method we propose in this paper performs segmenta- 3.1 Features tion and recognition of gestures simultaneously and attempts We use the information obtained through a KinectT M as in- to address the limitations of most of the available techniques. puts for our gesture spotting method. The KinectT M is ca- Specifically, our proposal is efficient and very simple to im- pable of capturing RGB and depth video, as well as the posi- plement; it is robust, to some extend, to problems present in tions of certain body-parts at rates up to 30 frames-per-second appearance-based methods; and, more importantly, does not (fps). In this work we considered gestures to command a require the specification of a no-gesture model. robot that are performed with the hands. Therefore, we used the position of hands as given by KinectT M as features. For 3 Multiple-windows approach each hand, we obtain per each frame a sextuple indicating the We face the problem of simultaneously segmenting and rec- position of both hands in the x, y, and z coordinates. Since we ognizing gestures in continuous video1 . That is, given a se- consider standard hidden Markov models (HMMs) for clas- quence of images (video) we want to determine where a ges- sification, we had to preprocess the continuous data provided ture is being performed (independently of the type of gesture) by the considered sensor. Our preprocessing consisted in es- and next to recognize what is the actual gesture being per- timating tendencies: we obtain the difference in the positions formed. We propose a solution based on multiple windows obtained in consecutive frames and codify them into two val- that are incrementally and dynamically created. Each window ues: +1 when the difference is positive and a 0 when the is passed through predictive models each trained to recognize difference is zero or negative. Thus, the observations are sex- a particular gesture. The predictions of models for different tuples of zeros and ones (the number of different observations windows are accumulated, when the model for a particular is 26 ). These are the inputs for the HMMs. gesture obtains a majority of votes, we segment the video and make a prediction, cf. Figure 1. 3.2 Gesture recognition models The underlying hypothesis of our work is that when a win- As classification model we consider an HMM2 , one of the dow covers a large portion of a particular gesture, the confi- most popular models for gesture recognition [Aviles et al., dence in the prediction of the correct model will be high and 2011; Aggarwal and Ryoo, 2011; Mitra, 2007]. For each ges- ture i to be recognized we trained an HMM, let Mi denote the 1 Although we use (processed) body-part positions as features, we refer to the sequence of these features as video. This is in order 2 We used the HMM implementation from MatlabR ’s statistics to simplify explanations. toolbox. HMM for the ith gesture, where i = {1, . . . K} when con- these settings, our proposal will try to segment and recognize sidering K different gestures. The models are trained with gestures only when the number of windows/predictions is be- the Baum-Welch algorithm using complete sequences depict- tween (p, q). ing (only) the gestures of interest. Each HMM was trained Figure 2 illustrates the process for simultaneous segmenta- for a maximum of 200 iterations and a tolerance of 0.00001 tion and recognition for a particular test sequence containing (the training process stops when changes between probabili- one gesture. The first three plots show the probabilities re- ties of successive transition/emission matrices do not exceed turned by the HMMs for three gestures; we show the proba- this value); the number of states in the HMM was fixed to 3, bilities for windows starting at different frames of the contin- after some preliminary experimentation. uous sequence. The fourth plot shows the percentage of votes For making predictions we evaluate the different HMMs for a particular gesture at different segments of the video. over the test sequence using the Forward algorithm, see [Ra- For this particular example, the proposed approach is able to biner, 1990] for details. We use the probabilities returned by segment correctly the gesture (the boundaries for the gesture each HMM as its confidence on the gesture class for a partic- present in the sequence are shown in gray). In the next sec- ular window. tion we report experimental results obtained with our method for simultaneous segmentation and recognition of gestures. 3.3 Simultaneous segmentation and recognition The multi-windows approach to gesture segmentation and 4 Experimental results recognition is as follows, see Figure 1. For processing a con- tinuous video we trigger windows incrementally: at time t0 We performed experiments with the multi-windows approach a temporal window W0 of length ∆ is triggered and all of by trying to recognize gestures to command a robot. Specif- the (trained) HMMs are evaluated in this window. At time ically, we consider three gestures: move-right (MR), atten- t1 we trigger another window W1 of length ∆ and increase tion (ATTN), move-left (ML), these are illustrated in Figure 3. window W0 by ∆ frames, the HMMs are evaluated in these For evaluation we generated sequences of gestures of varying two windows too. This process is repeated until certain con- lengths and applied our method. The number of training and dition is met (see below) or until window W1 surpass a max- testing gestures are shown in Table 1. Training gestures were imum length, which corresponds to the maximum number of manually segmented. Test sequences are not segmented; they allowed simultaneous window, q. contain a single gesture, but the gesture is surrounded by large In this way, at a time tg we have g− windows of varying portions of continuous video without a gesture, see Figure 2. lengths, and the outputs of the K−HMMs for each window (i.e., a total of g × K probabilities, where K is the number of gestures or activities that the system can recognize). The out- puts of the HMMs are given in the form of probabilities. To obtain a prediction for each window i we simply keep the la- bel/gesture corresponding to the model that obtains the high- est probability in window i, that is, argmaxk P (Mk , Wi ). In order to detect the presence of a gesture in the continu- ous video we estimate at each time tj the percentage of votes that each of the K−gestures obtains, by considering the pre- dictions for the j−windows. If the number of votes exceeds a threshold, τ , we trigger a flag indicating that a gesture has Figure 3: The three gestures considered for experimentation. been recognized. When the flag is on, we keep increasing From left to right: move-right, attention, move-left. and generating windows and storing predictions until there is a decrement in the percentages of votes for the dominant ges- Three different subjects recorded the training videos. The ture. That is, end of the gesture happens in the frame where test sequences were recorded by six subjects (three of which there is a decrement in the number of votes. Alternatively, we were different from those that recorded the training ones). also experimented with varying the window in which we seg- The skeleton information was recorded with the NUI Capture ment the gesture: we segmented the gesture 10 frames before software3 at a rate of 30fps. The average duration of training and 10 frames after we detect the decrement in the percent- gestures was of 35.33 frames, whereas the average duration age of votes, we report experimental results under the three of test sequences was of 94 frames (maximum and minimum settings in Section 4. At this instant the votes for each type durations were of 189 and 55 frames respectively). of gesture are counted, and the gesture with the maximum All of the parameters of our model were fixed after pre- number of votes is selected as the recognized gesture. Once a liminary experimentation. The better values we found for gesture is recognized, the system is reset; that is, all ongoing them are as follows: ∆ = 10; p = 30; q = 60; τ = 100. windows are discarded and a the process starts again with a After training the HMMs individually, we applied the multi- single window. windows approach to each of the test sequences. One should note that the less windows we consider for tak- We evaluate the segmentation and recognition performance ing a decision the higher the chances that we make a mistake. as follows. We say the proposed method correctly segments Therefore, we ban the proposed technique for making pre- 3 dictions before having analyzed at least p−windows. Under http://nuicapture.com/ Figure 2: Multi-windows technique in action. The first three plots show probabilities obtained per each HMM for windows starting at different times. In the bottom-right plot we show the number of votes obtained by the dominant HMM, note that the number of votes start to diminish, this is taken as an indication of the end of the gesture (best viewed in color). Table 1: Characteristics of the data set considered for exper- Table 2: Segmentation (Seg.) and recognition (Rec.) perfor- imentation. We show the number of training videos per ges- mance of the multi-windows technique. . ture, and, in row two, the number of gestures present in the test sequences. Before In After Feature MR ATTN ML δ Seg. Rec. Seg. Rec. Seg. Rec. 5 29.82% 82.35% 26.32% 60.00% 26.32% 80.00% Training vids. 30 30 30 10 54.39% 67.74% 63.16% 66.67% 50.88% 68.97% Testing vids. 18 18 21 15 59.65% 64.71% 70.18% 67.50% 56.14% 68.75% 20 78.95% 62.22% 80.70% 63.04% 73.68% 66.67% a video when the segmentation prediction is at a distance of δ−frames (or less) from the final frame for the gesture; we tion and recognition performance. report results for δ = 5, 10, 15, 20. On the other hand, we say In order to determine how good/bad our recognition results the proposed method correctly recognizes a gesture, when the were, we performed an experiment in which we classified all gesture predicted by our method (previously segmented) was of the gestures in test sequences after we manually segmented the correct one. them (top-line). The average recognition performance for that Table 2 shows the segmentation and recognition perfor- experiment was of 85.96%. This performance represents the mance obtained by the multi-windows approach. We report best recognition performance we could obtain with the fea- results when segmenting the gesture before, in and after the tures and trained models. By looking at our best recognition decrement in percentage of votes is detected, see Section 3. result (columns Before, row 1), we can see that the recogni- From Table 2 it can be observed that segmentation perfor- tion performance of the multi-windows approach is very close mance is low under a hard criteria (i.e., δ = 5 frames of dis- to that we would obtain when classifying segmented gestures. tance), the highest performance in this setting was of 29.82%. As expected, segmentation performance improves when However, the recognition performance is quite high for the we relax the distance to the boundaries of the gesture (i.e., same configuration, achieving recognition rates of 82.35%. for increasing δ). When the allowed distance is of δ = 20 Thus, the method offers a good tradeoff4 between segmenta- identifies the command we want to transmit. Instead, accurate recog- 4 Despite the fact that segmentation performance may seem low, nition systems are required so that the robot clearly understand the one should note that for the considered application it is not too bad ordered command, even when the user has to repeat the gesture a for an user to repeat a gesture 3 times in order that a robot correctly couple of times. frames, we were able to segment up to 80% of the gestures. proposal. We are looking for alternative methods to compare Recognition rates decreased accordingly. When we compare our proposal. the segmentation performance obtained when segmenting the Current and future work includes extending the number gesture before, in or after the decrement of votes, we found of gestures considered in this study and implementing the that the performance was very similar. Although, segment- method in the robot of our laboratory5 . Additionally, we ing the gesture 10 frames before we detected the decrement are working in different ways to improve the segmentation seems to be a better option. This makes sense, as we would performance of our method, including using different voting expect to see a decrement of votes when the gesture already schemes to combine the outputs of the different windows. has finished. Regarding efficiency, in preliminarily experiments we have References found the proposed method can run in near real-time. In a [Aggarwal and Ryoo, 2011] J. K. Aggarwal and M. S. Ryoo. state-of-the art workstation, it can process data at a rate of Human activity analysis: a review. ACM Computing Sur- 30fps, which is enough for many human-computer interaction veys, 43:16(3), 2011. tasks. Nevertheless, we still have to perform a comprehensive evaluation of our proposal in terms of efficiency and taking [Aviles et al., 2011] H.H. Aviles, L.E. Sucar, C.E. Mendoza, into account that in some scenarios a high-performance com- and L.A. Pineda. A comparison of dynamic naive bayesian puters are not available. classifiers and hidden. Journal of Applied Research and From the experimental study presented in this section we Technology, 9(1):81–102, 2011. can conclude that the proposed method is a promising solu- [Derpanis et al., 2010] K. G. Derpanis, M. Sizintsev, K. Can- tion to the problem of simultaneous gesture segmentation and nons, and R. P. Wildes. Efficient action spotting based on recognition. The simplicity of implementation and the effi- a spacetime oriented structure representation. In Proc. of ciency of our approach are beneficial for the development of CVPR, pages 1990–1997. IEEE, 2010. ubiquious and human-centric systems. [Eunju, 2010] K. Eunju. Human activity recognition and pat- tern discovery. Pervasive Computing, 9(1):48–53, 2010. 5 Conclusions and future work directions [Huynh et al., 2008] T. Huynh, M. Fritz, and B. Schiele. Dis- We proposed a new method for simultaneous segmentation covery of activity patterns using topic models. In Proc. of and recognition of gestures in continuous video. The pro- UbiComp’08, pages 10–19. ACM Press, 2008. posed approach combines the outputs of classification mod- [Kim et al., 2007] D. Kim, J. Song, and D. Kim. Simulta- els evaluated in multiple temporal windows. These windows neous gesture segmentation and recognition based on for- are dynamically and incrementally created as the video us ward spotting accumulative hmms. Pattern recognition, scanned. We report preliminary results obtained with the pro- 40(11):3012–3026, 2007. posed technique for segmenting and recognizing gestures to [Malgireddy et al., 2012] M.R. Malgireddy, I. Inwogu, and command a robot. Experimental results reveal that the recog- V. Govindaraju. A temporal bayesian model for classify- nition performance of our method is very close to that ob- ing, detecting and localizing activities in video sequences. tained when using manually segmented gestures. Segmenta- In Proc. of CVPRW, pages 43–48, 2012. tion performance of out proposal is still low, yet current per- formance is acceptable for the considered application. The [Mitra, 2007] S. Mitra. Gesture recognition: a survey. IEEE following conclusions can be drawn so far: Transactions on Systems, Man, and Cybernetics, Part C, 37(3):311–324, 2007. • The proposed method is capable of segmenting gestures (with an error of 5 frames) at low-mild recognition rates. [Rabiner, 1990] L. E. Rabiner. Readings in speech recogni- Nevertheless, these rates are accurate-enough for some tion, chapter A tutorial on hidden Markov models and se- applications. Recall we are analyzing a continuous se- lected applications in speech recognition, pages 267–296. quence of video and that we do not require of a model Morgan Kaufmann, 1990. for no-gesture, as required in related models. [Webb and Ashley, 2012] J. Webb and J. Ashley. Begin- • Recognition rates achieved by the method are acceptable ning Kinect Programming with the Microsoft Kinect SDK. for a number of applications and domains. In fact, recog- Apres, 2012. nition results were very close to what we would obtain [Yang et al., 2007] H.D. Yang, A. Y. Park, and S. W. Lee. when classifying manually-segmented gestures. Gesture spotting and recognition for humanrobot inter- action. IEEE Transactions on robotics, 23(2):256–270, • The proposed method is very easy to implement and can 2007. work in near real-time, hence its applicability in ubiq- uitous data mining and human-centric applications are [Yuan et al., 2009] J. Yuan, Z. Liu, and Y. Wu. Discrimi- quite possible. native subvolume search for efficient action detection. In Proc. of CVPR. IEEE, 2009. The proposed method can be improved in several ways, but it remains to be compared to alternative techniques. In this aspect we have already implemented the method from [Kim 5 et al., 2007], but results are too bad in comparison with our http://ccc.inaoep.mx/ markovito/