Classification of Movement Quality in a Weight-Shifting Exercise Vonstad, Elise Klæbo1 , Su, Xiaomeng1 , Vereijken, Beatrix2 , Nilsen, Jan Harald1 , Bach, Kerstin1 , 1 Norwegian University of Science and Technology, Department of Computer Science 2 Norwegian University of Science and Technology, Department of Neuromedicine and Movement Science Abstract Exergames for elderly might decrease the load on the health care system in the coming years in two ways: by prevent- In exercise games, it is often possible to gain re- ing or reducing loss of independence due to reduced physical wards, i.e. points, by only partly completing an function, and by empowering elderly to effectively exercise intended movement, which can undermine the ef- without having to travel to a therapist or training center for fect of using such games for exercise. To en- supervised exercise. Exergames are fun and motivating par- sure usability and reliability of exergames, correct tially because they provide additional, extrinsic motivation to movements must be accurately identified. Aim complete a movement – points or score in the game. Because of the current study was to evaluate performance people have differences in their body shapes and sizes, the of machine learning models in classifying weight- game system needs to accept a wide variety in movements to shifting movements as correct or incorrect. Eleven allow for different players to play the game. This also means healthy elderly (6 F) performed a stepping exer- that in many situations, it is possible to gain points without cise in a correct (with weight shift) and an incor- doing the complete exercise movement intended, or just doing rect (without weight shift) version. A 3D Motion a small version of the movement, as reported in e.g. [Pasch Capture (3DMoCap) system calculated joint center et al., 2009]. People quickly catch that this is possible: they positions (JCPs); 2270 repetitions (1133 correct) learn how to cheat. Such incorrectly performed exercise rep- were recorded. Random Forest (RF), k-Nearest etitions undermine the effect of exergaming, as it might make Neighbor (k-NN) and Support Vector Machine the quality of the exercise performed poorer and give lower (SVM) classification models were built. Evalua- gains in skill or function than could be expected if the exer- tion: 10fold leave-one-group-out cross validation cise was performed correctly. Apart from being less effec- (CV), repeated for all persons. Results showed high tive, this can also be dangerous as over-estimation of one’s accuracy and recall in all classifiers. Average ac- own skill is related to increased fall risk in elderly [Sakurai curacy and recall was RF = 0.989, k-NN = 0.949, et al., 2013]. For exergames to be effective and useful, it SVM = 0.958. Highest was RF on all JCPs, and is vital that they can accurately identify the performance of SVM on shoulder JCPs (both 0.996). Lowest was an exercise repetition as being correct or incorrect. To en- k-NN on ankle JCPs (0.879). This study shows that able such classification, accurate tracking movement while all three models can distinguish correct and incor- exergaming is a prerequisite. As the usability and accuracy of rect repetitions with high accuracy and recall, also different measurement devices varies, finding a trade-off that by using selected JCPs. RF consistently outper- gives a good enough measurement accuracy while being user formed the other models. friendly is especially challenging. The gold standard for mo- tion capture accuracy, marker-based 3D Motion Capture (e.g. 1 Introduction Vicon Motion Systems Ltd) camera systems give very accu- Exercise games, or exergames, are games played on a com- rate measurements of body movements, but are expensive, re- puter screen that use bodily movements as input to interact quire a fixed (laboratory) setting and expert users. Currently, with the game. This form of exercising is gaining popular- the most promising alternative measurement methods are the ity and attention from both researchers and therapists. In re- marker-less time-of-flight (ToF)/depth camera systems such cent years, it has been shown that doing exercises elicited by as the Kinect v2 (Microsoft Inc), and inertial measurement games is a more motivating and fun way of exercising than unit (IMU) systems such as the Xsens (Xsens Technologies conventional exercise programs, while being as effective as B.V.). These are easy to use, portable and low-cost, but do conventional exercise when used in cooperation with thera- not give as accurate full-body measurements as the 3DMo- pists [Nicholson et al., 2015], [Skjaeret et al., 2016]. This is Cap systems, especially when measuring hands and feet [van encouraging with respect to the increasing number of elderly Diest et al., 2014]. ToF camera systems usually utilize a in the population, as we might utilize exergames as a tool to skeleton model based on the 3D cloud mapping of a person to promote self-management of exercise in people of older age. analyze movements, where joint center positions (JCPs) are calculated and used in analyses. Using JCPs, it is possible to Cap measurement systems has also increased in recent years, represent the person being tracked with enough information but is mostly used to identify human actions and not to assess to identify different activities [Gaglio et al., 2015], analyze the quality of movements. For example, ML models were postural stability [Dehbandi et al., 2017] or use the positions successfully used to discriminate between f.e. jumping and as input to a video-based game [Shih et al., 2016]. The ToF walking in a continuous stream of MoCap data [Kapsouras based systems show promising results regarding accuracy of and Nikolaidis, 2014]. To our knowledge, research is scarce measuring torso/upper body movements, as their discrepancy on automatic classification of movement quality measured us- from a 3DMoCap system are reported to be within accept- ing high-quality JCP data obtained from 3DMoCap systems. able ranges [Bonnechère et al., 2014], [Matsen et al., 2016]. Still, others warn about limitations in measurements of shoul- 3 Approach der movements when comparing to goniometers [Huber et al., 2015]. 3.1 Data set The aim of the current study was to assess the performance As there are no open data sets containing labelled weight- of ML classifiers. In order to capture the participants’ full- shifting balance exercises, we conducted a data collection body movements as accurately as possible, we used a 3DMo- to obtain a labelled training data set. Collection of time se- Cap system to measure high-quality movement data to ensure ries data was conducted November 2017 using a 10-camera, that the classification was performed on the actual movements 100Hz, 3DMoCap system (Vicon Motion Systems Ltd). Si- the participants performed. Furthermore, as JCPs is com- multaneous ground reaction force (GRF) data was collected monly used in more user-friendly measurement devices, we using a 1000Hz force plate (Kistler Inc) embedded in the chose to use this as input to the classification model in the cur- floor, and digital video in sagittal view was recorded for qual- rent study, possibly allowing insight into whether data from ity control purposes. Reflective markers were placed accord- ToF/depth cameras could be used as input to classification ing to the Plug-in-Gait full-body biomechanical model, with models in the future. head and hand markers excluded. Eleven participants were As there are several ways to successfully classify the type recruited from local exercise groups for elderly. There were of movement being performed using machine learning, we 6 females and 5 males, and mean age was 69.3 years (1SD hypothesized that it is feasible to use learning algorithms to 4.0). Participants performed two versions of a balance exer- analyze whole-body movement patterns to classify if a de- cise movement common in stroke rehabilitation (as seen in tected movement was performed correctly or not. Thus, this e.g. [Okubo et al., 2016]). Both versions had the same start- paper aims to investigate the classification performance of ing position (Figure 1a), with both feet placed on the force three common classification algorithms on JCP 3DMoCap plate. The red arrow originating at the feet of the participant data from a weight-shifting balance task in correct or incor- represents the 3D ground reaction force (GRF). In the “cor- rect performances. rect” performance of the movement, the right foot was placed in front of the person, off the force plate, and body weight 2 Related Work was shifted over to the right foot while keeping the left foot in contact with the force plate (as seen in Figure 1b, where In movement analysis, machine learning has been used the remaining GRF on the left foot is small), before moving mostly on data from sensors that track persons outside of the the right foot back to the force plate. In the “incorrect” ver- lab, as data from e.g. inertial measurement units is chal- sion of the movement, the same step was performed, but the lenging to analyze with traditional methods. ML analysis person did not shift body weight over to the right foot when methods have been used in for example activity recognition they took the step (as seen in figure 1c, where the GRF on the [Mukhopadhyay, 2014], [Lara and Labrador, 2013], and in left foot is large). This movement pattern was chosen as they identification of falls [Aziz et al., 2012] using data from are typical ways of performing this weight-shifting exercise IMUs. Furthermore, IMUs have been used in classification correctly and incorrectly, as described and demonstrated by a of movement performance in adults [Giggins et al., 2014], al- physical therapist experienced in stroke rehabilitation. Partic- though in this paper it only reached medium-to-good classifi- ipants were instructed orally on how to perform these move- cation accuracy. In [Yurtman and Barshan, 2014] a complete ments with and without weight shift, but were encouraged to system of movement detection and error classification con- move in a way that was natural to them. One repetition was cerning movement amplitude was implemented using wired one completion of such a movement: from the moment the IMUs to record movement during physiotherapy exercises, person was standing in the starting position, through taking with good results. One study used machine learning to eval- the step, until the person had the right foot back in the starting uate movement quality in exercises performed by children, position. During one trial, 10 repetitions were completed in using smart-phone IMU sensors to measure movements and sequence. Each round of 10 repetitions was performed three using natural fatigue as a mechanism to produce wrong per- times, producing a 3x10 block of repetitions to mimic a nor- formances [Carvalho and Furtado, 2016]. Lo Presti et al [Lo mal sequence of exercising. To reduce risk of fatigue from Presti and La Cascia, 2016] showed a wide range of ML repeating the same movement many times during the test ses- methods being used on identification of human actions using sion, test persons first performed two 3x10 block of repeti- ToF/depth cameras, with good results, however not report- tions in the correct version of the movement, then had a 5- ing any studies that aimed to classify the quality of detected minute break and completed two 3x10 blocks of the incorrect movements. The use of ML methods on data from 3DMo- version. This was then repeated so that each person com- (b) Correct performance: with weight shift (c) Incorrect performance: (a) Start and end position without weight shift Figure 1: a) Shows the start and end position of the movement. b) Shows the correct performance, and c) an example of an incorrect performance. pleted 240 repetitions in total: 120 repetitions of each version domain, where it is likely that a model would be trained on of the movement. Data from 11 persons were collected, with other people’s data than data from the current player being one person only completing half of the test protocol. This evaluated for correct/incorrect repetitions. resulted in 2520 recorded repetitions. 3.4 Classification models 3.2 Pre-processing and feature extraction A random forest (RF, n estimators: 10) classifier, a k-nearest Figure 2 shows the data processing model used to analyze neighbor (k-NN, k = 10) classifier and a support vector ma- the data. Marker data was first quality checked in the Vicon chine (SVM, kernel = polynomial) classifier were trained and Nexus software, and missing position data from markers were tested, using the SciKit-Learn library, in each iteration of the gap-filled using the built-in algorithms. JCP time series data train-test-split. Hyperparameters were not tuned due to the was extracted from the Plug-in-Gait biomechanical model. success of the initial parameter settings. Results were ob- Some repetitions were not included due to participants doing tained as confusion matrices, where accuracy and recall were a different movement (e.g. loss of balance, side-stepping), or reported. Recall was chosen as a primary outcome measure due to partial capture of repetitions at the beginning or end of as it is vital in this setting, aside from overall accuracy. a trial. This resulted in JCP time series data from 2270 rep- etitions being included for further analysis, 1133 correct and 1137 incorrect. Statistical features from each JCP time se- 4 Results ries were computed: these included mean, median, standard Table 1 shows average accuracy from all LOGOCV iterations deviation, sum, variance, minimum and maximum values. for classification of incorrect and correct repetitions by the three classifiers. Overall, results show that all three classifi- cation models achieve very high accuracy of around 95 % in almost all classifications. The RF and SVM models achieved the highest accuracies, with 99.6 % on shoulder JCPs and all JCPs, respectively. Lowest accuracy was reached by the k-NN model on data from ankle JCP, 87.9 %. Recall re- sults (Figure 3 & 4) showed that all three models achieved largely more than 90 % accuracy in both correct and incor- rect repetitions. Figure 3 shows recall for correct repetitions by all classifiers, in each of the JCP selections. RF consis- tently achieved >95 % recall, being the most consistent in the different JCP selections of the three models. Average re- Figure 2: Data flow model call of correct repetitions was 98.9 % for RF, 94.4 % in k-NN and 96.0 % in SVM. The SVM model performed best of the three on recall of correct repetitions on data from all JCPs, 3.3 Test-train-split but also had the most variable performance in the other JCP Using the SciKit-Learn library [Pedregosa et al., 2012], the selections. K-NN reached around 95 % on all JCP selections data was split into training and test sets, where the Leave- except in ankle JCPs, where it was the overall worst perform- One-Group-Out Cross-Validation (LOGOCV) method was ing model of the three. Figure 4 shows recall accuracy for used to exclude data from one person and use as the test set incorrect repetitions by all classifiers, in each of the JCP se- in each iteration. This is a suitable method in the exercise lections. Again, RF is most consistent with an average of 99.0 %, while k-NN and SVM achieved 95.2 % and 95.6 %, 5 Discussion respectively. k-NN had the lowest recall of all models in all JCPs for incorrect repetitions, with 85.8 % in data from an- This paper aimed to evaluate the performance of three ML kle JCPs. All three models had the highest recall when using classification models in classifying correctly and incorrectly data from all JCPs, although recall from using JCP selections, performed repetitions of a weight-shifting exercise, using especially shoulder JCPs, was also high. JCPs measured with a 3DMoCap system. Performance of Random Forest, K-Nearest Neighbor and a Support Vector Random Forest k-NN SVM Avg Machine was evaluated. Results indicated that all three mod- els are able to distinguish between incorrect and correct rep- All 99.0 % 96.8 % 99.6 % 98.5 % etitions with high accuracy and recall (with an average accu- SHO 99.6 % 96.4 % 96.2 % 97.4 % racy of 98.9 %, 94.9 % and 95.5 %, respectively). Results from the current study are similar to those seen in [Gaglio HIP 99.2 % 96.8 % 92.1 % 96.0 % et al., 2015] and in [Liu et al., 2017], where novel meth- KNE 97.5 % 96.6 % 94.1 % 96.1 % ods were used to classify activities using JCPs from Kinect, ANK 99.3 % 87.9 % 96.8 % 94.7 % outperforming other approaches on the same data set. How- ever, these results are not directly comparable to results in Avg 98.9 % 94.9 % 95.8 % 96.5 % the current study, as the mentioned studies are not concerned with movement quality but with movement type. Compared Table 1: Accuracy of classifiers for the different joint centre posi- to other studies on movement quality (e.g. [Giggins et al., tions. 2014], [Yurtman and Barshan, 2014]), which are based on data from IMUs, the achieved accuracy in the current study is higher. This is possibly an effect of the movements in this study being instructed, and that the movements in these other studies are more complex and varied. Also, the IMU data might not represent the movements as accurately as the 3DMoCap data does. Using all JCPs in the classification reached marginally higher accuracy than using any of the JCP selections, as seen in Table 1. The RF model was consistently slightly more accurate than the other two models, for both ac- curacy and recall. In light of the issue of avoiding in-game rewards for incorrect performance, recall of incorrect repe- titions is a vital score here. The RF model achieved >95 % recall in all JCP selections. The k-NN and SVM models also achieved high recall, but were not as consistent in JCP selections as the RF model. Other studies using JCPs typi- cally use all joints, or only joints that are tracked with good accuracy during the whole capture, as seen in [Gaglio et al., Figure 3: Recall for correct repetitions by all classifiers on all JCPs, 2015]. Therefore, the results from classification of movement shoulder (SHO), hip (HIP), knee (KNE) and ankle (ANK) JCPs. quality using JCP selections in the current study might not be comparable to results from selected JCPs in other stud- ies. Results also reflect that the data from incorrect and cor- rect repetitions were very different, as all three models ac- curately distinguished between them. The oral instructions might have contributed to this, as the instructions probably in- fluenced the movement patterns. Spontaneous, natural move- ments might be more variable than what was seen in this data set. Also, the correct movements were performed with more upper-body movement towards the stepping foot, and the heel of the stance foot was also lifted from the force plate. Fur- thermore, data from only the ankle JCPs were also classified with >80 % accuracy and recall by all models, which was not expected as both movements include similar stepping move- ments in the feet. The movements of the feet alone were dif- ferent enough in the correct and incorrect repetitions to en- able accurate classification, which might be a result of the Figure 4: Recall for incorrect repetitions by all classifiers on all aforementioned heel-lifts seen in only the correct trials. This JCPs, shoulder (SHO), hip (HIP), knee (KNE) and ankle (ANK) probably resulted in more variable JCP’s during correct repe- JCPs. titions, enabling the ML models to accurately identify them. Using ML-models for the purpose of evaluating movement quality using data from ToF/depth cameras seems feasible ing Support Vector Machines. Conference proceedings: given the very good performance achieved here. Furthermore, IEEE Engineering in Medicine and Biology Society. An- the good performance achieved in this study indicates that the nual Conference, 2012:5837–5840, 2012. models possibly can reach acceptable accuracy and recall also [Bonnechère et al., 2014] B. Bonnechère, B. Jansen, with lower-quality data. This can facilitate implementation of P. Salvia, H. Bouzahouene, L. Omelina, F. Moiseev, ML models into more user-friendly exergaming contexts. Re- V. Sholukha, J. Cornelis, M. Rooze, and S. Van Sint call results in classification of both correct and incorrect rep- Jan. Validity and reliability of the Kinect within func- etitions are very encouraging for applying ML in analysis of tional assessment activities: Comparison with standard movements during exergaming, as this could make it harder stereophotogrammetry. Gait and Posture, 39(1):593–598, for the player to receive rewards without performing the in- 2014. tended movement correctly. However, as the current move- ments were not elicited by an actual exergame, it remains [Carvalho and Furtado, 2016] L. D. Carvalho and V. Fur- to be determined whether a similar level of accuracy can be tado. Using machine learning for evaluating the quality achieved in more realistic exergaming movements. Further- of exercises in a mobile exergame for tackling obesity in more, the high accuracy in all JCP selections suggests that it children. Proceedings of SAI Intelligent Systems Confer- might be feasible to use only the more accurate measurements ence (IntelliSys), 15, 2016. of shoulder or hip JCPs from using ToF/depth cameras, and [Dehbandi et al., 2017] B. Dehbandi, A. Barachant, A. H still accurately identify correct and incorrect repetitions of a Smeragliuolo, J. D. Long, S. J. Bumanlag, V. He, weight-shifting exercise. This could provide a way of using A. Lampe, and D. Putrino. Using data from the Microsoft ML in exergames to more accurately reward movements dur- Kinect 2 to determine postural stability in healthy subjects: ing play, thus ensuring movement quality to a greater extent A feasibility trial. PloS one, 12(2):e0170890, 2017. than the existing systems do. Future work will focus on the [Gaglio et al., 2015] S. Gaglio, G. Lo Re, and M. Morana. use of ML models in actual exergame situations, as this pos- Human Activity Recognition Process Using 3-D Posture sibly elicits movements that are noisier than in the current Data. IEEE Transactions on Human-Machine Systems, study, hence making the repetitions difficult to classify as be- 45(5):586–597, 2015. ing incorrect or correct. Using motion capture systems with lower accuracy, and only using e.g. shoulder JCPs as input to [Giggins et al., 2014] O. M Giggins, K. T. Sweeney, and the classification models would also be interesting to test in B. Caulfield. Rehabilitation exercise assessment using in- an actual exergaming setting, to see if the movements are still ertial sensors: a cross-sectional analytical study. Jour- different enough to be classified as being correctly or incor- nal of NeuroEngineering and Rehabilitation, pages 1–10, rectly performed with similar accuracy to this study. 2014. [Huber et al., 2015] M. E. Huber, A. L. Seitz, M. Leeser, and 6 Conclusion D. Sternad. Validity and reliability of Kinect skeleton for In order to use exergames effectively as a training and reha- measuring shoulder joint angles: A feasibility study. Phys- bilitation tool, it is crucial that the exergame system can iden- iotherapy (United Kingdom), 101(4):389–393, 2015. tify correct and incorrect exercise repetitions accurately. This [Kapsouras and Nikolaidis, 2014] I. Kapsouras and N. Niko- paper shows that it is feasible to use ML models in the au- laidis. Action recognition on motion capture data using a tomatic classification of correctly and incorrectly performed dynemes and forward difference representation. Proceed- weight-shifts in balance exercises. Applying ML models on ings - International Conference on Pattern Recognition, high-quality JCP movement data from a weight-shifting ex- 25:2649–2654, 2014. ercise yielded accurate classification of correct and incorrect [Lara and Labrador, 2013] Oscar D. Lara and Miguel A. exercise repetitions. Results encourage the testing of such Labrador. A Survey on Human Activity Recognition us- models on JCP data obtained while elderly are playing ac- ing Wearable Sensors. IEEE Communications Surveys & tual exergames, to investigate whether the models are equally Tutorials, 15(3):1192–1209, 2013. accurate in a more natural and possibly noisier setting. How- ever, this was done in a setting where the performance of rep- [Liu et al., 2017] Jun Liu, Amir Shahroudy, Dong Xu, Alex etitions was instructed, and the movements performed (for Kot Chichung, and Gang Wang. Skeleton-Based Action example the movement pattern of an incorrectly performed Recognition Using Spatio-Temporal LSTM Network with repetition) might differ from the movements performed here. Trust Gates. IEEE Transactions on Pattern Analysis and The study also shows that using only selected JCPs yields ac- Machine Intelligence, 2017. curate results as well, which is promising with regard to pos- [Lo Presti and La Cascia, 2016] Liliana Lo Presti and Marco sible use of ML models on data from data capture methods La Cascia. 3D skeleton-based human action classification: that are lower cost and more user friendly. A survey. Pattern Recognition, 53:130–147, 2016. [Matsen et al., 2016] F. A. Matsen, Al. Lauder, K. Rector, References P. Keeling, and A. L. Cherones. Measurement of active [Aziz et al., 2012] O. Aziz, E. J Park, G. Mori, and S. N shoulder motion using the Kinect, a commercially avail- Robinovitch. Distinguishing near-falls from daily activ- able infrared position detection system. Journal of Shoul- ities with wearable accelerometers and gyroscopes us- der and Elbow Surgery, 25(2):216–223, 2016. [Mukhopadhyay, 2014] S C Mukhopadhyay. Wearable sen- sors for human activity monitoring: A review. IEEE Sen- sors Journal, 15(3):1321–1330, 2014. [Nicholson et al., 2015] V. P. Nicholson, M. McKean, J. Lowe, C. Fawcett, and B. Burkett. Six weeks of unsu- pervised Nintendo Wii Fit gaming is effective at improving balance in independent older adults. Journal of Aging and Physical Activity, 23(1):153–158, 2015. [Okubo et al., 2016] Y. Okubo, D. Schoene, and S. R Lord. Step training improves reaction time, gait and balance and reduces falls in older people: a systematic review and meta-analysis. British Journal of Sports Medicine, 2016. [Pasch et al., 2009] Marco Pasch, Nadia Bianchi-Berthouze, Betsy van Dijk, and Anton Nijholt. Movement-based sports video games: Investigating motivation and gaming experience. Entertainment Computing, 1(2):49–61, 2009. [Pedregosa et al., 2012] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2012. [Sakurai et al., 2013] R. Sakurai, Y. Fujiwara, M. Ishihara, T. Higuchi, H. Uchida, and K. Imanaka. Age-related self- overestimation of step-over ability in healthy older adults and its relationship to fall risk. BMC Geriatrics, 13(1):15– 17, 2013. [Shih et al., 2016] Meng Che Shih, Ray Yau Wang, Shih Jung Cheng, and Yea Ru Yang. Effects of a balance- based exergaming intervention using the Kinect sensor on posture stability in individuals with Parkinson’s disease: A single-blinded randomized controlled trial. Journal of NeuroEngineering and Rehabilitation, 13(1):1–9, 2016. [Skjaeret et al., 2016] Nina Skjaeret, Ather Nawaz, Tobias Morat, Daniel Schoene, Jorunn Laegdheim, and Beatrix Vereijken. Exercise and rehabilitation delivered through exergames in older adults : An integrative review of tech- nologies, safety and efficacy. International Journal of Medical Informatics, 85(1):1–16, 2016. [van Diest et al., 2014] Mike van Diest, Jan Stegenga, Hein- rich J. Wörtche, Klaas Postema, Gijsbertus J. Verkerke, and Claudine J.C. Lamoth. Suitability of Kinect for mea- suring whole body movement patterns during exergaming. Journal of Biomechanics, 47(12):2925–2932, 2014. [Yurtman and Barshan, 2014] A. Yurtman and B. Barshan. Automated evaluation of physical therapy exercises using multi-template dynamic time warping on wearable sensor signals. Computer Methods and Programs in Biomedicine, 117(2):189–207, 2014.