Automatic Nursing Care Trainer Based on Machine Learning Ankita Agrawal, Wolfgang Ertel Institute for Artificial Intelligence University of Applied Sciences Ravensburg-Weingarten agrawala@hs-weingarten.de, ertel@hs-weingarten.de Abstract Grobe, 2014], thereby increasing the urgent need for trained personnel. Here the virtual trainer supports the caregivers in Nursing Care is a challenging occupation. The er- the learning of ergonomically correct practices. It is also suit- gonomically correct execution of physically stren- able for training the care-taking of a patient by family mem- uous care activities is very important in order to bers at home. The system can be used to practice the basic avoid secondary health problems such as backache care movements with a Kinect camera at home without strain- for the nursing staff. However, there is a scarcity ing the back muscles. of ergonomics experts to facilitate the education of caregivers. In the project ERTRAG (Virtual Er- gonomics Trainer in the Nursing Care Education), 2 Problem Definition we aim to develop a virtual trainer that supports In the project ERTRAG (Virtueller ERgonomieTRainer in der learning of ergonomically correct movements, thus PflegeAusbildunG / Virtual Ergonomics Trainer in the Nurs- avoiding serious health risks. The virtual trainer ing Care Education), our goal is to develop a training sys- itself is trained by means of machine learning tech- tem for the students and employees in the nursing profession niques, while the virtual trainer observes a human that assists them with the training of basic daily care activ- expert. The project is funded by the German Fed- ities. The activities performed by students are recorded us- eral Ministry of Education and Research. ing cameras and shoe soles. A skeleton model is generated using the point clouds delivered by the cameras. Sensors at- tached to the shoe soles are used to measure the force car- 1 Introduction ried by a person to find if the caregiver is lifting a heavy The need to deliver nursing care has increased over the re- load. Machine learning is applied on the skeleton and force cent years due to the challenges brought by the societal de- data to recognize the correct execution of an activity. Later mographic changes and treatment advancements. Stagnat- while practicing the nursing care activities in front of the ing birth rates and continuously increasing life expectancy cameras, the error stances will be detected by the learned has led to long term changes in the age structure of Ger- algorithm and an immediate real-time feedback in the form many [Birg, 2003]. The hospital employees are confronted of audio messages, visual animation or through haptic sen- with growing physical strain in addition to the known men- sors will be provided to the students. Possible individual tal stress. Furthermore, the increase in overweight patients improvements will be suggested or the expert video will is a major challenge for clinical professionals, which of- be shown depending upon the severity and frequency of a ten leads to excessive demand. However, there is a lack of particular mistake. In this way, the system will not only trained nursing staff in comparison to the increasing demand help maintain the working ability of older employees, but for health care services. Usually the nursing care students also in gaining young and skilled workers, thereby contribut- have a chance to attend seminars from the experts only two ing to improving the quality and performance of a hospital. or three times during their entire apprenticeship. While tak- The project involves two research institutes and two com- ing care of the patients and elderly people, their health is at panies from Baden-Württemberg, namely, University of Ap- a constant risk. The caregivers often suffer with wok-related plied Sciences Ravensburg-Weingarten, University of Kon- musculoskeletal disorders (MSD) [Serranheira et al., 2014], stanz, TWT GmbH Science & Innovation and Sarissa GmbH, especially back disorders and shoulder-arm complaints as that bring in different areas of expertise to the system. they have to transfer heavy loads when working with pa- To get an overview of the various care activities and prob- tients. This partly results in significant occupational impair- lems associated with the non-ergonomic movements, the first ments and the loss of quality of life [Engels et al., 1996; step is to consult kinesthetic and physiotherapy experts. Af- Kusma et al., 2015; Freitag et al., 2013]. Hence, the em- ter consulting experts and observing students in the skills lab, ployees either go into premature retirement due to unfa- it became apparent that there is no standard movement se- vorable working conditions and prolonged illness or have quence for performing an activity. The nursing staff adapts to take frequent sick leaves [Meyer and Meschede, 2016; the movements depending on the factors such as weight of the patient, the kind of health problem and treatment prescribed to the patient. However, there are certain incorrect postures that should be avoided by the caregivers so as to maintain their health. Therefore, we dropped our earlier premise of recognizing one correct movement sequence and rather ap- ply machine learning to classify the movements into correct ones and various error categories that should be avoided in any case. This makes the problem more challenging because an incorrect movement for a tall person may not be necessar- ily wrong for a small person. Also, it is not harmful if the back of a caregiver is bent normally as opposed to when the person is lifting a patient with the back bent in a wrong way. The classification task is described in detail in Section 3.3. 3 Technical Approach For training the machine learning algorithm, a large labeled dataset is required. State of the art datasets for pose, activity and gesture recognition are publicly available. Some of the Figure 1: Setup for data acquisition with single-view camera with datasets are MSR Action 3D Dataset [Li et al., 2010], MSR the adjustable bed and wheelchair. Daily Activity 3D Dataset [Wang et al., 2012], MSR Gesture 3D Dataset [Kurakin et al., 2012]. These datasets are avail- able for specific tasks and actions such as day-to-day tasks vided by the Kinect SDK (Software Development Kit). The involving brushing teeth, chopping vegetables, hand gestures, recorded sample images for the scenario in which the expert playing badminton, working on a computer and other human transfers the patient from wheelchair to the bed are shown in activities. However, due to the specific nature of the human Figure 2. The tool captures the RGB images, depth images, posture data required by the care activities along with the shoe skeleton images and skeleton joint data for each scenario per- soles data, these datasets are not suitable for the ERTRAG formed by the expert/students at the frame rate of 22 frames system arising the need for our own data generation. The per second made available by Kinect. The skeleton joint data dataset should be comprised of the correct motion sequences at each frame consists of the three-dimensional absolute po- along with the motion sequences containing incorrect stances sition with respect to the camera and orientation in the form of the caregiver for the three scenarios mentioned in Section of quaternion for each joint. The tool can also be used to con- 3.1. vert the image frames of a particular recording into a video sequence. 3.1 Experiment Setup For each activity, about 20 videos are recorded making it In the project we observe three basic caregiving activities that a total of 60 videos. The recorded data is then prepared for are performed by the nursing staff. The frequently performed labeling. Performing one scenario takes on an average about scenarios in a care facility are, (a) Moving a patient up in the 20 to 30 seconds. One RGB image per second is extracted bed towards the head as they often slide down in the bed, (b) from the recorded data using a python script. In total, there Bringing a patient from the lying position in the bed to sit- are 1454 images and 60 videos that have to be labeled. ting position on the edge of the bed, (c) Transferring the pa- To facilitate the data labeling by the experts and remove tient from sitting position on bed edge to the wheelchair and the need for local software installation, the author developed vice-versa. In the first batch of data acquisition in 2017, the a web-based user-friendly labeling tool using the Google Web scenarios performed by a kinesthetic expert and two students Toolkit as shown in Figure 3. are recorded using Microsoft Kinect v2 as shown in Figure 1. The tool is developed to label images and videos by the ex- The second batch of data is currently being recorded with the perts. The comparison of the labeling of images and videos help of a kinesthetic expert and about ten nursing students will show whether static image data is adequate for the pos- in different semesters. The students playing the role of pa- ture assessment or dynamic video data is essential. The tool tients are selected having different height, weight, gender so takes an image or a video as input on the left side. The images as to obtain a diverse dataset for applying machine learning. are shown in a random order so that the data can be labeled A wheeled hospital bed with the ability to elevate head/feet based on the posture independent of the chronological order and adjust the bed height along with a wheelchair are also of the images in the execution of an activity. This ensures arranged for recording the nursing care activities in order to that the pose errors are accurately identified and the previous procure a genuine database for the problem scenarios. The frames do not affect the labeling of a particular frame. More- movements are recorded in two hour sessions with 50 videos over, an error in the single frame does not make the whole recorded for the three activities per session. sequence as incorrect but only the posture in this particular frame is assigned to be incorrect. If the image shows the 3.2 Dataset wrong pose of the caregiver, the expert can assign an error The data was recorded with the help of an acquisition tool category from the ones already available below the image and built using the API (Application Programming Interface) pro- error severity in a range from 1 to 4. It is necessary to assign Figure 3: Web-based tool programmed for labeling the images and videos by the experts. both error category and severity when an incorrect stance has been detected. If the desired error category is not available, a new category can be added that would be available for all the subsequent images and videos. If multiple errors in the pose of the caregiver are identified, multiple error categories along with their respective severity can be assigned to an image. If there is no mistake in the posture of the caregiver, that is, the expert has assigned no error to an image, the label for that image is automatically set to “correct”. The error categories correspond only to the unergonomic postures of the caregiver. The relative motion of the patient is not taken into account in the current analysis. Similarly, for labeling a video, when a pose error is identified, the video is paused and one or multiple error categories and their sever- ity is assigned to this particular frame in the video. All other frames are labeled as “correct”. It can happen that the errors at a particular frame are a result of the movement performed in the previous frames. Therefore, a fixed number of frames before the error frame would have to be observed by the learn- ing algorithm while processing an error frame. The labeling can be carried out using either a mouse or the keyboard de- pending upon choice of the person using the tool. The data is labeled by two kinesthetic and one physiotherapy expert. After the completion of labeling, the skeleton joint data cor- responding to the time stamp of the RGB images that are ex- tracted for labeling is assigned the respective labels, resulting in a labeled set of skeleton data. 3.3 Feature Engineering and Classification Since labeling is done by the experts independently, many of the error categories provided by them are different. The final set of error categories to be considered in the project are determined in a meeting with the experts. Some of the categories are combined together and the irrelevant ones are removed. The data labeled with the rejected error categories Figure 2: The sample RGB, Depth and Skeleton images recorded are labeled as “correct”. The categories that are combined are with Data Acquisition Tool while the expert transfers the patient to renamed appropriately and the data is relabeled accordingly. bed. The final eight error categories are, 1. Bed too low 2. Bed too high 3. The arms are bent used as individual labels to further train a multi-class classi- fier, otherwise the dynamic data or the movement sequences 4. Movement in the wrong direction (the apprentice does will be used. The skeleton data is normalized using Standard- not face the correct way while performing a movement) ization technique. It normalizes the features by subtracting 5. Stride position is too narrow the feature mean and scaling to unit variance. The data is then randomly divided into 67% training and 33% test data 6. There is no stride position present containing feature vectors from both classes. The algorithm 7. Strong bending of the spine (while lifting the patients, is trained on the training data using cross-validation [Kohavi, the back should not be bent) 1995] over a range of respective parameter values for each algorithm. For K-Means, the number of clusters is chosen 8. Patient being too heavily lifted (includes the cases when between 2 and 9 representing the total available classes and the plenum region such as back of neck or back of knee k-means++ is used for initial cluster center calculation. The is grasped). parameter ranges for kNN are: These final categories are in accordance to the fundamental • Number of neighbors - 1 to 26 ergonomically incorrect postures defined in the health care profession [Weißert-Horn et al., 2014]. • Weight function for prediction - Uniform, Distance The results shown in this paper are obtained using The parameters for SVM are varied as follows: the skeleton data recorded from Kinect to finalize the • Kernel - Linear, RBF, Polynomial pose/motion analysis strategy. Kinect provides data for 25 joints, namely, SpineBase, SpineMid, Neck, Head, • Penalty term, C - between −2 and 10 ShoulderLeft, ElbowLeft, WristLeft, HandLeft, Shoul- • Kernel coefficient, gamma - between −9 and 3 derRight, ElbowRight, WristRight, HandRight, HipLeft, The following parameter ranges are used for XGBoost: KneeLeft, AnkleLeft, FootLeft, HipRight, KneeRight, An- kleRight, FootRight, SpineShoulder, HandTipLeft, Thum- • Number of estimators - 2 to 140 bLeft, HandTipRight and ThumbRight. With the data acqui- • Maximum tree depth - 2 to 6 sition tool, the absolute position and orientation in the form • Learning rate - 0.05 to 0.8 of quaternion provided for each joint at each time stamp is saved. Since the absolute position of a joint can vary for the • Minimum loss reduction, gamma - 0 to 10 same pose depending upon the position of the camera, relative • L1 regularization term, alpha - 0 to 50 coordinates of each joint with respect to the joint SpineBase • Minimum sum of weights of all observations - 0 to 50 along with their orientation quaternion are used as features. That is, the three-dimensional relative coordinates and four- The model with the best parameter combination is saved dimensional orientation quaternion of all the joints at a par- for each classifier. The learned models are applied on the test ticular time stamp forms one feature vector. skeleton data to evaluate their performance and find the best In the ERTRAG project we are dealing with the recogni- fitting algorithm for the pose detection problem. Finally, the tion of incorrect human postures while performing a nurs- learned model of the best classifier will be used for real-time ing care task. Usually, skeleton or silhouette data is used recognition of the incorrect movements. for motion analysis and pose detection [Ye et al., 2013; Elgammal and Lee, 2004]. However, due to the inherent task 4 Results complexity, the classical methods of software problem solv- In this section, the results obtained for various machine learn- ing are not applicable here. Therefore, supervised machine ing algorithms on the labeling done by individual experts are learning with automated feature generation to learn the dif- discussed. Figure 4 shows the mean classification accuracy ferent error classes is applied. After the labeled data captured for the binary classifiers for the labels obtained from the two from Kinect v2 has been obtained, this data is used to train kinesthetic experts. As we can see, SVM performs fairly different machine learning algorithms. The classification al- equally on both experts labeling with 80 ± 3% and 83 ± 4% gorithms such as K-Means [Lloyd, 1982] variant for clas- accuracy, however, performs better with a mean accuracy of sification with k-means++ [Arthur and Vassilvitskii, 2007] 90 ± 3% when the labels of the two experts are mixed (a initialization, k-Nearest Neighbors (kNN) [Cover and Hart, feature vector is labeled as positive data and belongs to the 1967], Support Vector Machines (SVM) [Cortes and Vapnik, correct class only if both the experts have not found any er- 1995] and Extreme Gradient Boosting (XGBoost) [Chen et ror in the corresponding RGB image). This is because in the al., 2015] are implemented and evaluated. beginning, the experts used different error categories to label Pertaining to small amount of data and also to ascertain if the data. One expert focused on certain type of errors while the static data is sufficient, we first apply the algorithms as bi- the other expert assigned error categories such that some of nary classifier. The positive data or the correct class (label = them were slightly different. Therefore, the annotated data 1) consists of the data that has been labeled “correct” in the la- from both kinesthetic experts taken together yield improved beling tool. All the data containing non-ergonomic postures results. XGBoost and kNN both give better results when the that are being assigned any of the error categories form the labels are mixed with 90 ± 2% and 88 ± 2% accuracy re- negative data and belong to incorrect class (label = 0). If the spectively. K-Means classification results are not shown as it results prove to be good enough, the error categories will be performs very poorly with a mean accuracy below 35%. In general, we can see that the classifiers work better on Expert 2 labels which indicates that the labels assigned by Expert 1 are slightly inconsistent. Here we can also see that the clas- sification accuracy does not vary significantly for SVM, kNN and XGBoost. Figure 6: Normalized confusion matrix on mixed data for XGBoost. Figure 4: Results for labeling done by individual experts. Mean clas- multiple classes, we executed them on the data with eight er- sification accuracy with lower and upper bound accuracy in percent. ror categories and one correct category as mentioned in Sec- tion 3.3. The mean classification accuracy for the algorithms are shown in Table 1. The results are not good as we already The confusion matrix with and without normalization for expected but the renewed evaluation in coming months with XGBoost with mixed labels is shown in Figure 5 and Figure 6 a much larger dataset should give better results. The confu- respectively. In the figures, “correctPose” is the positive class sion matrix for the same is shown in Figure 7 and Figure 8. and the “incorrectPose” represents the error classes. Out of The error classes E1 to E8 correspond to the final eight error the 480 test data, 414 data points are classified correctly as categories. The data contains no label corresponding to the depicted in the diagonal elements. The off-diagonal elements error category “Bed too high”. Therefore, E2 is not present in represent the 66 data points that were misclassified. the confusion matrix. We can also see in the normalized con- fusion matrix that data belonging to E7 is mislabeled as E6, no stride position present. This may be because a data point labeled as E6 is often labeled as E7 as well by the experts. Table 1: Mean Classification Accuracy (%) on Multi-class Classifier Classifiers SVM K-Means kNN XGBoost 68 ± 4 4±0 67 ± 3 68 ± 5 5 Conclusion and Future Work As can be seen in the results, SVM, XGBoost and kNN bi- nary classifiers perform well on the static skeleton data pro- ducing 90 ± 3%, 90 ± 2% and 88 ± 2% classification ac- curacy, respectively. The results also show that the multi- class classifier does not work very well as compared to the binary classification. However, it shows that the approach to use the static data should work and using a much larger Figure 5: Confusion matrix without normalization on mixed data for database should improve results. If the binary classifier XGBoost. would not have given satisfactory results, it would be un- likely that the multi-class classifier would provide similar To evaluate the current performance of the classifiers on or better results. In that case, we would switch to the dy- using two cameras and force-measuring shoe soles. A regres- sion algorithm will be applied to predict the error severity in addition to the error class. Other features such as Euler angles depending upon the degree of freedom of each joint will also be evaluated. If necessary, the dynamic data would be taken into account and machine learning would be applied to obtain better results. We will perform field tests in a health care in- stitute to test the system. The feedback will be collected from the participating nursing care students and the results will be used to further improve our virtual ergonomics trainer. References [Arthur and Vassilvitskii, 2007] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seed- ing. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. So- ciety for Industrial and Applied Mathematics, 2007. [Birg, 2003] Herwig Birg. Dynamik der demographischen Figure 7: Confusion matrix without normalization on mixed data for alterung, bevölkerungsschrumpfung und zuwanderung in XGBoost. deutschland: Prognosen und auswirkungen. Aus Politik und Zeitgeschichte, 53, 2003. [Chen et al., 2015] Tianqi Chen, Tong He, and Michael Ben- esty. Xgboost: extreme gradient boosting. R package ver- sion 0.4-2, pages 1–4, 2015. [Cortes and Vapnik, 1995] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995. [Cover and Hart, 1967] Thomas Cover and Peter Hart. Near- est neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967. [Du et al., 2015] Yong Du, Wei Wang, and Liang Wang. Hi- erarchical recurrent neural network for skeleton based ac- tion recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1110– 1118, 2015. [Elgammal and Lee, 2004] Ahmed Elgammal and Chan-Su Lee. Inferring 3d body pose from silhouettes using ac- tivity manifold learning. In Computer Vision and Pattern Figure 8: Normalized confusion matrix on mixed data for XGBoost. Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–II. IEEE, 2004. [Engels et al., 1996] Josephine A Engels, JW Van namic data approach which involves observing the time se- Der Gulden, Theo F Senden, and Bep van’t Hof. ries and applying relevant machine learning algorithms such Work related risk factors for musculoskeletal complaints as Markov Model [Lee and Nevatia, 2009; Lv and Neva- in the nursing profession: results of a questionnaire tia, 2006] and Recurrent Neural Networks [Du et al., 2015; survey. Occupational and environmental medicine, Gers et al., 1999] to find the incorrect postures and move- 53(9):636–641, 1996. ments. Furthermore, in addition to the current setup where the training and test samples contain data from all the demon- [Freitag et al., 2013] Sonja Freitag, Rachida Seddouki, strators, another setup would be analyzed. The second setup Madeleine Dulon, Jan Felix Kersten, Tore J Larsson, will leave one demonstrator out from the training samples and and Albert Nienhaus. The effect of working position will only be used as test data so that this test subject has not on trunk posture and exertion for routine nursing tasks: been seen previously by the machine learning algorithm. an experimental study. Annals of occupational hygiene, As already mentioned in the paper, a large dataset is favor- 58(3):317–325, 2013. able for obtaining better results. Currently we are collecting [Gers et al., 1999] Felix A Gers, Jürgen Schmidhuber, and and labeling more data and we plan to optimize the current al- Fred Cummins. Learning to forget: Continual prediction gorithms and evaluate the results. The recording is carried out with lstm. 1999. [Grobe, 2014] T Grobe. Gesundheitsreport 2014. risiko arbeitsweise beim transfer von schwerstpflegebedürftigen rücken. Gesundheitsreport-Veröffentlichung zum Be- ‘Ergonomically correct’methods of transferring inten- trieblichen Gesundheitsmanagement der TK, Hamburg, sive care patients. Zeitschrift für Arbeitswissenschaft, 29(S 76), 2014. 68(3):175–184, 2014. [Kohavi, 1995] Ron Kohavi. A study of cross-validation and [Ye et al., 2013] Mao Ye, Qing Zhang, Liang Wang, Jiejie bootstrap for accuracy estimation and model selection. In Zhu, Ruigang Yang, and Juergen Gall. A survey on hu- Ijcai, volume 14, pages 1137–1145. Montreal, Canada, man motion analysis from depth data. In Time-of-flight 1995. and depth imaging. sensors, algorithms, and applications, [Kurakin et al., 2012] Alexey Kurakin, Zhengyou Zhang, pages 149–187. Springer, 2013. and Zicheng Liu. A real time system for dynamic hand gesture recognition with a depth sensor. In Signal Pro- cessing Conference (EUSIPCO), 2012 Proceedings of the 20th European, pages 1975–1979. IEEE, 2012. [Kusma et al., 2015] B Kusma, J-J Glaesener, S Bran- denburg, A Pietsch, K Fischer, J Schmidt, S Behl- Schön, and U Pohrt. Der pflege das kreuz stärken- individualprävention “Rücken” bei der berufsgenossen- schaft für gesundheitsdienst und wohlfahrtspflege. Trauma und Berufskrankheit, 17(4):244–249, 2015. [Lee and Nevatia, 2009] Mun Wai Lee and Ramakant Neva- tia. Human pose tracking in monocular sequence us- ing multilevel structured models. IEEE transactions on pattern analysis and machine intelligence, 31(1):27–38, 2009. [Li et al., 2010] Wanqing Li, Zhengyou Zhang, and Zicheng Liu. Action recognition based on a bag of 3d points. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 9–14. IEEE, 2010. [Lloyd, 1982] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982. [Lv and Nevatia, 2006] Fengjun Lv and Ramakant Nevatia. Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In European conference on computer vision, pages 359–372. Springer, 2006. [Meyer and Meschede, 2016] Markus Meyer and M Meschede. Krankheitsbedingte fehlzeiten in der deutschen wirtschaft im jahr 2015. In Fehlzeiten-Report 2016, pages 251–454. Springer, 2016. [Serranheira et al., 2014] F Serranheira, A Sousa-Uva, and M Sousa-Uva. Importance of occupational hazards in nurses msd symptoms. In Bridging Research and Good Practices towards Patients Welfare: Proceedings of the 4th International Conference on Healthcare Ergonomics and Patient Safety (HEPS), Taipei, Taiwan, 23-26 June 2014, page 133. CRC Press, 2014. [Wang et al., 2012] Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. Mining actionlet ensemble for action recognition with depth cameras. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1290–1297. IEEE, 2012. [Weißert-Horn et al., 2014] Margit Weißert-Horn, Mari- anela Diaz Meyer, Michael Jacobs, Hartmut Stern, Heinz- Werner Raske, and Kurt Landau. “Ergonomisch richtige”