323 Human Activity Recognition Using Pose Estimation and Machine Learning Algorithm Abhay Gupta, Kuldeep Gupta, Kshama Gupta and Kapil Gupta National Institute of Technology, Kurukshetra Haryana, India Abstract Human Activity Recognition is becoming a popular field of research in the last two decades. Understanding human behavior in images gives useful information for a large number of computer vision problems and has many applications like scene recognition and pose estimation. There are various methods present for activity recognition; every technique has its advantages and disadvantages. Despite being a lot of research work, recognizing activity is still a complex and challenging task. In this work, we proposed an approach for human activity recognition and classification using a person's pose skeleton in images. This work is divided into two parts; a single person poses estimation and activity classification using pose. Pose Estimation consists of the recognition of 18 body key points and joints locations. We have used the OpenPose library for pose estimation work. And the activity classification task is performed by using multiple logistic regression. We have also shown a comparison between various other regression and classification algorithm's accuracy on our dataset. We have prepared our dataset, divided it into two parts, one is used to train the model, and another is used to validate our proposed model's performance. Keywords 1 Human Activity Recognition, Pose Estimation, Body Keypoints, Logistic Regression, OpenPose 1. Introduction been applied to video-based systems cannot be applicable in this. However, the approach is not the only problem faced in this task. There are The goal of a Human Activity Recognition many other challenges too, especially the (HAR) system is to predict the label of a changes in clothing and body shape that affect person's action from an image or video. This the appearance of the body parts, various interesting topic is inspired by many useful illumination effects, estimation of the pose is real-world applications, such as simulation, difficult if the person is not facing the camera, visual surveillance, understanding human definition, and diversity activities, etc. behavior, etc. Action recognition through Activity recognition through smartphones videos is a well-known and established research and wearable sensors is very common; there are problem. In contrast, image-based action various benchmarks available. But these recognition is a comparably, less explored systems rely on collecting data from sensors problem, but it has gained the community's installed on the devices and user needs to wear attention in recent years. Because motion these devices that are uncomfortable in activities cannot be estimated through the still practical. Vision-based systems are a better image, recognition of actions from images alternative for this kind of problem due to the remains a tedious and challenging problem. It fact that the user doesn't need to carry or wear requires a lot of work as the methods that have any device. Instead, tools like a camera are ISIC’21: International Semantic Intelligence Conference, February 25–27, 2021, New Delhi, India EMAIL: abhaygupta190@gmail.com (A. Gupta); ORCID: 0000-0001-8529-8085 (A. Gupta); ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 324 installed in the surrounding environment to actions in some image properties [4]. In the capture data [1]. One of the popular vision- method based on image structure, the posture's based HAR systems uses pose information. representation is considered as features to the Poses have had remarkable success in human classification of the action [5]. activity recognition, and researchers are widely [6] detected daily living activities by using them in this problem these days. Poses preprocessing the data collected from the provide useful information about human Microsoft Kinect motion-sensing device for behavior. The concept is beneficial in various minimizing the error produced by the system tasks such as HAR, content extraction, and subject. [7] proposed a new approach to semantic understanding, etc. It uses a activity recognition by simultaneously convolutional neural network (CNN) because extracting features from objects used to they are much efficient in dealing with images. perform the activity and human posture. [8] They are similar to traditional neural networks applied openpose and Kalman filter to track the in that they consist of neurons with biases and target body, and then a one-dimensional full learnable weights. In this study, we proposed a CNN is used for the classification of activity. pose-based HAR system that overcomes the Moreover, a single person activity can also issues we discussed in the smartphone and be recognized by using smartphone sensors and wearable sensor approach. We extract human wearable sensors; the smartphone-based pose (18 body keypoints location in the two- approach uses sensors that are inbuilt in the dimensional plane) from images using device, such as accelerometer and gyroscope, to OpenPose library, which internally uses CNNs. identify activity, whereas the wearable sensor- Finally, activity is classified using the pose based approach requires the sensors to be information through a supervised machine attached on the subject body to collect action learning algorithm. information. [9] used several machine learning The rest of the paper is structured as in algorithms (SVM, KNN, and Bagging) and section 2 literature survey of some selected collected data from smartphones' research papers in the area is mentioned. accelerometers and gyroscope sensors, and Section 3 contains the methodology and detected six different activities. [10] recognize architecture of the proposed approach. A brief human activity using an accelerometer and description of dataset and evaluation metrics gyroscope sensor, which is mounted on (like precision, recall, and f1-score) used in this humans, and used various machine learning work is given in sections 4 and 5, respectively. algorithms such as KNN, Random Forest, Section 6 contains experiments and results of Naïve Bayes, and detecting three different various classification algorithms applied in this activities. [11] collect data from the smartphone work. Section 7 finally concludes the work with and smartwatch and used a five-fold cross- some future direction. validation technique to detect five upper limb motions. [12] used wearable and smartphone- 2. Related Work embedded sensors for detecting six dynamic and six static activities using a machine learning algorithm. [13] applied Deep learning Research has recently begun to recognize and convolutional neural network to recognize the behavior of humans from the images. the body's actions on data retrieved from Compared to the video-based action smartphone sensors. classification, the number of research papers and journals are less. We have stated some techniques used for HAR. Four types of 3. Proposed Approach approaches address classifications of actions, including image structure-based methods, pose- Our approach to activity recognition and based systems, model-based approaches, and classification consists of two sequential tasks, example-based methods. The pose-based pose estimation from images, and then the method trains each pose using an annotated 3D classification of the activities using extracted image [2]. The model-based method uses a pose key points as input with the help of known parametric body model to match posture classification algorithms such as logistic variables [3]. The example-based model uses regression, support vector machine, decision classical machine learning algorithms to find 325 tree, etc. Figure 1 shows the architecture of the proposed approach. Figure 1: Proposed Architecture 3.1. Human Pose Estimation It contains two-dimensional vectors that encode the body part's positions and orientations in an image. It encrypts your data Human Pose Estimation is the task of in the form of a double link be-tween body extracting the body's skeletal key points and parts. joints locations corresponding to the human L = (L1, L2, L3 …. Lc) (1) body parts. It uses all those key points and joints Lc∈Rw*h*2 to associate the two-dimensional structure of c ∈ {1 … C}, where C is the total number of the human body. In this work, we have used the limbs, R is the real number, L is the set of part OpenPose framework for estimating the pose affinity field maps, and w x h is the dimension from the input image. of each map in the set L. In OpenPose, the image is sent over the CNN output network to get the features from 3.1.2. Confidence Map input. The feature map is then processed in the multi-stage CNN sequential layers to generate It is a two-dimensional representation of the (PAF) Part Affinity Fields and Confidence belief that a particular part of the body can be Map. The Partial Affinity Fields and placed on a specific pixel. Confidence map generated above are passes S = (S1, S2, S3 …. Sj) (2) through a bipartite graph matching algorithm to Sj∈Rw*h capture human posture in the image. Figure 2 j ∈ {1 … J}, where J is the total number of shows the OpenPose pipeline. body parts, R is the real number, and S is the set of confidence maps. 3.1.1. Part Affinity Field Maps (L) The number of keypoints detected through OpenPose dependent upon the dataset has been trained. 326 Figure 2: OpenPose Pipeline In this work, the COCO dataset having 18 each point) as input for our model's training and different body key points (see Figure 3) testing. We used a supervised learning R_Ankle, R_Knee, R_Wrist, L_Wrist, approach as our dataset contains body R_Shoulder, L_Shoulder, L_Ankle, L_Ear, keypoints with an activity label. Among all the R_Ear, R_Elbow, L_Elbow, L_Knee, L_Eye, algorithms, we use multiple logistic regression, R_Eye, R_Hip, L_Hip, Nose, and Neck is used. and random forest provides significantly greater accuracy. 4. Dataset OpenPose uses the COCO keypoints detection dataset for the pose estimation task, which contains more than 200K images labeled with keypoints [14]. We have collected images from Google for classification purposes, and some photos are clicked by a smartphone camera. We prepared our dataset on approximately 1000 images (five activity categories, namely, sitting, standing, running, dancing, and laying). Each activity category has more than 170 images. We divided our dataset into training and testing in the ratio 90:10. Data collected from different sources contain unequal width and height images, while our Figure 3: OpenPose Keypoints model requires the same width and height. We have resized all images to fixed-size 432x368 pixels, and then key points are extracted from 3.2. Activity Classification them. We formulate the activity classification 5. Evaluation Metrics problem as a multiclass classification problem, which can be modeled using various machine learning regression and classification For performance evaluation, Recall, algorithms. The classification algorithm takes Precision, and F1-score are used in this 18 body keypoints (x-axis and y-axis values of 327 experiment. We have also shown the Confusion predicted activity class, and each column is Matrix of some classifiers. associated with the actual activity class. The matrix compares the target activity with the 5.1. Precision activity predicted by the model. This gives a better idea of what types of errors our classifier has made. Precision(P) is the ratio of the number of true positives (Tp) to the sum of false positives (Fp) and true positives. It can also be defined as 6. Experiments and Result how many images classified into this class belong to this class. The following five activities are considered for pose estimation and activity recognition and P = Tp/(Tp+Fp) (3) classification: sitting, standing, dancing, laying, and running. The experiments are conducted in 5.2. Recall Scikit Learn (0.23.1) and Python (3.6.6) in Windows 10 Operating System with Intel i5 Recall(R) is the ratio of the number of true Processor 3.40 GHz with 8 GB RAM and using positives (Tp) to the sum of false negatives (Fn) five classification algorithms for activity and true positives. It can also be defined as how classification. These algorithms are described many images that belong to this class are below with their confusion matrix. The classified into this class. performance results are provided in Table 1, which shows the recall, precision, and f1-score R = Tp/(Tp+Fn) (4) of various classifiers used in the proposed approach. 5.3. F1-Score 6.1. Classification Algorithm F1-Score is calculated as the harmonic mean of recall and precision. Eqs.5 calculates it. 6.1.1. Logistic Regression F1-Score = 2 (P*R)/(P+R) (5) This algorithm is based on supervised learning, and it is used in classification 5.4. Confusion Matrix problems. In this work, multiple logistic regression is used for classifying activities, and It is a two-dimensional matrix used to 'sag' is used as a solver because it solves only measure the overall performance of the L2 regularization with primal formulation or no machine learning classification algorithm. In regularization and Uses dummy variables to the matrix, each row is associated with the represent the categorical outcome. Table 1 Performance Evaluation on Different Classifier Algorithms Precision (%) Recall (%) F1-Measure (%) Logistic 80.72 81.47 80.95 KNN 77.90 77.89 77.12 SVM 80.43 81.14 80.46 Decision Tree 74.49 75.80 73.50 Random Forest 80.75 80.34 79.43 328 Figure 4:Confusion Matrix (Logistic Regression) Figure 6: Confusion Matrix (SVM) 6.1.2. K-Nearest Neighbors 6.1.4. Decision Tree K-nearest neighbor (KNN) is a supervised machine learning algorithm used for The decision tree comes under supervised classification, and it's a non-parametric, lazy learning. It is the most powerful and accepted algorithm. Despite this simplicity, we got very tool for prediction and classification. This competitive results that are one reason for using algorithm uses learning to predict a target pose's this algorithm in our work. We used different activity and make decisions from previously values for k and got the highest accuracy in 5. trained data. Predictions for activities are made The distance function(d) used in this algorithm from the root of the tree. The record attribute is given in Eqs.6 and for the confusion matrix value is compared to the root attribute value. (see Figure5). The confusion matrix is given in Figure7. d(p,q) = √ ∑ (qi - pi)2 (6) where p, q are vectors containing keypoints of two different images and i=1 … n. Figure 7: Confusion Matrix (Decision Tree) Figure 5: Confusion Matrix (KNN) 6.1.5. Random Forest Random decision forest is a supervised 6.1.3. Support Vector Machine learning algorithm, and it is an ensemble learning method for classification and It also comes under supervised learning regression. It is also one of the most used and algorithms and is mainly used in classification popular algorithms because it gives better and regression problems. We plotted all results without tuning hyper-parameter. It available data as points in two-dimensional creates multiple decision trees and selects the space. The classification is done by finding a best solution using voting. We use a random hyperplane that provides similarly different forest because it predicts activity with good outputs between the two classes. The confusion accuracy and runs efficiently even for big matrix is provided in Figure6. 329 datasets. The confusion matrix is shown in 80.75%, respectively, and the other two Figure8. algorithms KNN and Decision Tree, are underperforming. We have shown accuracies of 7. Conclusion some recent researches on HAR in table 2. Although much research has already been done to a certain extent to deal with the activity In this study, we proposed our approach for recognition problem, more convincing actions human activity recognition from still images by must be taken. In practice, there are a lot of extracting the skeletal coordinate information different activities that humans use to perform (pose) using OpenPose API and then further in everyday life. Detecting all of them isn't an utilizing this pose information to classify easy task because it requires a very large dataset activity with the help of a supervised machine to train the model. Although the dataset is not learning algorithm. We prepared our dataset for the only problem, the definition and diversity of this work, which contains five different activities also make it more complicated for activities, viz. sitting, standing, laying, dancing, machines to understand. Some more activities running. We have used five algorithms can be added to extend the scope and usefulness (Logistic Regression, SVM, KNN, Random of the work in the future. Besides adding Forest, and Decision Tree) to find better results activities, we can apply some data for our model. From our experiment results, we preprocessing techniques for handling missing observed that Multiple Logistic Regression, keypoints of the body. We can also experiment SVM, and Random Forest are showing the with some other machine learning algorithms highest accuracy of 80.72%, 80.43%, and that can provide better results. Table 2 Comparative Study S.No. Authors and Dataset Activities Model Used Accuracy Year (%) 1. Nandy et al., Acceleromete Walking, Multilayer Perceptron 77.0 2019 [12] r and heart climbing stairs, Linear Regression 53.92 rate sensor sitting, running Gaussian Naïve Bayes 73.73 Decision Tree 93.54 2. Ghazal et al., Images from Sitting on the Decision-making 95.2 2018 [15] the internet chair or algorithm with ground feedforward CNN 3. Gatt et al., COCO Abnormal Used pre-trained models 93 2019 [16] keypoints activity such as of PoseNet and fall detection OpenPose 8. References [1] A. Gupta, K. Gupta, K. Gupta and K. Gupta, "A Survey on Human Activity Recognition and Classification," 2020 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 2020, pp. 0915- 0919, doi: 10.1109/ICCSP48568.2020.9182416. [2] G. Sharma, F. Jurie and C. Schmid, "Expanded Parts Model for Human Figure 8: Confusion Matrix (Random Forest) Attribute and Action Recognition in Still Images," 2013 IEEE Conference on 330 Computer Vision and Pattern Recognition, Shenzhen, China, 2020, pp. 239-242, doi: Portland, OR, 2013, pp. 652-659, doi: 10.1109/AEMCSE50948.2020.00058. 10.1109/CVPR.2013.90. [9] E. Bulbul, A. Cetin and I. A. Dogru, [3] J A. Gupta, A. Kembhavi and L. S. Davis, "Human Activity Recognition Using "Observing Human Object Interactions: Smartphones," 2018 2nd International Using Spatial and Functional Symposium on Multidisciplinary Studies Compatibility for Recognition," in IEEE and Innovative Technolo-gies (ISMSIT), Transactions on Pattern Analysis and Ankara, 2018, pp. 1-6, doi: Machine Intelligence, vol. 31, no. 10, pp. 10.1109/ISMSIT.2018.8567275. 1775-1789, Oct. 2009, doi: [10] R. Liu, T. Chen and L. Huang, "Research 10.1109/TPAMI.2009.83. on human activity recognition based on [4] Yang Wang, Hao Jiang, M. S. Drew, Ze- active learning," 2010 International Nian Li and G. Mori, "Unsupervised Conference on Machine Learning and Discovery of Action Classes," 2006 IEEE Cybernetics, Qingdao, 2010, pp. 285-290, Computer Society Conference on doi: 10.1109/ICMLC.2010.5581050. Computer Vision and Pattern Recognition [11] K. -S. Lee, S. Chae and H. -S. Park, (CVPR'06), New York, NY, USA, 2006, "Optimal Time-Window Derivation for pp. 1654-1661, doi: Human-Activity Recognition Based on 10.1109/CVPR.2006.321. Convolutional Neural Networks of [5] J. Shotton et al., "Realtime human pose Repeated Rehabilitation Motions," 2019 recognition in parts from single depth IEEE 16th International Conference on images," CVPR 2011, Providence, RI, Rehabilitation Robotics (ICORR), 2011, pp. 1297-1304, doi: Toronto, ON, Canada, 2019, pp. 583-586, 10.1109/CVPR.2011.5995316. doi: 10.1109/ICORR.2019.8779475.. [6] B. M. V. Guerra, S. Ramat, R. Gandolfi, [12] A. Nandy, J. Saha, C. Chowdhury and K. G. Beltrami and M. Schmid, "Skeleton P. D. Singh, "Detailed Human Activity data preprocessing for human pose Recognition using Wearable Sensor and recognition using Neural Network*," 2020 Smartphones," 2019 International 42nd Annual International Conference of Conference on Opto-Electronics and the IEEE Engineering in Medicine & Applied Optics (Optronix), Kolkata, India, Biology Society (EMBC), Montreal, QC, 2019, pp. 1-6, doi: Canada, 2020, pp. 4265-4268, doi: 10.1109/OPTRONIX.2019.8862427.. 10.1109/EMBC44109.2020.9175588. [13] R. Saini and V. Maan, "Human Activity [7] B. Reily, Q. Zhu, C. Reardon and H. and Gesture Recognition: A Review," Zhang, "Simultaneous Learning from 2020 International Conference on Human Pose and Object Cues for Real- Emerging Trends in Communication, Time Activity Recognition," 2020 IEEE Control and Computing (ICONC3), International Conference on Robotics and Lakshmangarh, Sikar, India, 2020, pp. 1-2, Automation (ICRA), Paris, France, 2020, doi: pp. 8006-8012, doi: 10.1109/ICONC345789.2020.9117535. 10.1109/ICRA40945.2020.9196632. [14] Z. Cao, T. Simon, S. Wei and Y. Sheikh, [8] H. Yan, B. Hu, G. Chen and E. Zhengyuan, "Realtime Multi-person 2D Pose "Real-Time Continuous Human Estimation Using Part Affinity Fields," Rehabilitation Action Recognition using 2017 IEEE Conference on Computer OpenPose and FCN," 2020 3rd Vision and Pattern Recognition (CVPR), International Conference on Advanced Honolulu, HI, 2017, pp. 1302-1310, doi: Electronic Materials, Computers and 10.1109/CVPR.2017.143. Software Engineering (AEMCSE),