=Paper=
{{Paper
|id=Vol-2786/Paper40
|storemode=property
|title=Human Activity Recognition Using Pose Estimation and Machine Learning Algorithm
|pdfUrl=https://ceur-ws.org/Vol-2786/Paper40.pdf
|volume=Vol-2786
|authors=Abhay Gupta,Kuldeep Gupta,Kshama Gupta,Kapil Gupta
|dblpUrl=https://dblp.org/rec/conf/isic2/GuptaGGG21
}}
==Human Activity Recognition Using Pose Estimation and Machine Learning Algorithm==
323
Human Activity Recognition Using Pose Estimation and Machine
Learning Algorithm
Abhay Gupta, Kuldeep Gupta, Kshama Gupta and Kapil Gupta
National Institute of Technology, Kurukshetra Haryana, India
Abstract
Human Activity Recognition is becoming a popular field of research in the last two decades.
Understanding human behavior in images gives useful information for a large number of
computer vision problems and has many applications like scene recognition and pose
estimation. There are various methods present for activity recognition; every technique has its
advantages and disadvantages. Despite being a lot of research work, recognizing activity is still
a complex and challenging task. In this work, we proposed an approach for human activity
recognition and classification using a person's pose skeleton in images. This work is divided
into two parts; a single person poses estimation and activity classification using pose. Pose
Estimation consists of the recognition of 18 body key points and joints locations. We have used
the OpenPose library for pose estimation work. And the activity classification task is performed
by using multiple logistic regression. We have also shown a comparison between various other
regression and classification algorithm's accuracy on our dataset. We have prepared our dataset,
divided it into two parts, one is used to train the model, and another is used to validate our
proposed model's performance.
Keywords 1
Human Activity Recognition, Pose Estimation, Body Keypoints, Logistic Regression,
OpenPose
1. Introduction been applied to video-based systems cannot be
applicable in this. However, the approach is not
the only problem faced in this task. There are
The goal of a Human Activity Recognition
many other challenges too, especially the
(HAR) system is to predict the label of a
changes in clothing and body shape that affect
person's action from an image or video. This
the appearance of the body parts, various
interesting topic is inspired by many useful
illumination effects, estimation of the pose is
real-world applications, such as simulation,
difficult if the person is not facing the camera,
visual surveillance, understanding human
definition, and diversity activities, etc.
behavior, etc. Action recognition through
Activity recognition through smartphones
videos is a well-known and established research
and wearable sensors is very common; there are
problem. In contrast, image-based action
various benchmarks available. But these
recognition is a comparably, less explored
systems rely on collecting data from sensors
problem, but it has gained the community's
installed on the devices and user needs to wear
attention in recent years. Because motion
these devices that are uncomfortable in
activities cannot be estimated through the still
practical. Vision-based systems are a better
image, recognition of actions from images
alternative for this kind of problem due to the
remains a tedious and challenging problem. It
fact that the user doesn't need to carry or wear
requires a lot of work as the methods that have
any device. Instead, tools like a camera are
ISIC’21: International Semantic Intelligence Conference,
February 25–27, 2021, New Delhi, India
EMAIL: abhaygupta190@gmail.com (A. Gupta);
ORCID: 0000-0001-8529-8085 (A. Gupta);
©️ 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
324
installed in the surrounding environment to actions in some image properties [4]. In the
capture data [1]. One of the popular vision- method based on image structure, the posture's
based HAR systems uses pose information. representation is considered as features to the
Poses have had remarkable success in human classification of the action [5].
activity recognition, and researchers are widely [6] detected daily living activities by
using them in this problem these days. Poses preprocessing the data collected from the
provide useful information about human Microsoft Kinect motion-sensing device for
behavior. The concept is beneficial in various minimizing the error produced by the system
tasks such as HAR, content extraction, and subject. [7] proposed a new approach to
semantic understanding, etc. It uses a activity recognition by simultaneously
convolutional neural network (CNN) because extracting features from objects used to
they are much efficient in dealing with images. perform the activity and human posture. [8]
They are similar to traditional neural networks applied openpose and Kalman filter to track the
in that they consist of neurons with biases and target body, and then a one-dimensional full
learnable weights. In this study, we proposed a CNN is used for the classification of activity.
pose-based HAR system that overcomes the Moreover, a single person activity can also
issues we discussed in the smartphone and be recognized by using smartphone sensors and
wearable sensor approach. We extract human wearable sensors; the smartphone-based
pose (18 body keypoints location in the two- approach uses sensors that are inbuilt in the
dimensional plane) from images using device, such as accelerometer and gyroscope, to
OpenPose library, which internally uses CNNs. identify activity, whereas the wearable sensor-
Finally, activity is classified using the pose based approach requires the sensors to be
information through a supervised machine attached on the subject body to collect action
learning algorithm. information. [9] used several machine learning
The rest of the paper is structured as in algorithms (SVM, KNN, and Bagging) and
section 2 literature survey of some selected collected data from smartphones'
research papers in the area is mentioned. accelerometers and gyroscope sensors, and
Section 3 contains the methodology and detected six different activities. [10] recognize
architecture of the proposed approach. A brief human activity using an accelerometer and
description of dataset and evaluation metrics gyroscope sensor, which is mounted on
(like precision, recall, and f1-score) used in this humans, and used various machine learning
work is given in sections 4 and 5, respectively. algorithms such as KNN, Random Forest,
Section 6 contains experiments and results of Naïve Bayes, and detecting three different
various classification algorithms applied in this activities. [11] collect data from the smartphone
work. Section 7 finally concludes the work with and smartwatch and used a five-fold cross-
some future direction. validation technique to detect five upper limb
motions. [12] used wearable and smartphone-
2. Related Work embedded sensors for detecting six dynamic
and six static activities using a machine
learning algorithm. [13] applied Deep learning
Research has recently begun to recognize and convolutional neural network to recognize
the behavior of humans from the images. the body's actions on data retrieved from
Compared to the video-based action smartphone sensors.
classification, the number of research papers
and journals are less. We have stated some
techniques used for HAR. Four types of 3. Proposed Approach
approaches address classifications of actions,
including image structure-based methods, pose- Our approach to activity recognition and
based systems, model-based approaches, and classification consists of two sequential tasks,
example-based methods. The pose-based pose estimation from images, and then the
method trains each pose using an annotated 3D classification of the activities using extracted
image [2]. The model-based method uses a pose key points as input with the help of
known parametric body model to match posture classification algorithms such as logistic
variables [3]. The example-based model uses regression, support vector machine, decision
classical machine learning algorithms to find
325
tree, etc. Figure 1 shows the architecture of the
proposed approach.
Figure 1: Proposed Architecture
3.1. Human Pose Estimation It contains two-dimensional vectors that
encode the body part's positions and
orientations in an image. It encrypts your data
Human Pose Estimation is the task of in the form of a double link be-tween body
extracting the body's skeletal key points and parts.
joints locations corresponding to the human L = (L1, L2, L3 …. Lc) (1)
body parts. It uses all those key points and joints Lc∈Rw*h*2
to associate the two-dimensional structure of c ∈ {1 … C}, where C is the total number of
the human body. In this work, we have used the limbs, R is the real number, L is the set of part
OpenPose framework for estimating the pose affinity field maps, and w x h is the dimension
from the input image. of each map in the set L.
In OpenPose, the image is sent over the
CNN output network to get the features from 3.1.2. Confidence Map
input. The feature map is then processed in the
multi-stage CNN sequential layers to generate It is a two-dimensional representation of the
(PAF) Part Affinity Fields and Confidence belief that a particular part of the body can be
Map. The Partial Affinity Fields and placed on a specific pixel.
Confidence map generated above are passes S = (S1, S2, S3 …. Sj) (2)
through a bipartite graph matching algorithm to Sj∈Rw*h
capture human posture in the image. Figure 2 j ∈ {1 … J}, where J is the total number of
shows the OpenPose pipeline. body parts, R is the real number, and S is the set
of confidence maps.
3.1.1. Part Affinity Field Maps (L) The number of keypoints detected through
OpenPose dependent upon the dataset has been
trained.
326
Figure 2: OpenPose Pipeline
In this work, the COCO dataset having 18 each point) as input for our model's training and
different body key points (see Figure 3) testing. We used a supervised learning
R_Ankle, R_Knee, R_Wrist, L_Wrist, approach as our dataset contains body
R_Shoulder, L_Shoulder, L_Ankle, L_Ear, keypoints with an activity label. Among all the
R_Ear, R_Elbow, L_Elbow, L_Knee, L_Eye, algorithms, we use multiple logistic regression,
R_Eye, R_Hip, L_Hip, Nose, and Neck is used. and random forest provides significantly
greater accuracy.
4. Dataset
OpenPose uses the COCO keypoints
detection dataset for the pose estimation task,
which contains more than 200K images labeled
with keypoints [14]. We have collected images
from Google for classification purposes, and
some photos are clicked by a smartphone
camera. We prepared our dataset on
approximately 1000 images (five activity
categories, namely, sitting, standing, running,
dancing, and laying). Each activity category has
more than 170 images. We divided our dataset
into training and testing in the ratio 90:10. Data
collected from different sources contain
unequal width and height images, while our
Figure 3: OpenPose Keypoints model requires the same width and height. We
have resized all images to fixed-size 432x368
pixels, and then key points are extracted from
3.2. Activity Classification them.
We formulate the activity classification 5. Evaluation Metrics
problem as a multiclass classification problem,
which can be modeled using various machine
learning regression and classification For performance evaluation, Recall,
algorithms. The classification algorithm takes Precision, and F1-score are used in this
18 body keypoints (x-axis and y-axis values of
327
experiment. We have also shown the Confusion predicted activity class, and each column is
Matrix of some classifiers. associated with the actual activity class. The
matrix compares the target activity with the
5.1. Precision activity predicted by the model. This gives a
better idea of what types of errors our classifier
has made.
Precision(P) is the ratio of the number of
true positives (Tp) to the sum of false positives
(Fp) and true positives. It can also be defined as 6. Experiments and Result
how many images classified into this class
belong to this class. The following five activities are considered
for pose estimation and activity recognition and
P = Tp/(Tp+Fp) (3)
classification: sitting, standing, dancing, laying,
and running. The experiments are conducted in
5.2. Recall Scikit Learn (0.23.1) and Python (3.6.6) in
Windows 10 Operating System with Intel i5
Recall(R) is the ratio of the number of true Processor 3.40 GHz with 8 GB RAM and using
positives (Tp) to the sum of false negatives (Fn) five classification algorithms for activity
and true positives. It can also be defined as how classification. These algorithms are described
many images that belong to this class are below with their confusion matrix. The
classified into this class. performance results are provided in Table 1,
which shows the recall, precision, and f1-score
R = Tp/(Tp+Fn) (4)
of various classifiers used in the proposed
approach.
5.3. F1-Score
6.1. Classification Algorithm
F1-Score is calculated as the harmonic mean
of recall and precision. Eqs.5 calculates it. 6.1.1. Logistic Regression
F1-Score = 2 (P*R)/(P+R) (5) This algorithm is based on supervised
learning, and it is used in classification
5.4. Confusion Matrix problems. In this work, multiple logistic
regression is used for classifying activities, and
It is a two-dimensional matrix used to 'sag' is used as a solver because it solves only
measure the overall performance of the L2 regularization with primal formulation or no
machine learning classification algorithm. In regularization and Uses dummy variables to
the matrix, each row is associated with the represent the categorical outcome.
Table 1
Performance Evaluation on Different Classifier
Algorithms Precision (%) Recall (%) F1-Measure (%)
Logistic 80.72 81.47 80.95
KNN 77.90 77.89 77.12
SVM 80.43 81.14 80.46
Decision Tree 74.49 75.80 73.50
Random Forest 80.75 80.34 79.43
328
Figure 4:Confusion Matrix (Logistic Regression)
Figure 6: Confusion Matrix (SVM)
6.1.2. K-Nearest Neighbors
6.1.4. Decision Tree
K-nearest neighbor (KNN) is a supervised
machine learning algorithm used for The decision tree comes under supervised
classification, and it's a non-parametric, lazy learning. It is the most powerful and accepted
algorithm. Despite this simplicity, we got very tool for prediction and classification. This
competitive results that are one reason for using algorithm uses learning to predict a target pose's
this algorithm in our work. We used different activity and make decisions from previously
values for k and got the highest accuracy in 5. trained data. Predictions for activities are made
The distance function(d) used in this algorithm from the root of the tree. The record attribute
is given in Eqs.6 and for the confusion matrix value is compared to the root attribute value.
(see Figure5). The confusion matrix is given in Figure7.
d(p,q) = √ ∑ (qi - pi)2 (6)
where p, q are vectors containing keypoints
of two different images and i=1 … n.
Figure 7: Confusion Matrix (Decision Tree)
Figure 5: Confusion Matrix (KNN) 6.1.5. Random Forest
Random decision forest is a supervised
6.1.3. Support Vector Machine learning algorithm, and it is an ensemble
learning method for classification and
It also comes under supervised learning regression. It is also one of the most used and
algorithms and is mainly used in classification popular algorithms because it gives better
and regression problems. We plotted all results without tuning hyper-parameter. It
available data as points in two-dimensional creates multiple decision trees and selects the
space. The classification is done by finding a best solution using voting. We use a random
hyperplane that provides similarly different forest because it predicts activity with good
outputs between the two classes. The confusion accuracy and runs efficiently even for big
matrix is provided in Figure6.
329
datasets. The confusion matrix is shown in 80.75%, respectively, and the other two
Figure8. algorithms KNN and Decision Tree, are
underperforming. We have shown accuracies of
7. Conclusion some recent researches on HAR in table 2.
Although much research has already been
done to a certain extent to deal with the activity
In this study, we proposed our approach for recognition problem, more convincing actions
human activity recognition from still images by
must be taken. In practice, there are a lot of
extracting the skeletal coordinate information
different activities that humans use to perform
(pose) using OpenPose API and then further in everyday life. Detecting all of them isn't an
utilizing this pose information to classify easy task because it requires a very large dataset
activity with the help of a supervised machine to train the model. Although the dataset is not
learning algorithm. We prepared our dataset for the only problem, the definition and diversity of
this work, which contains five different activities also make it more complicated for
activities, viz. sitting, standing, laying, dancing, machines to understand. Some more activities
running. We have used five algorithms can be added to extend the scope and usefulness
(Logistic Regression, SVM, KNN, Random
of the work in the future. Besides adding
Forest, and Decision Tree) to find better results
activities, we can apply some data
for our model. From our experiment results, we
preprocessing techniques for handling missing
observed that Multiple Logistic Regression, keypoints of the body. We can also experiment
SVM, and Random Forest are showing the with some other machine learning algorithms
highest accuracy of 80.72%, 80.43%, and
that can provide better results.
Table 2
Comparative Study
S.No. Authors and Dataset Activities Model Used Accuracy
Year (%)
1. Nandy et al., Acceleromete Walking, Multilayer Perceptron 77.0
2019 [12] r and heart climbing stairs, Linear Regression 53.92
rate sensor sitting, running Gaussian Naïve Bayes 73.73
Decision Tree 93.54
2. Ghazal et al., Images from Sitting on the Decision-making 95.2
2018 [15] the internet chair or algorithm with
ground feedforward CNN
3. Gatt et al., COCO Abnormal Used pre-trained models 93
2019 [16] keypoints activity such as of PoseNet and
fall detection OpenPose
8. References
[1] A. Gupta, K. Gupta, K. Gupta and K.
Gupta, "A Survey on Human Activity
Recognition and Classification," 2020
International Conference on
Communication and Signal Processing
(ICCSP), Chennai, India, 2020, pp. 0915-
0919, doi:
10.1109/ICCSP48568.2020.9182416.
[2] G. Sharma, F. Jurie and C. Schmid,
"Expanded Parts Model for Human
Figure 8: Confusion Matrix (Random Forest) Attribute and Action Recognition in Still
Images," 2013 IEEE Conference on
330
Computer Vision and Pattern Recognition, Shenzhen, China, 2020, pp. 239-242, doi:
Portland, OR, 2013, pp. 652-659, doi: 10.1109/AEMCSE50948.2020.00058.
10.1109/CVPR.2013.90. [9] E. Bulbul, A. Cetin and I. A. Dogru,
[3] J A. Gupta, A. Kembhavi and L. S. Davis, "Human Activity Recognition Using
"Observing Human Object Interactions: Smartphones," 2018 2nd International
Using Spatial and Functional Symposium on Multidisciplinary Studies
Compatibility for Recognition," in IEEE and Innovative Technolo-gies (ISMSIT),
Transactions on Pattern Analysis and Ankara, 2018, pp. 1-6, doi:
Machine Intelligence, vol. 31, no. 10, pp. 10.1109/ISMSIT.2018.8567275.
1775-1789, Oct. 2009, doi: [10] R. Liu, T. Chen and L. Huang, "Research
10.1109/TPAMI.2009.83. on human activity recognition based on
[4] Yang Wang, Hao Jiang, M. S. Drew, Ze- active learning," 2010 International
Nian Li and G. Mori, "Unsupervised Conference on Machine Learning and
Discovery of Action Classes," 2006 IEEE Cybernetics, Qingdao, 2010, pp. 285-290,
Computer Society Conference on doi: 10.1109/ICMLC.2010.5581050.
Computer Vision and Pattern Recognition [11] K. -S. Lee, S. Chae and H. -S. Park,
(CVPR'06), New York, NY, USA, 2006, "Optimal Time-Window Derivation for
pp. 1654-1661, doi: Human-Activity Recognition Based on
10.1109/CVPR.2006.321. Convolutional Neural Networks of
[5] J. Shotton et al., "Realtime human pose Repeated Rehabilitation Motions," 2019
recognition in parts from single depth IEEE 16th International Conference on
images," CVPR 2011, Providence, RI, Rehabilitation Robotics (ICORR),
2011, pp. 1297-1304, doi: Toronto, ON, Canada, 2019, pp. 583-586,
10.1109/CVPR.2011.5995316. doi: 10.1109/ICORR.2019.8779475..
[6] B. M. V. Guerra, S. Ramat, R. Gandolfi, [12] A. Nandy, J. Saha, C. Chowdhury and K.
G. Beltrami and M. Schmid, "Skeleton P. D. Singh, "Detailed Human Activity
data preprocessing for human pose Recognition using Wearable Sensor and
recognition using Neural Network*," 2020 Smartphones," 2019 International
42nd Annual International Conference of Conference on Opto-Electronics and
the IEEE Engineering in Medicine & Applied Optics (Optronix), Kolkata, India,
Biology Society (EMBC), Montreal, QC, 2019, pp. 1-6, doi:
Canada, 2020, pp. 4265-4268, doi: 10.1109/OPTRONIX.2019.8862427..
10.1109/EMBC44109.2020.9175588. [13] R. Saini and V. Maan, "Human Activity
[7] B. Reily, Q. Zhu, C. Reardon and H. and Gesture Recognition: A Review,"
Zhang, "Simultaneous Learning from 2020 International Conference on
Human Pose and Object Cues for Real- Emerging Trends in Communication,
Time Activity Recognition," 2020 IEEE Control and Computing (ICONC3),
International Conference on Robotics and Lakshmangarh, Sikar, India, 2020, pp. 1-2,
Automation (ICRA), Paris, France, 2020, doi:
pp. 8006-8012, doi: 10.1109/ICONC345789.2020.9117535.
10.1109/ICRA40945.2020.9196632. [14] Z. Cao, T. Simon, S. Wei and Y. Sheikh,
[8] H. Yan, B. Hu, G. Chen and E. Zhengyuan, "Realtime Multi-person 2D Pose
"Real-Time Continuous Human Estimation Using Part Affinity Fields,"
Rehabilitation Action Recognition using 2017 IEEE Conference on Computer
OpenPose and FCN," 2020 3rd Vision and Pattern Recognition (CVPR),
International Conference on Advanced Honolulu, HI, 2017, pp. 1302-1310, doi:
Electronic Materials, Computers and 10.1109/CVPR.2017.143.
Software Engineering (AEMCSE),