323


Human Activity Recognition Using Pose Estimation and Machine
                    Learning Algorithm
Abhay Gupta, Kuldeep Gupta, Kshama Gupta and Kapil Gupta
National Institute of Technology, Kurukshetra Haryana, India


                Abstract
                 Human Activity Recognition is becoming a popular field of research in the last two decades.
                Understanding human behavior in images gives useful information for a large number of
                computer vision problems and has many applications like scene recognition and pose
                estimation. There are various methods present for activity recognition; every technique has its
                advantages and disadvantages. Despite being a lot of research work, recognizing activity is still
                a complex and challenging task. In this work, we proposed an approach for human activity
                recognition and classification using a person's pose skeleton in images. This work is divided
                into two parts; a single person poses estimation and activity classification using pose. Pose
                Estimation consists of the recognition of 18 body key points and joints locations. We have used
                the OpenPose library for pose estimation work. And the activity classification task is performed
                by using multiple logistic regression. We have also shown a comparison between various other
                regression and classification algorithm's accuracy on our dataset. We have prepared our dataset,
                divided it into two parts, one is used to train the model, and another is used to validate our
                proposed model's performance.

                Keywords 1
                Human Activity Recognition, Pose Estimation, Body Keypoints, Logistic Regression,
                OpenPose


1. Introduction                                                                             been applied to video-based systems cannot be
                                                                                            applicable in this. However, the approach is not
                                                                                            the only problem faced in this task. There are
    The goal of a Human Activity Recognition
                                                                                            many other challenges too, especially the
(HAR) system is to predict the label of a
                                                                                            changes in clothing and body shape that affect
person's action from an image or video. This
                                                                                            the appearance of the body parts, various
interesting topic is inspired by many useful
                                                                                            illumination effects, estimation of the pose is
real-world applications, such as simulation,
                                                                                            difficult if the person is not facing the camera,
visual surveillance, understanding human
                                                                                            definition, and diversity activities, etc.
behavior, etc. Action recognition through
                                                                                                Activity recognition through smartphones
videos is a well-known and established research
                                                                                            and wearable sensors is very common; there are
problem. In contrast, image-based action
                                                                                            various benchmarks available. But these
recognition is a comparably, less explored
                                                                                            systems rely on collecting data from sensors
problem, but it has gained the community's
                                                                                            installed on the devices and user needs to wear
attention in recent years. Because motion
                                                                                            these devices that are uncomfortable in
activities cannot be estimated through the still
                                                                                            practical. Vision-based systems are a better
image, recognition of actions from images
                                                                                            alternative for this kind of problem due to the
remains a tedious and challenging problem. It
                                                                                            fact that the user doesn't need to carry or wear
requires a lot of work as the methods that have
                                                                                            any device. Instead, tools like a camera are

ISIC’21: International Semantic Intelligence                              Conference,
February 25–27, 2021, New Delhi, India
EMAIL: abhaygupta190@gmail.com (A. Gupta);
ORCID: 0000-0001-8529-8085 (A. Gupta);
            ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative
            Commons License Attribution 4.0 International (CC BY 4.0).

            CEUR Workshop Proceedings (CEUR-WS.org)
                                                                                             324


installed in the surrounding environment to           actions in some image properties [4]. In the
capture data [1]. One of the popular vision-          method based on image structure, the posture's
based HAR systems uses pose information.              representation is considered as features to the
Poses have had remarkable success in human            classification of the action [5].
activity recognition, and researchers are widely          [6] detected daily living activities by
using them in this problem these days. Poses          preprocessing the data collected from the
provide useful information about human                Microsoft Kinect motion-sensing device for
behavior. The concept is beneficial in various        minimizing the error produced by the system
tasks such as HAR, content extraction,                and subject. [7] proposed a new approach to
semantic understanding, etc. It uses a                activity recognition by simultaneously
convolutional neural network (CNN) because            extracting features from objects used to
they are much efficient in dealing with images.       perform the activity and human posture. [8]
They are similar to traditional neural networks       applied openpose and Kalman filter to track the
in that they consist of neurons with biases and       target body, and then a one-dimensional full
learnable weights. In this study, we proposed a       CNN is used for the classification of activity.
pose-based HAR system that overcomes the                  Moreover, a single person activity can also
issues we discussed in the smartphone and             be recognized by using smartphone sensors and
wearable sensor approach. We extract human            wearable sensors; the smartphone-based
pose (18 body keypoints location in the two-          approach uses sensors that are inbuilt in the
dimensional plane) from images using                  device, such as accelerometer and gyroscope, to
OpenPose library, which internally uses CNNs.         identify activity, whereas the wearable sensor-
Finally, activity is classified using the pose        based approach requires the sensors to be
information through a supervised machine              attached on the subject body to collect action
learning algorithm.                                   information. [9] used several machine learning
    The rest of the paper is structured as in         algorithms (SVM, KNN, and Bagging) and
section 2 literature survey of some selected          collected       data       from      smartphones'
research papers in the area is mentioned.             accelerometers and gyroscope sensors, and
Section 3 contains the methodology and                detected six different activities. [10] recognize
architecture of the proposed approach. A brief        human activity using an accelerometer and
description of dataset and evaluation metrics         gyroscope sensor, which is mounted on
(like precision, recall, and f1-score) used in this   humans, and used various machine learning
work is given in sections 4 and 5, respectively.      algorithms such as KNN, Random Forest,
Section 6 contains experiments and results of         Naïve Bayes, and detecting three different
various classification algorithms applied in this     activities. [11] collect data from the smartphone
work. Section 7 finally concludes the work with       and smartwatch and used a five-fold cross-
some future direction.                                validation technique to detect five upper limb
                                                      motions. [12] used wearable and smartphone-
2. Related Work                                       embedded sensors for detecting six dynamic
                                                      and six static activities using a machine
                                                      learning algorithm. [13] applied Deep learning
   Research has recently begun to recognize           and convolutional neural network to recognize
the behavior of humans from the images.               the body's actions on data retrieved from
Compared to the video-based action                    smartphone sensors.
classification, the number of research papers
and journals are less. We have stated some
techniques used for HAR. Four types of                3. Proposed Approach
approaches address classifications of actions,
including image structure-based methods, pose-           Our approach to activity recognition and
based systems, model-based approaches, and            classification consists of two sequential tasks,
example-based methods. The pose-based                 pose estimation from images, and then the
method trains each pose using an annotated 3D         classification of the activities using extracted
image [2]. The model-based method uses a              pose key points as input with the help of
known parametric body model to match posture          classification algorithms such as logistic
variables [3]. The example-based model uses           regression, support vector machine, decision
classical machine learning algorithms to find
                                                                                              325


tree, etc. Figure 1 shows the architecture of the
proposed approach.


Figure 1: Proposed Architecture

3.1.    Human Pose Estimation                             It contains two-dimensional vectors that
                                                      encode the body part's positions and
                                                      orientations in an image. It encrypts your data
   Human Pose Estimation is the task of               in the form of a double link be-tween body
extracting the body's skeletal key points and         parts.
joints locations corresponding to the human                    L = (L1, L2, L3 …. Lc)            (1)
body parts. It uses all those key points and joints            Lc∈Rw*h*2
to associate the two-dimensional structure of             c ∈ {1 … C}, where C is the total number of
the human body. In this work, we have used the        limbs, R is the real number, L is the set of part
OpenPose framework for estimating the pose            affinity field maps, and w x h is the dimension
from the input image.                                 of each map in the set L.
   In OpenPose, the image is sent over the
CNN output network to get the features from           3.1.2. Confidence Map
input. The feature map is then processed in the
multi-stage CNN sequential layers to generate             It is a two-dimensional representation of the
(PAF) Part Affinity Fields and Confidence             belief that a particular part of the body can be
Map. The Partial Affinity Fields and                  placed on a specific pixel.
Confidence map generated above are passes                       S = (S1, S2, S3 …. Sj)           (2)
through a bipartite graph matching algorithm to                 Sj∈Rw*h
capture human posture in the image. Figure 2              j ∈ {1 … J}, where J is the total number of
shows the OpenPose pipeline.                          body parts, R is the real number, and S is the set
                                                      of confidence maps.
3.1.1. Part Affinity Field Maps (L)                       The number of keypoints detected through
                                                      OpenPose dependent upon the dataset has been
                                                      trained.
                                                                                           326


Figure 2: OpenPose Pipeline

In this work, the COCO dataset having 18           each point) as input for our model's training and
different body key points (see Figure 3)           testing. We used a supervised learning
R_Ankle, R_Knee, R_Wrist, L_Wrist,                 approach as our dataset contains body
R_Shoulder, L_Shoulder, L_Ankle, L_Ear,            keypoints with an activity label. Among all the
R_Ear, R_Elbow, L_Elbow, L_Knee, L_Eye,            algorithms, we use multiple logistic regression,
R_Eye, R_Hip, L_Hip, Nose, and Neck is used.       and random forest provides significantly
                                                   greater accuracy.

                                                   4. Dataset
                                                      OpenPose uses the COCO keypoints
                                                   detection dataset for the pose estimation task,
                                                   which contains more than 200K images labeled
                                                   with keypoints [14]. We have collected images
                                                   from Google for classification purposes, and
                                                   some photos are clicked by a smartphone
                                                   camera. We prepared our dataset on
                                                   approximately 1000 images (five activity
                                                   categories, namely, sitting, standing, running,
                                                   dancing, and laying). Each activity category has
                                                   more than 170 images. We divided our dataset
                                                   into training and testing in the ratio 90:10. Data
                                                   collected from different sources contain
                                                   unequal width and height images, while our
Figure 3: OpenPose Keypoints                       model requires the same width and height. We
                                                   have resized all images to fixed-size 432x368
                                                   pixels, and then key points are extracted from
3.2.    Activity Classification                    them.

   We formulate the activity classification        5. Evaluation Metrics
problem as a multiclass classification problem,
which can be modeled using various machine
learning    regression    and     classification      For performance evaluation, Recall,
algorithms. The classification algorithm takes     Precision, and F1-score are used in this
18 body keypoints (x-axis and y-axis values of
                                                                                              327


experiment. We have also shown the Confusion          predicted activity class, and each column is
Matrix of some classifiers.                           associated with the actual activity class. The
                                                      matrix compares the target activity with the
5.1.    Precision                                     activity predicted by the model. This gives a
                                                      better idea of what types of errors our classifier
                                                      has made.
   Precision(P) is the ratio of the number of
true positives (Tp) to the sum of false positives
(Fp) and true positives. It can also be defined as    6. Experiments and Result
how many images classified into this class
belong to this class.                                     The following five activities are considered
                                                      for pose estimation and activity recognition and
        P = Tp/(Tp+Fp)                     (3)
                                                      classification: sitting, standing, dancing, laying,
                                                      and running. The experiments are conducted in
5.2.    Recall                                        Scikit Learn (0.23.1) and Python (3.6.6) in
                                                      Windows 10 Operating System with Intel i5
   Recall(R) is the ratio of the number of true       Processor 3.40 GHz with 8 GB RAM and using
positives (Tp) to the sum of false negatives (Fn)     five classification algorithms for activity
and true positives. It can also be defined as how     classification. These algorithms are described
many images that belong to this class are             below with their confusion matrix. The
classified into this class.                           performance results are provided in Table 1,
                                                      which shows the recall, precision, and f1-score
        R = Tp/(Tp+Fn)                     (4)
                                                      of various classifiers used in the proposed
                                                      approach.
5.3.    F1-Score
                                                      6.1. Classification Algorithm
    F1-Score is calculated as the harmonic mean
of recall and precision. Eqs.5 calculates it.         6.1.1. Logistic Regression
        F1-Score = 2 (P*R)/(P+R)           (5)            This algorithm is based on supervised
                                                      learning, and it is used in classification
5.4.    Confusion Matrix                              problems. In this work, multiple logistic
                                                      regression is used for classifying activities, and
   It is a two-dimensional matrix used to             'sag' is used as a solver because it solves only
measure the overall performance of the                L2 regularization with primal formulation or no
machine learning classification algorithm. In         regularization and Uses dummy variables to
the matrix, each row is associated with the           represent the categorical outcome.
Table 1
Performance Evaluation on Different Classifier

            Algorithms                Precision (%)       Recall (%)            F1-Measure (%)
              Logistic                    80.72              81.47                    80.95
                KNN                       77.90              77.89                    77.12
               SVM                        80.43              81.14                    80.46
           Decision Tree                  74.49              75.80                    73.50
          Random Forest                   80.75              80.34                    79.43
                                                                                            328


Figure 4:Confusion Matrix (Logistic Regression)
                                                    Figure 6: Confusion Matrix (SVM)
6.1.2. K-Nearest Neighbors
                                                    6.1.4. Decision Tree
    K-nearest neighbor (KNN) is a supervised
machine learning algorithm used for                     The decision tree comes under supervised
classification, and it's a non-parametric, lazy     learning. It is the most powerful and accepted
algorithm. Despite this simplicity, we got very     tool for prediction and classification. This
competitive results that are one reason for using   algorithm uses learning to predict a target pose's
this algorithm in our work. We used different       activity and make decisions from previously
values for k and got the highest accuracy in 5.     trained data. Predictions for activities are made
The distance function(d) used in this algorithm     from the root of the tree. The record attribute
is given in Eqs.6 and for the confusion matrix      value is compared to the root attribute value.
(see Figure5).                                      The confusion matrix is given in Figure7.
         d(p,q) = √ ∑ (qi - pi)2           (6)
    where p, q are vectors containing keypoints
of two different images and i=1 … n.


                                                    Figure 7: Confusion Matrix (Decision Tree)


Figure 5: Confusion Matrix (KNN)                    6.1.5. Random Forest
                                                       Random decision forest is a supervised
6.1.3. Support Vector Machine                       learning algorithm, and it is an ensemble
                                                    learning method for classification and
   It also comes under supervised learning          regression. It is also one of the most used and
algorithms and is mainly used in classification     popular algorithms because it gives better
and regression problems. We plotted all             results without tuning hyper-parameter. It
available data as points in two-dimensional         creates multiple decision trees and selects the
space. The classification is done by finding a      best solution using voting. We use a random
hyperplane that provides similarly different        forest because it predicts activity with good
outputs between the two classes. The confusion      accuracy and runs efficiently even for big
matrix is provided in Figure6.
                                                                                                    329


datasets. The confusion matrix is shown in                  80.75%, respectively, and the other two
Figure8.                                                    algorithms KNN and Decision Tree, are
                                                            underperforming. We have shown accuracies of
7. Conclusion                                               some recent researches on HAR in table 2.
                                                                Although much research has already been
                                                            done to a certain extent to deal with the activity
    In this study, we proposed our approach for             recognition problem, more convincing actions
human activity recognition from still images by
                                                            must be taken. In practice, there are a lot of
extracting the skeletal coordinate information
                                                            different activities that humans use to perform
(pose) using OpenPose API and then further                  in everyday life. Detecting all of them isn't an
utilizing this pose information to classify                 easy task because it requires a very large dataset
activity with the help of a supervised machine              to train the model. Although the dataset is not
learning algorithm. We prepared our dataset for             the only problem, the definition and diversity of
this work, which contains five different                    activities also make it more complicated for
activities, viz. sitting, standing, laying, dancing,        machines to understand. Some more activities
running. We have used five algorithms                       can be added to extend the scope and usefulness
(Logistic Regression, SVM, KNN, Random
                                                            of the work in the future. Besides adding
Forest, and Decision Tree) to find better results
                                                            activities, we can apply some data
for our model. From our experiment results, we
                                                            preprocessing techniques for handling missing
observed that Multiple Logistic Regression,                 keypoints of the body. We can also experiment
SVM, and Random Forest are showing the                      with some other machine learning algorithms
highest accuracy of 80.72%, 80.43%, and
                                                            that can provide better results.
Table 2
Comparative Study
 S.No.    Authors and           Dataset            Activities            Model Used              Accuracy
             Year                                                                                  (%)
 1.      Nandy et al.,      Acceleromete           Walking,        Multilayer Perceptron           77.0
          2019 [12]          r and heart       climbing stairs,      Linear Regression            53.92
                             rate sensor       sitting, running    Gaussian Naïve Bayes           73.73
                                                                       Decision Tree              93.54
 2.      Ghazal et al.,      Images from        Sitting on the        Decision-making              95.2
          2018 [15]          the internet          chair or            algorithm with
                                                    ground           feedforward CNN
 3.        Gatt et al.,          COCO             Abnormal        Used pre-trained models            93
           2019 [16]           keypoints       activity such as       of PoseNet and
                                                fall detection           OpenPose


                                                            8. References
                                                            [1] A. Gupta, K. Gupta, K. Gupta and K.
                                                                Gupta, "A Survey on Human Activity
                                                                Recognition and Classification," 2020
                                                                International      Conference         on
                                                                Communication and Signal Processing
                                                                (ICCSP), Chennai, India, 2020, pp. 0915-
                                                                0919,                                doi:
                                                                10.1109/ICCSP48568.2020.9182416.
                                                            [2] G. Sharma, F. Jurie and C. Schmid,
                                                                "Expanded Parts Model for Human
Figure 8: Confusion Matrix (Random Forest)                      Attribute and Action Recognition in Still
                                                                Images," 2013 IEEE Conference on
                                                                                         330


      Computer Vision and Pattern Recognition,          Shenzhen, China, 2020, pp. 239-242, doi:
      Portland, OR, 2013, pp. 652-659, doi:             10.1109/AEMCSE50948.2020.00058.
      10.1109/CVPR.2013.90.                        [9] E. Bulbul, A. Cetin and I. A. Dogru,
[3]   J A. Gupta, A. Kembhavi and L. S. Davis,          "Human Activity Recognition Using
      "Observing Human Object Interactions:             Smartphones," 2018 2nd International
      Using      Spatial     and     Functional         Symposium on Multidisciplinary Studies
      Compatibility for Recognition," in IEEE           and Innovative Technolo-gies (ISMSIT),
      Transactions on Pattern Analysis and              Ankara,      2018,    pp.      1-6,    doi:
      Machine Intelligence, vol. 31, no. 10, pp.        10.1109/ISMSIT.2018.8567275.
      1775-1789,        Oct.     2009,      doi:   [10] R. Liu, T. Chen and L. Huang, "Research
      10.1109/TPAMI.2009.83.                            on human activity recognition based on
[4]   Yang Wang, Hao Jiang, M. S. Drew, Ze-             active learning," 2010 International
      Nian Li and G. Mori, "Unsupervised                Conference on Machine Learning and
      Discovery of Action Classes," 2006 IEEE           Cybernetics, Qingdao, 2010, pp. 285-290,
      Computer Society Conference on                    doi: 10.1109/ICMLC.2010.5581050.
      Computer Vision and Pattern Recognition      [11] K. -S. Lee, S. Chae and H. -S. Park,
      (CVPR'06), New York, NY, USA, 2006,               "Optimal Time-Window Derivation for
      pp.             1654-1661,            doi:        Human-Activity Recognition Based on
      10.1109/CVPR.2006.321.                            Convolutional Neural Networks of
[5]   J. Shotton et al., "Realtime human pose           Repeated Rehabilitation Motions," 2019
      recognition in parts from single depth            IEEE 16th International Conference on
      images," CVPR 2011, Providence, RI,               Rehabilitation     Robotics      (ICORR),
      2011,       pp.      1297-1304,       doi:        Toronto, ON, Canada, 2019, pp. 583-586,
      10.1109/CVPR.2011.5995316.                        doi: 10.1109/ICORR.2019.8779475..
[6]   B. M. V. Guerra, S. Ramat, R. Gandolfi,      [12] A. Nandy, J. Saha, C. Chowdhury and K.
      G. Beltrami and M. Schmid, "Skeleton              P. D. Singh, "Detailed Human Activity
      data preprocessing for human pose                 Recognition using Wearable Sensor and
      recognition using Neural Network*," 2020          Smartphones,"      2019       International
      42nd Annual International Conference of           Conference on Opto-Electronics and
      the IEEE Engineering in Medicine &                Applied Optics (Optronix), Kolkata, India,
      Biology Society (EMBC), Montreal, QC,             2019,          pp.        1-6,         doi:
      Canada, 2020, pp. 4265-4268, doi:                 10.1109/OPTRONIX.2019.8862427..
      10.1109/EMBC44109.2020.9175588.              [13] R. Saini and V. Maan, "Human Activity
[7]   B. Reily, Q. Zhu, C. Reardon and H.               and Gesture Recognition: A Review,"
      Zhang, "Simultaneous Learning from                2020 International Conference on
      Human Pose and Object Cues for Real-              Emerging Trends in Communication,
      Time Activity Recognition," 2020 IEEE             Control and Computing (ICONC3),
      International Conference on Robotics and          Lakshmangarh, Sikar, India, 2020, pp. 1-2,
      Automation (ICRA), Paris, France, 2020,           doi:
      pp.             8006-8012,            doi:        10.1109/ICONC345789.2020.9117535.
      10.1109/ICRA40945.2020.9196632.              [14] Z. Cao, T. Simon, S. Wei and Y. Sheikh,
[8]   H. Yan, B. Hu, G. Chen and E. Zhengyuan,          "Realtime     Multi-person      2D    Pose
      "Real-Time        Continuous       Human          Estimation Using Part Affinity Fields,"
      Rehabilitation Action Recognition using           2017 IEEE Conference on Computer
      OpenPose and FCN," 2020 3rd                       Vision and Pattern Recognition (CVPR),
      International Conference on Advanced              Honolulu, HI, 2017, pp. 1302-1310, doi:
      Electronic Materials, Computers and               10.1109/CVPR.2017.143.
      Software      Engineering    (AEMCSE),