=Paper= {{Paper |id=Vol-1856/p02 |storemode=property |title=Research on human activity recognition based on image classification methods |pdfUrl=https://ceur-ws.org/Vol-1856/p02.pdf |volume=Vol-1856 |authors=Aistė Štulienė,Agnė Paulauskaitė-Tarasevičienė }} ==Research on human activity recognition based on image classification methods== https://ceur-ws.org/Vol-1856/p02.pdf
      Research on human activity recognition based on
               image classification methods
                           Aistė Štulienė                                               Agnė Paulauskaitė-Tarasevičienė
                                                                                      Department of Applied Informatics
                      Faculty of Informatics
                                                                                             Faculty of Informatics
                Kaunas University of Technology                                        Kaunas University of Technology
                       Kaunas, Lithuania                                                       Kaunas, Lithuania
                 e-mail: aiste.stuliene@ktu.edu                                  e-mail: agne.paulauskaite-taraseviciene@ktu.lt

     Abstract–Human activity recognition is a significant                The commercial products such as the Nintendo’s WII or
component of many innovative and human-behavior based                    Microsoft’s Kinect are good examples of such devices [11].
systems. The ability to recognize various human activities enables       Although these products have been partially successful, their
the developing of intelligent control system. Usually the task of        deployment is not practical, limiting the mobility area of the
human activity recognition is mapped to the classification task of
images representing person’s actions. This paper addresses the
                                                                         human (e.g., public areas are excluded). Furthermore the
problem of human activities’ classification using various machine        wearable motion sensors make human’s movement
learning methods such as Convolutional Neural Networks, Bag of           cumbersome. Additionally, the installation and maintenance of
Features model, Support Vector Machine and K-Nearest                     the sensors usually cause high costs. According to these facts,
Neighbors. This paper provides the comparison study on these             the more practical solutions rely on the combination of video
methods applied for human activity recognition task using the set        monitoring devices and image classification methods.
of images representing five different categories of daily life                Various machine learning technologies are applied for
activities. The usage of wearable sensors that could improve             image recognition tasks. Therefore, the major challenge in
classification results of human activity recognition is beyond the       human activity recognition is to evaluate the reliability of
scope of this research.
                                                                         selected technologies. Considering this fact, it is necessary to
   Keywords–activity recognition; machine learning; CNN; BoF;            compare the experimental results obtained using different
KNN; SVM                                                                 machine learning approaches. In this paper, four different
                                                                         methods have been chosen for experiments: Convolutional
                      I.    INTRODUCTION                                 Neural Networks (CNNs), Bag of Features (BoF), Support
                                                                         Vector Machine (SVM) and K-Nearest Neighbors (KNN).
      Recently the human activity recognition problem has                Using the same set of images representing human daily life
become a significant matter of research. In most of the cases it         activities these methods have been applied for the image
has a very explicit practical applicability: human activity              classification into five categories.
recognition is an integrate part of human behavior-based
system. Nowadays, smart home technologies are getting a lot of                                  II. IMAGE CLASSIFICATION
attention because of better care of the residents which is                    The general schema of human activity classification using
extremely important for elderly, children or disabled people [1].        all four methods mentioned above is presented in Fig. 1.
Smart home solutions, health monitoring equipment,
surveillance systems can be indicated as the typical examples
of such kind of systems [2], [3], [4]. Nevertheless, there is a                                                                     Resize

huge variety of specific application areas, namely anomalous                                                                        B&W

behaviour detection, unhealthy habits prevention or condition                          Set of images        Image selection
                                                                                                                                    RGB to
                                                                                                                                    L.A.B
tracking [5].                                                                                                                                …. …. ...
                                                                                                                                       Image Processing
      Nowadays, the primitive human activity partition to the                                 Category I
                                                                                                                   CNN        BoF     SVM       k-NN
static postures and dynamic motions is not sufficient. One of
the key features of smart system technologies’ task for human                                                          Machine learning method
                                                                                              Category II
activity recognition is enabling to identify the current activity                                                            application

considering to the wide range of provided indoor activities.                                     ...
Fully-autonomous and barely noticeable assisting systems are                                  Category N
becoming more appropriate for daily use than equipment based                                    output
on wearable sensors or appliances [6], [3]. Accelerometers,
                                                                         Fig. 1. The general architecture of image classification using machine learning
gyroscopes and magnetometers have been substantiated as the                                                   methods
most informative sensors in the sensor based recognition
systems [7], [8]. Such techniques as radar, I/R or microwave,                 Depending on the machine learning methods, the different
depth cameras have been widely used to obtain images [9], [10].          requirements are imposed on images. For example, using CNN,
                                                                         all images must be of the same size, which is usually pretty

                                                                     8
Copyright © 2017 held by the authors
small (e.g., 224×224×3). KNN classifier may be enhanced by                              high accuracy results in various image recognition tasks [12],
converting images from RGB color model to LAB model,                                    [13]. CNN for human activity recognition tasks usually is tested
which enables to quantify visual differences of colors and may                          on a very popular research categories of activities (walking,
lead to better results. SVM algorithm is used for image                                 jogging, running, boxing, waving and clapping) and can
classification if RGB images are converted to grayscale images                          achieve more than 90% accuracy [14], [15]. However, in most
and then to binary images.                                                              of the cases the solutions based on CNN employ additional
                                                                                        sophisticated sensors [16], [17]. Signals received from the
     A. Convolutional Neural Networks
                                                                                        accelerometer and gyroscope are transferred into a new activity
    CNN is a deep learning model that obtains complicated                               image which contains hidden relations between any pair of
hierarchical features via convolutional operation alternating                           signals. Using CNN discriminative additional features suited
with sub-sampling operation on the raw input images.                                    for human activity recognition are automatically extracted and
Convolutional neural networks have become one of the most                               learned [18].
widely spread models of deep learning and have shown a very

                Input layer




                                                                                                                                        Softmax
                                                                                           ...

                                                                                                                                                  Output
                               Convolutions                        Max                   Convolutions         Max             Fully                layer
                                                                  pooling                                    pooling        connected

                                                                   Fig. 2. A typical architecture of CNN

     A general CNN architecture consists of several                                         Cross    channel      normalization    (local     response
convolutions, pooling, and fully connected layers (Fig. 2).                             normalization) layer follows ReLU layer. This layer replaces all
Convolutional layer computes the output of neurons that are                             elements with normalized values. The normalized value 𝑥‘⁡for
connected to local regions in the input. Pooling layer reduces                          each element x is defined as:
the spatial size of the representation to reduce the amount of                                                          𝑥
parameters and computation in the network. All these layers are                                     𝑥′ =                𝛼∗𝑠                        (5)
                                                                                                          (𝐾 +                        )𝛽
followed by fully connected layers leading into Softmax, which                                                   𝑤𝑖𝑛𝑑𝑜𝑤𝐶ℎ𝑎𝑛𝑛𝑒𝑙𝑆𝑖𝑧𝑒
is a final classifier.                                                                  where K, α and β are hyper-parameters in the normalization, s
     The images of the same size a×a×b (where a is the height                           is the sum of squares of the elements in the normalization
and width of the image, b is the number of channels) are passed                         window [19]. The expression can be detalized:
as the input to a convolutional layer. When RGB image is used,                                                               (𝑖)
                                                                                                                            𝑎𝑥,𝑦
b is equal to 3. The convolutional layer has m kernels (or filters)                                      (𝑖)
                                                                                                        𝑏𝑥,𝑦 =
of size c×c×d, where c is smaller than a.
                                                                                                                                   𝑛
                                                                                                                        min⁡(𝑁−1,𝑖+ ) (𝑗)
                                                                                                                                   2
                                                                                                                                                           (6)
                                                                                                                 (𝐾 + 𝛼 ∑                   2 𝛽
                                                                                                                                   𝑛 (𝑎𝑥,𝑦 ) )
     The neurons of the convolutional layer are connected to the                                                        𝑗=max⁡(0,𝑖− )
                                                                                                                                   2
sub-regions of the input image (for the first convolutional layer)
or the output of the previous layer. Feature map is formed when                         where bx,y(i) is the response-normalized activity, ax,y(i) is the
a filter moves along the input and uses the same set of weights                         activity of a neuron computed by applying kernel i at position
and bias for the convolution. If l is a convolutional layer, the ith                    (x,y) and then applying the ReLU nonlinearity, n represents
feature map Yi(l) is defined using formula:                                             adjacent kernel maps at the same spatial position, N is the total
                                      (𝑙−1)                                             number of kernels in the layer.
                                    𝑚1
                                                                                             Pooling layers follow convolutional layers and summarize
                𝑌𝑖
                  (𝑙)        (𝑙)
                        = 𝐵𝑖 + ∑ 𝐾𝑖,𝑗 ∗ 𝑌𝑗
                                              (𝑙)         (𝑙−1)              (1)        the outputs of near groups of neurons in the same kernel map.
                                     𝑗=1                                                The neighborhoods summarized by adjacent pooling units do
                                                                                        not overlap. Max-pooling layer returns the maximum values of
where Bi(l) is a bias matrix, Ki,j(l) is the filter connecting the jth                  the input‘s rectangular regions and respectively, average-
feature map in layer (l-1) with ith feature map in layer l and m1(l-                    pooling layer returns average values.
1)
  is the amount of feature maps in layer l-1.                                                The convolutional layer is followed by a particular amount
     The convolutional layer is followed by an activation                               of fully connected layers. The aim of the convolutional layer is
function. Rectified linear unit is represented by ReLU layer.                           to determine large patterns using the combinations of the
ReLU is a function defined as:                                                          features known from previous layers. In order to classify the
                         (𝑙)                  (𝑙−1)
                        𝑌𝑖     = max⁡(0, 𝑌𝑖           )                      (2)        images, the last fully connected layer combines the identified
                  (𝑙)     (𝑙−1)          (𝑙−1)                                          patterns. The final fully connected layer is followed by Softmax
                𝑌𝑖 = 𝑌𝑖         , 𝑤ℎ𝑒𝑛⁡𝑌𝑖      ≥0                            (3)
                                                                                        layer and classification (output) layer. In the classification
                      (𝑙)             (𝑙−1)
                   𝑌𝑖 = 0, 𝑤ℎ𝑒𝑛⁡𝑌𝑖          <0                               (4)        layer, the network takes the values from the Softmax function
                                                                                        and assigns each input to one of classes.

                                                                                    9
    Three of CNN architectures have been selected for                                    smartphone with embedded inertial sensors, multiclass SVM
experiments in this paper: AlexNet [19], CaffeRef [20] and                               (“one against all” approach) has shown an overall accuracy of
VGG [21]. These architectures have the same number of layers,                            more than 90%. However, the accuracy results are much lower
but different input requirements for image size. AlexNet and                             (71.63%) trying to classify more complex activities [27].
CaffeRef require the size of 227×227×3, and VGG accepts the                              Similarities between different actions can be explained with
size of 224×224×3. The first convolutional layer filters the                             matched features in different sequences of actions (it may
input 227×227×3 image with 96 kernels of size 11×11×3 when                               appear that running for some people is similar to the jogging for
AlexNet or CaffeRef are used and 64 kernels of size 11×11×3                              the others). However, employing 3D trajectories of body joints
when VGG is used. The second convolutional layer uses the                                obtained by Kinect can provide remarkably good results of
kernels of size 5×5×d, where d is equal to 48 for AlexNet and                            accuracy 90.57% [28].
CaffeRef and 64 for VGG architecture. Further layers filters the
                                                                                              D. K-Nearest Neighbors
inputs with m kernels of size 3×3×d, where d is increasing,
however the exact number of d and m depends on the selected                                   K-Nearest Neighbors approach is a machine learning
architecture.                                                                            algorithm, which is often used for classifying objects based on
                                                                                         the most similar training samples in the feature space. The
     B. Bag of Features                                                                  classification is based on distance between a set of input data
     Bag of Features encodes the image features into a                                   points and training points. Various metrics can be used to
representation suitable for image classification. This technique                         determine the distance (Euclidean distance, Mahalanobis
is also often referred to as Bag of Words, because it uses image                         distance, Spearman distance and etc.). KNN search enables to
features as visual words represented as image. The features                              find k closest points in A (a set of n points) to a set of query
(which sometimes can be general, such as color, texture or                               points, when A and distance function are given. This algorithm
shape) are used to find the similarities between images (Fig. 3).                        is widely used in image processing and classification tasks.
                                                                                              The objects are classified according to the features of its k
                                                       histogram
                                                                                         nearest neighbors by majority vote. Training process consists of
                                                                                         storing feature vectors and labels of the training images. During
                   Extracted features
                                                                                         the classification, the unlabelled query point is simply assigned
                                                ....
       ....                                                        Classification
                                                                                         to the label of its k nearest neighbors.
                                                                                              The performance of KNN application for classification of
                                                                                         human activities particularly was examined using uni-axial
                  Extracted features                                                     sensors (sternum, wrist, thigh, and lower leg) [29]. Other
                                                                                         studies based on KNN for human activity recognition have also
              Fig. 3. Bag of Features for image recognition                              shown rather good results of accuracy (> 90%). However it can
    BoF has shown the promising results (over 80% of                                     be concluded that high accuracy results of human activity
accuracy) in the tasks of action recognition in video sequences                          recognition based on this method can be achieved if the
[22], [23]. The typical group of sport type activities (jumping,                         additional equipment (i.e., wearable sensors) is used [30].
walking, running) is used to evaluate the performance of BoF,                                 E. Accuracy Evaluation
proving that the better accuracy results can be achieved in
                                                                                              Human activity classification results for the particular
combination with other classification methods or additional
                                                                                         method are often represented as confusion matrix 𝑀𝑛𝑥𝑛 (n is
techniques [24], [25].
                                                                                         equal to the amount of categories). Confusion matrix is such
     C. Support Vector Machine                                                           that the element 𝑀𝑖𝑗 is the amount of instances from category i
     Support Vector Machine (SVM) belongs to the class of                                that were actually classified as category j [6].
machine learning algorithms called kernel methods. It is one of                                   TABLE I. The confusion matrix for binary classification
the best known methods in pattern classification and image
                                                                                                                                 Predicted category
classification. The SVM method was designed to be applied
only for two-class problems. In the context of human activity                                                                     NO            YES
classification problem, usually there are more than two possible                                  Actual          NO              TN             FP
classes (categories). Depending on this fact, it is extremely                                    category         YES             FN             TP
important to use modified SVM, which can be applied for
multiclass classification. Two main approaches have been                                      The confusion matrix for binary classification contains four
suggested to solve this problem [26]. The first one is called “one                       elements (TABLE I): True Positives (TP) represent the amount
against all”. In this approach, a set of binary classifiers is trained                   of positive instances that were classified as positive; True
to be able to separate each class from all others, where resulting                       Negatives (TN) represent the amount of negative instances that
class is with the highest score. The second approach is called                           were classified as negative; False Positives (FP) represent the
“one against one”. In this approach the resulting class is                               amount of negative instances that were classified as positive;
obtained by majority vote of all classifiers.                                            False Negatives (FN) represent the amount of positive instances
     For recognition of very simple Daily Living activities                              that were classified as negative.
(siting, standing, walking) by carrying a waist-mounted

                                                                                    10
     The accuracy is widely used metric for the generalization
of classification results. This metric is defined using formula:
                                 𝑇𝑃 + 𝑇𝑁
               𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =                                    (7)
                            𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑃
    The confusion matrix and the accuracy can also be used for
n categories, where n should be more than 1 (see TABLE II –
TABLE VII). In this case, the instance could be positive or                                    a)                                 b)
negative according to the particular category, e.g., positives
might be all instances of category II (e.g., sleeping) while
negatives would be all instances other than category II (e.g.,
other than sleeping).
                       III. EXPERIMENTS
     A. The Categories of Human Activities                                                     c)                                 d)
     Despite of considerable amount of scientific research,
human activity recognition only from images is still a very
challenging task due to the background clutter, viewpoint,
lighting, appearance and the rest of wide range aspects.
Moreover, the similarities between different human actions
make the classification even more challenging. The same
activity may be expressed by people who have completely                                                         e)
different appearance, body movements, postures and habits
[31]. These criterions affect the way how people perform the              Fig. 4. Examples of five different categories of human activities: a) represents
                                                                           category I, b) category II, c) category III, d) category IV and e) category V.
particular action, consequently it becomes quite complicated to
define the activity. Changing lifestyle, small or modern                       B. Experimental Results
accommodations affect the employment of home areas: the                        The image datasets containing 502 images for each
rooms are usually used not only by their primary purpose (e.g.,           category of human activity have been collected. Each dataset
the resident can work with computer in the kitchen or eat in the          has been split into a training set (which contains 400 images for
bedroom). Home appliances, computers, mobile devices and                  each category) and a test set (which contains 102 images for
other stuff around the person in most of the cases are not                each category). The data for training and testing has been
connected with the current activity. Even if the resident uses            chosen randomly from the primary datasets.
them at the moment, due to the changing technologies and                       The experiments of human activities’ classification have
trends they can be barely noticeable or recognizable.                     been implemented using MATLAB software and Add-Ons
Considering the unsolved human activity recognition problems              [32]. The implementation of CNNs has been accomplished
based on image classification methods, further theoretical and            using MatConvNet [33], which is an open source
practical studies need to be carried out in order to improve the          implementation of CNNs in MATLAB environment. There also
results or reject inadequate solutions.                                   exist specific software and hardware requirements for the
     In this paper the experimental scenario including five               implementation of CNNs, such as MATLAB 2015a (or later
possible categories depending on the type of human activity has           version), C\C++ compiler, the computer with CUDA – enabled
been created (Fig. 4). The activities are supposed to be                  NVIDIA GPU with compute capability 2.0 or above.
performed in home or office areas. Category I relates to the                   The estimated classification accuracy of human activity
situation when the people are communicating. Category II is               recognition task using different image classification methods is
assigned to the situation when the people are sleeping or having          presented in TABLE II – TABLE VII. The average accuracy of
a rest. Category III represents empty spaces (human staying               KNN is the worst one and approaches to 40.98% (although, it
temporarily in the selected area). Human’s work at computer,              is more than twice the probability to choose the correct class
reading, writing or studying is assigned to category IV. Any              randomly). The difference between average accuracy of SVM
type of eating or drinking activities are assigned to the category        and Bag of Features is less than 9%. The values are 59.61% and
V. Differently from common activities’ images in various                  68.24%, respectively (the probability to choose the correct class
recognition tasks, images representing these activities include           using one of these methods is more than 0.5). The use of CNN
all other objects naturally appearing while performing the                architectures (AlexNet, CaffeRef and VGG) provides very
particular activity. Therefore, the accuracy of expected results          similar results. Despite this fact, the average accuracy of
may not be as high as they are provided in previous researches            AlexNet is the best one and approaches to 90.78%.
(especially where additional techniques or methods are
included).


                                                                                 TABLE II. CONFUSION MATRIX OF ALEXNET ARCHITECTURE


                                                                     11
    AlexNet                                Predicted class
                                                                                          The experimental results have shown that activities of
average accuracy:                                                       Total:        category III determine the best results of classification for all
     90.78%                   I.    II.        III.          IV.   V.                 methods except KNN (Fig. 5).
                        I.    93     4          2             3    0     102
  Actual class



                       II.    7     87          2             2    4     102                          100%
                                                                                                       90%
                      III.    1      1         100            0    0     102                           80%




                                                                                        Accuracy, %
                      IV.     3      3          0            93    3     102                           70%
                                                                                                       60%
                      V.       7     4          0             1    90    102                           50%
                                                                                                       40%
      TABLE III. CONFUSION MATRIX OF CAFFEREF ARCHITECTURE                                             30%
                                                                                                       20%
                                                                                                       10%
    CaffeRef                                                                                            0%
                                           Predicted class
average accuracy:                                                       Total:                                AlexNet        CaffeRef      VGG       BoF         SVM        KNN
     88.04%                   I.    II.        III.          IV.   V.                                                                      Method
              I.              86     9          1             3    3     102
  Actual class




             II.              7     91          2             2    0     102                                            I.         II.       III.          IV.         V.
            III.              0      1         101            0    0     102
            IV.               8      6          1            82    5     102                                  Fig. 5. Comparison of image classification methods
            V.                1      4          2             6    89    102
                                                                                          The recognition of activities belonging to II, IV and V
                 TABLE IV. CONFUSION MATRIX OF VGG ARCHITECTURE                       categories are the most complicated, therefore provides the
                                                                                      worst results of classification.
      VGG
                                           Predicted class
average accuracy:                                                       Total:                                           IV. CONCLUDING REMARKS
     88.43%                   I.     II.       III.          IV.   V.
              I.              91      6         1             1    2     102               In this paper the research of different machine learning
  Actual class




             II.              8      89         3             2    1     102          methods used to recognize human activities has been performed.
            III.              1       2        99             0    1     102          Four different classical methods of machine learning have been
            IV.               4       5         1            91    1     102
            V.                5       5         0            10    81    102
                                                                                      selected in this research, including CNNs, BoF model, SVM and
                                                                                      KNN. This paper provides the comparison study of the
                        TABLE V. CONFUSION MATRIX OF BOF                              mentioned methods for human activity recognition only from
      BoF                                                                             images using five different categories of daily life activities.
                                           Predicted class                            Issues related to wearable sensors or other additional techniques
average accuracy:                                                       Total:
     68.24%                   I.     II.       III.          IV.   V.                 have not been considered. The obtained accuracy results satisfy
              I.              71      9         8             2    12    102          our expectations, especially taking into account the consideration
  Actual class




             II.              17     75         7             1    2     102          that images representing these activities include all other objects
            III.              6       9        82             1    4     102
            IV.               7      16        19            47    13    102
                                                                                      naturally appearing while performing the particular activity. The
            V.                8       8         4             9    73    102          average accuracy of image classification using BoF is 68.24%.
                                                                                      The average accuracy using SVM is lower and approaches
                        TABLE VI. CONFUSION MATRIX OF SVM
                                                                                      59.61%. Based on the experimental results we can conclude that
      SVM                                                                             KNN is not an appropriate method for human activity
                                           Predicted class
average accuracy:                                                       Total:        classification, using such complicated pictures of activities and
     59.61%                   I.     II.       III.          IV.   V.                 applying classical KNN notation without any improvements or
                        I.    59     23         6            10    4     102          technological supplements. The application of different CNN
  Actual class




                       II.    19     57        10             5    11    102          architectures has revealed very similar high accuracy results,
                      III.    6      12        67            12    5     102          although AlexNet has reached more than 90% average accuracy,
                      IV.     10      8        12            58    14    102          which indicates the best score of all applied methods.
                      V.      8       9         4            18    63    102          Considering the obtained results, further studies are needed to
                                                                                      analyze the eligibility of different and newly created CNN
                       TABLE VII. CONFUSION MATRIX OF KNN                             architectures for the solution of image-based human activity
                                                                                      classification problem.
      KNN                                  Predicted class
average accuracy:                                                       Total:                                                          REFERENCES
     40.98%                   I.     II.       III.          IV.   V.
                                                                                      [1]             T. van Kasteren, G. Englebienne, B. Krose, “An activity monitoring
                        I.    56     13        10            13    10    102                          system for elderly care using generative and discriminative models,”
  Actual class




                       II.    27     34        13            14    14    102                          Personal and Ubiquitous Computing, vol. 14(6), pp. 489-498, 2010.
                      III.    18     13        39            16    16    102          [2]             Y. Liang, X. Zhou, Z. Yu, B. Guo, “Energy-efficient motion related
                                                                                                      activity recognition on mobile devices for pervasive healthcare,” Mobile
                      IV.     12     11         9            44    26    102                          Networks and Application, vol. 19(3), pp. 303-317, 2014.
                      V.      12     15        13            26    36    102          [3]             J. Iglesias, J. Cano, A. M. Bernardos, J. R. Casar, “A ubiquitous activity-
                                                                                                      monitor to prevent sedentariness,” IEEE Conference on Pervasive
                                                                                                      Computing and Communications, pp. 667-680, 2011.




                                                                                 12
[4]  S. Thomas, M. Bourobou, Y. Yoo, “User Activity Recognition in Smart              [28] M. A. Bagheri, Q. Gao, S. Escalera, “Support vector machines with time
     Homes Using Pattern Clustering Applied to Temporal ANN Algorithm,”                    series distance kernels for action classification,” IEEE Winter Conference
     Sensors, vol. 15(5), pp. 11953-11971, 2015.                                           on Applications of Computer Vision, 2016.
[5] X. Zhu, Z. Liu, J. Zhang, “Human Activity Clustering for Online                   [29] F. Foerster, J. Fahrenberg, “Motion pattern and posture: Correctly
     Anomaly Detection,” Journal of Computer, vol. 6(6), pp. 1071-1079,                    assessed by calibrated accelerometers,” Behavior Research Methods,
     2001.                                                                                 Instruments, and Computers, vol. 32(3), pp. 450-457, 2000.
[6] O. D. Lara, M. A. Labrador, “A survey on human activity recognition               [30] F. Chamroukhi, S.Mohammed, D. Trabelsi, L. Oukhellou, Y. Amirat,
     using wearable sensors,” IEEE Communications Surveys & Tutorials,                     “Joint segmentation of multivariate time series with hidden process
     vol. 15(3), pp. 1192-1209, 2013.                                                      regression for human activity recognition,” Neurocomputing, vol. 120,
[7] P. Gupta, T. Dallas, “Feature Selection and Activity Recognition System                pp. 633-644, 2013.
     using a Single Tri-axial Accelerometer,” IEEE Trans. Biomed. Eng., pp.           [31] M. Vrigkas, C. Nikou, I. A. Kakadiaris, “A Review of Human Activity
     1780-1786, 2014.                                                                      Recognition Methods,” In journal Frontiers in Robotics and AI, vol. 2,
[8] L. Atallah, B. Lo, R. C. King, G. Z. Gitang, “Sensor positioning for                   Article 28, 2015.
     activity recognition using wearable accelerometers,” IEEE Transactions           [32] R. Collobert, K. Kavukcuoglu, C. Farabet, “Torch7: A MATLAB-like
     on Biomedical Circuits and Systems, vol. 5(4), pp. 320-329, 2011.                     environment for machine learning,” BigLearn, NIPS Workshop, 2011.
[9] M. A. A. H. Khan, et al., “RAM: Radar-based activity monitor,” IEEE               [33] A. Vedaldi, K. Lenc, “MatConvNet: Convolutional Neural Networks for
     INFOCOM 2016, Computer Communications, pp. 1-9, 2016.                                 MATLAB,” Proceedings of the 25th annual ACM international
[10] A. Dubois, F. Charpillet, “Human activities recognition with RGB-Depth                conference on Multimedia, pp.689-692, 2015.
     camera using HMM,” Conf. Proc. IEEE Eng. Med. Biol. Soc., 2013.
[11] J. Shotton, et al, “Real-time human pose recognition in parts from single
     depth images,” IEEE Conference on Computer Vision and Pattern
     Recognition, 2011.
[12] K. He, X. Zhang, S. Ren, J. Sun, “Deep Residual Learning for Image
     Recognition,” Computer Vision Foundation, pp. 770-778, 2015.
[13] M. D. Zeiler, R. Fergus, “Visualizing and understanding convolutional
     networks,” In Proceedings ECCV, 2014.
[14] S. Ravimaran, R. Anuradha, “Survey of Action Recognition Methods for
     Human Activity Recognition,” In International Journal of Advanced
     Research in Computer Science and Software Engineering, vol. 6, pp. 284-
     284, 2016.
[15] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, A. Baskurt, “Sequential
     Deep Learning for Human Action Recognition,” In International
     Workshop on Human Behavior Understanding, vol. 7065, Lecture Notes
     in Computer Science, pp. 29-39, 2011.
[16] J. B. Yang, M. N. Nguyen, P. P. San, X. L. Li, S. Krishnaswamy, “Deep
     Convolutional Neural Networks On Multichannel Time Series For
     Human Activity Recognition,” Proceedings of the 24th International
     Conference on Artificial Intelligence, pp. 3995-4001, 2015.
[17] N. Y. Hammerla, S. Halloran, T. Plotz, “Deep, Convolutional, and
     Recurrent Models for Human Activity Recognition Using Wearables,” In
     Proceedings of the Twenty-Fifth International Joint Conference on
     Artificial Intelligence, pp. 1533-1540, 2016.
[18] W. Jiang, Z. Yin, “Human Activity Recognition Using Wearable Sensors
     by Deep Convolutional Neural Networks,” Proceeding of the 23rd ACM
     international conference on Multimedia, pp. 1307-1310, 2015.
[19] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet classification with
     deep convolutional neural networks,” Advances in Neural Information
     Processing Systems, pp. 1106-1114, 2012.
[20] Y. Jia, et al, “Caffe: Convolutional architecture for fast feature
     embedding”, In Proceedings of the 22nd ACM international conference
     on Multimedia, pp. 675-678, 2014.
[21] K. Simonyan, A. Zisserman, “Very deep convolutional networks for
     large-scale image recognition,” Conference ICLR, 2014.
[22] M. Zhang, A. A. Sawchuk, “Motion primitive-based human activity
     recognition using a bag-of-features approach,” ACM symposium on
     International health informatics (IHI), pp. 631-640, 2012.
[23] J. C. Niebles, H. Wang, “Unsupervised Learning of Human Action
     Categories Using Spatial-Temporal Words,” International Journal of
     Computer Vision, vol. 79(3), 2008, pp. 299-318.
[24] T. D. Campos et al., “An evaluation of bags-of-words and spatio-temporal
     shapes for action recognition,” IEEE Workshop on Applications of
     Computer Vision (WACV), 2011.
[25] M. M. Ullah, S. N. Parizi, I. Laptev, “Improving Bag-of-Features Action
     Recognition with Non-Local Cues,” Proceedings of the British Machine
     Vision Conference, pp. 1-11, 2010.
[26] C. W. Hsu, C. J. Lin, “A comparison of methods for multiclass support
     vector machines,” IEEE Transactions on Neural Networks, vol. 13(2), pp.
     415-425, 2002.
[27] C. Schuldt, I. Laptev, B. Caputo, “Recognizing Human Actions: A Local
     SVM Approach∗,” Proceedings of the 17th International Conference on
     Pattern Recognition, 2004.



                                                                                 13