=Paper= {{Paper |id=Vol-1856/p02 |storemode=property |title=Research on human activity recognition based on image classification methods |pdfUrl=https://ceur-ws.org/Vol-1856/p02.pdf |volume=Vol-1856 |authors=Aistė Štulienė,Agnė Paulauskaitė-Tarasevičienė }} ==Research on human activity recognition based on image classification methods== https://ceur-ws.org/Vol-1856/p02.pdf

Research on human activity recognition based on
image classification methods
Aistė Štulienė Agnė Paulauskaitė-Tarasevičienė
Department of Applied Informatics
Faculty of Informatics
Faculty of Informatics
Kaunas University of Technology Kaunas University of Technology
Kaunas, Lithuania Kaunas, Lithuania
e-mail: aiste.stuliene@ktu.edu e-mail: agne.paulauskaite-taraseviciene@ktu.lt

Abstract–Human activity recognition is a significant The commercial products such as the Nintendo’s WII or
component of many innovative and human-behavior based Microsoft’s Kinect are good examples of such devices [11].
systems. The ability to recognize various human activities enables Although these products have been partially successful, their
the developing of intelligent control system. Usually the task of deployment is not practical, limiting the mobility area of the
human activity recognition is mapped to the classification task of
images representing person’s actions. This paper addresses the
human (e.g., public areas are excluded). Furthermore the
problem of human activities’ classification using various machine wearable motion sensors make human’s movement
learning methods such as Convolutional Neural Networks, Bag of cumbersome. Additionally, the installation and maintenance of
Features model, Support Vector Machine and K-Nearest the sensors usually cause high costs. According to these facts,
Neighbors. This paper provides the comparison study on these the more practical solutions rely on the combination of video
methods applied for human activity recognition task using the set monitoring devices and image classification methods.
of images representing five different categories of daily life Various machine learning technologies are applied for
activities. The usage of wearable sensors that could improve image recognition tasks. Therefore, the major challenge in
classification results of human activity recognition is beyond the human activity recognition is to evaluate the reliability of
scope of this research.
selected technologies. Considering this fact, it is necessary to
Keywords–activity recognition; machine learning; CNN; BoF; compare the experimental results obtained using different
KNN; SVM machine learning approaches. In this paper, four different
methods have been chosen for experiments: Convolutional
I. INTRODUCTION Neural Networks (CNNs), Bag of Features (BoF), Support
Vector Machine (SVM) and K-Nearest Neighbors (KNN).
Recently the human activity recognition problem has Using the same set of images representing human daily life
become a significant matter of research. In most of the cases it activities these methods have been applied for the image
has a very explicit practical applicability: human activity classification into five categories.
recognition is an integrate part of human behavior-based
system. Nowadays, smart home technologies are getting a lot of II. IMAGE CLASSIFICATION
attention because of better care of the residents which is The general schema of human activity classification using
extremely important for elderly, children or disabled people [1]. all four methods mentioned above is presented in Fig. 1.
Smart home solutions, health monitoring equipment,
surveillance systems can be indicated as the typical examples
of such kind of systems [2], [3], [4]. Nevertheless, there is a Resize

huge variety of specific application areas, namely anomalous B&W

behaviour detection, unhealthy habits prevention or condition Set of images Image selection
RGB to
L.A.B
tracking [5]. …. …. ...
Image Processing
Nowadays, the primitive human activity partition to the Category I
CNN BoF SVM k-NN
static postures and dynamic motions is not sufficient. One of
the key features of smart system technologies’ task for human Machine learning method
Category II
activity recognition is enabling to identify the current activity application

considering to the wide range of provided indoor activities. ...
Fully-autonomous and barely noticeable assisting systems are Category N
becoming more appropriate for daily use than equipment based output
on wearable sensors or appliances [6], [3]. Accelerometers,
Fig. 1. The general architecture of image classification using machine learning
gyroscopes and magnetometers have been substantiated as the methods
most informative sensors in the sensor based recognition
systems [7], [8]. Such techniques as radar, I/R or microwave, Depending on the machine learning methods, the different
depth cameras have been widely used to obtain images [9], [10]. requirements are imposed on images. For example, using CNN,
all images must be of the same size, which is usually pretty

8
Copyright © 2017 held by the authors
small (e.g., 224×224×3). KNN classifier may be enhanced by high accuracy results in various image recognition tasks [12],
converting images from RGB color model to LAB model, [13]. CNN for human activity recognition tasks usually is tested
which enables to quantify visual differences of colors and may on a very popular research categories of activities (walking,
lead to better results. SVM algorithm is used for image jogging, running, boxing, waving and clapping) and can
classification if RGB images are converted to grayscale images achieve more than 90% accuracy [14], [15]. However, in most
and then to binary images. of the cases the solutions based on CNN employ additional
sophisticated sensors [16], [17]. Signals received from the
A. Convolutional Neural Networks
accelerometer and gyroscope are transferred into a new activity
CNN is a deep learning model that obtains complicated image which contains hidden relations between any pair of
hierarchical features via convolutional operation alternating signals. Using CNN discriminative additional features suited
with sub-sampling operation on the raw input images. for human activity recognition are automatically extracted and
Convolutional neural networks have become one of the most learned [18].
widely spread models of deep learning and have shown a very

Input layer

Softmax
...

Output
Convolutions Max Convolutions Max Fully layer
pooling pooling connected

Fig. 2. A typical architecture of CNN

A general CNN architecture consists of several Cross channel normalization (local response
convolutions, pooling, and fully connected layers (Fig. 2). normalization) layer follows ReLU layer. This layer replaces all
Convolutional layer computes the output of neurons that are elements with normalized values. The normalized value 𝑥‘⁡for
connected to local regions in the input. Pooling layer reduces each element x is defined as:
the spatial size of the representation to reduce the amount of 𝑥
parameters and computation in the network. All these layers are 𝑥′ = 𝛼∗𝑠 (5)
(𝐾 + )𝛽
followed by fully connected layers leading into Softmax, which 𝑤𝑖𝑛𝑑𝑜𝑤𝐶ℎ𝑎𝑛𝑛𝑒𝑙𝑆𝑖𝑧𝑒
is a final classifier. where K, α and β are hyper-parameters in the normalization, s
The images of the same size a×a×b (where a is the height is the sum of squares of the elements in the normalization
and width of the image, b is the number of channels) are passed window [19]. The expression can be detalized:
as the input to a convolutional layer. When RGB image is used, (𝑖)
𝑎𝑥,𝑦
b is equal to 3. The convolutional layer has m kernels (or filters) (𝑖)
𝑏𝑥,𝑦 =
of size c×c×d, where c is smaller than a.
𝑛
min⁡(𝑁−1,𝑖+ ) (𝑗)
2
(6)
(𝐾 + 𝛼 ∑ 2 𝛽
𝑛 (𝑎𝑥,𝑦 ) )
The neurons of the convolutional layer are connected to the 𝑗=max⁡(0,𝑖− )
2
sub-regions of the input image (for the first convolutional layer)
or the output of the previous layer. Feature map is formed when where bx,y(i) is the response-normalized activity, ax,y(i) is the
a filter moves along the input and uses the same set of weights activity of a neuron computed by applying kernel i at position
and bias for the convolution. If l is a convolutional layer, the ith (x,y) and then applying the ReLU nonlinearity, n represents
feature map Yi(l) is defined using formula: adjacent kernel maps at the same spatial position, N is the total
(𝑙−1) number of kernels in the layer.
𝑚1
Pooling layers follow convolutional layers and summarize
𝑌𝑖
(𝑙) (𝑙)
= 𝐵𝑖 + ∑ 𝐾𝑖,𝑗 ∗ 𝑌𝑗
(𝑙) (𝑙−1) (1) the outputs of near groups of neurons in the same kernel map.
𝑗=1 The neighborhoods summarized by adjacent pooling units do
not overlap. Max-pooling layer returns the maximum values of
where Bi(l) is a bias matrix, Ki,j(l) is the filter connecting the jth the input‘s rectangular regions and respectively, average-
feature map in layer (l-1) with ith feature map in layer l and m1(l- pooling layer returns average values.
1)
is the amount of feature maps in layer l-1. The convolutional layer is followed by a particular amount
The convolutional layer is followed by an activation of fully connected layers. The aim of the convolutional layer is
function. Rectified linear unit is represented by ReLU layer. to determine large patterns using the combinations of the
ReLU is a function defined as: features known from previous layers. In order to classify the
(𝑙) (𝑙−1)
𝑌𝑖 = max⁡(0, 𝑌𝑖 ) (2) images, the last fully connected layer combines the identified
(𝑙) (𝑙−1) (𝑙−1) patterns. The final fully connected layer is followed by Softmax
𝑌𝑖 = 𝑌𝑖 , 𝑤ℎ𝑒𝑛⁡𝑌𝑖 ≥0 (3)
layer and classification (output) layer. In the classification
(𝑙) (𝑙−1)
𝑌𝑖 = 0, 𝑤ℎ𝑒𝑛⁡𝑌𝑖 <0 (4) layer, the network takes the values from the Softmax function
and assigns each input to one of classes.

9
Three of CNN architectures have been selected for smartphone with embedded inertial sensors, multiclass SVM
experiments in this paper: AlexNet [19], CaffeRef [20] and (“one against all” approach) has shown an overall accuracy of
VGG [21]. These architectures have the same number of layers, more than 90%. However, the accuracy results are much lower
but different input requirements for image size. AlexNet and (71.63%) trying to classify more complex activities [27].
CaffeRef require the size of 227×227×3, and VGG accepts the Similarities between different actions can be explained with
size of 224×224×3. The first convolutional layer filters the matched features in different sequences of actions (it may
input 227×227×3 image with 96 kernels of size 11×11×3 when appear that running for some people is similar to the jogging for
AlexNet or CaffeRef are used and 64 kernels of size 11×11×3 the others). However, employing 3D trajectories of body joints
when VGG is used. The second convolutional layer uses the obtained by Kinect can provide remarkably good results of
kernels of size 5×5×d, where d is equal to 48 for AlexNet and accuracy 90.57% [28].
CaffeRef and 64 for VGG architecture. Further layers filters the
D. K-Nearest Neighbors
inputs with m kernels of size 3×3×d, where d is increasing,
however the exact number of d and m depends on the selected K-Nearest Neighbors approach is a machine learning
architecture. algorithm, which is often used for classifying objects based on
the most similar training samples in the feature space. The
B. Bag of Features classification is based on distance between a set of input data
Bag of Features encodes the image features into a points and training points. Various metrics can be used to
representation suitable for image classification. This technique determine the distance (Euclidean distance, Mahalanobis
is also often referred to as Bag of Words, because it uses image distance, Spearman distance and etc.). KNN search enables to
features as visual words represented as image. The features find k closest points in A (a set of n points) to a set of query
(which sometimes can be general, such as color, texture or points, when A and distance function are given. This algorithm
shape) are used to find the similarities between images (Fig. 3). is widely used in image processing and classification tasks.
The objects are classified according to the features of its k
histogram
nearest neighbors by majority vote. Training process consists of
storing feature vectors and labels of the training images. During
Extracted features
the classification, the unlabelled query point is simply assigned
....
.... Classification
to the label of its k nearest neighbors.
The performance of KNN application for classification of
human activities particularly was examined using uni-axial
Extracted features sensors (sternum, wrist, thigh, and lower leg) [29]. Other
studies based on KNN for human activity recognition have also
Fig. 3. Bag of Features for image recognition shown rather good results of accuracy (> 90%). However it can
BoF has shown the promising results (over 80% of be concluded that high accuracy results of human activity
accuracy) in the tasks of action recognition in video sequences recognition based on this method can be achieved if the
[22], [23]. The typical group of sport type activities (jumping, additional equipment (i.e., wearable sensors) is used [30].
walking, running) is used to evaluate the performance of BoF, E. Accuracy Evaluation
proving that the better accuracy results can be achieved in
Human activity classification results for the particular
combination with other classification methods or additional
method are often represented as confusion matrix 𝑀𝑛𝑥𝑛 (n is
techniques [24], [25].
equal to the amount of categories). Confusion matrix is such
C. Support Vector Machine that the element 𝑀𝑖𝑗 is the amount of instances from category i
Support Vector Machine (SVM) belongs to the class of that were actually classified as category j [6].
machine learning algorithms called kernel methods. It is one of TABLE I. The confusion matrix for binary classification
the best known methods in pattern classification and image
Predicted category
classification. The SVM method was designed to be applied
only for two-class problems. In the context of human activity NO YES
classification problem, usually there are more than two possible Actual NO TN FP
classes (categories). Depending on this fact, it is extremely category YES FN TP
important to use modified SVM, which can be applied for
multiclass classification. Two main approaches have been The confusion matrix for binary classification contains four
suggested to solve this problem [26]. The first one is called “one elements (TABLE I): True Positives (TP) represent the amount
against all”. In this approach, a set of binary classifiers is trained of positive instances that were classified as positive; True
to be able to separate each class from all others, where resulting Negatives (TN) represent the amount of negative instances that
class is with the highest score. The second approach is called were classified as negative; False Positives (FP) represent the
“one against one”. In this approach the resulting class is amount of negative instances that were classified as positive;
obtained by majority vote of all classifiers. False Negatives (FN) represent the amount of positive instances
For recognition of very simple Daily Living activities that were classified as negative.
(siting, standing, walking) by carrying a waist-mounted

10
The accuracy is widely used metric for the generalization
of classification results. This metric is defined using formula:
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (7)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑃
The confusion matrix and the accuracy can also be used for
n categories, where n should be more than 1 (see TABLE II –
TABLE VII). In this case, the instance could be positive or a) b)
negative according to the particular category, e.g., positives
might be all instances of category II (e.g., sleeping) while
negatives would be all instances other than category II (e.g.,
other than sleeping).
III. EXPERIMENTS
A. The Categories of Human Activities c) d)
Despite of considerable amount of scientific research,
human activity recognition only from images is still a very
challenging task due to the background clutter, viewpoint,
lighting, appearance and the rest of wide range aspects.
Moreover, the similarities between different human actions
make the classification even more challenging. The same
activity may be expressed by people who have completely e)
different appearance, body movements, postures and habits
[31]. These criterions affect the way how people perform the Fig. 4. Examples of five different categories of human activities: a) represents
category I, b) category II, c) category III, d) category IV and e) category V.
particular action, consequently it becomes quite complicated to
define the activity. Changing lifestyle, small or modern B. Experimental Results
accommodations affect the employment of home areas: the The image datasets containing 502 images for each
rooms are usually used not only by their primary purpose (e.g., category of human activity have been collected. Each dataset
the resident can work with computer in the kitchen or eat in the has been split into a training set (which contains 400 images for
bedroom). Home appliances, computers, mobile devices and each category) and a test set (which contains 102 images for
other stuff around the person in most of the cases are not each category). The data for training and testing has been
connected with the current activity. Even if the resident uses chosen randomly from the primary datasets.
them at the moment, due to the changing technologies and The experiments of human activities’ classification have
trends they can be barely noticeable or recognizable. been implemented using MATLAB software and Add-Ons
Considering the unsolved human activity recognition problems [32]. The implementation of CNNs has been accomplished
based on image classification methods, further theoretical and using MatConvNet [33], which is an open source
practical studies need to be carried out in order to improve the implementation of CNNs in MATLAB environment. There also
results or reject inadequate solutions. exist specific software and hardware requirements for the
In this paper the experimental scenario including five implementation of CNNs, such as MATLAB 2015a (or later
possible categories depending on the type of human activity has version), C\C++ compiler, the computer with CUDA – enabled
been created (Fig. 4). The activities are supposed to be NVIDIA GPU with compute capability 2.0 or above.
performed in home or office areas. Category I relates to the The estimated classification accuracy of human activity
situation when the people are communicating. Category II is recognition task using different image classification methods is
assigned to the situation when the people are sleeping or having presented in TABLE II – TABLE VII. The average accuracy of
a rest. Category III represents empty spaces (human staying KNN is the worst one and approaches to 40.98% (although, it
temporarily in the selected area). Human’s work at computer, is more than twice the probability to choose the correct class
reading, writing or studying is assigned to category IV. Any randomly). The difference between average accuracy of SVM
type of eating or drinking activities are assigned to the category and Bag of Features is less than 9%. The values are 59.61% and
V. Differently from common activities’ images in various 68.24%, respectively (the probability to choose the correct class
recognition tasks, images representing these activities include using one of these methods is more than 0.5). The use of CNN
all other objects naturally appearing while performing the architectures (AlexNet, CaffeRef and VGG) provides very
particular activity. Therefore, the accuracy of expected results similar results. Despite this fact, the average accuracy of
may not be as high as they are provided in previous researches AlexNet is the best one and approaches to 90.78%.
(especially where additional techniques or methods are
included).

TABLE II. CONFUSION MATRIX OF ALEXNET ARCHITECTURE

11
AlexNet Predicted class
The experimental results have shown that activities of
average accuracy: Total: category III determine the best results of classification for all
90.78% I. II. III. IV. V. methods except KNN (Fig. 5).
I. 93 4 2 3 0 102
Actual class

II. 7 87 2 2 4 102 100%
90%
III. 1 1 100 0 0 102 80%

Accuracy, %
IV. 3 3 0 93 3 102 70%
60%
V. 7 4 0 1 90 102 50%
40%
TABLE III. CONFUSION MATRIX OF CAFFEREF ARCHITECTURE 30%
20%
10%
CaffeRef 0%
Predicted class
average accuracy: Total: AlexNet CaffeRef VGG BoF SVM KNN
88.04% I. II. III. IV. V. Method
I. 86 9 1 3 3 102
Actual class

II. 7 91 2 2 0 102 I. II. III. IV. V.
III. 0 1 101 0 0 102
IV. 8 6 1 82 5 102 Fig. 5. Comparison of image classification methods
V. 1 4 2 6 89 102
The recognition of activities belonging to II, IV and V
TABLE IV. CONFUSION MATRIX OF VGG ARCHITECTURE categories are the most complicated, therefore provides the
worst results of classification.
VGG
Predicted class
average accuracy: Total: IV. CONCLUDING REMARKS
88.43% I. II. III. IV. V.
I. 91 6 1 1 2 102 In this paper the research of different machine learning
Actual class

II. 8 89 3 2 1 102 methods used to recognize human activities has been performed.
III. 1 2 99 0 1 102 Four different classical methods of machine learning have been
IV. 4 5 1 91 1 102
V. 5 5 0 10 81 102
selected in this research, including CNNs, BoF model, SVM and
KNN. This paper provides the comparison study of the
TABLE V. CONFUSION MATRIX OF BOF mentioned methods for human activity recognition only from
BoF images using five different categories of daily life activities.
Predicted class Issues related to wearable sensors or other additional techniques
average accuracy: Total:
68.24% I. II. III. IV. V. have not been considered. The obtained accuracy results satisfy
I. 71 9 8 2 12 102 our expectations, especially taking into account the consideration
Actual class

II. 17 75 7 1 2 102 that images representing these activities include all other objects
III. 6 9 82 1 4 102
IV. 7 16 19 47 13 102
naturally appearing while performing the particular activity. The
V. 8 8 4 9 73 102 average accuracy of image classification using BoF is 68.24%.
The average accuracy using SVM is lower and approaches
TABLE VI. CONFUSION MATRIX OF SVM
59.61%. Based on the experimental results we can conclude that
SVM KNN is not an appropriate method for human activity
Predicted class
average accuracy: Total: classification, using such complicated pictures of activities and
59.61% I. II. III. IV. V. applying classical KNN notation without any improvements or
I. 59 23 6 10 4 102 technological supplements. The application of different CNN
Actual class

II. 19 57 10 5 11 102 architectures has revealed very similar high accuracy results,
III. 6 12 67 12 5 102 although AlexNet has reached more than 90% average accuracy,
IV. 10 8 12 58 14 102 which indicates the best score of all applied methods.
V. 8 9 4 18 63 102 Considering the obtained results, further studies are needed to
analyze the eligibility of different and newly created CNN
TABLE VII. CONFUSION MATRIX OF KNN architectures for the solution of image-based human activity
classification problem.
KNN Predicted class
average accuracy: Total: REFERENCES
40.98% I. II. III. IV. V.
[1] T. van Kasteren, G. Englebienne, B. Krose, “An activity monitoring
I. 56 13 10 13 10 102 system for elderly care using generative and discriminative models,”
Actual class

II. 27 34 13 14 14 102 Personal and Ubiquitous Computing, vol. 14(6), pp. 489-498, 2010.
III. 18 13 39 16 16 102 [2] Y. Liang, X. Zhou, Z. Yu, B. Guo, “Energy-efficient motion related
activity recognition on mobile devices for pervasive healthcare,” Mobile
IV. 12 11 9 44 26 102 Networks and Application, vol. 19(3), pp. 303-317, 2014.
V. 12 15 13 26 36 102 [3] J. Iglesias, J. Cano, A. M. Bernardos, J. R. Casar, “A ubiquitous activity-
monitor to prevent sedentariness,” IEEE Conference on Pervasive
Computing and Communications, pp. 667-680, 2011.

12
[4] S. Thomas, M. Bourobou, Y. Yoo, “User Activity Recognition in Smart [28] M. A. Bagheri, Q. Gao, S. Escalera, “Support vector machines with time
Homes Using Pattern Clustering Applied to Temporal ANN Algorithm,” series distance kernels for action classification,” IEEE Winter Conference
Sensors, vol. 15(5), pp. 11953-11971, 2015. on Applications of Computer Vision, 2016.
[5] X. Zhu, Z. Liu, J. Zhang, “Human Activity Clustering for Online [29] F. Foerster, J. Fahrenberg, “Motion pattern and posture: Correctly
Anomaly Detection,” Journal of Computer, vol. 6(6), pp. 1071-1079, assessed by calibrated accelerometers,” Behavior Research Methods,
2001. Instruments, and Computers, vol. 32(3), pp. 450-457, 2000.
[6] O. D. Lara, M. A. Labrador, “A survey on human activity recognition [30] F. Chamroukhi, S.Mohammed, D. Trabelsi, L. Oukhellou, Y. Amirat,
using wearable sensors,” IEEE Communications Surveys & Tutorials, “Joint segmentation of multivariate time series with hidden process
vol. 15(3), pp. 1192-1209, 2013. regression for human activity recognition,” Neurocomputing, vol. 120,
[7] P. Gupta, T. Dallas, “Feature Selection and Activity Recognition System pp. 633-644, 2013.
using a Single Tri-axial Accelerometer,” IEEE Trans. Biomed. Eng., pp. [31] M. Vrigkas, C. Nikou, I. A. Kakadiaris, “A Review of Human Activity
1780-1786, 2014. Recognition Methods,” In journal Frontiers in Robotics and AI, vol. 2,
[8] L. Atallah, B. Lo, R. C. King, G. Z. Gitang, “Sensor positioning for Article 28, 2015.
activity recognition using wearable accelerometers,” IEEE Transactions [32] R. Collobert, K. Kavukcuoglu, C. Farabet, “Torch7: A MATLAB-like
on Biomedical Circuits and Systems, vol. 5(4), pp. 320-329, 2011. environment for machine learning,” BigLearn, NIPS Workshop, 2011.
[9] M. A. A. H. Khan, et al., “RAM: Radar-based activity monitor,” IEEE [33] A. Vedaldi, K. Lenc, “MatConvNet: Convolutional Neural Networks for
INFOCOM 2016, Computer Communications, pp. 1-9, 2016. MATLAB,” Proceedings of the 25th annual ACM international
[10] A. Dubois, F. Charpillet, “Human activities recognition with RGB-Depth conference on Multimedia, pp.689-692, 2015.
camera using HMM,” Conf. Proc. IEEE Eng. Med. Biol. Soc., 2013.
[11] J. Shotton, et al, “Real-time human pose recognition in parts from single
depth images,” IEEE Conference on Computer Vision and Pattern
Recognition, 2011.
[12] K. He, X. Zhang, S. Ren, J. Sun, “Deep Residual Learning for Image
Recognition,” Computer Vision Foundation, pp. 770-778, 2015.
[13] M. D. Zeiler, R. Fergus, “Visualizing and understanding convolutional
networks,” In Proceedings ECCV, 2014.
[14] S. Ravimaran, R. Anuradha, “Survey of Action Recognition Methods for
Human Activity Recognition,” In International Journal of Advanced
Research in Computer Science and Software Engineering, vol. 6, pp. 284-
284, 2016.
[15] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, A. Baskurt, “Sequential
Deep Learning for Human Action Recognition,” In International
Workshop on Human Behavior Understanding, vol. 7065, Lecture Notes
in Computer Science, pp. 29-39, 2011.
[16] J. B. Yang, M. N. Nguyen, P. P. San, X. L. Li, S. Krishnaswamy, “Deep
Convolutional Neural Networks On Multichannel Time Series For
Human Activity Recognition,” Proceedings of the 24th International
Conference on Artificial Intelligence, pp. 3995-4001, 2015.
[17] N. Y. Hammerla, S. Halloran, T. Plotz, “Deep, Convolutional, and
Recurrent Models for Human Activity Recognition Using Wearables,” In
Proceedings of the Twenty-Fifth International Joint Conference on
Artificial Intelligence, pp. 1533-1540, 2016.
[18] W. Jiang, Z. Yin, “Human Activity Recognition Using Wearable Sensors
by Deep Convolutional Neural Networks,” Proceeding of the 23rd ACM
international conference on Multimedia, pp. 1307-1310, 2015.
[19] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet classification with
deep convolutional neural networks,” Advances in Neural Information
Processing Systems, pp. 1106-1114, 2012.
[20] Y. Jia, et al, “Caffe: Convolutional architecture for fast feature
embedding”, In Proceedings of the 22nd ACM international conference
on Multimedia, pp. 675-678, 2014.
[21] K. Simonyan, A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” Conference ICLR, 2014.
[22] M. Zhang, A. A. Sawchuk, “Motion primitive-based human activity
recognition using a bag-of-features approach,” ACM symposium on
International health informatics (IHI), pp. 631-640, 2012.
[23] J. C. Niebles, H. Wang, “Unsupervised Learning of Human Action
Categories Using Spatial-Temporal Words,” International Journal of
Computer Vision, vol. 79(3), 2008, pp. 299-318.
[24] T. D. Campos et al., “An evaluation of bags-of-words and spatio-temporal
shapes for action recognition,” IEEE Workshop on Applications of
Computer Vision (WACV), 2011.
[25] M. M. Ullah, S. N. Parizi, I. Laptev, “Improving Bag-of-Features Action
Recognition with Non-Local Cues,” Proceedings of the British Machine
Vision Conference, pp. 1-11, 2010.
[26] C. W. Hsu, C. J. Lin, “A comparison of methods for multiclass support
vector machines,” IEEE Transactions on Neural Networks, vol. 13(2), pp.
415-425, 2002.
[27] C. Schuldt, I. Laptev, B. Caputo, “Recognizing Human Actions: A Local
SVM Approach∗,” Proceedings of the 17th International Conference on
Pattern Recognition, 2004.