95 Learning Deep Features for kNN-Based Human Activity Recognition Sadiq Sani, Nirmalie Wiratunga, and Stewart Massie School of Computing Science and Digital Media, Robert Gordon University, Aberdeen AB25 1HG, Scotland, UK {s.sani|n.wiratunga|s.massie}@rgu.ac.uk Abstract. A CBR approach to Human Activity Recognition (HAR) uses the kNN algorithm to classify sensor data into different activity classes. Different feature representation approaches have been proposed for sensor data for the pur- pose of HAR. These include shallow features, which can either be hand-crafted from the time and frequency domains, or the coefficients of frequency transfor- mations. Alternatively, deep features can be extracted using deep learning ap- proaches. These different representation approaches have been compared in pre- vious works without a consistent best approach being identified. In this paper, we explore the question of which representation approach is best for kNN. Accord- ingly, we compare 5 different feature representation approaches (ranging from shallow to deep) on accelerometer data collected from two body locations, wrist and thigh. Results show deep features to produce the best results for kNN, com- pared to both hand-crafted and frequency transform, by a margin of up to 6.5% on the wrist and over 2.2% on the thigh. In addition, kNN produces very good results with as little as a single epoch of training for the deep features. Keywords: human activity recognition, feature representation, deep learning 1 Introduction Human activity recognition (HAR) is the computational discovery of human activity from sensor data and is receiving increasing interest in the areas of health care and fitness [3]. This is mainly driven by the need to find innovative ways to encourage physical activity. An example of a health application of HAR is S ELF BACK 1 [1], an EU funded project that is developing a self-management system for patients with Lower Back Pain. The motivation for this work is driven by the need for an effective HAR component for S ELF BACK, which is required to accurately measure adherence to physical activity targets. HAR is generally considered as a classification problem where a classifier is trained to identify user activity from sensor data. A CBR approach to this problem makes use of 1 http://www.selfback.eu/ Copyright © 2017 for this paper by its authors. Copying permitted for private and academic purpose. In Proceedings of the ICCBR 2017 Workshops. Trondheim, Norway 96 a kNN classifier in order to facilitate similarity-based reasoning and explanation. How- ever, the effectiveness of a kNN classifier depends on the quality of the feature represen- tation used. Different feature representation approaches have been proposed for HAR, from shallow hand-crafted features to frequency transform features e.g. Fast Fourier Transforms (FFT) and Discrete Cosine Transforms (DCT) coefficients, and more re- cently, deep learning approaches. All these approaches have had some degree of success and setbacks in performance [6]. It is our view that none of the previous works provides a clear answer to which feature extraction approach is best. Also, previous works have evaluated these feature representation approaches on combinations of different types of data-sets with different mixes of sensor locations and classifiers. In this work, we focus on the feature representation for the kNN classifier using data from two popular body locations, wrist and thigh. The main contribution of this work is an empirical evaluation of 5 different feature representation approaches across three different classes of features i.e. shallow hand- crafted features, shallow frequency transformation features and deep CNN derived fea- tures, for kNN, using sensor data collected from two common body locations, the wrist and the thigh. Wrist data is more prone to random noise compared to data collected at other body locations (e.g. thigh) due to increased variations in movement and posture possible with the hand while undertaking activities. Our goal in this work is to under- stand which of these feature representations is better suited for the kNN classifier and to analyse any differences in feature performance that may exist between the wrist and thigh. The rest of this paper is organised as follows: in Section 2, we highlight important related work on feature representation for HAR. Our dataset is described in Section 3. Evaluation is presented in Section 4 and conclusions in Section 5. 2 Related Work on Feature Representation for HAR Many different feature extraction approaches have been proposed for accelerometer data for the purpose of activity recognition [3]. We broadly classify these into hand- crafted, frequency-transform and deep features. 2.1 Hand-crafted Features This is the most common approach to HAR and involves the computation of a number of defined measures on either the raw accelerometer data (time-domain) or the frequency transformation of the data (frequency domain) [5]. These measures are designed to cap- ture the characteristics of the signal that are useful for distinguishing different classes of activities. In the case of both time and frequency domains, the input is a vector of real values →−v = v1 , v2 , ....vn for each axis x, y and z. A function θi is then applied to each vector to compute a single feature value. Typical time domain features include mean, standard deviation and percentiles [10]; while typical frequency domain features include energy, spectral entropy and dominant frequency [2]. The time-domain and fre- quency domain features used in this work are presented in Table 1 97 Time Domain Features Frequency Domain Features Mean Dominant frequency Standard deviation Spectral centroid Inter-quartile range Maximum Lag-one-autocorrelation Mean Percentiles (10,25,50,75,90) Median Peak-to-peak amplitude Standard deviation Power Skewness Kurtosis Log-energy Zero crossings Root squared mean Table 1. Hand-crafted features for both time and frequency domains. While hand-crafted features have worked well for HAR, a significant disadvantage is that they are domain specific. A different set of features need to be defined for each different type of input data i.e. accelerometer, gyroscope, time-domain and frequency domain. Hence, some understanding of the characteristics of the data is required. Also, it is not always clear which features are likely to work best [5]. Choice of features is usually made through empirical evaluation of different combinations of features or with the aid of feature selection algorithms [9]. 2.2 Frequency Transform Features Frequency transform features extraction involves applying a single function φ on the raw accelerometer data to transform this into the frequency domain, where it is ex- pected that distinctions between different activities are more emphasised. The main dif- ference between frequency transform and hand-crafted features is that the coefficients of the transformation are directly used for feature representation without taking further measurements. Common transformations that have been applied include Fast Fourier Transforms (FFTs) and Discrete Cosine Transforms (DCTs). FFT is an efficient algorithm optimised for computing the discrete Fourier transform of a digital input. Fourier transforms decompose an input signal into its constituent sine waves. In contrast, DCT, a similar algorithm to FFT, decomposes a given signal into it’s constituent cosine waves. Also, DCT returns an ordered sequence of coefficients such that the most significant information is concentrated at the lower indices of the sequence. This means that higher DCT coefficients can be discarded without losing information, making DCT better for compression. For frequency transform feature extraction, a transformation function (DCT or FFT) φ is applied to the time-series accelerometer vector → −v of each axis. The output of φ is a vector of coefficients which describe the sinusoidal wave forms that constitute the original signal. Accordingly the transformed vector representations, x0 = φ(x), y0 = φ(y) and z0 = φ(z), are obtained for each axis of a given instance. Additionally we derive a further magnitude vector, m = {mi1 , ..., mil } of the accelerometer data for 98 q each instance as a separate axis, where mij is defined as mij = 2 + z2 . x2ij + yij ij 0 0 0 0 As with x , y and z , we also apply φ to m to obtain m = φ(m). The final feature representation is obtained by concatenating the absolute values of the first l coefficients of x0 , y0 , z0 and m0 to produce a single feature vector of length 4 × l. The value l = 80 is used in this work, which is determined empirically. Further information on feature representation using DCT and FFT can be found in [7]. 2.3 CNN Feature Extraction Convolutional Neural Networks (CNNs) have been applied for feature extraction in HAR, due to their ability to model local dependencies that may exist between adjacent data points in the accelerometer data [8]. CNNs are a type of Deep Neural Network that is able to extract increasingly more abstract feature representations by passing the input data through a stack of multiple convolutional operators [4], where each layer in the stack takes as input, the output of the previous layer of convolutional operators. An example of a CNN is shown in Figure 1. Fig. 1. Illustration of CNN The input into the CNN in Figure 1 is a 3-dimensional matrix representation with dimensions 1 × 28 × 3 representing the width, length and depth respectively. Tri-axial acceleromter data typically have a width of 1, a length l and a depth of 3 representing the 99 x, y and z axes. A convolution operation is then applied by passing a convolution filter over the input which exploits local relationships between adjacent data points. This op- eration is defined by two parameters, D representing the number of convolution filters to apply and C, the dimensions of each filter. For this example, D = 6 and C = 1 × 5. The output of the convolution operation is a matrix with dimensions 1 × 24 × 6, these dimensions being determined by the dimension of the input and the parameters of the convolution operation applied. This output is then passed through a Pooling operation which basically performs dimensionality reduction. The parameter P determines the dimensions of the pooling operator which in this example is 1 × 2, which results in a reduction of the width of its input by half. The output of the pooling layer can be passed through additional Convolution and Pooling layers. The output of the final Pool- ing layer is then flattened into a 1-dimensional representation and then fed into a fully connected neural network. The entire network (including convolution layers) is trained through back propagation over a number of generations until some convergence criteria is reached. Detailed description of CNNs can be obtained in [4]. Note that once the CNN is fully trained, it can used to provide feature representa- tions for use with other types of classifiers e.g. kNN. This is achieved by cutting off the trained network after the final pooling layer and just before the fully-connected neural network. Each training example is then passed through the convolutional network in order to obtain an abstract representation which is used to train the kNN classifier. A similar operation is performed for each test example to obtain an abstract representation which is passed to kNN for classification. 3 Dataset A group of 34 volunteer participants was used for data collection. The age range of participants is 18 - 54 years and the gender distribution is 52% Female and 48% Male. Data collection concentrated on the activities provided in Table 2. Activity Description Walking Walking at normal pace Jogging Jogging on a treadmill at moderate speed Up Stairs Walking up 4 - 6 flights of stairs Down Stairs Walking down 4 - 6 a flights of stairs Standing Standing relatively still Sitting Sitting still with hands on desk or thighs Table 2. Details of activities classes in our dataset. This set of activities was chosen because it represents the range of normal daily ac- tivities typically performed by most people. Data was collected using the Axivity Ax3 tri-axial accelerometer 2 at a sampling rate of 100Hz. Accelerometers were mounted 2 http://axivity.com/product/ax3 100 on the right-hand wrists and right thighs of the participants. Activities are evenly dis- tributed between classes as participants were asked to do each activity for the same period of time (3 minutes). 4 Evaluation Evaluations are conducted using a leave-one-person-out methodology where each user’s data is held out for testing in turns, while the remaining 33 are used for training. In this way, we are testing the general applicability of the system to users whose data is not included in the trained model. Performance is reported using macro-averaged F1 and kNN is used for classification with euclidean distance and the parameter k = 5. The representations included in our comparison are as below: - Time: Time domain hand-crafted features - Freq: Frequency domain hand-crafted features - DCT: DCT frequency features - FFT: FFT frequency features - CNN: CNN deep features with soft-max classifier - CNN-kNN: CNN deep features with kNN classifier For the CNN, after experimenting with different parameter settings, the final config- uration used for thigh data had 3 convolution layers with 150, 100 and 80 convolution filters respectively. The configuration used for wrist data had 5 convolution layers with the same numbers of convolution filters as the thigh data in the first 3 layers and 60 and 40 convolution filters in the fourth and fifth layers respectively. Each convolution layer was followed by a max pooling layer. A convolution filter of size 10 and pooling size of 2 were used on all convolution and pooling layers respectively. The last pooling layer is connected to a fully connected network with 2 hidden layers, where the first layer had 900 units and second layer had 200 units. A dropout probability of 0.5 was used for each hidden layer. The final output layer had 6 units representing the 6 activity classes in our dataset and uses soft-max regression. Loss is computed using cross-entropy and the network is trained using back-propagation for 200 epochs. 4.1 Results Results of our comparative evaluation are shown in Figure 2. The best results for both thigh and wrist are achieved using deep features (CNN and CNN-kNN). This highlights the fact that kNN, using deep features, can rival the performance of state-of-the-art deep learners, while still providing the ability for similarity-based reasoning and explainabil- ity that makes kNN desirable. In general, HAR performance is higher using thigh data compared to wrist by a margin of up to 14.7% (for DCT). This indicates that the thigh is a much better position for HAR compared to the wrist. However, the benefit from deep feature representations is consistent on both wrist and thigh. 101 Out of the shallow features (Time, Freq, DCT and FFT), the best performance is achieved using DCT. This is consistent with our previous findings [7]. However, in comparison with DCT, CNN-kNN produces 6.5% and 2.2% improvement on the wrist and thigh respectively. Both improvements are statistically significant at 95% using a paired t-test. Fig. 2. Results of different representations(Time, Freq, DCT, FFT, CNN, CNN-kNN) It is known that one of the major bottleneck of applying deep learning is the amount of time required for training. Hence, it is important to understand the effect of training time on the performance of both CNN and CNN-kNN. Particularly, we would like to see the level of performance that can be achieved with minimum training time. Figure 3 presents the results of CNN and CNN-kNN at between 1 to 5 epochs of training for the wrist (left) and thigh (right). Fig. 3. Results for CNN and CNN-kNN after training for between 1-5 epochs. 102 Note that CNN-kNN outperforms CNN on both wrist and thigh at all 5 training epochs. Also, the performance of CNN-kNN is good (on par with Freq) even after a single epoch of training. These results are an important finding of this work and demonstrate the robustness of kNN in effectively using deep features, irrespective of the amount of time spent on training. Finally, we analyse the effect of the depth of our network on the quality of deep features we are able to extract for kNN. Figure 4 shows the performance of CNN-kNN with different numbers of convolution layers between 3 and 5. Note that the best per- formance for the thigh is achieved using 3 convolution layers (0.949) and performance gradually decreases with the addition of more convolution layer (0.947 for 4 and 0.937 for 5). In contrast, performance on the wrist produces a significant increase (at 95% using a paired t-test) with additional layers from 0.73 for 3 layers to 0.84 for 5 layers. This indicates that deeper layers are required for effective feature extraction on more difficult datasets. However, a relatively shallow architecture seems sufficient for easier datasets. Fig. 4. Results for CNN-kNN at different depths between 3-5 convolution layers. 5 Conclusion In this paper, we have presented an analysis of different feature representation ap- proaches for the purpose of human activity recognition using kNN. These feature rep- resentation approaches can be broadly categorised into three classes: handcrafted, fre- quency transform and deep features. Evaluation is conducted using accelerometer data collection from two different body locations: wrist and thigh. Results show deep fea- tures to significantly out-perform the other representation types on both wrist and thigh by a margin of over 6.5% on the wrist, and 2.2% on the thigh. In addition, our eval- 103 uation shows kNN to be very effective at using deep features, even when a minimum amount of time spent in training these deep features. Future work will investigate the use of RNN for feature extraction due to their ability to model the sequential relationship inherent in the time series accelerometer data. References 1. K. Bach, T. Szczepanski, A. Aamodt, O. E. Gundersen, and P. J. Mork. Case representation and similarity assessment in the selfback decision support system. In Proceedings of 24th International Conference on Case-Based Reasoning, ICCBR 2016, pages 32–46. Springer International Publishing, 2016. 2. D. Figo, P. C. Diniz, D. R. Ferreira, and J. M. Cardoso. Preprocessing techniques for context recognition from accelerometer data. Personal and Ubiquitous Computing, 14(7):645–662, 2010. 3. O. D. Lara and M. A. Labrador. A survey on human activity recognition using wearable sensors. Communications Surveys & Tutorials, IEEE, 15(3):1192–1209, 2013. 4. Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time series. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 255–258. MIT Press, 1998. 5. T. Plötz, N. Y. Hammerla, and P. Olivier. Feature learning for activity recognition in ubiq- uitous computing. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, IJCAI’11, pages 1729–1734. AAAI Press, 2011. 6. D. Ravi, C. Wong, B. Lo, and G. Z. Yang. A deep learning approach to on-node sensor data analytics for mobile or wearable devices. IEEE Journal of Biomedical and Health Informatics, 21(1):56–64, Jan 2017. 7. S. Sani, N. Wiratunga, S. Massie, and K. Cooper. Selfback-activity recognition for self- management of low back pain. In Research and Development in Intelligent Systems XXXIII: Incorporating Applications and Innovations in Intelligent Systems XXIV, pages 281–294. Springer, 2016. 8. M. Zeng, L. T. Nguyen, B. Yu, O. J. Mengshoel, J. Zhu, P. Wu, and J. Zhang. Convolutional neural networks for human activity recognition using mobile sensors. In Proceedings of 6th International Conference on Mobile Computing, Applications and Services, pages 197–205, 2014. 9. S. Zhang, P. Mccullagh, and V. Callaghan. An efficient feature selection method for activity classification. In Proceedings of IEEE International Conference on Intelligent Environments, pages 16–22, 2014. 10. Y. Zheng, W.-K. Wong, X. Guan, and S. Trost. Physical activity recognition from accelerom- eter data using a multi-scale ensemble method. In IAAI, 2013.