Human Walking Behavior detection with a RGB-D sensors network for ambient assisted living applications Nicola Mosca, Vito Renò, Roberto Marani, Massimiliano Nitti, Tiziana D’Orazio, and Ettore Stella National Research Council of Italy, Institute of Intelligent Systems for Automation, Via Amendola 122/DO, 70126 Bari, Italy mosca@ba.issia.cnr.it Abstract. Automatically determining anomalies in human behavior is an important tool in ambient assisted living, especially when it concerns elderly people that for several reasons cannot be continuously monitored and assisted by a caregiver or a family member. This paper proposes a network of low cost RGB-D sensors with no overlapping fields-of-view, capable of identifying anomalous behaviors with respect a pre-learned normal one. A 3D trajectory analysis is carried out by comparing three different classifiers (SVM, neural networks and k-nearest neighbors). The results on real experiments prove the effectiveness of the proposed ap- proach both in terms of performances and of real time application. More- over, the possibility to extract and use depth information without con- sidering color information enables the system to operate while respecting user privacy. 1 Introduction Ageing of population is predicted to be a serious global concern in the near future and it is already an important challenge in some countries. This is going to have important implications in many aspects of our life and will dramatically affect policies around the world to address the issue, including its impact on health care expenditure [13]. In this context, extending the time in which elderly people can act independently, without the need to be housed in assisted living facilities or be hospitalized, is not only humanly desirable but required for economic sus- tainability [24]. Cameras have proven to be an important technology in different surveillance applications, used in a rapidly growing industry. Many factors con- tribute to this trend such as escalating safety and security concerns, decreasing hardware costs and advances in processing and storage capabilities. These ad- vances have enabled automatic tools for monitoring both small and vast areas. Initially focused on security applications[8, 11] their usage has rapidly grown in other sectors as well, including ambient assisted living [18]. Traditionally, video surveillance systems have employed a network of passive cameras, using fixed position and orientation cameras, sometimes with pan-tilt-zoom (PTZ). Pas- sive camera images often require preprocessing steps designed to enable better performance, such as automatic gain and white-balance compensation, reduc- ing issues in subsequent operations. These operations are often indispensable for addressing the challenging illumination conditions that can be found in real situations. Video surveillance applications employ object detection algorithms, along with higher-level processing, such as tracking or event analysis, to extract mean- ingful data from the captured scenes. Detection algorithms vary in relation to the task to be performed and the particular context. Most of the time their focus is on moving objects in an otherwise static environment, where a background can be dynamically modelled and updated and moving objects are then obtained through background subtraction techniques. However, these techniques are influ- enced by illumination conditions, performing poorly both when images appear overly bright and saturated, and when captured scenes are dimly lit. Artificial lights can also prove challenging since lightbulbs flicker due to alternate current, with consequences on the background modelling, although some solutions can be employed to counter it[19]. Part-based human body detectors based on color information have been pro- posed by [22] where SVM classifiers are used to learn specific parts of the human body on a variety of poses and backgrounds. This approach is able to handle partial occlusions enabling robust identification and tracking in crowded scenes. Nie et al. [14] developed an algorithm for tracklets association with overlapping objects for crowded scenes. Occlusions are handled using a part-based similarity algorithm while the tracklets association is formulated as a Maximum A Poste- rior problem by using a Markov-chain with spatiotemporal context constraints. In [5] Bouma et al. propose a system for the video surveillance of a shopping mall. In this case, the researchers employ several pedestrian detector algorithms instead of an object detector based on background subtraction, citing the limits of this technique in providing a reliable segmentation in crowded environments. The challenges related to the employment of RGB sensors, even when used with stereo algorithms, have led researchers to investigate other sensors, such as time-of-flight cameras [2], capable of directly providing depth information. In recent times, novel camera systems such as Microsoft Kinect, pushed by research advances and economies of scale, have enabled a widespread develop- ment of 3D vision algorithms that can operate on RGB-D data [7]. Furthermore, by offloading the depth computation from the CPU to a dedicated peripheral, these systems have enabled the development of more complex techniques capable of real-time performance. Last but not least, this has important and beneficial implications for addressing (or at least reducing) privacy issues in the AAL con- text, since the offloaded processing and the possibility to obtain 3d information by other means, enables the system to extract useful information without relying on color data at all. In [1] researchers proposed a multi-Kinect system designed to monitor indoor environments looking for a camera placement able to provide minimal overlap- ping between their field of view, in order to minimize sensor interference, a com- mon issue in active camera systems. Positional data are expressed in a common cooordinate system enabling the whole solution to work with a combination of mean-shift and Kalman based algorithms proposed by [2] in their pipeline. Hu- man action recognition has benefited from this trend by using techniques based on skeletal tracking. In [12] researchers used a circular array of Kinect sensors surrounding a central treadmill. Human actions are then classified by using a sup- port vector machine operating on the extracted three dimensional skeletal data. The enhanced tracking, segmentation and pose estimation provided by Kinect libraries are used in [21] for providing accurate people segmentation. This in- formation is then fed to a particular implementation of a Multiple Component Dissimilarity (MCD) for person re-identification through features extracted from the color data. In addition to tracking, event analysis is another major requirement in most surveillance applications. It can be approached either with high-level semantic interpretation of video sequences or by performing anomaly detection, by sub- dividing sequences in normal and abnormal sets and employing classification techniques to learn a model able to discriminate between them. In [17] Piciarelli et al. follow this approach by using a single class support vector machine able to identify anomalous trajectory. The work presented in this paper approaches the event analysis problem by learning a model. In this case walking behaviors are split between normal and anomalous ones by relying on trajectory evaluation focusing on the detection of wandering situations. Wandering is indeed an important hint in the assisted living context since it is one of the symptoms shown by people affected by de- mentia[23]. The system, designed for an indoor environment, uses multiple Kinect cam- eras, suitably placed in a corridor for maximum coverage and no overlapping. Skeletal features are extracted from the RGB-D sensor by exploiting the OpenNI framework and by considering the extracted torso feature. A proper Kalman filter is used for the prediction step and allows robust people tracking both inter-camera and intra-camera. Trajectories are assembled together in a com- mon reference system, by extrapolating the path using splines. Finally, anomaly behavior detection is performed using different classification algorithms by com- paring multiple techniques: in addition to an SVM classifier, we use a k-NN algorithm and a feed forward neural network trained with a backpropagation al- gorithm. Additional information about the methodology are reported in Section 2, while experimental results follows in Section 3. Conclusions and consideration on future researches are drawn in Section 4. 2 Methodology The methodology proposed in this paper can be summarized in three main blocks, namely: 1. 3D Data acquisition and Preprocessing; 2. Feature Extraction; 3. Behavior Classification. Data coming from one or multiple RGB-D sensors is initially acquired and pre-processed to obtain three dimensional trajectories of a moving subject. Then, a specific set of features is extracted from each trajectory in order to perform the classification task and understand if it leads to an anomalous behavior or not. 2.1 3D Data acquisition and Preprocessing In the first step, several RGB-D sensors with no overlapping fields of view are employed to acquire depth data from an observed scene. In order to refer the depth maps produced by each sensor, or equivalently the corresponding point clouds, into a global reference system it is necessary to perform a preliminary calibration phase. Several reference points, whose coordinates in a global refer- ence system are already known, are observed in each camera and are used to determine the transformation matrices between the local reference systems and the global one. In particular, by considering the position of every sample points in the local reference systems C Ki , i = 1, 2, ..n, of the n RGB-D sensors, it is possible to find the 4 × 4 matrices Mi which are able to transform every point Ki pC = [xK Ki Ki T p , yp , zp , 1] , defined in C i Ki , into the global reference system C, C T C Ki since p = [xp , yp , zp , 1] = Mi p . Solutions are obtained in the least-squares (LS) sense through the application of a standard registration algorithm, based on the single value decomposition[6]. Once all cameras refer to the same system of coordinates, people have to be detected and then tracked in time. As we will describe in the next session, in this paper we use the OpenNI framework to detect people and recover their 3D positions in each frame. Since people can move into an extended region, performing complex movements, which are not completely under the field of view of a single camera, it is necessary to put in a unique trajectory the 3D points generated by a user in each camera. For this reason a Kalman filter[3] has been designed to predict at each frame the position of the users detected at the previous frame and to further filter measurements noise. Following the diagram in Fig. 1 for every frame at a specific time t, the user detection procedure segments new Q users (Q ≥ 0) in the fields of view of the n sensors placed in the environment. On the other hand, Nt−1 users (Nt−1 ≥ 0) were computed at the previous discrete time instant t − 1. The task of user tracking aims to associate, if possible, users detected at time t with those identified at time t − 1. We suppose that each user detected at t − 1 moves with constant velocity. Its position is thus predicted by using a Kalman filter, which operates over a state vector defined by the position and the speed of the user. The predicted positions of the Nt−1 users are thus compared with those observed in the environment. This comparison is mediated by a cost computation, easily defined in terms of the Euclidean distance between the positions of every current user and the Nt−1 previous ones. Users with close positions, i.e. with small cost values, are in rela- tion. Finally, for each reassigned user, the state of the Kalman filter is updated in Fig. 1. Data processing scheme for user tracking with Kalman filter order to reduce the contribution of measurement noise. This strategy is applied between every pair of consecutive frames, where in general three different events can arise: – Users are still visible in the field of view of the specific sensor and thus are correctly assigned to corresponding new instances observed in the scene. In this case the Kalman filter operates to correct measurement, in accordance with the previous estimation; – New users enter in the scene and are detected in the current frame t. New instances are then initialized with the states of the detected users; – Users are lost and no longer visible in the fields of view of the sensors. The states of the lost users are still kept in the analysis and evolve following the model of the Kalman filter, i.e. at constant velocity. As a result of this processing, each user is tracked within the whole envi- ronment leading to the generation of a trajectory Θj = [θ1 , θ2 , . . . , θNj ] that contains Nj 3D points. Hence, θk represents the 3D information associated at time tk . The number of points Nj depends on the duration of the time interval in which the specific user Uj is tracked by the proposed algorithm. Moreover, each trajectory has been fit on a smoothing spline σ to obtain a single continuous trajectory starting from multiple sub-trajectories acquired from each sensor. σ is a curve defined starting from a smoothing parameter s (in our experiments s = 0.99) so that the following quantity gets minimized: Z 2 X d σ s (Θ(tk ) − σ(tk )) + (1 − s) ( 2 )2 dt 2 i dt where tk represents the time in which a point is observed or interpolated. Both Θ(·) and σ(·) are referred to the same time basis. 2.2 Feature Extraction Trajectories can be seen as raw data that need to be managed by the classifier to understand whether a behavior of the selected user is anomalous. This goal can be achieved by creating a more discriminative representation of the trajectories, i.e. feature vectors. In this case, eleven features have been identified for each trajectory and have been used to define the feature vector x = [x1 , x2 , . . . , x11 ] that will be the input of the subsequent classifier. x is populated in the following manner: – the first five elements are, respectively: mean, median, standard deviation, median absolute deviation (MAD) and maximum value of the velocity com- puted on Θi (defined as the ratio of the difference of position on the XY plane and the temporal difference between subsequent frames); – the next five elements are: mean, median, standard deviation, MAD and maximum value of curvatures that have been evaluated on the spline tra- jectory σi . Each curvature is defined as the reciprocal of the radius of the circumference that passes through three consecutive trajectory points; – the last element of the feature vector is the number of trajectory intersections with itself. It is worth pointing out why the last feature is important for classification tasks involving wandering or loitering detection. In these cases, the effect of moving without any aim or purpose often result in trajectories where loops (or self intersections) are more frequent than when a person travels from a point to another with a specific goal in mind. 2.3 Behavior Classification Three different supervised classifiers have been employed in this experiment: a support vector machine (SVM)[4], a k-nearest neighbor (k-NN)[9] and a neural network (NN)[10]. The first one is a binary classifier that tries to estimate the boundary that best divides two different clusters in a multidimensional space. In other words, it looks for the hyperplane that minimizes the distance with respect to training data by solving an optimization problem. K-nearest neighbor classifies a new incoming sample evaluating its k nearest samples among training data by means of a voting procedure (in this case, k has been set to 1). Finally, the neural network tries to approximate a model of an unknown function by using artificial neurons arranged in several layers and changing the weights of the connections between them. In this work, 11 input features are mapped on two classes (normal, not normal) mapped on two different nodes. This way ambiguity cases can be detected and dealt accordingly. 3 Experiments and discussions The next sub-sections will introduce the actual setup used in our experiments, describing the implemented sensors and the system architecture. Input dataset will be presented together with classification results obtained by SVM and k-NN and Neural Network. 3.1 Experimental setup Unfortunately, to the best of our knowledge, no public dataset exhibiting walking behaviour of dementia patients has been plublished yet, therefore the proposed methodology has been applied to the analysis of videos produced by a set of RGB-D camera placed within an indoor environment, namely a corridor of a research lab where a group of researchers agreed to take part in the experiment, tasked sometimes of achieving a particular goal (reaching a location, collecting a print, talking, etc.) while in other cases they were instructed of just loiter freely, in order to simulate the abnormal behavior. With reference to the sketch map in Fig. 2a, three Microsoft Kinect sensors K1 , K2 and K3 are arranged within the corridor. Specifically, K2 and K3 focus on the boundaries of the corridor, while K1 looks at the central area (Fig. 2b). Each sensor is locally connected to a node for the data storage, whereas the whole system is remotely controlled by a server unit via a UDP protocol. The server sends a start signal to every node which enable video recording. Each video, which lasts 30 seconds, is finally downloaded by the server. A start signal is sent to the nodes through the network and thus is received with slight delays. However this is negligible and does not affect the whole system pipeline. As an example, a frame captured by the Kinect K2 is displayed in Figs. 2c-d, where a depth map and the corresponding RGB image are shown, respectively. The position of the three Kinect cameras has been set in order to cover the highest area without overlapping of their cones of sight (red regions in the sketch). It ensures the best working conditions for the sensors, since no inter- ference phenomena would alter the depth maps. However, it produces shadow areas, e.g. the regions between K1 and K2 (narrow shadow) or between K1 and K3 (wide shadow). As previously stated, the calibration phase needs a few points of known position in the reference systems of both the kinect cameras and an external surveying instrument. In figure 2c-d the red circles enclose corresponding objects between the depth and the RGB images, which are used to calibrate sensors and transform data into a global reference system. In order to measure the position and attitude of each camera a theodolite (Nikon Total Station D50[15]) has been used. In this paper, in order to detect and segment the people silhouettes we have used the well known OpenNI framework together with the Primesense NiTE li- brary[16], which is able to recognize and track up to 15 user skeletons. Although, this framework also integrates a robust algorithm for tracking, it has been used only for the extraction of the skeletal joints, specifically the torso joint, which is assumed as the center of mass of the detected user. Additionally, users can per- form complex movements subtending different cameras. Since each Kinect works independently from the others, it is necessary to address people re-identification on a higher level, which is not provided by OpenNI. Fig. 2. (a) Map of the corridor and position of the three Microsoft Kinect cameras used in the proposed experiments. (b) Picture of the actual environment. (c)-(d) Depth map and corresponding RGB image. Red circles highlight objects in actual relationship. Two examples of acquired trajectories belonging to the two different classes of behavior (normal and anomalous) are reported in Fig. 3, with a single user moving with-in the environment. Here, blue lines display the actual trajectories captured by the RGB-D cameras, whereas the red ones are those generated by spline interpolation, which is also able to reconstruct the user movements out of the fields of view of the three Kinect sensors. 3.2 Classification results The preliminary task of trajectory extraction has been used to create a dataset of 60 user paths within the corridor under inspection. Each path refers to the observation of a single individual that has been recognized among the three Fig. 3. Comparison of trajectories belonging to (a) normal and (b) anomalous behavior classes. The inset of (b) highlights the final part of an anomalous trajectory. Kinect sensors. It should be noted that user paths are extracted also when many people moves simultaneusly in the scene, as the tracking procedure based on the Kalman filter prediction is able to disambiguate the great majority of people intersection. In the whole dataset, 33 trajectories are associated with a normal behavior and are labeled with 0 (the 55%), while the remaining 27 anomalies are associated to the value 1 (the 45%). A k-fold cross validation method (with k = 5) has been employed to evaluate the capabilities of all the classifiers on the entire available data, since training and test set change and span the whole dataset. For this reason, data has been randomly partitioned in 5 subsets to build the training set with 80% of data and the test set with the remaining 20%. Then, training and testing tasks are repeated 5 times per run, iteratively changing the test set with one of the partitioned subsets. Moreover, in order to better evaluate the accuracy of the tested classifiers, the experiment has been repeated three times per each classifier by changing the initial condition (random seed) used for partitioning. Table 1. Confusion matrices and average accuracy value for the experiment. Each entry of the table represents a confusion matrix in which diagonal elements represent the correct prediction while off-diagonal elements are classification errors. Accuracy per each run is reported in green, and the last column shows the average value of the accuracy achieved by each classifier in the three experiments. The best one is highlighted in bold and corresponds to the neural network one: 93.9%. Run 1 Run 2 Run 3 Accuracy 0 1 0 1 0 1 0 53.3% 1.7% 0 53.3% 1.7% 0 53.3% 1.7% SVM 90.5% 1 6.7% 38.3% 1 8.3% 36.7% 1 8.3% 36.7% 91.6% 90% 90% 0 1 0 1 0 1 0 50% 5.0% 0 48.3% 6.7% 0 50% 5.0% K-NN 87.7% 1 6.7% 38.3% 1 6.7% 38.3% 1 6.7% 38.3% 88.3% 86.6% 88.3% 0 1 0 1 0 1 0 53.3% 1.7% 0 55.0% 0% 0 55.0% 0% NN 93.9% 1 1.7% 43.3% 1 6.7% 38.3% 1 8.3% 36.7% 96.6% 93.3% 91.7% Results are reported in Table 1. Three experiments are repeated for each classifier, for a total of 9 confusion matrices. The last column reports the average accuracy value for the three runs. The accuracy of each run has been shown in green, under the confusion matrices. The first thing to notice is that both SVM and neural network are able to exceed 90% accuracy value, implying that the chosen features show acceptable discriminating capabilities when used with such classifiers. On the contrary, K-NN has the worst performances among the classifiers. In particular, the neural network seems to be the best classifier among those considered, as it produces results on average around the 93.9%. Some examples of the classified paths are reported in figure 4. On the left three normal behaviors correctly classified, while on the right three anomalous behaviors, which are characterized by repeated changes of the directions or long periods of standing still. Fig. 4. Examples of classified trajectories: on the left normal behavior, on the right anomalous behavior. 4 Conclusions In this paper we propose the use of multiple Kinect cameras for developing a low cost system able to recognize anomalous human behavior both for surveillance and AAL applications. The torso node is extracted from the skeleton features provided by the OpenNi framework. A proper Kalman filter is used for the prediction step and allows robust people tracking both inter-camera and intra- camera. Anomalous behavior detection is performed using different classification algorithms by comparing multiple techniques: ANN, SVM and k-NN. Experimental results demonstrate that the proposed architecture and the de- veloped methodologies are able to recognize anomalous behavior in the majority of cases with respect to the total of observed path. However, it should be noted that the initial association of the paths in the dataset to normal and anomalous behaviors has been done by a human operator observing each path performed by the users. In future researches, more paths will be considered simultaneously to make a decision about a behavior, as those considered as anomalies could be only due to interactions among people. In its actual form, the system could fail when multiple people enter simul- taneously in the camera field of view and do not maintain the same walking direction while crossing occluded areas. In this case a people re-identification procedure [20], based on color features, could be used to avoid false associations and perform correctly the trajectory reconstruction. Another aspect that is deemed to be important, especially in AAL appli- cations, is the possibility to temporally compare the various observations, for example for detecting medium/long term behavior changes and provide early “warnings” that needs to be notified to a caregiver, in order to enable a prompt and more effective response. Moreover, the compared methodologies have been tested on a dataset with windows of observations limited to a fixed length of 30 seconds. It will be worth investigating how other solutions exploting an internal memory, such as recurrent neural networks, can help in detecting anomalies in sequences of arbitrary lengths. Last but not least, more joints in the skeletal reconstruction will be consid- ered, both for enhancing system robustness and for extending the applicability of the methodology to gait analysis and other applications in the AAL context. References 1. Almazan, E., Jones, G.: Tracking people across multiple non-overlapping rgb-d sensors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 831–837 (2013) 2. Bevilacqua, A., Di Stefano, L., Azzari, P.: People tracking using a time-of-flight depth sensor. In: 2006 IEEE International Conference on Video and Signal Based Surveillance. pp. 89–89. IEEE (2006) 3. Bishop, C.M.: Pattern recognition. Machine Learning 128 (2006) 4. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory. pp. 144–152. ACM (1992) 5. Bouma, H., Baan, J., Landsmeer, S., Kruszynski, C., van Antwerpen, G., Dijk, J.: Real-time tracking and fast retrieval of persons in multiple surveillance cameras of a shopping mall. In: SPIE Defense, Security, and Sensing. pp. 87560A–87560A. International Society for Optics and Photonics (2013) 6. Chen, Y., Medioni, G.: Object modelling by registration of multiple range images. Image and vision computing 10(3), 145–155 (1992) 7. D’Orazio, T., Marani, R., Renò, V., Cicirelli, G.: Recent trends in gesture recogni- tion: how depth data has improved classical approaches. Image and Vision Com- puting (2016) 8. D’Orazio, T., Guaragnella, C.: A survey of automatic event detection in multi- camera third generation surveillance systems. International Journal of Pattern Recognition and Artificial Intelligence 29(01), 1555001 (2015) 9. Fix, E., Hodges Jr, J.L.: Discriminatory analysis-nonparametric discrimination: consistency properties. Tech. rep., DTIC Document (1951) 10. Haykin, S., Network, N.: A comprehensive foundation. Neural Networks 2(2004) (2004) 11. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 34(3), 334–352 (2004) 12. Kwon, B., Kim, D., Kim, J., Lee, I., Kim, J., Oh, H., Kim, H., Lee, S.: Implemen- tation of Human Action Recognition System Using Multiple Kinect Sensors, pp. 334–343. Springer International Publishing, Cham (2015) 13. Lutz, W., Sanderson, W., Scherbov, S.: The coming acceleration of global popula- tion ageing. Nature 451(7179), 716 (2008) 14. Nie, W., Liu, A., Su, Y.: Multiple person tracking by spatiotemporal tracklet as- sociation. In: Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on. pp. 481–486. IEEE (2012) 15. Nikon: Total station, http://www.nikon.com/about/technology/life/others/surveying/ 16. OpenNI: Openni website, http://openni.ru/ 17. Piciarelli, C., Micheloni, C., Foresti, G.L.: Trajectory-based anomalous event de- tection. IEEE Transactions on Circuits and Systems for video Technology 18(11), 1544–1554 (2008) 18. Rashidi, P., Mihailidis, A.: A survey on ambient-assisted living tools for older adults. IEEE journal of biomedical and health informatics 17(3), 579–590 (2013) 19. Renò, V., Marani, R., Nitti, M., Mosca, N., D’Orazio, T., Stella, E.: A powerline- tuned camera trigger for ac illumination flickering reduction. IEEE Embedded Systems Letters PP(99), 1–1 (2017) 20. Renò, V., Politi, T., D’Orazio, T., Cardellicchio, A.: An human perceptive model for person re-identification. In: VISAPP 2015. pp. 638–643. SCITEPRESS (2015) 21. Satta, R., Pala, F., Fumera, G., Roli, F.: Real-time appearance-based person re- identification over multiple kinecttm cameras. In: VISAPP (2). pp. 407–410 (2013) 22. Shu, G., Dehghan, A., Oreifej, O., Hand, E., Shah, M.: Part-based multiple-person tracking with partial occlusion handling. In: Computer Vision and Pattern Recog- nition (CVPR), 2012 IEEE Conference on. pp. 1815–1821. IEEE (2012) 23. Warren, A., Rosenblatt, A., Lyketsos, C.G.: Wandering behaviour in community- residing persons with dementia. Int. J. Geriat. Psychiatry 14, 272–279 (1999) 24. Zweifel, P., Felder, S., Meiers, M.: Ageing of population and health care expendi- ture: a red herring? Health economics 8(6), 485–496 (1999)