An Efficient Framework for the Clustering of Human Activity Data using Kernelized Robust Covariance Descriptors Guntru Prasanth Kumar* , M. S. Subodh Raj and Sudhish N. George National Institute of Technology Calicut, India Abstract In this paper, a new method for the efficient clustering of human activity data is proposed. Unlike the traditional human activity clustering approaches, our method relies on the skeletal data recorded with the help of motion capture (mocap) systems to achieve the goal. The proposed method is structured around the kernel-based robust covariance descriptor. By introducing a data re-framing technique that efficiently utilizes the temporal properties of the human activity data, we have alleviated the data redundancy and insufficiency issues associated with action sequences. The optimization model developed encompasses the combined benefits of low-rank representation and least square regression. The formulation is strengthened by incorporating the temporal dependency of the human activity sequences with the help of a temporal Laplacian regularizer. With the proposed algorithm, a representation matrix is learned from the raw data, which is then used to perform subspace clustering. Experiments conducted on multiple human activity datasets reveal the ability of the proposed method to achieve better clustering results compared to state-of-the-art counterparts. Keywords Human activity data, Kernelized covariance descriptors, Temporal Laplacian regularization, Temporal subspace clustering. 1. Introduction Human activity recognition (HAR) from action sequences remains a challenging research topic in computer vision due to its multifaceted applications [1, 2]. HAR finds application in visual surveillance, healthcare, human-machine interface, video retrieval, and entertainment industry, to name a few [3, 4]. Traditional approaches to HAR use RGB video sequences as the input. Handcrafted features are later extracted from the video sequences for the purpose of activity recognition [3, 4]. Because of the high-dimensional nature of the video sequences, high computational complexity is often associated with such HAR approaches [1]. Later, sensor-based HAR gained popularity. The focus of such methods were on the data obtained from sensors such as accelerometer and gyroscopes. In such cases, the subject itself needs to be in possession of the sensor so that the movements of the human body can be recorded. This along with the The 11th Colour and Visual Computing Symposium, September 08-09, 2022, Gjøvik, Norway * Corresponding author. $ prasanthkumarguntru7@gmail.com (G. P. Kumar); subodhrajms@gmail.com (. M. S. S. Raj); sudhish@nitc.ac.in (S. N. George)  0000-0002-1111-9520 (. M. S. S. Raj); 0000-0002-0886-9478 (S. N. George) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) influence of noise acts as a limiting factor in sensor based HAR approaches [5, 1, 6]. With the evolution of mocap systems, new modalities were introduced to represent the human activity information. Such modalities include the motion depth maps and the skeletal representations [7, 8]. With skeletal representations, time series of 3D joint positions of the human body are recorded by the mocap systems [9]. The spatio-temporal quality of such recordings are superior that they find application in multiple domains including gait analysis, medical rehabilitation, and computer animations [7, 3]. The common approaches to HAR with mocap data involve supervised learning methods. Though they guarantee good results, such methods are often susceptible to missing sample issues. Further, the requirement of a huge clean dataset for initial learning of the system poses a serious bottleneck to supervised HAR approaches [8, 10]. This paves a foundation for the need of having unsupervised HAR strategies. Though in unsupervised methods the aforementioned challenges faced by supervised methods are alleviated, they encounter other limitations. The main challenge in unsupervised HAR is posed by the fact that the task needs to be performed in a robust and accurate manner without any prior knowledge about the data samples [11]. Recent studies showcase the ability of subspace clustering algorithms in dealing with high- dimensional data clustering problems [12]. The key idea in subspace clustering approach is to identify multiple low-dimensional subspaces from which the data originates [13]. The subspaces so identified house a cluster of data. This approach is commonly termed as the Union of Subspaces (UoS) model [14]. The popularity of subspace clustering approach has increased with the introduction of sparse subspace clustering approach in which a sparsity constraint is imposed on the coefficients in order to learn a sparse representation of the raw data [14]. Low-rank representation learning (LRR) [15] is another technique used with subspace clustering wherein the global structure of the data is considered while learning the coefficients. In LRR based approaches, a given dictionary is utilized to learn a low-rank representation of the data samples. The clustering results obtained with such low-rank representations are usually better [16]. Least square regression (LSR) [17] based subspace clustering in which the grouping of data samples are performed with the help of Frobenius norm operator is another promising approach in subspace clustering. The aforementioned approaches when used independently will not be suitable for HAR as they do not consider the time series information associated with the skeletal data. As a solution to this problem we have developed an approach called the time series activity clustering (TSAC) which utilizes the combined advantages of LRR and LSR. We have incorporated a kernel-based robust covariance descriptor to extract features out of the raw input data with the aim of exploring the non-linearity present in the data. Further, the temporal Laplacian regularizer is employed to capture the temporal dependency among the data samples. The following are the main contributions of this work: 1. An unsupervised optimization model with improved performance is formulated for the clustering of human activity data. To effectively utilize the temporal dependencies of human activity sequences, a temporal Laplacian regularizer is introduced in the proposed model. 2. We have blended the LRR and LSR based subspace clustering approaches to achieve better clustering results. A clean dictionary is learned along with the representation matrix in Vectorization and stacking Data Subspace re-framing clustering Feature Representation Affinity Kernelized matrix matrix matrix Input action covariance sequences descriptors Data pre-processing and feature matrix generation Time series activity clustering Figure 1: Overview of the proposed time series activity clustering order to facilitate for the efficient use of the LRR model. 3. Data re-framing techniques are introduced to deal with data redundancy and insufficiency issues associated with the human activity timestamps. With the help of a robust kernel- based covariance descriptor, the underlying non-linear dependencies of the human activity timestamps are well utilized. 4. A representation matrix is generated by solving the formulated optimization problem using the Alternating Direction Method of Multipliers (ADMM) approach in an iterative manner. The representation matrix so obtained is later used to perform subspace clustering of the action sequences. The performance of the proposed model is verified against the state-of-the-art counterparts using multiple human activity datasets. The rest of the work is outlined as follows. The proposed method, the problem formulation, and the solutions obtained are presented in Section 2. Section 3 explains the experimental validation done using the proposed method. Finally, conclusions are drawn in Section 4. 2. Proposed Method In this section, we give a detailed outline of the proposed TSAC approach. The workflow of the proposed approach is shown in Fig. 1. 2.1. Data Re-framing A sequence of human action is a collection of action timestamps evolved over a period of time. Two important observations can be made with reference to a given collection of action sequences as mentioned below: 1. The number of timestamps corresponding to each of the action sequences may not necessarily be the same. This disparity often appears as a bottleneck in generalizing any algorithm dealing with human activity recognition. 2. As the number of timestamps in an action sequence increases, it introduces additional computation overhead. That will lead to a rise in demand for resource utilization. The aforementioned challenges can be addressed by standardizing the number of timestamps for all the action sequences under consideration. If the number of timestamps is standardized to be ‘N ’, then data pruning methods need to be employed on action sequences having more number of timestamps and data augmentation needs to be performed on action sequences having lesser number of timestamps. The temporal smoothness property of human action sequences can be conveniently utilized to achieve this goal Human action sequences are temporally highly correlated. As far as activity recognition is concerned, this correlation leads to redundant information. Often, the complete set of frames of a recorded action sequence is not essential to perform activity recognition. Thus, we introduce a pruning technique, termed succession pruning, in which the alternative frames of the action sequence are removed to eliminate redundancy while maintaining the temporal properties of the action sequence. That will drastically reduce the amount of data to be processed and will also result in reduced computation overhead and resource utilization. Whereas in action sequences experiencing insufficiency of timestamps, we perform timestamp augmentation. In this process, the trailing end of the action sequence gets augmented with the terminal timestamps of the same action sequence. That is in line with the temporal smoothness property of the human action data. 2.2. Feature Matrix Generation using Kernel-based Robust Covariance Descriptor Human actions are represented in the form of skeletal structures with the help of modern motion capture systems. Each timestamp of an action sequence can be represented as a collection of ‘n’ joints. Thus, the time stamp ‘t’ of an action sequence can be represented as e(𝑡) ∈ R3×𝑛 of 3D positions {e1 (𝑡), . . . , e𝑛 (𝑡)}. The 3D coordinates of the i𝑡ℎ joint of the skeletal structure corresponding to the t 𝑡ℎ timestamp can be denoted as, e𝑖 (𝑡) ∈ [𝑥𝑖 (𝑡), 𝑦𝑖 (𝑡), 𝑧𝑖 (𝑡)]⊤ . Once the data re-framing process is completed, the raw action sequences with fixed number of timestamps are represented in the form of a feature matrix. It involves a two step process. In the first step, we use the concept of covariance to obtain the covariance descriptor of each action sequence. The use of covariance will help us to capture the changes pertaining to each joint of the skeletal structure [18]. If 𝜇 represents the temporal average of the timestamps of an action sequence, then the corresponding action sequence can be represented in the form of a covariance matrix as shown below. 𝑁 1 ∑︁ Ψ= [e(𝑡) − 𝜇][e(𝑡) − 𝜇]⊤ (1) 𝑁 −1 𝑡=1 This process is repeated for each action sequence, resulting in a unique covariance descriptor for each input sequence. Although covariance descriptors finds application in multiple domains, they cannot capture non-linearity present in the data. For making the feature matrix robust, different approaches are adopted to incorporate additional statistical information along with the covariance descriptor. This includes the entropy based approaches, the mutual information based approaches, and the kernel-based approaches. Among others, the kernel-based approaches have been used to simulate more complex models. The use of kernels improves the descriptive power of the covariance matrices. The work proposed by Cavazza et al. [18] showcases the benefits of using kernels in works related to human activity recognition. Motivated by this observation, we modify the expression given in Eq. (1) to incorporate the kernel function and obtain the following robust covariance descriptor. 𝑁 1 ∑︁ [︀ ]︀⊤ (2) ]︀[︀ Ψ= 𝒦(e(𝑡)) − 𝜇𝜅 𝒦(e(𝑡)) − 𝜇𝜅 𝑁 −1 𝑡=1 Here, 𝒦(.) represents the kernel function and 𝜇𝜅 is the temporal average of the kernel entries. The choice of kernel function is application specific. We have used two kernel functions namely the polynomial kernel and the exponential kernel, out of which the later one have produced promising results. The exponential kernel is defined as: {︁ e(𝑡) }︁ 𝒦(e(𝑡)) = exp (3) (𝜎 + 𝑏)2 where 𝑏 > 0 and 𝜎 is the kernel bandwidth. After obtaining the robust covariance descriptors for each action sequence, in the second step we generate the feature matrix. The covariance descriptor Ψ contains redundant information as it is symmetric along the main diagonal. In order to reduce the amount of data to be processed, we vectorize each covariance descriptor by retaining the upper triangular values alone. Later, we stack (as columns) each of the vectors so obtained to form the feature matrix X ∈ R𝑝×𝑘 . Here, ‘𝑝’ is the length of the individual vectors and ‘k’ is the total number of action sequences. 2.3. Temporal Subspace Clustering (TSC) with Laplacian Regularization Subspace clustering is performed by generating a representation matrix out of the feature matrix in state-of-the-art approaches. This is accomplished by using the self representation property of the feature matrix, resulting in the computation of a set of unique coefficients Y ∈ R𝑘×𝑘 for the feature matrix X ∈ R𝑝×𝑘 . This can be mathematically expressed as X = XY. The limitation of such an approach is that they tend to produce sub-optimal results if the sampling done is not sufficient. Instead of using the self representation property, if we learn an efficient dictionary W ∈ R𝑝×𝑘 , we can overcome the aforementioned problem. Given a dictionary W, the set of data samples X can be expressed as X ≈ WY. The set of coefficients Y can be efficiently obtained in an iterative manner by utilizing the underlying low-rank nature of Y. The low-rank property of Y can be capture with the help of LRR [15]. Since the rank minimization problem is inherently NP-hard, nuclear norm can be used as a substitute for rank minimization [17]. It is also important to obtain the intra-cluster correlation among the data samples to obtain better clustering results. To this end, we can use the principle of LSR [17]. By yielding the concepts of LRR and LSR we formulate an optimization problem as follows. 1 𝜆1 min ‖X − WY‖2F + ‖Y‖2F + 𝜆2 ‖Y‖* s.t. Y ≥ 0, W ≥ 0 (4) Y,W 2 2 where, the term 12 ‖X − WY‖2F captures the reconstruction error, ||·||𝐹 represents the Frobenius norm operator, || · ||* denotes the nuclear norm operator, and 𝜆1 and 𝜆2 are the balancing parameters. Manifold regularization methods are efficient in incorporating the temporal dependency of data in the problem formulation [19]. Thus, by modifying the general Laplacian regular- izer, which captures the spatial dependency of data, we have developed a temporal Laplacian regularizer ℒ(·) as our interest is in the temporal information of the action sequences. For a representation matrix Y, the temporal Laplacian regularization function can be defined as [20]: 1 ∑︁ ∑︁ ℒ(Y) = 𝑧𝑖𝑗 ‖𝑦𝑖 − 𝑦𝑗 ‖22 = tr (YLT Y⊤ ), (5) 2 𝑖 𝑗 where LT = W ̃︁ − Z is a temporal Laplacian matrix, W̃︁ ii = ∑︀𝑚 𝑧𝑖𝑗 , and Z is a weight matrix 𝑗=1 that finds the successive similarities in X. Each element of Z is found as [21], {︃ 1 for |𝑖 − 𝑗| ≤ 𝛾2 𝑧𝑖𝑗 = (6) 0 otherwise where 𝛾 denotes empirically defined threshold value. With the introduction of ℒ(Y), Eq. (4) is modified as, 1 𝜆1 min ‖X − WY‖2F + ‖Y‖2F + 𝜆2 ‖Y‖* + 𝜆3 ℒ(Y) s.t. Y ≥ 0, W ≥ 0 (7) Y,W 2 2 2.4. Solution The optimization problem given in Eq. (7) can be solved using the ADMM approach under the ALM framework. ADMM finds solution for the unconstrained optimization problem by splitting the problem into multiple sub-problems. As a first step, we will introduce three auxiliary variables E, F, and G to decouple the terms present in the formulation. This results in a formulation as shown below. 1 𝜆1 min ‖X − WY‖2F + ‖E‖2F + 𝜆2 ‖F‖* + 𝜆3 ℒ(G) Y,W,E,F,G 2 2 s.t. Y = E, Y = F, Y = G, Y ≥ 0, W ≥ 0 (8) The Augmented Lagrangian corresponding to Eq. (8) is given as: 𝜆1 L(E, F, G, Y, W) = 12 ‖X − WY‖2F + ‖E‖2F + 𝜆2 ‖F‖* + 𝜆3 tr(GLT G⊤ ) 2 + ⟨Φ1 , Y − E⟩ + ⟨Φ2 , Y − F⟩ + ⟨Φ3 , Y − G⟩ (9) + 𝛽2 (‖Y − E‖2F + ‖Y − F‖2F + ‖Y − G‖2F ) 2.4.1. Updating E: The update expression for E is obtained by solving the following sub-problem. E[𝑙+1] = argmin 𝜆1 ‖E‖2F + ⟨Φ1 , Y − E⟩ + 𝛽2 ‖Y − E‖2F (10) E By differentiating Eq. (10) with respect to E and equating it to zero, the E update is given as, 1 (︀ [𝑙] E[𝑙+1] = Φ1 + 𝛽Y[𝑙] (11) )︀ 𝜆1 + 𝛽 2.4.2. Updating F: The F sub-problem is given as, F[𝑙+1] = argmin 𝜆2 ‖F‖* + ⟨𝜑2 , Y − F⟩ + 𝛽2 ‖Y − F‖2F (12) F The update expression for F is found using the singular value thresholding (SVT) operator as follows [16], [︃ ]︃ [𝑙] Φ2 F [𝑙+1] = SVT 𝜆2 Y + [𝑙] (13) 𝛽 𝛽 2.4.3. Updating G: The update expression for G is obtained by solving the following sub-problem. G[𝑙+1] = argmin 𝜆3 tr(GLT G⊤ ) + ⟨Φ3 , Y − G⟩ + 𝛽2 ‖Y − G‖2F (14) G By differentiating Eq. (14) with respect to G and equating it to zero, the G update is given as, (︀ [𝑙] )︀−1 G[𝑙+1] = Φ3 + 𝛽Y[𝑙] 𝜆3 (LT + LT ⊤ ) + 𝛽I (15) )︀(︀ 2.4.4. Updating Y: By solving the following sub-problem, the update expression for Y can be obtained. 1 Y[𝑙+1] = argmin ‖X − WY‖2F + ⟨Φ1 , Y − E⟩ + ⟨Φ2 , Y − F⟩ 2 Y (16) 𝛽 (︀ ‖Y − E‖2F + ‖Y − F‖2F + ‖Y − G‖2F )︀ + ⟨Φ3 , Y − G⟩ + 2 By equating the gradient of Eq. (16) to zero, the update expression for Y is given as, [︁(︀ ]︁−1 [︁(︀ [𝑙] ⊤ [𝑙+1] [𝑙] )︀⊤ W[𝑙] X[𝑙] + 𝛽 E[𝑙+1] + F[𝑙+1] + G[𝑙+1] )︀ (︀ )︀ Y = W W + 3𝛽I (︀ [𝑙] ]︁ (17) [𝑙] [𝑙] )︀ − Φ1 + Φ2 + Φ3 2.4.5. Updating W: The W sub-problem is given as follows. W[𝑙+1] = argmin 1 2 2 ‖X − WY‖F (18) W Solution of the above equation can be found as, [︁ )︀⊤ ]︁[︁ [𝑙+1] (︀ [𝑙+1] )︀⊤ ]︁−1 W[𝑙+1] = X[𝑙] Y[𝑙+1] (19) (︀ Y Y Finally, the Lagrange multipliers are updated as follows: [𝑙+1] [𝑙] = Φ1 + 𝛽 Y[𝑙+1] − E[𝑙+1] (20) (︀ )︀ Φ1 [𝑙+1] [𝑙] = Φ2 + 𝛽 Y[𝑙+1] − F[𝑙+1] (21) (︀ )︀ Φ2 [𝑙+1] [𝑙] = Φ3 + 𝛽 Y[𝑙+1] − G[𝑙+1] (22) (︀ )︀ Φ3 Convergence of the algorithm is ensured if, ⎧ ⃦ [𝑙+1] ⃦ [𝑙] ⃦ , ⃦ [𝑙+1] ⃦ [𝑙] ⃦ ⎫ ⎨ ⃦ Y − E ⃦∞ ⃦E − E ⃦∞ ⎬ (23) ⃦ [𝑙+1] ⃦ [𝑙+1] max ⃦Y ⃦ − F[𝑙] ⃦⃦∞ , ⃦F ⃦ [𝑙+1] − F [𝑙] ⃦ ⃦∞ ⎭ < 𝜖 ⎩ ⃦ [𝑙+1] Y − G[𝑙] ⃦∞ , ⃦G − G ⃦∞[𝑙] The overall process involved in the proposed time series activity clustering algorithm is summarized in Algorithm 1. Once the representation matrix Y is obtained, an affinity matrix Q ∈ R𝑘×𝑘 is calculated. The accuracy of clustering is highly dependent on the affinity matrix Q. A usual approach in obtaining the affinity matrix is as shown below [14, 15]. |Y| + |Y⊤ | Q= (24) 2 But, the graph so constructed do not take into account the intrinsic relationships of the within- cluster data points. But for data containing temporal information, the within-cluster data points are highly correlated. In order to take advantage of this information, an affinity matrix Q is calculated as follows. 𝑦𝑖⊤ 𝑦𝑗 Q(𝑖, 𝑗) = (25) ‖𝑦𝑖 ‖2 ‖𝑦𝑗 ‖2 where, ‖.‖2 represents the ℓ2 norm operator. Algorithm 1 Time Series Activity Clustering Require: Skeletal data and parameters 𝜆1 , 𝜆2 , 𝜆3 , 𝜂, 𝛾 and 𝛽 Ensure: Y ∈ R𝑘×𝑘 1: Find Ψ using Eq. (1) 2: Find X using Ψ 3: Generate matrices Z, W, ̃︁ and LT 4: while 𝑛𝑜𝑡 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑 do 5: Update E[𝑙+1] with Eq. (11) 6: Update F[𝑙+1] with Eq. (13) 7: Update G[𝑙+1] with Eq. (15) 8: Update Y[𝑙+1] with Eq. (17) 9: Update W[𝑙+1] with Eq. (19) [𝑙+1] 10: Update Φ1 with Eq. (20) [𝑙+1] 11: Update Φ2 with Eq. (21) [𝑙+1] 12: Update Φ3 with Eq. (22) 13: Update 𝛽 [𝑙+1] = 𝜂𝛽 [𝑙] 14: 𝑙 =𝑙+1 15: Use (23) to check the convergence 16: end while 3. Experimental Results and Analysis 3.1. Dataset and Parameter Settings To verify the performance of the proposed algorithm, it was tested on multiple human activity datasets. The datasets considered include the Gaming 3D (G3D) [22], Florence 3D (F3D) [23], UTKinect-Action 3D (UTK) [24], MSRC-Kinect12 (MSRC) [25], MSR Action 3D (MSRA) [26], and HDM14 [27] datasets. By means of observation, the parameters 𝛾, 𝜆1 , 𝜆2 and 𝜂 are set to 5.2, 0.03, 18, and 0.7 respectively. The program development was done on a system with Intel Core i7 processor and a RAM of 16 GB, operating on 64-bit windows operating system with a clock frequency of 2.90 GHz. 3.2. Experimental Results Experimental validation was done on the following methods. Figure 2: Affinity graphs of SSC, Percentage Temporal SSC, Threshold Temporal SSC, TSC-Cov, and Kernel and LRR imposed TSC (Proposed) approaches on UTK Dataset. Figure 3: Clustering Results on UTK Dataset Subspace clustering approaches based on self representation: In this method, clustering using state-of-the-art clustering approaches are performed on the generated affinity matrix. The clustering methods considered include Spectral Clustering (SC) [28], Orthogonal Matching Pursuit (OMP) [29], K-means (Km) [28], SSC [14], and Elastic Net Subspace Clustering (EnSC) [29]. The results obtained using SSC is found to be superior to that of the counterparts [14]. SSC approaches with Data Pruning: In this method, the input skeletal data is pruned with strategies including min Φ [29], Temporal SSC [29], Threshold Temporal SSC [29], and Percentage Temporal SSC [29]. Later, a feature matrix is generated from the pruned data sequences, followed by the application of SSC [14] approach. This method converges quickly. Among others, the percentage temporal SSC approach gives better results. TSC-Cov: This is a method that we had developed in one of our previous works. In this method, data re-framing was not performed and the kernel-based features were not incorporated in the covariance descriptor. But, we had used a new clustering approach named TSC-Cov for subspace clustering. The performance of the proposed algorithm was evaluated against the above mentioned methods. The metrics used for quantitative evaluation include the accuracy, the Normalized Mutual Information (NMI), and the Adjusted Rand Index (ARI). The results obtained are tabulated in Tables (1-3). For qualitative evaluation, the affinity matrix and clustering results obtained using the proposed method and the state-of-the-art methods are presented. Although the experiments were conducted on all the six datasets mentioned earlier, for the purpose of illustration, the results obtained with the UTK dataset [24] is given in this paper. Table 1 Comparison of Clustering accuracy (%) Dataset SSC [14] Percentage TSC-Cov TSAC Temporal (Proposed) SSC [29] G3D [22] 65.16 66.04 90.04 92.65 F3D [23] 61.86 63.72 81.39 81.58 UTK [24] 74.37 72.36 89.44 90.01 MSRC [25] 73.42 80.60 81.09 82.00 MSRA [26] 58.35 60.86 89.76 91.54 HDM14 [27] 53.06 57.29 76.23 79.51 Table 2 Comparison of NMI Dataset SSC [14] Percentage TSC-Cov TSAC Temporal (Proposed) SSC [29] G3D [22] 0.719 0.708 0.953 0.961 F3D [23] 0.716 0.709 0.872 0.875 UTK [24] 0.709 0.672 0.899 0.911 MSRC [25] 0.720 0.762 0.887 0.890 MSRA [26] 0.700 0.720 0.958 0.972 HDM14 [27] 0.754 0.771 0.893 0.901 Table 3 Comparison of ARI Dataset SSC [14] Percentage TSC-Cov TSAC Temporal (Proposed) SSC [29] G3D [22] 0.499 0.479 0.847 0.851 F3D [23] 0.548 0.539 0.781 0.787 UTK [24] 0.547 0.499 0.804 0.819 MSRC [25] 0.551 0.714 0.773 0.781 MSRA [26] 0.435 0.456 0.881 0.895 HDM14 [27] 0.439 0.484 0.753 0.772 Fig. 2 visualizes the affinity matrices generated using SSC [14], Percentage Temporal SSC [29], Threshold Temporal SSC [29], TSC-Cov, and the proposed method while working on the UTK dataset. We can observe that among others, the affinity matrix generated using the proposed method have much denser block diagonal structure. This is an indication of quality of the clustering process. Fig. 3 shows the clustering results obtained on the UTK dataset. For visual analysis, each cluster is assigned a unique color. The figure shows a comparison between SSC [14], Percentage Temporal SSC [29], Threshold Temporal SSC [29], TSC-Cov, and the proposed method with reference to the true labels. From Fig. 3 we can observe that clustering results obtained with the proposed method are comparatively better than the other methods. 4. Conclusions The paper proposes a new method for clustering of human activity sequences in an efficient way. The proposed method involves the extraction of features from raw input data with the help of a kernel-based robust covariance descriptor. The optimization model developed uses the combined advantage of LRR and LSR based subspace clustering approaches. The concept of temporal Laplacian regularized dictionary learning is introduced in order to learn an effective representation matrix from the extracted data features. With the help of ADMM approach, the solution for the optimization problem is obtained. Performance of the proposed approach is compared with that of the state-of-the-art approaches in terms of accuracy, NMI, and ARI. Experimental results validate superiority of the proposed method in obtaining better clustering results as compared to that of the counterparts. Motion capture data often suffers from corrup- tions in the recorded information. To address this problem, robust human activity clustering algorithms can be developed in the future. Also, mocap information can be combined with other modalities of human activity data to achieve improved clustering results in challenging scenarios. References [1] P. Pareek, A. Thakkar, A survey on video-based human action recognition: recent updates, datasets, challenges, and applications, Artificial Intelligence Review 54 (2021) 2259–2322. [2] H.-B. Zhang, Y.-X. Zhang, B. Zhong, Q. Lei, L. Yang, J.-X. Du, D.-S. Chen, A comprehensive survey of vision-based human action recognition methods, Sensors 19 (2019) 1005. [3] C. Jobanputra, J. Bavishi, N. Doshi, Human activity recognition: A survey, Procedia Computer Science 155 (2019) 698–703. [4] Y. Kong, Y. Fu, Human action recognition and prediction: A survey, International Journal of Computer Vision 130 (2022) 1366–1401. [5] L. M. Dang, K. Min, H. Wang, M. J. Piran, C. H. Lee, H. Moon, Sensor-based and vision- based human activity recognition: A comprehensive survey, Pattern Recognition 108 (2020) 107561. [6] S. K. Yadav, K. Tiwari, H. M. Pandey, S. A. Akbar, Skeleton-based human activity recognition using convlstm and guided feature learning, Soft Computing 26 (2022) 877–890. [7] M. Barnachon, S. Bouakaz, B. Boufama, E. Guillou, Ongoing human action recognition with motion capture, Pattern Recognition 47 (2014) 238–247. [8] S. Park, J. Park, M. Al-Masni, M. Al-Antari, M. Z. Uddin, T.-S. Kim, A depth camera-based human activity recognition via deep learning recurrent neural network for health and social care services, Procedia Computer Science 100 (2016) 78–84. [9] J. K. Aggarwal, L. Xia, Human activity recognition from 3d data: A review, Pattern Recognition Letters 48 (2014) 70–80. [10] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, Q. Tian, Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction, IEEE Tran. on PAMI (2021). [11] A. Bagnall, J. Lines, A. Bostrom, J. Large, E. Keogh, The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances, Data Mining and Knowledge Discovery 31 (2017) 606–660. [12] L. Parsons, E. Haque, H. Liu, Subspace clustering for high dimensional data: a review, Acm sigkdd explorations newsletter 6 (2004) 90–105. [13] R. Vidal, P. Favaro, Low rank subspace clustering (lrsc), Pattern Recognition Letters 43 (2014) 47–61. [14] E. Elhamifar, R. Vidal, Sparse subspace clustering: Algorithm, theory, and applications, IEEE Tran. on PAMI 35 (2013) 2765–2781. [15] J. Xue, Y.-Q. Zhao, Y. Bu, W. Liao, J. C.-W. Chan, W. Philips, Spatial-spectral structured sparse low-rank representation for hyperspectral image super-resolution, IEEE Tran. on Image Processing 30 (2021) 3084–3097. [16] J. Francis, A. Johnson, B. Madathil, S. N. George, A joint sparse and correlation induced subspace clustering method for segmentation of natural images, in: 2020 IEEE 17th India Council Int. Conf. (INDICON), 2020, pp. 1–7. [17] Z. Wu, M. Yin, Y. Zhou, X. Fang, S. Xie, Robust spectral subspace clustering based on least square regression, Neural Processing Letters 48 (2018) 1359–1372. [18] J. Cavazza, A. Zunino, M. S. Biagio, V. Murino, Kernelized covariance for action recognition, in: 2016 23rd Int. Conf. on Pattern Recognition (ICPR), 2016, pp. 408–413. [19] Z. Zhang, K. Zhao, Low-rank matrix approximation with manifold regularization, IEEE Tran. on PAMI 35 (2013) 1717–1729. [20] W. Liu, X. Ma, Y. Zhou, D. Tao, J. Cheng, 𝑝-laplacian regularization for scene recognition, IEEE Tran. on Cybernetics 49 (2019) 2927–2940. [21] G. Casalino, N. D. Buono, C. Mencar, Part-based data analysis with masked non-negative matrix factorization, in: Int. Conf. on Computational Science and Its Applications, Springer, 2014, pp. 440–454. [22] V. Bloom, V. Argyriou, D. Makris, Hierarchical transfer learning for online recognition of compound actions, Computer Vision and Image Understanding 144 (2015). [23] L. Seidenari, V. Varano, S. Berretti, A. Del Bimbo, P. Pala, Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses, in: 2013 IEEE Conf. on CVPR - Workshops, 2013, pp. 479–485. [24] L. Xia, C.-C. Chen, J. K. Aggarwal, View invariant human action recognition using histograms of 3d joints, in: 2012 IEEE Computer Society Conf.on CVPR - Workshops, 2012, pp. 20–27. [25] S. Fothergill, H. M. M. , P. Kohli, S. Nowozin, Instructing people for training gestural interactive systems, in: CHI ’12 Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, ACM, 2012, pp. 1737–1746. [26] W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3d points, in: 2010 IEEE Computer Society Conf. on CVPR - Workshops, 2010, pp. 9–14. [27] M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger, A. Weber, Documentation Mocap Database HDM05, Technical Report CG-2007-2, Universität Bonn, 2007. [28] Y. Lee, S. Choi, Minimum entropy, k-means, spectral clustering, in: 2004 IEEE Int. Joint Conf. on Neural Networks (IEEE Cat. No.04CH37541), volume 1, 2004, pp. 117–122. [29] G. Paoletti, J. Cavazza, C. Beyan, A. Del Bue, Subspace clustering for action recognition with covariance representations and temporal pruning, in: 2020 25th Int. Conf. on Pattern Recognition (ICPR), 2021, pp. 6035–6042.