1. Introduction

TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering

Wei Lin

0 2

Anna Kukleva

Horst Possegger

Hilde Kuehne

Horst Bischof

2 0 Christian Doppler Laboratory for Semantic 3 D Computer Vision , Austria 1 Goethe University Frankfurt , Germany 2 Institute of Computer Graphics and Vision, Graz University of Technology , Austria 3 Max-Planck-Institute for Informatics , Germany

Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning, to preserve the spatial layout and sequential nature of the video features. A two-step clustering pipeline on these embedded feature representations then allows us to enforce temporal consistency within, as well as across videos. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes. Our evaluation on three challenging datasets shows the impact of each component and, furthermore, demonstrates our state-of-the-art unsupervised action segmentation results.

eol>Unsupervised learning unsupervised clustering action segmentation

1. Introduction Feature Embedding

3 videos of Making coffee Feature embedding frames of video 1

Feature embedding frames of video 2

Feature embedding frames of video 3 (a)

Within-Video Clustering Cross-Video Global Cluster Assignment Viterbi Decoding

Ck,n : the k-th within-video cluster in video n (b) 2 1 1 Global cluster 1 1 2 2 Global cluster 2 (c) 3 3 3 Global cluster 3 2 → 1 → 3 1 → 2 → 3 1 → 2 → 3 (d) sists of a within-video clustering and a cross-video global and recognition of frame orders [27, 28, 29, 30, 31]. For cluster assignment. Specifically, we perform cluster- instance, Srivastava et al. [24] exploit an LSTM-based auing within each video, with a spatio-temporal similarity toencoder for learning video representations. Villegas et among frames. Then we conduct global cluster assign- al. [26] and Denton and Birodkar [25] employed two enment to group the clusters across videos. The global coders to generate feature representations of content and cluster assignment defines the ordering of the clusters motion. The temporal order of frames or small chunks for each video. In this way, we overcome the unrealis- is utilized as a self-supervision signal for representation tic assumption that actions of an activity always follow learning on short video clips in [27] and [28]. Inspired by the same temporal order. Such an assumption is com- these approaches, we employ two self-supervision tasks: monly used in related works, e.g. [21, 22]. For instance, feature reconstruction and relative time prediction. in the activity of making cofee, a unified temporal order Clustering of temporal sequences has been explored between actions such as adding milk and adding sugar for parsing human motions [32, 33, 34, 35]. While Zhang is assumed for all videos of making cofee, whereas our et al. [35] proposed a hierarchical dynamic clustering approach can handle changes of the action order in dif- framework, Li et al. [33] and Tierney et al. [34] explored ferent videos. After assigning all within-video clusters to temporal subspace clustering to segment human motion a set of global clusters, we perform Viterbi decoding to data. In contrast to unsupervised action segmentation, obtain a segmentation of temporally coherent segments. these methods are applied on each temporal sequence Our contributions can summarized as following: individually and do not consider association among se• We design a sequence-to-sequence temporal em- quences. Instead, we propose a cross-video global cluster bedding network (SSTEN), which combines rel- assignment to group within-video clusters across diferative timestamp prediction, autoencoder recon- ent videos into global clusters.

struction and sequence-to-sequence learning. Unsupervised action segmentation on fine-grained • We propose a within-video clustering with a activities has recent work that either focus on the reprenovel spatio-temporal similarity formulation sentation learning [20, 22, 36] or the clustering step [23]. among frames. However, the temporal information is neglected in at least one of these two steps. For representation learning, • We propose a cross-video global cluster assign

Sener and Yao [20] construct a feature embedding by ment to group within-video clusters across videos

learning a linear mapping from visual features to a latent into global clusters, which also overcomes the as

space with a ranking loss. However, the linear model sumption that in all videos of an activity, actions

trained with individual frames does not consider the temfollow the same temporal order.

poral association between frames. VidalMata et al. [22] employ a U-Net trained on individual frames for future 2. Related Work frame prediction. Predicting for one or a few steps ahead only requires temporal relations within a small temporal Unsupervised learning of video representations is window. Instead, we propose to learn a representation by commonly performed via pretext tasks, such as recon- predicting the complete sequence of relative timestamps struction [23, 24], future frame prediction [22, 25, 26], to encode the long-range temporal information.

For the clustering step, related works [22, 23] neglect temporal consistency of frames within a video. Instead, we apply within-video clustering on each video with a proposed similarity formulation that considers both spatial and temporal distances.

Two recent approaches perform clustering [37] or cluster-agnostic boundary detection [38] on each video separately, without identifying clusters or segments across videos. [37] solves a task similar to human motion parsing and evaluates the segmentation for each video individually. [38] only detects boundaries of categoryagnostic segments, and does not identify if some segments within a video or across videos are of the same category. On the contrary, our segments on all videos are category-aware as they are aligned globally across videos by our global cluster assignment.

3. Temporal-Aware Embedding and Clustering (TAEC)

stage, the hidden representation is a concatenation (in yellow) of the features from dilated residual layers and the predicted relative timestamps. The training loss is:

3.1. SSTEN: Sequence-to-Sequence Temporal Embedding Network

We address unsupervised action segmentation as illus- ℒ = ∑︁ ∑︁ ⃦⃦x, − x^,⃦⃦22+ trated in Fig. 1. First, we learn a suitable feature embed- =1 =1 ding (Sec. 3.1). We then perform within-video clustering on each video (Sec. 3.2.1), and group the within-video clusters into global clusters (Sec. 3.2.2). Finally, we compute temporally coherent segments on each video using Viterbi decoding (Sec. 3.3). (1) where the coeficient balances the two terms. The pretext tasks of reconstruction and relative timestamp prediction encode both, the spatial distribution and the global temporal information, into the embedded features.

We compare SSTEN with several baseline embedding networks in the supplementary.

∑︁ ∑︁ ∑︁ (,−^,,)2, ∈{1,2} =1 =1

3.2. Two-Step Clustering

To learn a latent representation for temporal sequences, After learning the feature embedding, we group the emwe adopt a sequence-to-sequence autoencoder. Inspired bedded features into clusters by a within-video clusby the multi-stage temporal convolutional network [7], tering and a cross-video global cluster assignment. we use a concatenation of two stages for both encoder

and decoder, as shown in Fig. 2. Given a set {X}=1 of videos, where each video X = {x,}=1 has 3.2.1. Within-Video Clustering frames, the outputs are reconstructed frame features We perform spectral clustering on frames within each {xˆ,}=1. The embedded features are the hidden repre- video (detailed description in the supplementary). Given sentation {e,}=1. the embedded feature sequence1 [e1, e2, ..., e ], we build

Every encoder and decoder stage consist of 1 × 1 con- a frame-to-frame similarity matrix ∈ R × . The volution layers for dimension adjustment (Fig. 2 blue) and entries (, ), , ∈ {1, ..., }, represent the similarity dilated residual layers (green), each containing a di- between frame and frame . To consider both the spatial lated temporal 1D convolution. Since no fully connected and temporal distance of features, we propose to measure layers are employed, sequences of variable lengths can the similarity by the product of two Gaussian kernels be processed seamlessly. The dilation rate at the -th ltaeymeproirsal2re−c1e.ptBivyesfietladckinincrgeadsielasteexdproenseidnutiaallllya.yTehrse, rteh-e (, ) = exp(︃− ‖e−s2paet ‖22 )︃ · exp(︂− ( −t2mp )2 )︂, ceptive field of the -th layer is 1 + ( − 1) × (2 − 1), (2) where is the kernel size. Therefore, each frame in the where , are the corresponding relative timestamps hidden representation has a long temporal dependency of frame , and spat, tmp are the scaling factors for the on the input video. In each encoder stage, we use a 1 × 1 convolution layer (in red) to predict the frame-wise rela

. At the end of each encoder 1For ease of notation, we omit the video index . tive timestamps , =

Wei Lin et al. CEUR Workshop Proceedings 1–10 spatial and temporal Gaussian kernels. To avoid manually tuning spat, we use local scaling [39] to estimate spat dynamically. To this end, we replace s2pat by , where is the distance from e to its -th nearest neighbor in the embedding space. We provide an ablation study on scaling of the spatio-temporal similarity in the supplementary. Consequently, frames of similar visual content and relative timestamps are encouraged to be grouped into the same cluster. 3.2.2. Cross-Video Global Cluster Assignment After within-video clustering, we assign the × within-video clusters across videos into global clusters. Every global cluster should contain within-video clusters, each coming from a diferent video (c.f., Fig. 1).

This can be interpreted as an -dimensional assignment problem [40].

We regard the -th video = {c,| = 1, .., } as a vertex set, where each -th within-video cluster c, is a vertex. We construct an -partite graph = (1 ∪ 2 ∪ ... ∪ , ). = ⋃︀<,,∈{1,...,}{(c, c′)|c ∈ , c′ ∈ } is the set of edges between within-video clusters across videos. The edge weight (c, c′) is the distance between centroids of two within-video clusters c, c′. The solution to the -dimensional assignment is a partition by dividing the graph into cliques 1, 2, ..., . A clique , which is a subset of vertices from diferent vertex sets, defines the -th global cluster. The induced sub-graphs of the cliques 1, 2, ..., are complete and disjoint. We denote the edge set of the induced sub-graph of as . The cost of a clique is the sum of pairwise edge weights between the contained vertices. The cost of an assignment solution is the sum of the costs of all the cliques, i.e., Hub vertex set

non-hub vertex sets ...

...

By iterating over all possible initial hub vertex sets ℎ ∈ {1, ..., }, we choose the assignment solution ℎ^ which minimizes the assignment cost ℎ^ = arg min ℎ∈{1,...,} (c,c′)∈ ∑︁ ℎ(c, c′) · (c, c′),

(4) where ℎ(c, c′), ∀(c, c′) ∈ is a binary indicator function that describes the edge connection: ℎ(c, c′) equals 1 when two vertices c, c′ are connected. The assignment solution ℎ^ describes the partition which leads to the global clusters.

3.3. Frame Labeling by Viterbi Decoding

Given the embedded feature sequence e1∼, of video , we determine the optimal label sequence ˆ1∼,. The posterior probability can be factorized into the product of likelihoods and the probability of a given temporal vertices c,ℎ ∈ ℎ and c′,ℎ′ ∈ ℎ′ , if c,ℎ and c′,ℎ′ are connected to the same vertex c,ℎ on ℎ.

After the two steps, every hub vertex c,ℎ, with ∈ {1, .., } and all the non-hub vertices connected to c,ℎ form the -th clique . Therefore, the -partite graph is partitioned into complete and disjoint subgraphs. order, i.e., ˆ1∼, = arg max (1∼,|e1∼,) = 1∼, ℒ (1, 2, ..., ) = ∑︁ ∑︁ (c, c′). ( 3 ) War1eg∼mfitaa,xG{aΠus=si1an(meo,de|lo,n) ·eΠach=1glo(bal,c|lu1s∼ter−1a,nd)}. =1 (c,c′)∈ compute the frame-wise likelihoods, i.e., (x|) =

(x; , Σ), ∈ {1, ..., }. The temporal order con

In order to solve this NP-hard problem, we employ an straint is used to limit the search space for the optimal iterative multiple-hub heuristic [41]. In each iteration, label sequence by filtering out the sequences that do not we choose a hub vertex set ℎ = {c,ℎ| = 1, .., } follow the temporal order. and there are ( − 1) non-hub vertex sets. We compute The related works [21, 22] apply K-means on the an assignment solution in each iteration in two steps, as frames of all the videos. From the unified clustering they is shown in Fig. 3: (1) We first perform ( − 1) bipar- derive only a single temporal order of clusters for all the tite matchings between ℎ and each of the remaining videos. However, this is an unrealistic assumption due to non-hub vertex sets ℎ. (2) Secondly, we determine the interchangeable steps in the activities, e.g., pour milk and edge connection between pairs of non-hub vertex sets. pour sugar in making cofee. Instead, we can easily derive On two non-hub vertex sets ℎ, ℎ′ , we connect two the temporal order for each video separately. We do so by sorting the within-video clusters according to the average timestamp of frames in each cluster. The output of the Viterbi decoding is the optimal cluster label sequence ˆ1∼,. More details are given in the supplementary.

4. Experiments

tion (i.e., LSTM+AL [38]) on each video individually, without solving the alignment among diferent clusters or 4.1. Datasets & Evaluation Metrics segments across videos. For a fair comparison, these are evaluated by local Hungarian matching on indiWe evaluate on Breakfast [42], the YouTube Instructions vidual videos, where a per-video best ground-truth-todataset (YTI) [36] and 50 Salads [43]. Breakfast is com- cluster-label mapping is determined using the ground prised of 1712 videos recorded in various kitchens. There truth on each video separately. This results in a separate are 10 composite activities of breakfast preparation. YTI label mapping for each video. Following [37], we also is composed of 150 videos of 5 activities collected from report results with set to the average number of acYouTube. 50 Salads contains 50 videos of people prepar- tions for each activity (i.e., =avg.#gt) for a complete ing salads. Following [20, 21, 22], we use the dense trajec- comparison. tory Fisher vector features (DTFV) [44] for Breakfast and In Table 1, TAEC achieves strong results in compari50 Salads, and features provided by Alayrac et al. [36] on son to the unsupervised state-of-the-art and is even comYTI. We use the evaluation protocol in [21] and report parable to weakly supervised approaches. Although apthe performance in three metrics: (1) Mean over Frames proaches without solving the alignment of clusters across (MoF) is the frame-level accuracy over the frames of all videos inherently lead to better scores in the evaluation the videos. More frequent or longer action instances settings of the local Hungarian matching, our approach have a higher impact on the result. (2) Class-wise mean still compares favorably.

Intersection over Union (cIoU) is the average over the We compare qualitative results (with global Hungarian IoU performance for each class and penalizes segmenta- matching) of TAEC and MLP+kmeans [21] on 3 Breakfast tion results with dominating segments. ( 3 ) The F1-score activities in Fig. 4. We see that our two-step clustering penalizes results with oversegmentation. (the 2nd rows in all clustering result plots) already leads to temporally consistent segments with relatively accurate 4.2. Implementation Details boundaries of action instances, while K-means (the 4th For our SSTEN, we adapt the number of dilated residual rows in all clustering result plots) results in serious overlayers according to the dataset size: We set = 5 segmentation. The Viterbi decoding further improves for YTI (15k frames per activity subset on average) and the segmentation by suppressing the oversegmentation = 10 for Breakfast (360k) and 50 Salads (577k). The and domination of incorrect clusters (the 2nd rows in all dimension of the hidden representation is set to 32. We ifnal result plots). Moreover, MLP+kmeans [21] follows set in Eq. (1) to 0.002 (Breakfast), 0.01 (YTI) and 0.005 (50 the constraint of the fixed temporal order of segments Salads). For clustering, we follow the protocol of [20, 36] on videos of each activity (the 4th rows in all final result and define the number of clusters separately for each plots). In contrast, TAEC yields an individual temporal activity as the maximum number of ground truth classes. order for each video (the 2nd rows in all final result plots). The values of for the three datasets are provided in Additional qualitative results and evaluation scores are the supplementary material. included in the supplemental material.

For the YouTube Instructions dataset, we follow the protocol of [20, 21, 36] and report the results with and 4.3. Comparison with the State-of-the-Art without considering background frames. Here, our TAEC We compare with unsupervised learning methods, as outperforms all recent works in almost all of the metrics well as weakly and fully supervised approaches on under all three settings.

Breakfast (Table 1), YTI (Table 2) and 50 Salads (Ta- 50 Salads is a particularly challenging dataset for unble 3). Most unsupervised segmentation approaches yield supervised approaches, as each video has a diferent orcluster-aware segments that are aligned across all the der of actions and additionally includes many repetitive videos [20, 21, 22, 36, 45]. These approaches are eval- action instances. In the eval-level of 12 classes, TAEC uated with the global Hungarian matching on all outperforms all approaches under the global Hungarian videos, where the mapping between ground truth classes matching evaluation and achieves competitive results and clusters is performed on all the videos of an activ- under the local Hungarian matching. In the challenging ity, which results in one mapping for each activity. The mid-level evaluation of 19 classes, the sequential nature number of clusters is set to the maximum number of of frames is less advantageous. Therefore, MLP+kmeans ground truth classes for each activity (i.e., =max.#gt). [21] outperforms TAEC. Generally, in the local matchWe focus on the performance comparison in this setting ing case, approaches without alignment across videos and follow this setting in all the ablation studies. compare favorably.

Two recent approaches perform clustering (i.e., TWFINTCH [37]) or category-agnostic boundary detecempty stir milk

Making cereals take bowl pour cereals pour milk

Making juice

Making fried egg empty cut orange squeeze orange take squeezer take knife take plate take glass pour juice empty pour oil take plate take eggs fry egg butter pan add salt crack egg put egg2plate cereals video 1 clustering result juice video 1 clustering result fried egg video 1 clustering result cereals video 2 clustering result juice video 2 clustering result fried egg video 2 clustering result cereals video 3 clustering result juice video 3 clustering result fried egg video 3 clustering result cereals video 1 final result juice video 1 final result fried egg video 1 final result cereals video 2 final result juice video 2 final result fried egg video 2 final result cereals video 3 final result juice video 3 final result fried egg video 3 final result

1–10

Comparison of raw features without embedding.

Among the three types of features without temporal embedding, I3D achieves the best performance, while AlexNet features lead to the worst results. AlexNet features are computed from individual spatial frames. On the contrary, each frame feature of DTFV and I3D is computed based on a chunk of its temporal neighbor frames.

Therefore, the features already carry intrinsic temporal consistency. Furthermore, the two-stream I3D model can leverage both RGB and optical flow. Therefore, I3D features achieve a better performance than DTFV, which rely on handcrafted dense trajectories.

Comparison of SSTEN embeddings learned on diferent features. When comparing the SSTEN embeddings to the performance of the raw features, we see that SSTEN leads to a significant performance gain for both clustering methods. For DTFV, the performance improvements by SSTEN are MoF 8.5%, IoU 6.0%, F1 8.9% with K-means and MoF 15.8%, IoU 6.9%, F1 11.6% with two-step clustering.

Among the three types of SSTEN embedded features, I3D has slightly better IoU and F1 scores while DTFV leads to the best MoF scores for both, K-means and the two-step clustering. Overall, the SSTEN embeddings learned from these two features perform comparably. We conduct the following experiments using DTFV, which is also used in related works.

4.5. Impact of Loss Terms on Clustering

To evaluate the impact of the two loss terms in Eq. (1), we plot the quantitative segmentation results of SSTEN with both K-means and the two-step clustering w.r.t. diferent reconstruction loss coeficients in Fig. 5. In general, two-step clustering leads to a better performance than K-means for almost all values (except for the case of only reconstruction loss). With decreasing , the relative time prediction loss has an increasing impact and the embedded features have better global temporal consistency, which explains the increasing IoU and F1 scores. However, at extremely small values, the embedded features overfit to the relative time prediction task, which results in saturated IoU and F1 scores, and a significant drop in MoF for both K-means and two-step clustering.

To intuitively illustrate the loss term impact on the twostep clustering, we plot the similarity matrices for SSTEN embeddings trained with three diferent in Fig. 6. Here, we look at the similarity matrices with temporal Gaussian kernel (bottom row). Intuitively, the similarity matrix with clear diagonal block structure (Fig. 6(a2)), which is the result of an appropriate ratio between the reconstruction loss and relative time prediction loss ( = 0.002), leads to the best segmentation performance. When becomes larger (e.g., = 0.01), the reconstruction loss has a larger impact and the diagonal block structure

MoF two-step.

MoF kmeans IoU two-step.

IoU kmeans F1 two-step.

F1 kmeans toimnleyprreeladtiicvteion

Viterbi decoding) on Breakfast and 50 Salads in Table 5.

The global cluster assignment outperforms the naïve as(Fig. 6(b2)) becomes pale. Therefore, the performances of embedded features with = 0.005, = 0.01 and only reconstruction loss degrade successively. On the other hand, for extremely small values (e.g., = 0.0005), the block diagonal structure (Fig. 6(c2)) becomes noisy due to overfitting on relative time prediction. signment by a large margin for both, clustering results and the final segmentation results, on both datasets. The advantage of the global cluster assignment is even more evident on 50 Salads.

We illustrate exemplary qualitative results of the clus1.0 tering and the final segmentation for 3 activities (with 3 tmpw./Goauss 0.8 vroidweogsroeuapchd)iospnlaByrseathkefagstroinunFdigt.r4u.thFo(r1setarcohwv)i,dtehoe, rtehseu4lt(a1) (b1) (c1) 0.6 with global cluster assignment (2nd row) and the result 0.4 with naïve assignment (3rd row). The 4th row shows the tmwp.iGthauss 0.2 rroeswusltooffceMreLaPls+vkidmeeoa[nids][fin2a1l]. rBeysucltoimnpFaigri.n4g, walel steheet3hradt (a2) (b2) (c2) 0.0 the naïve assignment simply assumes the sub-clusters in (a) SSTEN (b) SSTEN (c) SSTEN the same temporal order in each video belong to the same global cluster, while they might not be close to each other bFeigdudrineg6s:fForratmhee-stoa-mfreamBreeasikmfailsatrvitiydemoa.tCriocleusmofnsSSsThEoNw ethme- in the feature space. On the contrary, the global cluster similarity matrices for diferent , while the rows show results assignment (the 2nd rows of cereals video [id] final result) without (top) and with (bottom) temporal Gaussian kernel. yields an optimal assignment solution with respect to the pairwise distances between sub-clusters, resulting in diferent orderings of sub-clusters on each video. Note

Therefore, both the reconstruction and the relative that on some videos, global cluster assignment could lead timestamp prediction loss, when combined in an appro- to the same assignment result as naïve assignment. priate ratio, are indispensable to learn the efective representation that preserves both spatial layout and the temporal information. 5. Conclusion We proposed a new pipeline for the unsupervised learn4.6. Impact of Cluster Assignment ing of action segmentation. For the feature embedding, In this ablation study, we evaluate the eficacy of the we propose a temporal-aware embedding network that global cluster assignment. For two-step clustering, we performs sequence-to-sequence learning with the pretext evaluate two strategies of grouping within-video clus- tasks of relative timestamp prediction and feature reconters into global clusters: (1) the naïve assignment, for struction. For clustering, we propose a two-step clusterwhich we order the sub-clusters according to the aver- ing schema, consisting of within-video clustering and age timestamp and simply group the -th sub-clusters cross-video global cluster assignment. The temporal emof all videos into a global cluster, i.e., the global cluster bedding of sequence-to-sequence learning together with = {c,| = 1, .., }, and (2) the global cluster two-step clustering is proven to be a well-suitable combiassignment, as detailed in Sec. 3.2.2. nation that considers the sequential nature of frames in

In order to show how the diferent cluster assignment both processing steps. Ultimately, we combine the tempostrategies afect the clustering result, we report both, ral embedding with a frame-to-cluster assignment based the results of the two-step clustering (before Viterbi de- on Viterbi decoding, which achieves the unsupervised coding) and the final segmentation performance (after state-of-the-art on three challenging benchmarks. TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering Supplementary 1. Introduction For additional insights into TAEC, we introduce the background of spectral clustering in Sec. 2.1 and give details of the Viterbi decoding in Sec. 2.2. We perform more ablation studies on comparing baseline embeddings and clustering methods (Sec. 3.1), scaling of spatio-temporal similarity (Sec. 3.2), cluster ordering (Sec. 3.3), decoding strategies (Sec. 3.4). Finally, we provide more quantitative (Sec. 3.5) and qualitative segmentation results (Sec. 3.6) on the three datasets. 2. Method 2.1. Spectral Clustering Background information related to Sec. 3.2.1 in the main manuscript: Given the embedded feature sequence e1, e2, ..., e , we build a frame-to-frame similarity graph ∈ R × , whose edge weight (, ), , ∈ {1, ..., } represents the similarity between frame and frame .

Grouping the frames into clusters can be interpreted as a graph partition problem by cutting edges on , resulting in subgraphs 1, 2, ..., . The normalized cut (Ncut) problem [1] is employed to compute a balanced partition by minimizing the energy ℒ (1, 2, ..., ) = 1 ∑︁ 2 =1 (, ) , (1) vol() 2.2. Frame Labeling by Viterbi Decoding Additional explanations to Sec. 3.3 in the main manuscript: The global cluster assignment delivers the ordered clusters on each video, which are aligned across all videos. To compute the final segmentation, we use the resulting ordering and decode each video into a sequence of temporally consistent segments. That is, we determine the optimal label sequence ˆ1∼, = {1,, ..., ,} by re-assigning each frame to one of the temporally ordered clusters.

Given the embedded feature sequence e1∼, = {e1,, ..., e,} and the temporal order of the clusters, we search for the optimal label sequence that maximizes the probability (1∼,|e1∼,). Following [6], this posterior probability can be factorized into the product of likelihoods and the probability of a given temporal order, i.e., ˆ1∼, = arg max (1∼,|e1∼,)

1∼, = arg max {Π=1(e,|,) · Π=1(,|1∼(−1),)}

1∼, = arg max {Π=1(e,|,) · (,|−1,)} 1∼,

(2)

Here the likelihood (e,|,) is the probability of a frame embedding e, from the video belonging to a cluster. Therefore, we fit a Gaussian distribution on each global cluster and compute the frame-wise likelihoods with the Gaussian model, i.e., where (, ) represents the sum of edge weights (x|) = (x; , Σ), ∈ {1, ..., }. ( 3 ) between elements in the subgraph and elements of all the other subgraphs, i.e., the sum of weights of edges (,|−1,) is the transition probability from label to be cut. vol() is the sum of weights of edges within −1, at frame − 1 to label , at frame , which is the resulting subgraph . Spectral clustering [2] is a defined by the temporal order of clusters. We denote the relaxed solution to this NP-hard minimization problem in set of frame transitions defined by the temporal order of Eq. (1) and has shown good performance on many graph- clusters on the -th video by , e.g., for the temporal based clustering problems, e.g. [3, 4, 5]. Note that while order of → → → , = { → , → , → K-means operates on Euclidean distance in the feature }. The transition probability is binary, i.e., space and assumes convex and isotropic clusters, spectral clustering can find clusters with non-convex boundaries. (,|−1,) (4) = 1(, = −1, ∨ −1, → , ∈ ). 2K6ltehbeCro(medpsu.)t,eKrrVeimsios,nLWowinetreAruWstorrikas,hAoup,stRroiab,eFrtebS.a1b5la-1tn7i,g20a2n3d Florian This means that we allow either a transition to the next © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License cluster according to the temporal order, or we keep the CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) cluster assignment of the previous frame.

Dilated residual layer (a) MLP (b) AEMLP

Note that in two-step clustering, we derive the temporal order of clusters on each video separately, by sorting the clusters on the video according to the average timestamp. Therefore, we have an individual for each video . On the contrary, in K-means, there is a uniform order of global clusters for all the videos and is thus the same for each video .

The Viterbi algorithm for solving Eq. (2) is performed in an iterative process using dynamic programming, i.e., as the temporal prior to train the model with only one iteration. (1∼,|e1∼,) = (5) TCN and SSTEN are both networks for sequencemax {(1∼−1,|e1∼−1,) to-sequence learning, while Rankloss MLP, MLP and , AEMLP are trained on individual frames. By compar· (e,|,) · (,|−1,)}. ing the performance between these two groups in Table 1, we see that sequence-to-sequence learning leads to The sequences that do not follow the temporal order will better performance, especially when combined with the be filtered out in an early stage to narrow down the search two-step clustering, which results in clusters with better range for the optimal label sequence. The output of the temporal consistency.

Viterbi decoding is the optimal cluster label sequence, For the two-step clustering, we also plot the frame-toi.e., ˆ1∼,. frame similarity matrices (spatial Gaussian kernel) of the ifve embeddings for the same Breakfast video in Fig. 2. 3. Additional Results The plots show that Rankloss MLP, MLP and AEMLP, which are trained on individual frames, do not expose 3.1. Embedding and clustering an appropriate temporal structure. There are noisy block patterns even in positions far away from the diagonal, Further, we compare our SSTEN embedding with three which results in noisy clusters and thus, leads to errobaseline variants (shown in Fig. 1): MLP temporal em- neous temporal orders and inferior assignment results in bedding, autoencoder with MLP (AEMLP) and temporal the two-step clustering. The least noisy Rankloss MLP convolutional network (TCN), in combination with the has the highest performance among these three. On the two clustering methods. contrary, TCN and SSTEN embedded features, which

MLP uses three FC layers for relative timestamp pre- show a clear diagonal block structure in the similarity diction. AEMLP uses MLP-based autencoder for both graph, achieve a better performance in the two-step clusrelative timestamp prediction and feature reconstruction. tering. This verifies that the sequence-to-sequence emTCN deploys stacked dilated residual layers only for bedding learning (TCN and SSTEN) and two-step clusterrelative timestamp prediction. ing are a well-suited combination to address the sequen

Here, we also implement the Rankloss MLP embed- tial nature of frames in both processing steps of feature ding [7] for reference. We report the performance of embedding and clustering. these five embeddings in Table 1. Considering K-means clustering, the merit of having

Comparison of the five embeddings. We learn the a better sequential nature of the embedded features via ifve embeddings (Rankloss MLP, MLP, AEMLP, TCN and sequence-to-sequence learning can also be seen from SSTEN) on the DTFV features. Here, the Rankloss MLP the higher IoU and F1 scores (TCN: IoU 17.8%/F1 31.3%, (consisting of two FC layers) is trained with a ranking vs. SSTEN: 17.8%/31.9%), as these penalize dominating loss. We use the initialization of uniform segmentation segments and oversegmentation.

In contrast to TCN, SSTEN can preserve the spatial layout of the input features due to the feature reconstruction via the autoencoder. By comparing TCN and SSTEN, we see that the SSTEN embedding with feature reconstruction leads to a boost in the MoF score. The marginal improvement of AEMLP over MLP is due to the fact that the MLP structure with only FC layers is not well-suited for feature reconstruction.

Comparison between K-means and two-step clustering. Considering the performance of the five embeddings with the two clustering methods, we see that Kmeans leads to higher scores on the inferior embeddings (Rankloss MLP, MLP and AEMLP) trained on individual frames, while two-step clustering performs better on sequence-to-sequence learning-based embeddings (TCN and SSTEN). When combined with the proposed SSTEN embedding, two-step clustering outperforms K-means by a large margin in terms of the MoF score. We also tried applying K-means on each video separately. However, the performance dropped significantly. K-means depends only on the spatial distance and results in oversegmentation, which leads to erroneous temporal order on each video and thus, an inferior global cluster assignment.

Impact of the scaling of the temporal Gaussian kernel. The temporal Gaussian kernel is operated on the temporal distance between frames in a video. With 3.2. Impact of Scaling in Spatiotemporal t2mp = 2′2, the term exp (︀ −( − )2/(2′2))︀ is in Similarity the standard form of a Gaussian kernel. We set ′ = 1/6 so that the 6′ range of the temporal Gaussian is equal to We perform spectral clustering with the proposed spa- the video length (since the length of each video is normaltiotemporal similarity. Here, we analyze the impact of ized to 1 for the relative timestamp prediction). The segthe scaling factors in the spatial and temporal Gaussian mentation performance with respect to ′ is shown in Takernels, i.e., s2pat and t2mp. These adjust the extent to ble 3. Apparently, ′ = 1/6 leads to the best result. Here, which two frames are considered similar to each other and influence the clustering quality. The experiments are conducted for SSTEN embeddings on Breakfast. Table 3

Impact of the scaling of the spatial Gaussian ker- eSmegbmedednitnagtiso(np=erf0o.r0m02a)nwceitohf rtewspoe-scttetpoctlhuestteerminpgooranl SscSaTliEnNg nel. For local scaling, we set s2pat = , where is factor ′ (t2mp = 2′2 on Breakfast (in %). the distance from e to its -th nearest neighbor in the feature space. The resulting segmentation performance ′ (t2mp = 2′2) MoF IoU F1 w.r.t. is shown in Fig. 3. With varying in the range ∞ (w/o tmp. Gauss) 41.5 16.5 30.6 of 3 to 20, the IoU and F1 scores remain stable. There 1/3 43.5 16.9 31.3 is a range of ∈ {8, 9} where the best MoF scores 11//162 5404..33 1198..50 3343..61 are achieved, whereas for other scaling parameters, the MoF score drops. Thus, we set = 9 for all following we also evaluate the case without the temporal Gaussian evaluations. kernel, which leads to a drop in performance. The impact

For comparison, we also set spat to fixed values (with- of the temporal Gaussian kernel on similarity matrices out local scaling) and report the segmentation perfor- of SSTEN embeddings can also be seen by comparing the mance in Table 2. We achieve great results at smaller top and bottom rows in Fig. 6 in the main manuscript. spat values (0.5 and 0.7). For example, by adding the temporal Gaussian kernel,

However, with increasing spat the MoF score drops we decrease the similarities in Fig. 6(a1) according to the significantly, while there are only minor fluctuations in temporal distance between two frames, which leads to IoU and F1. Apparently, spat has a large impact on the clearer diagonal block structure in Fig. 6(a2). Thus, we clustering quality. The local scaling eases the efort of set ′ = 1/6 for all following evaluations. tuning spat by dynamically determining the scaling factor.

Table 4 all video frames from the same activity with respect to a Impact of cluster order for two-step clustering on SSTEN em- pseudo ground truth action annotation. The embedded beddings (in %). frames of the whole activity set are then clustered and Order MoF IBoUreakfasFt1 Edit MoF IoUYTI F1 Edit tphuetelidk.eFliohroothdefodreceoadchinfgr,amthee aanudthfoorrseabcuhildcluashteisrtiosgcroammof features with respect to their clusters with a hard asvwidiseeo- 50.3 19.0 33.6 42.3 46.6 10.7 29.5 25.5 signment and set the length of each action with respect uniform 53.5 15.7 32.2 33.0 40.7 7.7 25.1 20.3 to the overall amount of features per bin. After that, they apply a Mallow model to sample diferent orderings for each video with respect to the sampled distribution. The 3.3. Impact of Cluster Order resulting model is a combination of Mallow model sampling and action length estimation based on the frame One merit of performing within-video clustering is that distribution. we can derive the temporal order of sub-clusters for each For this experiment, we evaluate the impact of the difvideo separately. The video-wise individual order of clus- ferent decoding strategies on two embeddings: the Ranters is used to guide the Viterbi decoding, which breaks kloss embedding [7] and our SSTEN embedding. Table 5 the common assumption that clusters follow the same reports the results of these two embeddings in combitemporal order in all the videos. In the following, we nation with three decodings: the Mallow model, Viterbi verify the eficacy of the derived video-wise order of decoding with K-means and Viterbi decoding with twoclusters. We use the same within-video clustering re- step clustering. sult with global cluster assignment and perform Viterbi decoding using two diferent temporal cluster orders: Table 5 (1) video-wise order: the temporal order of sub-clusters Comparison of combinations of embeddings and decoding is determined on each video separately; and (2) uniform strategies on Breakfast (in %). order: the uniform order is determined by sorting the average timestamps of global clusters and is then applied Decoding Rankloss [7] embed. SSTEN embed. to all the videos. Table 4 reports the segmentation per- MoF IoU F1 MoF IoU F1 formance (after Viterbi) with these two orders for our kmMeaanllso+wV[it7e]rbi 3354..72 1175..68 3218..84 3369..43 1187..18 3311..59 SSTEN embeddings on Breakfast and YTI. To measure the two-step.+Viterbi 34.7 13.4 23.7 50.3 19.0 33.6 correctness of the predicted segment order, we adopt the segmental edit distance (Edit), which is a common metric Following [7], we run 7 iterations for the Rankloss for supervised action segmentation, e.g., [8, 9, 10, 11]. embedding with the Mallow model. In each iteration, It penalizes segmentation results that have a diferent the Rankloss embedding is retrained using the segmensegment order than the ground truth (i.e., it penalizes tation result from the last iteration as pseudo label, and out-of-order predictions, as well as oversegmentation). the frame-wise likelihoods and the Mallow model are

From Table 4 we see that our video-wise order clearly updated. outperforms the uniform order except for MoF on Break- Unlike the Mallow model, our Viterbi decoding is a fast. Furthermore, the edit score verifies that our derived one-iteration procedure. It is operated on the embedvideo-wise temporal orders are valid. ding which is trained only once. When combining with

In our experiments we especially notice that the MoF Viterbi, we train the Rankloss model only once using the and IoU scores could act contradictory to each other, initialized uniform segmentation as a prior. For SSTEN e.g., the uniform order results in higher MoF scores (on with the Mallow model, we only run for one iteration, Breakfast) at the cost of lower IoU scores. MoF tends to as we do not need to train SSTEN with pseudo labels overfit on dominant classes (e.g., classes with longer ac- iteratively. tion instances) while IoU is sensitive to underrepresented Considering the Rankloss results in Table 5 we see classes and penalizes segmentation results with dominat- that combining it with the Mallow model achieves its ing segments. Therefore, it is necessary that we consider highest IoU and F1 scores. This is because for Viterbi deall metrics for evaluation, as a higher MoF score does not coding, the Rankloss model trained only one-time using always correspond to better performance in practice. the uniform initialization as pseudo label is lacking of a strong temporal prior. Considering SSTEN, our Viterbi 3.4. Impact of Decoding Strategies decoding with two-step clustering clearly outperforms the Mallow model. With Mallow, the SSTEN embedding We compare our approach, which uses Viterbi decoding, has competitive IoU and F1 scores but significantly lower with the Mallow model decoding that has been proposed MoF. We also tried running the Mallow model on SSTEN in [7]. The authors propose a Rankloss embedding over embedded features for multiple iterations. However, this resulted in a reduced number of clusters. Thus, we see Table 6 that an appropriate combination of embedding and de- Comparison of combinations of SSTEN and diferent clustering coding strategy is necessary. methods in terms of clustering and final segmentation after

To have a closer look into the Viterbi decoding, we Viterbi decoding on Breakfast (in %). vteisrus,aaliszewtehllealsiktehleihroeosdulgtirnidgsdceocmodpiuntgedpafrthomovgelrotbiaml eclfuosr- Embedding Clustering CMlouFsterIionUg resulFt1s MoFFinalIroeUsultsF1 two videos on Breakfast in Fig. 4. It shows that the decod- K-means 27.2 13.5 26.3 39.3 17.8 31.9 ing, which generates a full sequence of actions, is able SSTEN two-step.cluster 38.6 13.7 25.9 50.3 19.0 33.6 to marginalize actions that do not occur in the video by just assigning only very few frames to those ones and the majority of frames are assigned to the clusters that occur in the video. Even if the given temporal order constrains 3.5.2. Segmentation Results On Each Activity that the resulting coherent segments have to follow We report the ground truth number of classes and the fixed temporal order, the segments that actually do segmentation performance of MLP with K-means not belong in the sequence will be marginalized because (MLP+kmeans, our reimplementation of [12]) and TAEC the Viterbi algorithm decodes a path that maximizes the for each activity on Breakfast (Table 7), YTI (Table 8) posterior probability. Overall, it turns out that the Viterbi and 50 Salads (Table 9). The evaluation is done with decoding constrained by a temporal order performs bet- global Hungarian matching on all videos. The number of ter than the Mallow model’s iterative re-ordering. clusters is set to the maximum number of ground truth classes for each activity ( = max.#gt). cofee cereals tea milk juice sandwich scrambled egg fried egg salad pancake

All 7 5 7 5 8 9 9 8 12 14

Methods MLP+kmeans

TAEC MLP+kmeans

TAEC

MoF 46.8 35.6

Making juice Viterbi decoding path on likelihood grid prediction ground truth

Making fried egg Viterbi decoding path on likelihood grid prediction ground truth frame axis frame axis 3.5. Quantitative Segmentation Results 3.5.1. Results of Clustering and Final

Segmentation In order to show the advantage of two-step clustering over K-means, when combined with the proposed SSTEN embedding, we report both, the results of clustering (before Viterbi decoding) and the final segmentation performance (after Viterbi decoding) on Breakfast in Table 6.

We see that the proposed two-step clustering leads to superior performance than K-means, in terms of both clustering (before Viterbi decoding) in most metrics, and in terms of final segmentation (after Viterbi decoding).

MoF w/o bkg 3.6. Qualitative Segmentation Results We show the qualitative results of clustering and final segmentation on 7 composite activities: making cereals (Fig. 5), making milk (Fig. 6), making juice (Fig. 7), making fried egg (Fig. 8), making pancake (Fig. 9) on Breakfast, changing tire (Fig. 10) on YTI and making salad (Fig. 11) on 50 Salads (eval 12 classes). The mapping between cluster labels and ground truth classes is done with global Hungarian matching on all videos. The number of clusters is set to the maximum number of ground truth classes for each activity ( = max.#gt).

For each activity, we visualize the results of 10 videos.

For each video, the 3-row-group displays the ground truth (1st row), TAEC (2nd row), MLP+kmeans [12] (3rd row). vid 1 vid 2 vid 3 vid 4 vid 5 vid 6 vid 7 vid 8 vid 9 vid 10 vid 1 vid 2 vid 3 vid 4 vid 5 vid 6 vid 7 vid 8 vid 9 vid 10 vid 1 vid 2 vid 3 vid 4 vid 5 vid 6 vid 7 vid 8 vid 9 vid 10

constraints, in: CVPR, 2018 . [1]

Wang ,

She ,

Smolic , Action-net: Multipath [20]

Sener ,

Yao , Unsupervised learning and segmen-

excitation for action recognition , in: CVPR , 2021 . tation of complex activities from video , in: CVPR, [2]

Feichtenhofer , X3d: Expanding architectures for 2018 .

eficient video recognition , in: CVPR , 2020 . [21]

Kukleva ,

Kuehne ,

Sener ,

Gall , Unsuper[3]

Yang ,

Xu ,

Shi ,

Dai ,

Zhou , Temporal vised learning of action classes with continuous

pyramid network for action recognition, in: CVPR, temporal embedding , in: CVPR , 2019 .

2020 . [22] R. G. VidalMata , W. J. Scheirer , H.

Kuehne , Joint [4] M.

Xu , C.

Zhao , D. S.

Rojas , A.

Thabet , B.

Ghanem , visual -temporal embedding for unsupervised learn-

detection, in: CVPR, 2020 . 2021 . [5]

Zhao ,

Xie ,

Ju ,

Zhang ,

Wang ,

Tian , [23]

B. L.

Bhatnagar ,

Singh ,

Arora ,

C. V.

Jawahar ,

tual regularization , in: ECCV , 2020 . tion for clustering egocentric actions , in: IJCAI, [6]

Bai ,

Wang ,

Tong ,

Yang ,

Liu , J. Liu, 2017 .

Boundary content graph neural network for tem- [24]

Srivastava , E. Mansimov,

Salakhudinov , Un-

poral action proposal generation , in: ECCV , 2020 . supervised learning of video representations using [7 ]

Y. A.

Farha , J. Gall, MS-TCN: Multi-stage temporal LSTMs , in: ICML, 2015 .

convolutional network for action segmentation , in: [25]

E. L.

Denton ,

Birodkar , Unsupervised learning

CVPR , 2019 . of disentangled representations from video , in: [8]

Li ,

Y. A.

Farha ,

Liu , M.-M. Cheng, J. Gall, MS- NeurIPS, 2017 .

TCN ++: Multi-Stage Temporal Convolutional Net- [26]

Villegas ,

Yang ,

Hong ,

Lin ,

Lee , De-

work for Action Segmentation , PAMI ( 2020 ). composing motion and content for natural video [9]

Wang ,

Gao ,

Wang ,

Li , G . Wu, Boundary- sequence prediction, in: ICLR , 2017 .

aware cascade networks for temporal action seg- [27]

Kim ,

Cho ,

I. S.

Kweon , Self-supervised video

mentation, in: ECCV, 2020 . representation learning with space-time cubic puz [10]

Wang ,

Hu ,

Li ,

Dou , Temporal relational zles , in: AAAI , 2019 .

modeling with self-supervision for action segmen - [28]

H.-Y.

Lee , J.-B. Huang , M.

Singh , M.-H.

Yang , Un-

tation, in: AAAI, 2021 . supervised representation learning by sorting se[11]

Huang ,

Sugano ,

Sato , Improving action quences , in: ICCV, 2017 .

segmentation via graph-based temporal reasoning , [29]

Ramanathan ,

Tang ,

Mori ,

Fei-Fei , Learn-

in: CVPR , 2020 . ing temporal embeddings for complex video analy [12] C.-Y. Chang , D.-A.

Huang , Y.

Sui , L.

Fei-Fei , J. C. sis, in: ICCV, 2015 .

Niebles , D3tw: Discriminative diferentiable dy- [30] B.

Fernando , E.

Gavves , J. M.

Oramas , A . Ghodrati,

alignment and segmentation , in: CVPR, 2019 . recognition, in: CVPR, 2015 . [13]

Li ,

Lei ,

Todorovic , Weakly supervised energy- [31]

Cherian ,

Fernando ,

Harandi ,

Gould , Gen-

2019. CVPR, 2017 . [14]

Richard ,

Kuehne ,

Gall , Weakly supervised [32]

Hoai , F. D. la Torre, Maximum margin temporal

eling, in: CVPR, 2017 . 2012 . [15]

Richard ,

Kuehne ,

Iqbal , J. Gall, [33]

Li ,

Fu , Temporal subspace clustering for

NeuralNetwork-Viterbi : A framework for human motion segmentation , in: CVPR , 2015 .

weakly supervised video learning , in: CVPR , 2018 . [34]

Tierney ,

Gao ,

Guo , Subspace clustering for [16] D.-A. Huang , F.-F.

Li , J. C.

Niebles , Connectionist sequential data , in: CVPR , 2014 .

temporal modeling for weakly supervised action [35]

Zhang ,

Tang ,

Sun ,

Neumann , Human

labeling, in: ECCV, 2016 . motion parsing by hierarchical dynamic clustering , [17]

Fayyaz , J. Gall, SCT: Set constrained temporal in: BMVC, 2018 .

transformer for set supervised action segmentation , [36] J.-B. Alayrac , P.

Bojanowski , N.

Agrawal , J. Sivic,

in: CVPR , 2020 . I. Laptev,

Lacoste-Julien , Unsupervised learning [18]

Li ,

Todorovic , Set-constrained Viterbi for set- from narrated instruction videos , in: CVPR , 2016 .

supervised action segmentation , in: CVPR , 2020 . [37]

M. S.

Sarfraz ,

Murray ,

Sharma ,

Diba , [19]

Richard ,

Kuehne ,

Gall , Action sets: Weakly L. Van Gool , R. Stiefelhagen , Temporally-weighted

hierarchical clustering for unsupervised action seg- boundaries , in: WACV, 2021 .

mentation, in: CVPR, 2021 . [56]

Liu ,

Shahroudy ,

M. L.

Perez ,

Wang , L.-Y. [38]

S. N.

Aakur ,

Sarkar , A perceptual prediction Duan, A. K. Chichung , NTU RGB+ D 120: A large-

framework for self supervised event segmentation, scale benchmark for 3D human activity understand-

in: CVPR , 2019 . ing, PAMI ( 2019 ). [39]

Zelnik-Manor ,

Perona , Self-tuning spectral [57]

Krizhevsky , I. Sutskever,

G. E.

Hinton , Imagenet

clustering, in: NeurIPS, 2005 . classification with deep convolutional neural net [40]

W. P.

Pierskalla , Letter to the editor-the multi- works , in: NeurIPS, 2012 .

dimensional assignment problem , Operations Re- [58]

Deng ,

Dong ,

Socher ,

L.-J.

Li ,

Li , L . Fei-

search 16 ( 1968 ) 422 - 431 . Fei, Imagenet: A large-scale hierarchical image [41]

H.-J. Bandelt , Y.

Crama , F. C.

Spieksma , Approxima- database, in: CVPR, 2009 .

tion algorithms for multi-dimensional assignment [59]

Carreira ,

Zisserman , Quo vadis, action recog-

plied Mathematics 49 ( 1994 ) 25 - 50 . CVPR, 2017 . [42]

Kuehne ,

Arslan , T. Serre, The language of [60]

Kay ,

Carreira ,

Simonyan ,

Zhang ,

goal-directed human activities , in: CVPR , 2014 . T. Back,

Natsev , et al., The kinetics human ac [43]

Stein , S. J. McKenna , Combining embedded ac- tion video dataset , arXiv preprint arXiv:1705.06950

celerometers with computer vision for recognizing (

2017 ).

puting , 2013 . [44]

Wang ,

Schmid , Action recognition with im-

proved trajectories , in: ICCV, 2013 . [45]

Li ,

Todorovic , Action shufle alternating learn-

CVPR , 2021 . [46]

Wang ,

Yuan ,

Wang , Gated forward refine-

puting 407 ( 2020 ) 63 - 71 . [47] M.-H. Chen , B.

Li , Y.

Bao , G. AlRegib, Z.

Kira , Action

domain adaptation , in: CVPR , 2020 . [48]

S.-H.

Gao ,

Han ,

Z.-Y.

Li ,

Peng ,

Wang , M.-M.

Cheng, Global2local: Eficient structure search for

video action segmentation , in: CVPR , 2021 . [49]

Yi ,

Wen , T. Jiang, Asformer: Transformer for

action segmentation , in: BMVC , 2021 . [50]

Lu , E. Elhamifar, Weakly-supervised action

union-of-subspaces learning , in: ICCV , 2021 . [51]

Lea ,

Reiter ,

Vidal ,

G. D.

Hager , Segmental

mentation, in: ECCV, 2016 . [52]

Lea ,

M. D.

Flynn ,

Vidal ,

Reiter ,

G. D.

Hager ,

mentation and detection , in: CVPR, 2017 . [53]

Fermüller ,

Wang ,

Yang ,

Zampogiannis ,

puter Vision 126 ( 2018 ) 358 - 374 . [54]

Ahn ,

Lee , Refining action segmentation with

hierarchical video representations , in: ICCV , 2021 . [55]

Ishikawa ,

Kasai ,

Aoki ,

Kataoka , Alleviat-

ing over-segmentation errors by detecting action [1 ]

Shi ,

Malik , Normalized cuts and image segmen-

tation , PAMI 22 ( 2000 ) 888 - 905 . [2]

U. V.

Luxburg , A tutorial on spectral clustering,

Statistics and Computing 17 ( 2007 ) 395 - 416 . [3]

H.-C.

Huang ,

Y.-Y.

Chuang , C.-S. Chen, Afinity

aggregation for spectral clustering , in: CVPR , 2012 . [4]

Li ,

Chen , Superpixel segmentation using linear

spectral clustering , in: CVPR , 2015 . [5]

Liu ,

Zhang , Segmentation of 3d meshes

on Computer Graphics and Applications , 2004 . [6]

Richard ,

Kuehne ,

Iqbal , J. Gall,

weakly supervised video learning , in: CVPR , 2018 . [7]

Sener ,

Yao , Unsupervised learning and segmen-

2018 . [8]

Y. A.

Farha , J. Gall, MS-TCN: Multi-stage temporal

CVPR , 2019 . [9]

Lea ,

M. D.

Flynn ,

Vidal ,

Reiter ,

G. D.

Hager ,

mentation and detection , in: CVPR, 2017 . [10]

V. I.

Morariu ,

L. S.

Davis , Multi-agent event recog-

nition in structured scenarios , in: CVPR , 2011 . [11]

H. S.

Koppula ,

Gupta ,

Saxena , Learning human

The International Journal of Robotics Research 32

( 2013 ) 951 - 970 . [12]

Kukleva ,

Kuehne ,

Sener ,

Gall , Unsuper-

temporal embedding , in: CVPR , 2019 .