=Paper=
{{Paper
|id=Vol-3349/paper1
|storemode=property
|title=TAEC: Unsupervised Action Segmentation with
Temporal-Aware Embedding and Clustering
|pdfUrl=https://ceur-ws.org/Vol-3349/paper1.pdf
|volume=Vol-3349
|authors=Wei Lin,Anna Kukleva,Horst Possegger,Hilde Kuehne,Horst Bischof
|dblpUrl=https://dblp.org/rec/conf/cvww/LinKPKB23
}}
==TAEC: Unsupervised Action Segmentation with
Temporal-Aware Embedding and Clustering==
TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering Wei Lin1,2 , Anna Kukleva3 , Horst Possegger1 , Hilde Kuehne4 and Horst Bischof1 1 Institute of Computer Graphics and Vision, Graz University of Technology, Austria 2 Christian Doppler Laboratory for Semantic 3D Computer Vision, Austria 3 Max-Planck-Institute for Informatics, Germany 4 Goethe University Frankfurt, Germany Abstract Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning, to preserve the spatial layout and sequential nature of the video features. A two-step clustering pipeline on these embedded feature representations then allows us to enforce temporal consistency within, as well as across videos. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes. Our evaluation on three challenging datasets shows the impact of each component and, furthermore, demonstrates our state-of-the-art unsupervised action segmentation results. Keywords Unsupervised learning, unsupervised clustering, action segmentation 1. Introduction sequences that correspond to semantically meaningful action classes. Recent approaches for unsupervised ac- Action recognition has seen tremendous success in recent tion segmentation, e.g. [20, 21, 22], focus on three aspects: years, especially in the context of short video clip classifi- (1) finding a suitable embedding space for the video data, cation [1, 2, 3], action detection [4, 5, 6], and temporal ac- (2) identifying clusters of temporal segments across a tion segmentation [7, 8, 9, 10]. The top-performing meth- large amount of videos, and (3) parsing the input videos ods for temporal action segmentation, however, require given the respective feature embedding and cluster infor- frame-wise annotations, which is expensive and imprac- mation. Sener and Yao [20] tackle the action segmenta- tical for large-scale datasets [7, 8, 9, 11]. Consequently, tion task with a linear embedding and Mallow decoding, a large body of research focuses on weakly-supervised while other approaches follow the pipeline of an MLP approaches where only an ordered list [12, 13, 14, 15, 16] [21] or U-Net embedding [22], K-means clustering, and or an unordered set [17, 18, 19] of action labels is needed. Viterbi decoding. However, these methods do not fully These approaches assume that the actions that occur in leverage the temporal relationships of frames within a each training video are known, and sometimes even re- video, either neglecting this information when learning quire their exact ordering. Acquiring such ordered action the embedding [20, 22] or when clustering [21, 22]. lists, however, can still be time consuming or even infea- Since temporal consistency is essential for all steps of sible. For applications like indexing large video datasets the action segmentation pipeline, we propose TAEC, an or human behavior analysis in neuroscience or medicine, approach that considers the sequential nature of frames it is often unclear what actions should be annotated. In in a video for both, embedding learning and clustering. these cases, it is necessary to automatically discover and The main steps of our approach are illustrated in Fig. 1. identify recurring actions in large video datasets. Specifically, we first propose a sequence-to-sequence To address this problem, Sener and Yao [20] proposed temporal embedding network (SSTEN) that combines the task of unsupervised action segmentation to identify pretext tasks of relative timestamp prediction and au- patterns of recurring actions in long, untrimmed video toencoder feature reconstruction. While the autoencoder 26th Computer Vision Winter Workshop, Robert Sablatnig and Florian reconstruction retains the feature layout, the relative Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023 timestamp prediction encodes the relative temporal in- $ wei.lin@icg.tugraz.at (W. Lin); akukleva@mpi-inf.mpg.de formation within a video. Sequence-to-sequence learning (A. Kukleva); possegger@icg.tugraz.at (H. Possegger); enables the embedding of spatial layout and temporal kuehne@uni-frankfurt.de (H. Kuehne); bischof@icg.tugraz.at information on a complete video sequence. (H. Bischof) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License To cluster the embedded features, we propose a CEUR Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 temporal-aware two-step clustering approach that con- 1 Wei Lin et al. CEUR Workshop Proceedings 1–10 Feature Embedding Within-Video Cross-Video Viterbi Decoding 3 videos of Making coffee Clustering Global Cluster Assignment Feature embedding 2 1 3 2 → 1 → 3 frames of video 1 1 2 3 Feature embedding 1 → 2 → 3 frames of video 2 1 2 3 Feature embedding 1 → 2 → 3 Global Global Global Ck,n : the k-th within-video cluster cluster 1 cluster 2 cluster 3 frames of video 3 in video n (a) (b) (c) (d) Figure 1: Pipeline of TAEC. We compute the embedded features with the sequence-to-sequence temporal embedding network (SSTEN). Thereupon, we perform a within-video clustering on each video individually and apply a cross-video global cluster assignment to group the within-video clusters into global clusters. The global cluster assignment also defines the ordering of the clusters in each video. Finally, we use Viterbi decoding to estimate temporally coherent segments for each video. sists of a within-video clustering and a cross-video global and recognition of frame orders [27, 28, 29, 30, 31]. For cluster assignment. Specifically, we perform cluster- instance, Srivastava et al. [24] exploit an LSTM-based au- ing within each video, with a spatio-temporal similarity toencoder for learning video representations. Villegas et among frames. Then we conduct global cluster assign- al. [26] and Denton and Birodkar [25] employed two en- ment to group the clusters across videos. The global coders to generate feature representations of content and cluster assignment defines the ordering of the clusters motion. The temporal order of frames or small chunks for each video. In this way, we overcome the unrealis- is utilized as a self-supervision signal for representation tic assumption that actions of an activity always follow learning on short video clips in [27] and [28]. Inspired by the same temporal order. Such an assumption is com- these approaches, we employ two self-supervision tasks: monly used in related works, e.g. [21, 22]. For instance, feature reconstruction and relative time prediction. in the activity of making coffee, a unified temporal order Clustering of temporal sequences has been explored between actions such as adding milk and adding sugar for parsing human motions [32, 33, 34, 35]. While Zhang is assumed for all videos of making coffee, whereas our et al. [35] proposed a hierarchical dynamic clustering approach can handle changes of the action order in dif- framework, Li et al. [33] and Tierney et al. [34] explored ferent videos. After assigning all within-video clusters to temporal subspace clustering to segment human motion a set of global clusters, we perform Viterbi decoding to data. In contrast to unsupervised action segmentation, obtain a segmentation of temporally coherent segments. these methods are applied on each temporal sequence Our contributions can summarized as following: individually and do not consider association among se- quences. Instead, we propose a cross-video global cluster • We design a sequence-to-sequence temporal em- assignment to group within-video clusters across differ- bedding network (SSTEN), which combines rel- ent videos into global clusters. ative timestamp prediction, autoencoder recon- struction and sequence-to-sequence learning. Unsupervised action segmentation on fine-grained • We propose a within-video clustering with a activities has recent work that either focus on the repre- novel spatio-temporal similarity formulation sentation learning [20, 22, 36] or the clustering step [23]. among frames. However, the temporal information is neglected in at least one of these two steps. For representation learning, • We propose a cross-video global cluster assign- Sener and Yao [20] construct a feature embedding by ment to group within-video clusters across videos learning a linear mapping from visual features to a latent into global clusters, which also overcomes the as- space with a ranking loss. However, the linear model sumption that in all videos of an activity, actions trained with individual frames does not consider the tem- follow the same temporal order. poral association between frames. VidalMata et al. [22] employ a U-Net trained on individual frames for future 2. Related Work frame prediction. Predicting for one or a few steps ahead only requires temporal relations within a small temporal Unsupervised learning of video representations is window. Instead, we propose to learn a representation by commonly performed via pretext tasks, such as recon- predicting the complete sequence of relative timestamps struction [23, 24], future frame prediction [22, 25, 26], to encode the long-range temporal information. 2 Wei Lin et al. CEUR Workshop Proceedings 1–10 For the clustering step, related works [22, 23] neglect temporal consistency of frames within a video. Instead, we apply within-video clustering on each video with Dilated residual layer a proposed similarity formulation that considers both Dilated residual layer ReLU spatial and temporal distances. Dilated Conv1d Two recent approaches perform clustering [37] or concat Decoder cluster-agnostic boundary detection [38] on each video stage Dilated separately, without identifying clusters or segments Encoder residual layer across videos. [37] solves a task similar to human motion stage parsing and evaluates the segmentation for each video individually. [38] only detects boundaries of category- Figure 2: Architecture of SSTEN: We stack 2 stages of encoder agnostic segments, and does not identify if some seg- and decoder for sequence-to-sequence feature reconstruction. ments within a video or across videos are of the same Each stage consists of 𝑄 dilated residual layers with dilated temporal convolution. The intermediate representation in category. On the contrary, our segments on all videos encoder is used for relative time prediction (red block). are category-aware as they are aligned globally across videos by our global cluster assignment. stage, the hidden representation is a concatenation (in 3. Temporal-Aware Embedding yellow) of the features from dilated residual layers and and Clustering (TAEC) the predicted relative timestamps. The training loss is: 𝑇𝑛 𝑇𝑛 We address unsupervised action segmentation as illus- 𝑁 ∑︁ 𝑁 ∑︁ ^ 𝑡,𝑛 ⃦2 + 2 ∑︁ ⃦ ⃦ ∑︁ ∑︁ ℒ=𝜆 ⃦x𝑡,𝑛 − x 2 (𝑠𝑡,𝑛 −𝑠 ^𝑝,𝑡,𝑛 ) , trated in Fig. 1. First, we learn a suitable feature embed- 𝑛=1 𝑡=1 𝑝∈{1,2} 𝑛=1 𝑡=1 (1) ding (Sec. 3.1). We then perform within-video clustering on each video (Sec. 3.2.1), and group the within-video where the coefficient 𝜆 balances the two terms. The clusters into global clusters (Sec. 3.2.2). Finally, we com- pretext tasks of reconstruction and relative timestamp pute temporally coherent segments on each video using prediction encode both, the spatial distribution and the Viterbi decoding (Sec. 3.3). global temporal information, into the embedded features. We compare SSTEN with several baseline embedding networks in the supplementary. 3.1. SSTEN: Sequence-to-Sequence Temporal Embedding Network 3.2. Two-Step Clustering To learn a latent representation for temporal sequences, After learning the feature embedding, we group the em- we adopt a sequence-to-sequence autoencoder. Inspired bedded features into 𝐾 clusters by a within-video clus- by the multi-stage temporal convolutional network [7], tering and a cross-video global cluster assignment. we use a concatenation of two stages for both encoder and decoder, as shown in Fig. 2. Given a set {X𝑛 }𝑁 𝑛=1 of 𝑁 videos, where each video X𝑛 = {x𝑡,𝑛 }𝑇𝑡=1 𝑛 has 3.2.1. Within-Video Clustering 𝑇𝑛 frames, the outputs are reconstructed frame features We perform spectral clustering on frames within each {x̂𝑡,𝑛 }𝑇𝑡=1 𝑛 . The embedded features are the hidden repre- video (detailed description in the supplementary). Given sentation {e𝑡,𝑛 }𝑇𝑡=1 𝑛 . the embedded feature sequence1 [e1 , e2 , ..., e𝑇 ], we build Every encoder and decoder stage consist of 1 × 1 con- a frame-to-frame similarity matrix 𝐺 ∈ R𝑇 ×𝑇 . The volution layers for dimension adjustment (Fig. 2 blue) and entries 𝑔(𝑖, 𝑗), 𝑖, 𝑗 ∈ {1, ..., 𝑇 }, represent the similarity 𝑄 dilated residual layers (green), each containing a di- between frame 𝑖 and frame 𝑗. To consider both the spatial lated temporal 1D convolution. Since no fully connected and temporal distance of features, we propose to measure layers are employed, sequences of variable lengths can the similarity by the product of two Gaussian kernels be processed seamlessly. The dilation rate at the 𝑞-th layer is 2𝑞−1 . By stacking dilated residual layers, the (︃ )︃ ‖e𝑖 − e𝑗 ‖22 (𝑠𝑖 − 𝑠𝑗 )2 (︂ )︂ temporal receptive field increases exponentially. The re- 𝑔(𝑖, 𝑗) = exp − 2 · exp − 2 , 𝜎spat 𝜎tmp ceptive field of the 𝑞-th layer is 1 + (𝑟 − 1) × (2𝑞 − 1), (2) where 𝑟 is the kernel size. Therefore, each frame in the where 𝑠𝑖 , 𝑠𝑗 are the corresponding relative timestamps hidden representation has a long temporal dependency of frame 𝑖, 𝑗 and 𝜎spat , 𝜎tmp are the scaling factors for the on the input video. In each encoder stage, we use a 1 × 1 convolution layer (in red) to predict the frame-wise rela- tive timestamps 𝑠𝑡,𝑛 = 𝑇𝑡𝑛 . At the end of each encoder 1 For ease of notation, we omit the video index 𝑛. 3 Wei Lin et al. CEUR Workshop Proceedings 1–10 Hub vertex set non-hub vertex sets spatial and temporal Gaussian kernels. To avoid manu- ally tuning 𝜎spat , we use local scaling [39] to estimate 𝜎spat dynamically. To this end, we replace 𝜎spat 2 by 𝜎𝑖 𝜎𝑗 , where ... ... ... 𝜎𝑖 is the distance from e𝑖 to its 𝑚-th nearest neighbor in ... ... ... the embedding space. We provide an ablation study on scaling of the spatio-temporal similarity in the supple- mentary. Consequently, frames of similar visual content Figure 3: In the ℎ-th iteration, ℎ ∈ {1, ..., 𝑁 }, the hub vertex and relative timestamps are encouraged to be grouped set 𝑉ℎ is chosen and an assignment is computed by assigning into the same cluster. the vertices between 𝑉ℎ and the 𝑁 − 1 non-hub vertex sets. Solid lines denote bipartite matching results between 𝑉ℎ and 3.2.2. Cross-Video Global Cluster Assignment non-hub vertex sets (step (1)). Dashed lines denote connec- tions between non-hub vertex sets (step (2)). After within-video clustering, we assign the 𝑁 × 𝐾 within-video clusters across videos into 𝐾 global clus- ters. Every global cluster should contain 𝑁 within-video clusters, each coming from a different video (c.f., Fig. 1). By iterating over all possible initial hub vertex sets ℎ ∈ This can be interpreted as an 𝑁 -dimensional assignment {1, ..., 𝑁 }, we choose the assignment solution 𝑓ℎ^ which problem [40]. minimizes the assignment cost We regard the 𝑛-th video 𝑉𝑛 = {c𝑘,𝑛 |𝑘 = 1, .., 𝐾} ∑︁ as a vertex set, where each 𝑘-th within-video cluster c𝑘,𝑛 𝑓ℎ^ = arg min 𝑓ℎ (c, c′ ) · 𝑤(c, c′ ), (4) ℎ∈{1,...,𝑁 } (c,c′ )∈𝐸 is a vertex. We construct⋃︀an 𝑁 -partite graph 𝐺 = (𝑉1 ∪ 𝑉2 ∪ ... ∪ 𝑉𝑁 , 𝐸). 𝐸 = 𝑚<𝑛,𝑚,𝑛∈{1,...,𝑁 } {(c, c′ )|c ∈ where 𝑓ℎ (c, c′ ), ∀(c, c′ ) ∈ 𝐸 is a binary indicator func- 𝑉𝑚 , c′ ∈ 𝑉𝑛 } is the set of edges between within-video tion that describes the edge connection: 𝑓ℎ (c, c′ ) equals clusters across videos. The edge weight 𝑤(c, c′ ) is the 1 when two vertices c, c′ are connected. The assignment distance between centroids of two within-video clusters solution 𝑓ℎ^ describes the partition which leads to the 𝐾 c, c′ . The solution to the 𝑁 -dimensional assignment global clusters. is a partition by dividing the graph 𝐺 into 𝐾 cliques 𝑍1 , 𝑍2 , ..., 𝑍𝐾 . A clique 𝑍𝑘 , which is a subset of 𝑁 vertices from 𝑁 different vertex sets, defines the 𝑘-th 3.3. Frame Labeling by Viterbi Decoding global cluster. The induced sub-graphs of the cliques Given the embedded feature sequence e1∼𝑇𝑛 ,𝑛 of video 𝑍1 , 𝑍2 , ..., 𝑍𝐾 are complete and disjoint. We denote the 𝑛, we determine the optimal label sequence 𝑦ˆ1∼𝑇𝑛 ,𝑛 . The edge set of the induced sub-graph of 𝑍𝑘 as 𝐸𝑍𝑘 . The posterior probability can be factorized into the product cost of a clique is the sum of pairwise edge weights be- of likelihoods and the probability of a given temporal tween the contained vertices. The cost of an assignment order, i.e., 𝑦ˆ1∼𝑇𝑛 ,𝑛 = arg max 𝑝(𝑦1∼𝑇𝑛 ,𝑛 |e1∼𝑇𝑛 ,𝑛 ) = solution is the sum of the costs of all the 𝐾 cliques, i.e., 𝑦1∼𝑇𝑛 ,𝑛 arg max {Π𝑇𝑖=1 𝑛 𝑝𝑛 (e𝑖,𝑛 |𝑦𝑖,𝑛 ) · Π𝑇𝑖=1 𝑛 𝑝𝑛 (𝑦𝑖,𝑛 |𝑦1∼𝑖−1,𝑛 )}. 𝐾 ∑︁ ∑︁ 𝑦1∼𝑇𝑛 ,𝑛 ℒ(𝑍1 , 𝑍2 , ..., 𝑍𝐾 ) = 𝑤(c, c′ ). (3) We fit a Gaussian model on each global cluster and 𝑘=1 (c,c′ )∈𝐸𝑍 𝑘 compute the frame-wise likelihoods, i.e., 𝑝𝑛 (x|𝑘) = 𝒩𝑘 (x; 𝜇𝑘 , Σ𝑘 ), 𝑘 ∈ {1, ..., 𝐾}. The temporal order con- In order to solve this NP-hard problem, we employ an straint is used to limit the search space for the optimal iterative multiple-hub heuristic [41]. In each iteration, label sequence by filtering out the sequences that do not we choose a hub vertex set 𝑉ℎ = {c𝑘,ℎ |𝑘 = 1, .., 𝐾} follow the temporal order. and there are (𝑁 − 1) non-hub vertex sets. We compute The related works [21, 22] apply K-means on the an assignment solution in each iteration in two steps, as frames of all the videos. From the unified clustering they is shown in Fig. 3: (1) We first perform (𝑁 − 1) bipar- derive only a single temporal order of clusters for all the tite matchings between 𝑉ℎ and each of the remaining videos. However, this is an unrealistic assumption due to non-hub vertex sets 𝑉ℎ . (2) Secondly, we determine the interchangeable steps in the activities, e.g., pour milk and edge connection between pairs of non-hub vertex sets. pour sugar in making coffee. Instead, we can easily derive On two non-hub vertex sets 𝑉ℎ , 𝑉ℎ′ , we connect two the temporal order for each video separately. We do so vertices c𝑖,ℎ ∈ 𝑉ℎ and c𝑖′ ,ℎ′ ∈ 𝑉ℎ′ , if c𝑖,ℎ and c𝑖′ ,ℎ′ are by sorting the within-video clusters according to the av- connected to the same vertex c𝑘,ℎ on 𝑉ℎ . erage timestamp of frames in each cluster. The output of After the two steps, every hub vertex c𝑘,ℎ , with 𝑘 ∈ the Viterbi decoding is the optimal cluster label sequence {1, .., 𝐾} and all the non-hub vertices connected to c𝑘,ℎ 𝑦ˆ 1∼𝑇𝑛 ,𝑛 . More details are given in the supplementary. form the 𝑘-th clique 𝑍𝑘 . Therefore, the 𝑁 -partite graph 𝐺 is partitioned into 𝐾 complete and disjoint subgraphs. 4 Wei Lin et al. CEUR Workshop Proceedings 1–10 4. Experiments tion (i.e., LSTM+AL [38]) on each video individually, with- out solving the alignment among different clusters or 4.1. Datasets & Evaluation Metrics segments across videos. For a fair comparison, these are evaluated by local Hungarian matching on indi- We evaluate on Breakfast [42], the YouTube Instructions vidual videos, where a per-video best ground-truth-to- dataset (YTI) [36] and 50 Salads [43]. Breakfast is com- cluster-label mapping is determined using the ground prised of 1712 videos recorded in various kitchens. There truth on each video separately. This results in a separate are 10 composite activities of breakfast preparation. YTI label mapping for each video. Following [37], we also is composed of 150 videos of 5 activities collected from report results with 𝐾 set to the average number of ac- YouTube. 50 Salads contains 50 videos of people prepar- tions for each activity (i.e., 𝐾=avg.#gt) for a complete ing salads. Following [20, 21, 22], we use the dense trajec- comparison. tory Fisher vector features (DTFV) [44] for Breakfast and In Table 1, TAEC achieves strong results in compari- 50 Salads, and features provided by Alayrac et al. [36] on son to the unsupervised state-of-the-art and is even com- YTI. We use the evaluation protocol in [21] and report parable to weakly supervised approaches. Although ap- the performance in three metrics: (1) Mean over Frames proaches without solving the alignment of clusters across (MoF) is the frame-level accuracy over the frames of all videos inherently lead to better scores in the evaluation the videos. More frequent or longer action instances settings of the local Hungarian matching, our approach have a higher impact on the result. (2) Class-wise mean still compares favorably. Intersection over Union (cIoU) is the average over the We compare qualitative results (with global Hungarian IoU performance for each class and penalizes segmenta- matching) of TAEC and MLP+kmeans [21] on 3 Breakfast tion results with dominating segments. (3) The F1-score activities in Fig. 4. We see that our two-step clustering penalizes results with oversegmentation. (the 2nd rows in all clustering result plots) already leads to temporally consistent segments with relatively accurate 4.2. Implementation Details boundaries of action instances, while K-means (the 4th rows in all clustering result plots) results in serious over- For our SSTEN, we adapt the number of dilated residual segmentation. The Viterbi decoding further improves layers 𝑄 according to the dataset size: We set 𝑄 = 5 the segmentation by suppressing the oversegmentation for YTI (15k frames per activity subset on average) and and domination of incorrect clusters (the 2nd rows in all 𝑄 = 10 for Breakfast (360k) and 50 Salads (577k). The final result plots). Moreover, MLP+kmeans [21] follows dimension of the hidden representation is set to 32. We the constraint of the fixed temporal order of segments set 𝜆 in Eq. (1) to 0.002 (Breakfast), 0.01 (YTI) and 0.005 (50 on videos of each activity (the 4th rows in all final result Salads). For clustering, we follow the protocol of [20, 36] plots). In contrast, TAEC yields an individual temporal and define the number of clusters 𝐾 separately for each order for each video (the 2nd rows in all final result plots). activity as the maximum number of ground truth classes. Additional qualitative results and evaluation scores are The values of 𝐾 for the three datasets are provided in included in the supplemental material. the supplementary material. For the YouTube Instructions dataset, we follow the protocol of [20, 21, 36] and report the results with and 4.3. Comparison with the State-of-the-Art without considering background frames. Here, our TAEC We compare with unsupervised learning methods, as outperforms all recent works in almost all of the metrics well as weakly and fully supervised approaches on under all three settings. Breakfast (Table 1), YTI (Table 2) and 50 Salads (Ta- 50 Salads is a particularly challenging dataset for un- ble 3). Most unsupervised segmentation approaches yield supervised approaches, as each video has a different or- cluster-aware segments that are aligned across all the der of actions and additionally includes many repetitive videos [20, 21, 22, 36, 45]. These approaches are eval- action instances. In the eval-level of 12 classes, TAEC uated with the global Hungarian matching on all outperforms all approaches under the global Hungarian videos, where the mapping between ground truth classes matching evaluation and achieves competitive results and clusters is performed on all the videos of an activ- under the local Hungarian matching. In the challenging ity, which results in one mapping for each activity. The mid-level evaluation of 19 classes, the sequential nature number of clusters 𝐾 is set to the maximum number of of frames is less advantageous. Therefore, MLP+kmeans ground truth classes for each activity (i.e., 𝐾=max.#gt). [21] outperforms TAEC. Generally, in the local match- We focus on the performance comparison in this setting ing case, approaches without alignment across videos and follow this setting in all the ablation studies. compare favorably. Two recent approaches perform clustering (i.e., TWFINTCH [37]) or category-agnostic boundary detec- 5 Wei Lin et al. CEUR Workshop Proceedings 1–10 Making cereals Making juice Making fried egg empty take bowl pour cereals pour milk empty cut orange squeeze orange take glass empty pour oil take plate take eggs fry egg stir milk take squeezer take knife take plate pour juice butter pan add salt crack egg put egg2plate cereals video 1 clustering result juice video 1 clustering result fried egg video 1 clustering result cereals video 2 clustering result juice video 2 clustering result fried egg video 2 clustering result cereals video 3 clustering result juice video 3 clustering result fried egg video 3 clustering result cereals video 1 final result juice video 1 final result fried egg video 1 final result cereals video 2 final result juice video 2 final result fried egg video 2 final result cereals video 3 final result juice video 3 final result fried egg video 3 final result Figure 4: Qualitative results of clustering and final segmentation (with global Hungarian matching) on 3 activities (3 videos each) on Breakfast. For each video, the 4-row-group displays the ground truth (1st row), TAEC (2nd row), TAEC with naïve assignment (3rd row, quantitative comparison in Sec. 4.6), MLP+kmeans [21] (4th row). More quantitative and qualitative segmentation results are given in the supplementary. Table 1 Table 2 Comparison with state-of-the-art approaches on Breakfast Comparison with the state-of-the-art on YTI (in %). * denotes (in %). * denotes approach without segment alignment across approach without segment alignment across videos, ‡ denotes videos, ‡ denotes our reimplementation, underlined scores are our reimplementation, underlined scores are acquired from acquired from the author. the author. Breakfast YouTube Instructions Approach Supervision MoF IoU F1 MoF IoU F1 Super- MoF IoU Approach w/o w/o w/o MSTCN++ [8] full 67.6 - - vision w bg w bg bg bg bkg G-FRNet [46] full 67.7 - - DTGRM [10] full 68.3 - - Global Hungarian matching on all videos (𝐾= max. #gt) SSTDA [47] full 70.3 - - Frank-Wolfe [36] w/o - - 24.4 - - BCN [9] full 70.4 - - Mallow [20] w/o 27.8 - 27.0 - - Global2local [48] full 70.7 - - UNet+MLP [22] w/o 28.9 8.3 29.9 - - ASFormer [49] full 73.5 - - ASAL [45] w/o 44.9 - 32.1 - - NN-Viterbi [15] weak 43.0 - - MLP+kmeans [21] w/o 39.0 9.8 28.3 14.5 9.6 D3TW [12] weak 45.7 - - MLP+kmeans ‡ w/o 39.4 9.9 29.6 14.4 9.7 TASL [50] weak 47.8 - - TAEC w/o 46.6 10.7 29.5 17.0 10.5 CDFL [13] weak 50.2 - - Local Hungarian matching on each video (𝐾= max. #gt) Global Hungarian matching on all videos (𝐾= max. #gt) LSTM+AL [38]* w/o - - 39.7* - - Mallow [20] w/o 34.6 - - MLP+kmeans ‡ w/o 62.2 21.6 47.0 22.7 21.6 UNet+MLP [22] w/o 48.1 - 29.9 TAEC w/o 67.9 23.7 49.4 24.8 23.6 ASAL [45] w/o 52.5 - 37.9 Local Hungarian matching on each video (𝐾= avg. #gt) MLP+kmeans [21] w/o 41.8 - 26.4 MLP+kmeans ‡ w/o 42.9 13.1 25.5 TWFINCH [37]* w/o 56.7* - 48.2* - - TAEC w/o 50.3 19.0 33.6 MLP+kmeans ‡ w/o 63.6 20.5 52.3 23.2 20.5 TAEC w/o 65.3 20.9 51.0 23.9 20.8 Local Hungarian matching on each video (𝐾= max. #gt) LSTM+AL [38]* w/o 42.9* 46.9* - UNet+MLP [22] w/o 52.2 - - TWFINTCH [37]* MLP+kmeans ‡ w/o w/o 57.8* 61.2 - 30.3 - 35.9 4.4. Embedding and Clustering TAEC w/o 64.3 41.2 42.6 We first evaluate our SSTEN embedding in combination Local Hungarian matching on each video (𝐾= avg. #gt) with K-means and two-step clustering on three feature TWFINCH [37]* MLP+kmeans ‡ w/o w/o 62.7* 60.6 42.3* 27.8 - 46.4 types: the AlexNet fc6 features [57] pre-trained on Im- TAEC w/o 62.6 32.3 49.6 ageNet [58], I3D features [59] pre-trained on the Kinet- ics dataset [60], and the precomputed dense trajectory Fisher vectors (DTFV) [44]. We also report results of raw features without any temporal embedding. For a fair comparison, we reduce the dimensions of the three 6 Wei Lin et al. CEUR Workshop Proceedings 1–10 Table 3 Comparison of raw features without embedding. Comparison with the state-of-the-art on 50 Salads (in %). * Among the three types of features without temporal denotes approach without segment alignment across videos, embedding, I3D achieves the best performance, while ‡ denotes our reimplementation. AlexNet features lead to the worst results. AlexNet fea- 50 Salads tures are computed from individual spatial frames. On Level Approach Supervision MoF IoU F1 the contrary, each frame feature of DTFV and I3D is com- STCNN [51] full 72.0 - - puted based on a chunk of its temporal neighbor frames. EDTCN [52] full 73.4 - - Therefore, the features already carry intrinsic temporal MA [53] full 88.5 - - consistency. Furthermore, the two-stream I3D model can Global Hungarian matching on all videos (𝐾= max. #gt) leverage both RGB and optical flow. Therefore, I3D fea- UNet+MLP [22] w/o 30.6 - - tures achieve a better performance than DTFV, which ASAL [45] w/o 39.2 - - eval MLP+kmeans [21] w/o 35.5 - - rely on handcrafted dense trajectories. MLP+kmeans ‡ w/o 37.9 24.6 40.2 Comparison of SSTEN embeddings learned on TAEC w/o 48.4 26.0 44.8 different features. When comparing the SSTEN em- Local Hungarian matching on each video (𝐾= max. #gt) beddings to the performance of the raw features, we see LSTM+AL [38]* MLP+kmeans ‡ w/o w/o 60.6* 58.0 - 33.7 - 49.8 that SSTEN leads to a significant performance gain for TAEC w/o 59.7 35.0 54.4 both clustering methods. For DTFV, the performance Local Hungarian matching on each video (𝐾= avg. #gt) improvements by SSTEN are MoF 8.5%, IoU 6.0%, F1 8.9% TWFINCH [37]* w/o 71.1* - - with K-means and MoF 15.8%, IoU 6.9%, F1 11.6% with MLP+kmeans ‡ w/o 51.5 22.3 43.4 two-step clustering. TAEC w/o 59.7 35.7 54.7 Among the three types of SSTEN embedded features, SSTDA [47] full 83.2 - - MSTCN++ [8] full 83.7 - - I3D has slightly better IoU and F1 scores while DTFV HASR [54] full 83.9 - - leads to the best MoF scores for both, K-means and the ASRF [55] full 84.5 - - ASFormer [49] full 85.6 - - two-step clustering. Overall, the SSTEN embeddings NNViterbi [15] weak 49.4 - - learned from these two features perform comparably. We CDFL [56] weak 54.7 - - conduct the following experiments using DTFV, which Global Hungarian matching on all videos (𝐾= max. #gt) is also used in related works. mid UNet+MLP [22] w/o 24.2 - - ASAL [45] w/o 34.4 - - MLP+kmeans [21] w/o 30.2 - - 4.5. Impact of Loss Terms on Clustering MLP+kmeans ‡ w/o 29.1 15.7 23.4 TAEC w/o 26.6 14.9 23.4 To evaluate the impact of the two loss terms in Eq. (1), we Local Hungarian matching on each video (𝐾= max. #gt) plot the quantitative segmentation results of SSTEN with MLP+kmeans ‡ w/o 55.6 29.6 39.6 both K-means and the two-step clustering w.r.t. different TAEC w/o 50.2 29.4 40.3 reconstruction loss coefficients 𝜆 in Fig. 5. In general, Local Hungarian matching on each video (𝐾= avg. #gt) two-step clustering leads to a better performance than TWFINCH [37]* w/o 66.5* - - K-means for almost all 𝜆 values (except for the case of MLP+kmeans ‡ w/o 53.4 28.1 39.0 TAEC w/o 51.9 30.2 41.9 only reconstruction loss). With decreasing 𝜆, the relative time prediction loss has an increasing impact and the em- bedded features have better global temporal consistency, features without embedding to 32 via PCA. We conduct which explains the increasing IoU and F1 scores. How- the experiments on Breakfast and report the results with ever, at extremely small 𝜆 values, the embedded features and without our SSTEN embedding in Table 4. overfit to the relative time prediction task, which results in saturated IoU and F1 scores, and a significant drop in MoF for both K-means and two-step clustering. Table 4 To intuitively illustrate the loss term impact on the two- Comparison of features of SSTEN embedding, together with clustering methods on Breakfast (in %). step clustering, we plot the similarity matrices for SSTEN embeddings trained with three different 𝜆 in Fig. 6. Here, Model K-means Two-step clustering we look at the similarity matrices with temporal Gaussian Feature Embedding MoF IoU F1 MoF IoU F1 kernel (bottom row). Intuitively, the similarity matrix AlexNet 25.9 11.3 22.3 33.7 10.7 20.2 with clear diagonal block structure (Fig. 6(a2)), which is I3D w/o 33.4 14.4 25.6 37.7 14.3 26.1 DTFV 30.8 11.8 23.0 34.5 12.1 22.0 the result of an appropriate ratio between the reconstruc- AlexNet 33.0 14.2 27.0 39.1 16.1 30.7 tion loss and relative time prediction loss (𝜆 = 0.002), I3D SSTEN 37.9 18.7 33.3 45.2 20.5 35.1 leads to the best segmentation performance. When 𝜆 be- DTFV 39.3 17.8 31.9 50.3 19.0 33.6 comes larger (e.g., 𝜆 = 0.01), the reconstruction loss has a larger impact and the diagonal block structure 7 Wei Lin et al. CEUR Workshop Proceedings 1–10 50 MoF two-step. 45 MoF kmeans Viterbi decoding) on Breakfast and 50 Salads in Table 5. IoU two-step. performance (in %) 40 IoU kmeans The global cluster assignment outperforms the naïve as- 35 F1 two-step. F1 kmeans 30 25 Table 5 20 Impact of cluster assignment strategies for two-step clustering 15 on SSTEN embeddings on Breakfast and 50 Salads (eval level, 10 tive i.e., 12 classes) (in %). onlyuction rela ion only predict nstr reco time Clustering Final Dataset Strategy Figure 5: Segmentation performance of both clustering meth- MoF IoU F1 MoF IoU F1 ods on SSTEN embeddings with different 𝜆 on Breakfast. global cluster 38.6 13.7 25.9 50.3 19.0 33.6 Breakfast naïve 25.1 12.4 23.9 42.3 17.7 32.0 global cluster 45.3 23.0 43.7 48.4 26.0 44.8 50 Salads naïve 28.9 16.2 32.9 35.0 22.7 38.0 (Fig. 6(b2)) becomes pale. Therefore, the performances of embedded features with 𝜆 = 0.005, 𝜆 = 0.01 and only signment by a large margin for both, clustering results reconstruction loss degrade successively. On the other and the final segmentation results, on both datasets. The hand, for extremely small 𝜆 values (e.g., 𝜆 = 0.0005), advantage of the global cluster assignment is even more the block diagonal structure (Fig. 6(c2)) becomes noisy evident on 50 Salads. due to overfitting on relative time prediction. We illustrate exemplary qualitative results of the clus- 1.0 tering and the final segmentation for 3 activities (with 3 w/o 0.8 videos each) on Breakfast in Fig. 4. For each video, the 4- tmp.Gauss row group displays the ground truth (1st row), the result 0.6 (a1) (b1) (c1) with global cluster assignment (2nd row) and the result 0.4 with naïve assignment (3rd row). The 4th row shows the with 0.2 result of MLP+kmeans [21]. By comparing all the 3rd tmp.Gauss rows of cereals video [id] final result in Fig. 4, we see that (a2) (b2) (c2) 0.0 the naïve assignment simply assumes the sub-clusters in (a) SSTEN (b) SSTEN (c) SSTEN the same temporal order in each video belong to the same global cluster, while they might not be close to each other Figure 6: Frame-to-frame similarity matrices of SSTEN em- in the feature space. On the contrary, the global cluster beddings for the same Breakfast video. Columns show the similarity matrices for different 𝜆, while the rows show results assignment (the 2nd rows of cereals video [id] final result) without (top) and with (bottom) temporal Gaussian kernel. yields an optimal assignment solution with respect to the pairwise distances between sub-clusters, resulting in different orderings of sub-clusters on each video. Note Therefore, both the reconstruction and the relative that on some videos, global cluster assignment could lead timestamp prediction loss, when combined in an appro- to the same assignment result as naïve assignment. priate ratio, are indispensable to learn the effective rep- resentation that preserves both spatial layout and the temporal information. 5. Conclusion We proposed a new pipeline for the unsupervised learn- 4.6. Impact of Cluster Assignment ing of action segmentation. For the feature embedding, In this ablation study, we evaluate the efficacy of the we propose a temporal-aware embedding network that global cluster assignment. For two-step clustering, we performs sequence-to-sequence learning with the pretext evaluate two strategies of grouping within-video clus- tasks of relative timestamp prediction and feature recon- ters into global clusters: (1) the naïve assignment, for struction. For clustering, we propose a two-step cluster- which we order the sub-clusters according to the aver- ing schema, consisting of within-video clustering and age timestamp and simply group the 𝑘-th sub-clusters cross-video global cluster assignment. The temporal em- of all videos into a global cluster, i.e., the global cluster bedding of sequence-to-sequence learning together with 𝑍𝑘 = {c𝑘,𝑛 |𝑛 = 1, .., 𝑁 }, and (2) the global cluster two-step clustering is proven to be a well-suitable combi- assignment, as detailed in Sec. 3.2.2. nation that considers the sequential nature of frames in In order to show how the different cluster assignment both processing steps. Ultimately, we combine the tempo- strategies affect the clustering result, we report both, ral embedding with a frame-to-cluster assignment based the results of the two-step clustering (before Viterbi de- on Viterbi decoding, which achieves the unsupervised coding) and the final segmentation performance (after state-of-the-art on three challenging benchmarks. 8 Wei Lin et al. CEUR Workshop Proceedings 1–10 References supervised action segmentation without ordering constraints, in: CVPR, 2018. [1] Z. Wang, Q. She, A. Smolic, Action-net: Multipath [20] F. Sener, A. Yao, Unsupervised learning and segmen- excitation for action recognition, in: CVPR, 2021. tation of complex activities from video, in: CVPR, [2] C. Feichtenhofer, X3d: Expanding architectures for 2018. efficient video recognition, in: CVPR, 2020. [21] A. Kukleva, H. Kuehne, F. Sener, J. Gall, Unsuper- [3] C. Yang, Y. Xu, J. Shi, B. Dai, B. Zhou, Temporal vised learning of action classes with continuous pyramid network for action recognition, in: CVPR, temporal embedding, in: CVPR, 2019. 2020. [22] R. G. VidalMata, W. J. Scheirer, H. Kuehne, Joint [4] M. Xu, C. Zhao, D. S. Rojas, A. Thabet, B. Ghanem, visual-temporal embedding for unsupervised learn- G-tad: Sub-graph localization for temporal action ing of actions in untrimmed sequences, in: WACV, detection, in: CVPR, 2020. 2021. [5] P. Zhao, L. Xie, C. Ju, Y. Zhang, Y. Wang, Q. Tian, [23] B. L. Bhatnagar, S. Singh, C. Arora, C. V. Jawahar, Bottom-up temporal action localization with mu- Unsupervised learning of deep feature representa- tual regularization, in: ECCV, 2020. tion for clustering egocentric actions, in: IJCAI, [6] Y. Bai, Y. Wang, Y. Tong, Y. Yang, Q. Liu, J. Liu, 2017. Boundary content graph neural network for tem- [24] N. Srivastava, E. Mansimov, R. Salakhudinov, Un- poral action proposal generation, in: ECCV, 2020. supervised learning of video representations using [7] Y. A. Farha, J. Gall, MS-TCN: Multi-stage temporal LSTMs, in: ICML, 2015. convolutional network for action segmentation, in: [25] E. L. Denton, V. Birodkar, Unsupervised learning CVPR, 2019. of disentangled representations from video, in: [8] S. Li, Y. A. Farha, Y. Liu, M.-M. Cheng, J. Gall, MS- NeurIPS, 2017. TCN++: Multi-Stage Temporal Convolutional Net- [26] R. Villegas, J. Yang, S. Hong, X. Lin, H. Lee, De- work for Action Segmentation, PAMI (2020). composing motion and content for natural video [9] Z. Wang, Z. Gao, L. Wang, Z. Li, G. Wu, Boundary- sequence prediction, in: ICLR, 2017. aware cascade networks for temporal action seg- [27] D. Kim, D. Cho, I. S. Kweon, Self-supervised video mentation, in: ECCV, 2020. representation learning with space-time cubic puz- [10] D. Wang, D. Hu, X. Li, D. Dou, Temporal relational zles, in: AAAI, 2019. modeling with self-supervision for action segmen- [28] H.-Y. Lee, J.-B. Huang, M. Singh, M.-H. Yang, Un- tation, in: AAAI, 2021. supervised representation learning by sorting se- [11] Y. Huang, Y. Sugano, Y. Sato, Improving action quences, in: ICCV, 2017. segmentation via graph-based temporal reasoning, [29] V. Ramanathan, K. Tang, G. Mori, L. Fei-Fei, Learn- in: CVPR, 2020. ing temporal embeddings for complex video analy- [12] C.-Y. Chang, D.-A. Huang, Y. Sui, L. Fei-Fei, J. C. sis, in: ICCV, 2015. Niebles, D3tw: Discriminative differentiable dy- [30] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, namic time warping for weakly supervised action T. Tuytelaars, Modeling video evolution for action alignment and segmentation, in: CVPR, 2019. recognition, in: CVPR, 2015. [13] J. Li, P. Lei, S. Todorovic, Weakly supervised energy- [31] A. Cherian, B. Fernando, M. Harandi, S. Gould, Gen- based learning for action segmentation, in: ICCV, eralized rank pooling for activity recognition, in: 2019. CVPR, 2017. [14] A. Richard, H. Kuehne, J. Gall, Weakly supervised [32] M. Hoai, F. D. la Torre, Maximum margin temporal action learning with RNN based fine-to-coarse mod- clustering, in: Artificial Intelligence and Statistics, eling, in: CVPR, 2017. 2012. [15] A. Richard, H. Kuehne, A. Iqbal, J. Gall, [33] S. Li, K. Li, Y. Fu, Temporal subspace clustering for NeuralNetwork-Viterbi: A framework for human motion segmentation, in: CVPR, 2015. weakly supervised video learning, in: CVPR, 2018. [34] S. Tierney, J. Gao, Y. Guo, Subspace clustering for [16] D.-A. Huang, F.-F. Li, J. C. Niebles, Connectionist sequential data, in: CVPR, 2014. temporal modeling for weakly supervised action [35] Y. Zhang, S. Tang, H. Sun, H. Neumann, Human labeling, in: ECCV, 2016. motion parsing by hierarchical dynamic clustering, [17] M. Fayyaz, J. Gall, SCT: Set constrained temporal in: BMVC, 2018. transformer for set supervised action segmentation, [36] J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, in: CVPR, 2020. I. Laptev, S. Lacoste-Julien, Unsupervised learning [18] J. Li, S. Todorovic, Set-constrained Viterbi for set- from narrated instruction videos, in: CVPR, 2016. supervised action segmentation, in: CVPR, 2020. [37] M. S. Sarfraz, N. Murray, V. Sharma, A. Diba, [19] A. Richard, H. Kuehne, J. Gall, Action sets: Weakly L. Van Gool, R. Stiefelhagen, Temporally-weighted 9 Wei Lin et al. CEUR Workshop Proceedings 1–10 hierarchical clustering for unsupervised action seg- boundaries, in: WACV, 2021. mentation, in: CVPR, 2021. [56] J. Liu, A. Shahroudy, M. L. Perez, G. Wang, L.-Y. [38] S. N. Aakur, S. Sarkar, A perceptual prediction Duan, A. K. Chichung, NTU RGB+D 120: A large- framework for self supervised event segmentation, scale benchmark for 3D human activity understand- in: CVPR, 2019. ing, PAMI (2019). [39] L. Zelnik-Manor, P. Perona, Self-tuning spectral [57] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet clustering, in: NeurIPS, 2005. classification with deep convolutional neural net- [40] W. P. Pierskalla, Letter to the editor—the multi- works, in: NeurIPS, 2012. dimensional assignment problem, Operations Re- [58] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei- search 16 (1968) 422–431. Fei, Imagenet: A large-scale hierarchical image [41] H.-J. Bandelt, Y. Crama, F. C. Spieksma, Approxima- database, in: CVPR, 2009. tion algorithms for multi-dimensional assignment [59] J. Carreira, A. Zisserman, Quo vadis, action recog- problems with decomposable costs, Discrete Ap- nition? A new model and the kinetics dataset, in: plied Mathematics 49 (1994) 25–50. CVPR, 2017. [42] H. Kuehne, A. Arslan, T. Serre, The language of [60] W. Kay, J. Carreira, K. Simonyan, B. Zhang, actions: Recovering the syntax and semantics of C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, goal-directed human activities, in: CVPR, 2014. T. Back, P. Natsev, et al., The kinetics human ac- [43] S. Stein, S. J. McKenna, Combining embedded ac- tion video dataset, arXiv preprint arXiv:1705.06950 celerometers with computer vision for recognizing (2017). food preparation activities, in: ACM International Joint Conference on Pervasive and Ubiquitous Com- puting, 2013. [44] H. Wang, C. Schmid, Action recognition with im- proved trajectories, in: ICCV, 2013. [45] J. Li, S. Todorovic, Action shuffle alternating learn- ing for unsupervised action segmentation, in: CVPR, 2021. [46] D. Wang, Y. Yuan, Q. Wang, Gated forward refine- ment network for action segmentation, Neurocom- puting 407 (2020) 63–71. [47] M.-H. Chen, B. Li, Y. Bao, G. AlRegib, Z. Kira, Action segmentation with joint self-supervised temporal domain adaptation, in: CVPR, 2020. [48] S.-H. Gao, Q. Han, Z.-Y. Li, P. Peng, L. Wang, M.-M. Cheng, Global2local: Efficient structure search for video action segmentation, in: CVPR, 2021. [49] F. Yi, H. Wen, T. Jiang, Asformer: Transformer for action segmentation, in: BMVC, 2021. [50] Z. Lu, E. Elhamifar, Weakly-supervised action segmentation and alignment via transcript-aware union-of-subspaces learning, in: ICCV, 2021. [51] C. Lea, A. Reiter, R. Vidal, G. D. Hager, Segmental spatiotemporal cnns for fine-grained action seg- mentation, in: ECCV, 2016. [52] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, G. D. Hager, Temporal convolutional networks for action seg- mentation and detection, in: CVPR, 2017. [53] C. Fermüller, F. Wang, Y. Yang, K. Zampogiannis, Y. Zhang, F. Barranco, M. Pfeiffer, Prediction of ma- nipulation actions, International Journal of Com- puter Vision 126 (2018) 358–374. [54] H. Ahn, D. Lee, Refining action segmentation with hierarchical video representations, in: ICCV, 2021. [55] Y. Ishikawa, S. Kasai, Y. Aoki, H. Kataoka, Alleviat- ing over-segmentation errors by detecting action 10 TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering Supplementary 1. Introduction 2.2. Frame Labeling by Viterbi Decoding For additional insights into TAEC, we introduce the back- Additional explanations to Sec. 3.3 in the main ground of spectral clustering in Sec. 2.1 and give details manuscript: The global cluster assignment delivers the or- of the Viterbi decoding in Sec. 2.2. We perform more dered clusters on each video, which are aligned across all ablation studies on comparing baseline embeddings and videos. To compute the final segmentation, we use the re- clustering methods (Sec. 3.1), scaling of spatio-temporal sulting ordering and decode each video into a sequence of similarity (Sec. 3.2), cluster ordering (Sec. 3.3), decoding 𝐾 temporally consistent segments. That is, we determine strategies (Sec. 3.4). Finally, we provide more quantitative the optimal label sequence 𝑦ˆ1∼𝑇𝑛 ,𝑛 = {𝑦1,𝑛 , ..., 𝑦𝑇𝑛 ,𝑛 } (Sec. 3.5) and qualitative segmentation results (Sec. 3.6) by re-assigning each frame to one of the temporally or- on the three datasets. dered clusters. Given the embedded feature sequence e1∼𝑇𝑛 ,𝑛 = {e1,𝑛 , ..., e𝑇𝑛 ,𝑛 } and the temporal order of the clusters, 2. Method we search for the optimal label sequence that maximizes the probability 𝑝(𝑦1∼𝑇𝑛 ,𝑛 |e1∼𝑇𝑛 ,𝑛 ). Following [6], this 2.1. Spectral Clustering posterior probability can be factorized into the product of likelihoods and the probability of a given temporal Background information related to Sec. 3.2.1 in the order, i.e., main manuscript: Given the embedded feature sequence e1 , e2 , ..., e𝑇 , we build a frame-to-frame similarity graph 𝑦ˆ 1∼𝑇𝑛 ,𝑛 = arg max 𝑝(𝑦1∼𝑇𝑛 ,𝑛 |e1∼𝑇𝑛 ,𝑛 ) 𝐺 ∈ R𝑇 ×𝑇 , whose edge weight 𝑔(𝑖, 𝑗), 𝑖, 𝑗 ∈ {1, ..., 𝑇 } 𝑦1∼𝑇𝑛 ,𝑛 represents the similarity between frame 𝑖 and frame 𝑗. = arg max {Π𝑇𝑛 𝑝(e |𝑦 ) · Π𝑇𝑛 𝑝 (𝑦 |𝑦 𝑡=1 𝑡,𝑛 𝑡,𝑛 𝑡=1 𝑛 𝑡,𝑛 1∼(𝑡−1),𝑛 )} Grouping the frames into 𝐾 clusters can be interpreted 𝑦1∼𝑇𝑛 ,𝑛 as a graph partition problem by cutting edges on 𝐺, re- = arg max {Π𝑇𝑡=1 𝑛 𝑝(e𝑡,𝑛 |𝑦𝑡,𝑛 ) · 𝑝𝑛 (𝑦𝑡,𝑛 |𝑦𝑡−1,𝑛 )} (2) sulting in 𝐾 subgraphs 𝐺1 , 𝐺2 , ..., 𝐺𝐾 . The normalized 𝑦1∼𝑇𝑛 ,𝑛 cut (Ncut) problem [1] is employed to compute a balanced partition by minimizing the energy Here the likelihood 𝑝(e𝑡,𝑛 |𝑦𝑡,𝑛 ) is the probability of a frame embedding e𝑡,𝑛 from the video 𝑛 belonging to a 𝐾 1 ∑︁ 𝑊 (𝐺𝑘 , 𝐺𝑘 ) cluster. Therefore, we fit a Gaussian distribution on each ℒ𝑐𝑢𝑡 (𝐺1 , 𝐺2 , ..., 𝐺𝐾 ) = , (1) global cluster and compute the frame-wise likelihoods 2 vol(𝐺𝑘 ) 𝑘=1 with the Gaussian model, i.e., where 𝑊 (𝐺𝑘 , 𝐺𝑘 ) represents the sum of edge weights 𝑝(x|𝑘) = 𝒩𝑘 (x; 𝜇𝑘 , Σ𝑘 ), 𝑘 ∈ {1, ..., 𝐾}. (3) between elements in the subgraph 𝐺𝑘 and elements of all the other subgraphs, i.e., the sum of weights of edges 𝑝𝑛 (𝑦𝑡,𝑛 |𝑦𝑡−1,𝑛 ) is the transition probability from label to be cut. vol(𝐺𝑘 ) is the sum of weights of edges within 𝑦𝑡−1,𝑛 at frame 𝑡 − 1 to label 𝑦𝑡,𝑛 at frame 𝑡, which is the resulting subgraph 𝐺𝑘 . Spectral clustering [2] is a defined by the temporal order of clusters. We denote the relaxed solution to this NP-hard minimization problem in set of frame transitions defined by the temporal order of Eq. (1) and has shown good performance on many graph- clusters on the 𝑛-th video by 𝑂𝑛 , e.g., for the temporal based clustering problems, e.g. [3, 4, 5]. Note that while order of 𝑎 → 𝑏 → 𝑐 → 𝑑, 𝑂𝑛 = {𝑎 → 𝑏, 𝑏 → 𝑐, 𝑐 → K-means operates on Euclidean distance in the feature 𝑑}. The transition probability is binary, i.e., space and assumes convex and isotropic clusters, spectral clustering can find clusters with non-convex boundaries. 𝑝𝑛 (𝑦𝑖,𝑛 |𝑦𝑖−1,𝑛 ) (4) = 1(𝑦𝑖,𝑛 = 𝑦𝑖−1,𝑛 ∨ 𝑦𝑖−1,𝑛 → 𝑦𝑖,𝑛 ∈ 𝑂𝑛 ). 26th Computer Vision Winter Workshop, Robert Sablatnig and Florian Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023 This means that we allow either a transition to the next © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). cluster according to the temporal order, or we keep the CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) cluster assignment of the previous frame. Table 1 FC FC Comparison of combinations of embeddings and clustering FC Dilated methods on Breakfast (in %). FC residual layer FC Model K-means Two-step clustering FC Feature Embedding MoF IoU F1 MoF IoU F1 FC FC Rankloss [7] 35.2 15.6 28.8 34.7 13.4 23.7 MLP 42.9 13.1 25.5 32.7 10.9 21.2 DTFV AEMLP 34.7 13.6 25.8 32.6 11.6 21.4 (a) MLP (b) AEMLP (c) TCN TCN 33.4 17.8 31.3 40.4 19.1 32.9 SSTEN 39.3 17.8 31.9 50.3 19.0 33.6 Figure 1: Three baseline variants of embedding networks. 1.0 0.8 0.6 Note that in two-step clustering, we derive the tem- 0.4 poral order of clusters on each video separately, by sort- 0.2 (a) Rankloss MLP (b) MLP (c) AEMLP (d) TCN (e) SSTEN 0.0 ing the clusters on the video according to the average timestamp. Therefore, we have an individual 𝑂𝑛 for each Figure 2: Frame-to-frame similarity matrices of the embed- video 𝑛. On the contrary, in K-means, there is a uniform ded sequences of the same video (computed by different em- order of global clusters for all the videos and 𝑂𝑛 is thus bedding networks from DTFV features) on Breakfast. the same for each video 𝑛. The Viterbi algorithm for solving Eq. (2) is performed in an iterative process using dynamic programming, i.e., as the temporal prior to train the model with only one iteration. 𝑝(𝑦1∼𝑡,𝑛 |e1∼𝑡,𝑛 ) = (5) TCN and SSTEN are both networks for sequence- max {𝑝(𝑦1∼𝑡−1,𝑛 |e1∼𝑡−1,𝑛 ) to-sequence learning, while Rankloss MLP, MLP and 𝑦𝑡,𝑛 AEMLP are trained on individual frames. By compar- · 𝑝(e𝑡,𝑛 |𝑦𝑡,𝑛 ) · 𝑝(𝑦𝑡,𝑛 |𝑦𝑡−1,𝑛 )}. ing the performance between these two groups in Ta- ble 1, we see that sequence-to-sequence learning leads to The sequences that do not follow the temporal order will better performance, especially when combined with the be filtered out in an early stage to narrow down the search two-step clustering, which results in clusters with better range for the optimal label sequence. The output of the temporal consistency. Viterbi decoding is the optimal cluster label sequence, For the two-step clustering, we also plot the frame-to- i.e., 𝑦ˆ1∼𝑇𝑛 ,𝑛 . frame similarity matrices (spatial Gaussian kernel) of the five embeddings for the same Breakfast video in Fig. 2. 3. Additional Results The plots show that Rankloss MLP, MLP and AEMLP, which are trained on individual frames, do not expose 3.1. Embedding and clustering an appropriate temporal structure. There are noisy block patterns even in positions far away from the diagonal, Further, we compare our SSTEN embedding with three which results in noisy clusters and thus, leads to erro- baseline variants (shown in Fig. 1): MLP temporal em- neous temporal orders and inferior assignment results in bedding, autoencoder with MLP (AEMLP) and temporal the two-step clustering. The least noisy Rankloss MLP convolutional network (TCN), in combination with the has the highest performance among these three. On the two clustering methods. contrary, TCN and SSTEN embedded features, which MLP uses three FC layers for relative timestamp pre- show a clear diagonal block structure in the similarity diction. AEMLP uses MLP-based autencoder for both graph, achieve a better performance in the two-step clus- relative timestamp prediction and feature reconstruction. tering. This verifies that the sequence-to-sequence em- TCN deploys 𝑄 stacked dilated residual layers only for bedding learning (TCN and SSTEN) and two-step cluster- relative timestamp prediction. ing are a well-suited combination to address the sequen- Here, we also implement the Rankloss MLP embed- tial nature of frames in both processing steps of feature ding [7] for reference. We report the performance of embedding and clustering. these five embeddings in Table 1. Considering K-means clustering, the merit of having Comparison of the five embeddings. We learn the a better sequential nature of the embedded features via five embeddings (Rankloss MLP, MLP, AEMLP, TCN and sequence-to-sequence learning can also be seen from SSTEN) on the DTFV features. Here, the Rankloss MLP the higher IoU and F1 scores (TCN: IoU 17.8%/F1 31.3%, (consisting of two FC layers) is trained with a ranking vs. SSTEN: 17.8%/31.9%), as these penalize dominating loss. We use the initialization of uniform segmentation 50 MoF segments and oversegmentation. 45 IoU performance (in %) F1 In contrast to TCN, SSTEN can preserve the spatial 40 layout of the input features due to the feature reconstruc- 35 30 tion via the autoencoder. By comparing TCN and SSTEN, 25 we see that the SSTEN embedding with feature recon- 20 struction leads to a boost in the MoF score. The marginal 3 4 5 6 7 8 9 10 15 20 improvement of AEMLP over MLP is due to the fact that the MLP structure with only FC layers is not well-suited Figure 3: Segmentation performance of the two-step cluster- for feature reconstruction. ing on SSTEN embeddings (𝜆 = 0.002) with different 𝑚 (for local scaling) on Breakfast. Comparison between K-means and two-step clus- tering. Considering the performance of the five embed- dings with the two clustering methods, we see that K- Table 2 means leads to higher scores on the inferior embeddings Segmentation performance of two-step clustering on SSTEN (Rankloss MLP, MLP and AEMLP) trained on individ- embeddings (𝜆 = 0.002) with respect to a fixed spatial scaling ual frames, while two-step clustering performs better on factor 𝜎spat (without local scaling) on Breakfast (in %). sequence-to-sequence learning-based embeddings (TCN 𝜎spat MoF IoU F1 and SSTEN). When combined with the proposed SSTEN embedding, two-step clustering outperforms K-means by 0.5 0.7 47.1 48.0 18.0 18.5 33.4 33.9 a large margin in terms of the MoF score. We also tried 1 41.1 19.2 32.9 applying K-means on each video separately. However, 2 10 38.4 35.0 18.6 17.9 32.2 31.2 the performance dropped significantly. K-means depends only on the spatial distance and results in oversegmenta- tion, which leads to erroneous temporal order on each Impact of the scaling of the temporal Gaussian video and thus, an inferior global cluster assignment. kernel. The temporal Gaussian kernel is operated on the temporal distance between (︀ frames in a video.)︀With 2 3.2. Impact of Scaling in Spatiotemporal 𝜎tmp 2 = 2𝜎 ′ , the term exp −(𝑠𝑖 − 𝑠𝑗 )2 /(2𝜎 ′2 ) is in Similarity the standard form of a Gaussian kernel. We set 𝜎 ′ = 1/6 so that the 6𝜎 ′ range of the temporal Gaussian is equal to We perform spectral clustering with the proposed spa- the video length (since the length of each video is normal- tiotemporal similarity. Here, we analyze the impact of ized to 1 for the relative timestamp prediction). The seg- the scaling factors in the spatial and temporal Gaussian mentation performance with respect to 𝜎 ′ is shown in Ta- kernels, i.e., 𝜎spat 2 and 𝜎tmp 2 . These adjust the extent to ble 3. Apparently, 𝜎 ′ = 1/6 leads to the best result. Here, which two frames are considered similar to each other and influence the clustering quality. The experiments are Table 3 conducted for SSTEN embeddings on Breakfast. Segmentation performance of two-step clustering on SSTEN Impact of the scaling of the spatial Gaussian ker- embeddings (𝜆 = 0.002) with respect to the temporal scaling nel. For local scaling, we set 𝜎spat = 𝜎𝑖 𝜎𝑗 , where 𝜎𝑖 is factor 𝜎 ′ (𝜎 2 = 2𝜎 ′ 2 on Breakfast (in %). 2 tmp the distance from e𝑖 to its 𝑚-th nearest neighbor in the feature space. The resulting segmentation performance 𝜎 (𝜎tmp = 2𝜎 ′ 2 ) ′ 2 MoF IoU F1 w.r.t. 𝑚 is shown in Fig. 3. With 𝑚 varying in the range ∞ (w/o tmp. Gauss) 41.5 16.5 30.6 of 3 to 20, the IoU and F1 scores remain stable. There 1/3 43.5 16.9 31.3 1/6 50.3 19.0 33.6 is a range of 𝑚 ∈ {8, 9} where the best MoF scores 1/12 44.3 18.5 34.1 are achieved, whereas for other scaling parameters, the MoF score drops. Thus, we set 𝑚 = 9 for all following we also evaluate the case without the temporal Gaussian evaluations. kernel, which leads to a drop in performance. The impact For comparison, we also set 𝜎spat to fixed values (with- of the temporal Gaussian kernel on similarity matrices out local scaling) and report the segmentation perfor- of SSTEN embeddings can also be seen by comparing the mance in Table 2. We achieve great results at smaller top and bottom rows in Fig. 6 in the main manuscript. 𝜎spat values (0.5 and 0.7). For example, by adding the temporal Gaussian kernel, However, with increasing 𝜎spat the MoF score drops we decrease the similarities in Fig. 6(a1) according to the significantly, while there are only minor fluctuations in temporal distance between two frames, which leads to IoU and F1. Apparently, 𝜎spat has a large impact on the clearer diagonal block structure in Fig. 6(a2). Thus, we clustering quality. The local scaling eases the effort of set 𝜎 ′ = 1/6 for all following evaluations. tuning 𝜎spat by dynamically determining the scaling fac- tor. Table 4 all video frames from the same activity with respect to a Impact of cluster order for two-step clustering on SSTEN em- pseudo ground truth action annotation. The embedded beddings (in %). frames of the whole activity set are then clustered and Breakfast YTI the likelihood for each frame and for each cluster is com- Order MoF IoU F1 Edit MoF IoU F1 Edit puted. For the decoding, the authors build a histogram of features with respect to their clusters with a hard as- video- wise 50.3 19.0 33.6 42.3 46.6 10.7 29.5 25.5 signment and set the length of each action with respect uniform 53.5 15.7 32.2 33.0 40.7 7.7 25.1 20.3 to the overall amount of features per bin. After that, they apply a Mallow model to sample different orderings for each video with respect to the sampled distribution. The 3.3. Impact of Cluster Order resulting model is a combination of Mallow model sam- pling and action length estimation based on the frame One merit of performing within-video clustering is that distribution. we can derive the temporal order of sub-clusters for each For this experiment, we evaluate the impact of the dif- video separately. The video-wise individual order of clus- ferent decoding strategies on two embeddings: the Ran- ters is used to guide the Viterbi decoding, which breaks kloss embedding [7] and our SSTEN embedding. Table 5 the common assumption that clusters follow the same reports the results of these two embeddings in combi- temporal order in all the videos. In the following, we nation with three decodings: the Mallow model, Viterbi verify the efficacy of the derived video-wise order of decoding with K-means and Viterbi decoding with two- clusters. We use the same within-video clustering re- step clustering. sult with global cluster assignment and perform Viterbi decoding using two different temporal cluster orders: Table 5 (1) video-wise order: the temporal order of sub-clusters Comparison of combinations of embeddings and decoding is determined on each video separately; and (2) uniform strategies on Breakfast (in %). order: the uniform order is determined by sorting the Rankloss [7] embed. SSTEN embed. average timestamps of global clusters and is then applied Decoding to all the videos. Table 4 reports the segmentation per- MoF IoU F1 MoF IoU F1 formance (after Viterbi) with these two orders for our Mallow [7] kmeans+Viterbi 34.7 35.2 17.8 15.6 31.4 28.8 36.4 39.3 18.1 17.8 31.5 31.9 SSTEN embeddings on Breakfast and YTI. To measure the two-step.+Viterbi 34.7 13.4 23.7 50.3 19.0 33.6 correctness of the predicted segment order, we adopt the segmental edit distance (Edit), which is a common metric Following [7], we run 7 iterations for the Rankloss for supervised action segmentation, e.g., [8, 9, 10, 11]. embedding with the Mallow model. In each iteration, It penalizes segmentation results that have a different the Rankloss embedding is retrained using the segmen- segment order than the ground truth (i.e., it penalizes tation result from the last iteration as pseudo label, and out-of-order predictions, as well as oversegmentation). the frame-wise likelihoods and the Mallow model are From Table 4 we see that our video-wise order clearly updated. outperforms the uniform order except for MoF on Break- Unlike the Mallow model, our Viterbi decoding is a fast. Furthermore, the edit score verifies that our derived one-iteration procedure. It is operated on the embed- video-wise temporal orders are valid. ding which is trained only once. When combining with In our experiments we especially notice that the MoF Viterbi, we train the Rankloss model only once using the and IoU scores could act contradictory to each other, initialized uniform segmentation as a prior. For SSTEN e.g., the uniform order results in higher MoF scores (on with the Mallow model, we only run for one iteration, Breakfast) at the cost of lower IoU scores. MoF tends to as we do not need to train SSTEN with pseudo labels overfit on dominant classes (e.g., classes with longer ac- iteratively. tion instances) while IoU is sensitive to underrepresented Considering the Rankloss results in Table 5 we see classes and penalizes segmentation results with dominat- that combining it with the Mallow model achieves its ing segments. Therefore, it is necessary that we consider highest IoU and F1 scores. This is because for Viterbi de- all metrics for evaluation, as a higher MoF score does not coding, the Rankloss model trained only one-time using always correspond to better performance in practice. the uniform initialization as pseudo label is lacking of a strong temporal prior. Considering SSTEN, our Viterbi 3.4. Impact of Decoding Strategies decoding with two-step clustering clearly outperforms the Mallow model. With Mallow, the SSTEN embedding We compare our approach, which uses Viterbi decoding, has competitive IoU and F1 scores but significantly lower with the Mallow model decoding that has been proposed MoF. We also tried running the Mallow model on SSTEN in [7]. The authors propose a Rankloss embedding over embedded features for multiple iterations. However, this resulted in a reduced number of clusters. Thus, we see Table 6 that an appropriate combination of embedding and de- Comparison of combinations of SSTEN and different clustering coding strategy is necessary. methods in terms of clustering and final segmentation after To have a closer look into the Viterbi decoding, we Viterbi decoding on Breakfast (in %). visualize the likelihood grids computed from global clus- Clustering results Final results Embedding Clustering ters, as well as the resulting decoding path over time for MoF IoU F1 MoF IoU F1 two videos on Breakfast in Fig. 4. It shows that the decod- K-means 27.2 13.5 26.3 39.3 17.8 31.9 ing, which generates a full sequence of actions, is able SSTEN two-step.cluster 38.6 13.7 25.9 50.3 19.0 33.6 to marginalize actions that do not occur in the video by just assigning only very few frames to those ones and the majority of frames are assigned to the clusters that occur in the video. Even if the given temporal order constrains 3.5.2. Segmentation Results On Each Activity that the resulting 𝐾 coherent segments have to follow We report the ground truth number of classes and the fixed temporal order, the segments that actually do segmentation performance of MLP with K-means not belong in the sequence will be marginalized because (MLP+kmeans, our reimplementation of [12]) and TAEC the Viterbi algorithm decodes a path that maximizes the for each activity on Breakfast (Table 7), YTI (Table 8) posterior probability. Overall, it turns out that the Viterbi and 50 Salads (Table 9). The evaluation is done with decoding constrained by a temporal order performs bet- global Hungarian matching on all videos. The number of ter than the Mallow model’s iterative re-ordering. clusters is set to the maximum number of ground truth Making juice classes for each activity (𝐾 = max.#gt). Viterbi decoding path on likelihood grid frame axis cluster axis Table 7 prediction Maximum number of ground truth action classes and segmen- ground truth tation performance of MLP+kmeans (our reimplementation of Making fried egg [12]) and TAEC for the 10 activities on Breakfast (in %). The Viterbi decoding path on likelihood grid frame axis number of clusters is set to the maximum number of ground cluster axis truth classes for each activity (𝐾 = max.#gt). prediction Breakfast ground truth Global Hungarian matching on all videos (𝐾 = max.#gt) Activity 𝐾 Methods MoF IoU F1 Figure 4: Comparison of Viterbi decoding path on the like- coffee 7 MLP+kmeans 46.8 15.7 26.2 lihood grid computed from the global clusters resulted from TAEC 35.6 15.2 24.9 two-step clustering on the SSTEN embeddings, for two videos cereals 5 MLP+kmeans 48.8 25.8 37.2 TAEC 59.0 31.4 47.7 on Breakfast. The warm (red)/cool (blue) colors in the grid indicate high/low likelihoods of a frame belonging to an ac- tea 7 MLP+kmeans 32.2 13.0 22.7 TAEC 39.2 16.3 26.1 tion class. It shows that the decoding assigns most frames to occurring actions while marginalizing actions that do not MLP+kmeans 40.4 21.2 36.6 milk 5 TAEC 46.7 27.3 43.5 occur in the sequence by assigning only a few frames. MLP+kmeans 36.9 14.1 27.9 juice 8 TAEC 52.2 22.3 36.2 MLP+kmeans 47.4 15.0 25.3 sandwich 9 TAEC 53.7 19.8 33.5 3.5. Quantitative Segmentation Results MLP+kmeans 34.5 10.8 19.9 scrambled egg 12 TAEC 48.1 15.2 28.3 3.5.1. Results of Clustering and Final MLP+kmeans 36.4 11.5 24.5 Segmentation fried egg 9 TAEC 49.1 17.4 30.5 In order to show the advantage of two-step clustering salad 8 MLP+kmeans TAEC 34.7 42.0 7.8 15.2 27.5 34.0 over K-means, when combined with the proposed SSTEN MLP+kmeans 57.4 8.6 19.2 embedding, we report both, the results of clustering (be- pancake 14 TAEC 58.2 19.2 35.8 fore Viterbi decoding) and the final segmentation perfor- MLP+kmeans 42.9 13.1 25.5 mance (after Viterbi decoding) on Breakfast in Table 6. All - TAEC 50.3 19.0 33.6 We see that the proposed two-step clustering leads to superior performance than K-means, in terms of both clustering (before Viterbi decoding) in most metrics, and in terms of final segmentation (after Viterbi decoding). Table 8 References Maximum number of ground truth action classes and the segmentation performance of MLP+kmeans (our reimplemen- [1] J. Shi, J. Malik, Normalized cuts and image segmen- tation of [12]) and TAEC for the 5 activities on YTI (in %). The tation, PAMI 22 (2000) 888–905. number of clusters is set to the maximum number of ground [2] U. V. Luxburg, A tutorial on spectral clustering, truth classes for each activity (𝐾 = max.#gt). Statistics and Computing 17 (2007) 395–416. YouTube Instructions [3] H.-C. Huang, Y.-Y. Chuang, C.-S. Chen, Affinity Global Hungarian matching on all videos (𝐾 = max.#gt) aggregation for spectral clustering, in: CVPR, 2012. MoF IoU F1 MoF IoU [4] Z. Li, J. Chen, Superpixel segmentation using linear Activity 𝐾 Methods w/o w/o w/o w w spectral clustering, in: CVPR, 2015. bkg bkg bkg bkg bkg [5] R. Liu, H. Zhang, Segmentation of 3d meshes coffee 10 MLP+kmeans 40.9 10.4 34.2 11.9 9.5 through spectral clustering, in: Pacific Conference TAEC 42.8 10.5 26.6 12.4 9.6 on Computer Graphics and Applications, 2004. MLP+kmeans 45.9 17.2 34.0 24.7 15.8 change tire11 TAEC 58.6 20.0 37.2 31.5 18.4 [6] A. Richard, H. Kuehne, A. Iqbal, J. Gall, MLP+kmeans 30.6 4.5 24.4 5.1 4.1 NeuralNetwork-Viterbi: A framework for jump car 12 TAEC 34.3 6.2 26.4 5.7 5.8 weakly supervised video learning, in: CVPR, 2018. cpr 7 MLP+kmeans 34.4 9.9 31.2 15.0 8.6 [7] F. Sener, A. Yao, Unsupervised learning and segmen- TAEC 38.0 7.9 33.3 16.6 6.9 tation of complex activities from video, in: CVPR, repot 8 MLP+kmeans 29.8 7.2 23.1 10.1 6.4 2018. TAEC 35.8 7.1 22.4 12.1 6.3 [8] Y. A. Farha, J. Gall, MS-TCN: Multi-stage temporal All - MLP+kmeans 39.4 9.9 29.6 14.4 9.7 convolutional network for action segmentation, in: TAEC 46.6 10.7 29.5 17.0 10.5 CVPR, 2019. [9] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, G. D. Hager, Temporal convolutional networks for action seg- Table 9 mentation and detection, in: CVPR, 2017. Maximum number of ground truth action classes and the segmentation performance of MLP+kmeans (our reimplemen- [10] V. I. Morariu, L. S. Davis, Multi-agent event recog- tation of [12]) and TAEC for the single activity on 50 Salads nition in structured scenarios, in: CVPR, 2011. (in %). The number of clusters is set to the maximum number [11] H. S. Koppula, R. Gupta, A. Saxena, Learning human of ground truth classes for each activity (𝐾 = max.#gt). activities and object affordances from rgb-d videos, The International Journal of Robotics Research 32 50 Salads (2013) 951–970. Global Hungarian matching on all videos (𝐾 = max.#gt) [12] A. Kukleva, H. Kuehne, F. Sener, J. Gall, Unsuper- Activity 𝐾 Methods MoF IoU F1 vised learning of action classes with continuous eval 12 MLP+kmeans 37.9 24.6 40.2 temporal embedding, in: CVPR, 2019. TAEC 48.4 26.0 44.8 salad MLP+kmeans 29.1 15.7 23.4 mid 19 TAEC 26.6 14.9 23.4 3.6. Qualitative Segmentation Results We show the qualitative results of clustering and final segmentation on 7 composite activities: making cereals (Fig. 5), making milk (Fig. 6), making juice (Fig. 7), making fried egg (Fig. 8), making pancake (Fig. 9) on Breakfast, changing tire (Fig. 10) on YTI and making salad (Fig. 11) on 50 Salads (eval 12 classes). The mapping between clus- ter labels and ground truth classes is done with global Hungarian matching on all videos. The number of clus- ters is set to the maximum number of ground truth classes for each activity (𝐾 = max.#gt). For each activity, we visualize the results of 10 videos. For each video, the 3-row-group displays the ground truth (1st row), TAEC (2nd row), MLP+kmeans [12] (3rd row). Making cereals empty take bowl pour cereals pour milk stir milk Clustering result Final result vid 1 vid 2 vid 3 vid 4 vid 5 vid 6 vid 7 vid 8 vid 9 vid 10 Figure 5: Qualitative results of clustering and final segmentation of 10 making cereals videos on Breakfast. For each video, the 3-row-group displays the ground truth (1st row), TAEC (2nd row), MLP+kmeans [12] (3rd row). The mapping between cluster labels and ground truth classes is done with global Hungarian matching on all videos. The number of clusters is set to the maximum number of ground truth classes for each activity (𝐾 = max.#gt). Making milk empty spoon powder pour milk stir milk take cup Clustering result Final result vid 1 vid 2 vid 3 vid 4 vid 5 vid 6 vid 7 vid 8 vid 9 vid 10 Figure 6: Qualitative results for 10 making milk videos on Breakfast. Making juice empty cut orange squeeze orange take glass take squeezer take knife take plate pour juice Clustering result Final result vid 1 vid 2 vid 3 vid 4 vid 5 vid 6 vid 7 vid 8 vid 9 vid 10 Figure 7: Qualitative results for 10 making juice videos on Breakfast. Making fried egg empty pour oil take plate take eggs fry egg butter pan add salt crack egg put egg2plate Clustering result Final result vid 1 vid 2 vid 3 vid 4 vid 5 vid 6 vid 7 vid 8 vid 9 vid 10 Figure 8: Qualitative results for 10 making fried egg videos on Breakfast. Making pancake empty pour flour put pancake2plate fry pancake pour dough2pan stir dough spoon flour butter pan take eggs take plate crack egg pour oil take bowl pour milk Clustering result Final result vid 1 vid 2 vid 3 vid 4 vid 5 vid 6 vid 7 vid 8 vid 9 vid 10 Figure 9: Qualitative results for 10 making pancake videos on Breakfast. Changing tire bkg get things out start loose jack up unscrew wheel1 break on tight wheel put wheel unscrew wheel2 screw wheel jack down put things back Clustering result Final result vid 1 vid 2 vid 3 vid 4 vid 5 vid 6 vid 7 vid 8 vid 9 vid 10 Figure 10: Qualitative results for 10 changing tire videos of the YouTube Instructions dataset. Similar to the Breakfast illustrations, for each video the 3-row-group shows the ground truth (1st row), TAEC (2nd row), and MLP+kmeans [12] (3rd row). Making salad action start add dressing add oil add pepper cut lettuce serve salad add vinegar mix dressing mix ingredients peel cucumber place lettuce action end Clustering result Final result vid 1 vid 2 vid 3 vid 4 vid 5 vid 6 vid 7 vid 8 vid 9 vid 10 Figure 11: Qualitative results for 10 making salad videos on the 50 Salads dataset (from the eval-level, 12 action classes). Again, for each video the 3-row-group shows the ground truth (1st row), TAEC (2nd row), and MLP+kmeans [12] (3rd row).