=Paper= {{Paper |id=Vol-3349/paper1 |storemode=property |title=TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering |pdfUrl=https://ceur-ws.org/Vol-3349/paper1.pdf |volume=Vol-3349 |authors=Wei Lin,Anna Kukleva,Horst Possegger,Hilde Kuehne,Horst Bischof |dblpUrl=https://dblp.org/rec/conf/cvww/LinKPKB23 }} ==TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering== https://ceur-ws.org/Vol-3349/paper1.pdf

TAEC: Unsupervised Action Segmentation with
Temporal-Aware Embedding and Clustering
Wei Lin1,2 , Anna Kukleva3 , Horst Possegger1 , Hilde Kuehne4 and Horst Bischof1
1
Institute of Computer Graphics and Vision, Graz University of Technology, Austria
2
Christian Doppler Laboratory for Semantic 3D Computer Vision, Austria
3
Max-Planck-Institute for Informatics, Germany
4
Goethe University Frankfurt, Germany

Abstract
Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action
classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To
address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences.
In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction,
and sequence-to-sequence learning, to preserve the spatial layout and sequential nature of the video features. A two-step
clustering pipeline on these embedded feature representations then allows us to enforce temporal consistency within, as well
as across videos. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to
semantically meaningful action classes. Our evaluation on three challenging datasets shows the impact of each component
and, furthermore, demonstrates our state-of-the-art unsupervised action segmentation results.

Keywords
Unsupervised learning, unsupervised clustering, action segmentation

1. Introduction sequences that correspond to semantically meaningful
action classes. Recent approaches for unsupervised ac-
Action recognition has seen tremendous success in recent tion segmentation, e.g. [20, 21, 22], focus on three aspects:
years, especially in the context of short video clip classifi- (1) finding a suitable embedding space for the video data,
cation [1, 2, 3], action detection [4, 5, 6], and temporal ac- (2) identifying clusters of temporal segments across a
tion segmentation [7, 8, 9, 10]. The top-performing meth- large amount of videos, and (3) parsing the input videos
ods for temporal action segmentation, however, require given the respective feature embedding and cluster infor-
frame-wise annotations, which is expensive and imprac- mation. Sener and Yao [20] tackle the action segmenta-
tical for large-scale datasets [7, 8, 9, 11]. Consequently, tion task with a linear embedding and Mallow decoding,
a large body of research focuses on weakly-supervised while other approaches follow the pipeline of an MLP
approaches where only an ordered list [12, 13, 14, 15, 16] [21] or U-Net embedding [22], K-means clustering, and
or an unordered set [17, 18, 19] of action labels is needed. Viterbi decoding. However, these methods do not fully
These approaches assume that the actions that occur in leverage the temporal relationships of frames within a
each training video are known, and sometimes even re- video, either neglecting this information when learning
quire their exact ordering. Acquiring such ordered action the embedding [20, 22] or when clustering [21, 22].
lists, however, can still be time consuming or even infea- Since temporal consistency is essential for all steps of
sible. For applications like indexing large video datasets the action segmentation pipeline, we propose TAEC, an
or human behavior analysis in neuroscience or medicine, approach that considers the sequential nature of frames
it is often unclear what actions should be annotated. In in a video for both, embedding learning and clustering.
these cases, it is necessary to automatically discover and The main steps of our approach are illustrated in Fig. 1.
identify recurring actions in large video datasets. Specifically, we first propose a sequence-to-sequence
To address this problem, Sener and Yao [20] proposed temporal embedding network (SSTEN) that combines
the task of unsupervised action segmentation to identify pretext tasks of relative timestamp prediction and au-
patterns of recurring actions in long, untrimmed video toencoder feature reconstruction. While the autoencoder
26th Computer Vision Winter Workshop, Robert Sablatnig and Florian
reconstruction retains the feature layout, the relative
Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023 timestamp prediction encodes the relative temporal in-
$ wei.lin@icg.tugraz.at (W. Lin); akukleva@mpi-inf.mpg.de formation within a video. Sequence-to-sequence learning
(A. Kukleva); possegger@icg.tugraz.at (H. Possegger); enables the embedding of spatial layout and temporal
kuehne@uni-frankfurt.de (H. Kuehne); bischof@icg.tugraz.at information on a complete video sequence.
(H. Bischof)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License To cluster the embedded features, we propose a
CEUR
Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
temporal-aware two-step clustering approach that con-

1
Wei Lin et al. CEUR Workshop Proceedings 1–10

Feature Embedding Within-Video Cross-Video Viterbi Decoding
3 videos of Making coﬀee Clustering Global Cluster Assignment
Feature
embedding 2 1 3

2 → 1 → 3
frames of video 1
1 2 3
Feature
embedding
1 → 2 → 3
frames of video 2 1 2 3
Feature
embedding 1 → 2 → 3
Global Global Global
Ck,n : the k-th within-video cluster cluster 1 cluster 2 cluster 3
frames of video 3 in video n
(a) (b) (c) (d)

Figure 1: Pipeline of TAEC. We compute the embedded features with the sequence-to-sequence temporal embedding network
(SSTEN). Thereupon, we perform a within-video clustering on each video individually and apply a cross-video global cluster
assignment to group the within-video clusters into global clusters. The global cluster assignment also defines the ordering of
the clusters in each video. Finally, we use Viterbi decoding to estimate temporally coherent segments for each video.

sists of a within-video clustering and a cross-video global and recognition of frame orders [27, 28, 29, 30, 31]. For
cluster assignment. Specifically, we perform cluster- instance, Srivastava et al. [24] exploit an LSTM-based au-
ing within each video, with a spatio-temporal similarity toencoder for learning video representations. Villegas et
among frames. Then we conduct global cluster assign- al. [26] and Denton and Birodkar [25] employed two en-
ment to group the clusters across videos. The global coders to generate feature representations of content and
cluster assignment defines the ordering of the clusters motion. The temporal order of frames or small chunks
for each video. In this way, we overcome the unrealis- is utilized as a self-supervision signal for representation
tic assumption that actions of an activity always follow learning on short video clips in [27] and [28]. Inspired by
the same temporal order. Such an assumption is com- these approaches, we employ two self-supervision tasks:
monly used in related works, e.g. [21, 22]. For instance, feature reconstruction and relative time prediction.
in the activity of making coffee, a unified temporal order Clustering of temporal sequences has been explored
between actions such as adding milk and adding sugar for parsing human motions [32, 33, 34, 35]. While Zhang
is assumed for all videos of making coffee, whereas our et al. [35] proposed a hierarchical dynamic clustering
approach can handle changes of the action order in dif- framework, Li et al. [33] and Tierney et al. [34] explored
ferent videos. After assigning all within-video clusters to temporal subspace clustering to segment human motion
a set of global clusters, we perform Viterbi decoding to data. In contrast to unsupervised action segmentation,
obtain a segmentation of temporally coherent segments. these methods are applied on each temporal sequence
Our contributions can summarized as following: individually and do not consider association among se-
quences. Instead, we propose a cross-video global cluster
• We design a sequence-to-sequence temporal em-
assignment to group within-video clusters across differ-
bedding network (SSTEN), which combines rel-
ent videos into global clusters.
ative timestamp prediction, autoencoder recon-
struction and sequence-to-sequence learning. Unsupervised action segmentation on fine-grained
• We propose a within-video clustering with a activities has recent work that either focus on the repre-
novel spatio-temporal similarity formulation sentation learning [20, 22, 36] or the clustering step [23].
among frames. However, the temporal information is neglected in at
least one of these two steps. For representation learning,
• We propose a cross-video global cluster assign-
Sener and Yao [20] construct a feature embedding by
ment to group within-video clusters across videos
learning a linear mapping from visual features to a latent
into global clusters, which also overcomes the as-
space with a ranking loss. However, the linear model
sumption that in all videos of an activity, actions
trained with individual frames does not consider the tem-
follow the same temporal order.
poral association between frames. VidalMata et al. [22]
employ a U-Net trained on individual frames for future
2. Related Work frame prediction. Predicting for one or a few steps ahead
only requires temporal relations within a small temporal
Unsupervised learning of video representations is window. Instead, we propose to learn a representation by
commonly performed via pretext tasks, such as recon- predicting the complete sequence of relative timestamps
struction [23, 24], future frame prediction [22, 25, 26], to encode the long-range temporal information.

2
Wei Lin et al. CEUR Workshop Proceedings 1–10

For the clustering step, related works [22, 23] neglect
temporal consistency of frames within a video. Instead,
we apply within-video clustering on each video with Dilated
residual layer
a proposed similarity formulation that considers both Dilated
residual layer ReLU
spatial and temporal distances. Dilated Conv1d
Two recent approaches perform clustering [37] or concat
Decoder
cluster-agnostic boundary detection [38] on each video stage Dilated
separately, without identifying clusters or segments Encoder residual layer

across videos. [37] solves a task similar to human motion stage

parsing and evaluates the segmentation for each video
individually. [38] only detects boundaries of category- Figure 2: Architecture of SSTEN: We stack 2 stages of encoder
agnostic segments, and does not identify if some seg- and decoder for sequence-to-sequence feature reconstruction.
ments within a video or across videos are of the same Each stage consists of 𝑄 dilated residual layers with dilated
temporal convolution. The intermediate representation in
category. On the contrary, our segments on all videos
encoder is used for relative time prediction (red block).
are category-aware as they are aligned globally across
videos by our global cluster assignment.

stage, the hidden representation is a concatenation (in
3. Temporal-Aware Embedding yellow) of the features from dilated residual layers and
and Clustering (TAEC) the predicted relative timestamps. The training loss is:
𝑇𝑛 𝑇𝑛
We address unsupervised action segmentation as illus-
𝑁 ∑︁ 𝑁 ∑︁
^ 𝑡,𝑛 ⃦2 + 2
∑︁ ⃦ ⃦ ∑︁ ∑︁
ℒ=𝜆 ⃦x𝑡,𝑛 − x
2
(𝑠𝑡,𝑛 −𝑠
^𝑝,𝑡,𝑛 ) ,
trated in Fig. 1. First, we learn a suitable feature embed- 𝑛=1 𝑡=1 𝑝∈{1,2} 𝑛=1 𝑡=1
(1)
ding (Sec. 3.1). We then perform within-video clustering
on each video (Sec. 3.2.1), and group the within-video where the coefficient 𝜆 balances the two terms. The
clusters into global clusters (Sec. 3.2.2). Finally, we com- pretext tasks of reconstruction and relative timestamp
pute temporally coherent segments on each video using prediction encode both, the spatial distribution and the
Viterbi decoding (Sec. 3.3). global temporal information, into the embedded features.
We compare SSTEN with several baseline embedding
networks in the supplementary.
3.1. SSTEN: Sequence-to-Sequence
Temporal Embedding Network 3.2. Two-Step Clustering
To learn a latent representation for temporal sequences,
After learning the feature embedding, we group the em-
we adopt a sequence-to-sequence autoencoder. Inspired
bedded features into 𝐾 clusters by a within-video clus-
by the multi-stage temporal convolutional network [7],
tering and a cross-video global cluster assignment.
we use a concatenation of two stages for both encoder
and decoder, as shown in Fig. 2. Given a set {X𝑛 }𝑁 𝑛=1
of 𝑁 videos, where each video X𝑛 = {x𝑡,𝑛 }𝑇𝑡=1 𝑛
has 3.2.1. Within-Video Clustering
𝑇𝑛 frames, the outputs are reconstructed frame features We perform spectral clustering on frames within each
{x̂𝑡,𝑛 }𝑇𝑡=1
𝑛
. The embedded features are the hidden repre- video (detailed description in the supplementary). Given
sentation {e𝑡,𝑛 }𝑇𝑡=1
𝑛
. the embedded feature sequence1 [e1 , e2 , ..., e𝑇 ], we build
Every encoder and decoder stage consist of 1 × 1 con- a frame-to-frame similarity matrix 𝐺 ∈ R𝑇 ×𝑇 . The
volution layers for dimension adjustment (Fig. 2 blue) and entries 𝑔(𝑖, 𝑗), 𝑖, 𝑗 ∈ {1, ..., 𝑇 }, represent the similarity
𝑄 dilated residual layers (green), each containing a di- between frame 𝑖 and frame 𝑗. To consider both the spatial
lated temporal 1D convolution. Since no fully connected and temporal distance of features, we propose to measure
layers are employed, sequences of variable lengths can the similarity by the product of two Gaussian kernels
be processed seamlessly. The dilation rate at the 𝑞-th
layer is 2𝑞−1 . By stacking dilated residual layers, the
(︃ )︃
‖e𝑖 − e𝑗 ‖22 (𝑠𝑖 − 𝑠𝑗 )2
(︂ )︂
temporal receptive field increases exponentially. The re- 𝑔(𝑖, 𝑗) = exp − 2
· exp − 2
,
𝜎spat 𝜎tmp
ceptive field of the 𝑞-th layer is 1 + (𝑟 − 1) × (2𝑞 − 1),
(2)
where 𝑟 is the kernel size. Therefore, each frame in the
where 𝑠𝑖 , 𝑠𝑗 are the corresponding relative timestamps
hidden representation has a long temporal dependency
of frame 𝑖, 𝑗 and 𝜎spat , 𝜎tmp are the scaling factors for the
on the input video. In each encoder stage, we use a 1 × 1
convolution layer (in red) to predict the frame-wise rela-
tive timestamps 𝑠𝑡,𝑛 = 𝑇𝑡𝑛 . At the end of each encoder 1
For ease of notation, we omit the video index 𝑛.

3
Wei Lin et al. CEUR Workshop Proceedings 1–10

Hub vertex set non-hub vertex sets
spatial and temporal Gaussian kernels. To avoid manu-
ally tuning 𝜎spat , we use local scaling [39] to estimate 𝜎spat
dynamically. To this end, we replace 𝜎spat 2
by 𝜎𝑖 𝜎𝑗 , where
... ... ...
𝜎𝑖 is the distance from e𝑖 to its 𝑚-th nearest neighbor in

...
...
...
the embedding space. We provide an ablation study on
scaling of the spatio-temporal similarity in the supple-
mentary. Consequently, frames of similar visual content Figure 3: In the ℎ-th iteration, ℎ ∈ {1, ..., 𝑁 }, the hub vertex
and relative timestamps are encouraged to be grouped set 𝑉ℎ is chosen and an assignment is computed by assigning
into the same cluster. the vertices between 𝑉ℎ and the 𝑁 − 1 non-hub vertex sets.
Solid lines denote bipartite matching results between 𝑉ℎ and
3.2.2. Cross-Video Global Cluster Assignment non-hub vertex sets (step (1)). Dashed lines denote connec-
tions between non-hub vertex sets (step (2)).
After within-video clustering, we assign the 𝑁 × 𝐾
within-video clusters across videos into 𝐾 global clus-
ters. Every global cluster should contain 𝑁 within-video
clusters, each coming from a different video (c.f., Fig. 1). By iterating over all possible initial hub vertex sets ℎ ∈
This can be interpreted as an 𝑁 -dimensional assignment {1, ..., 𝑁 }, we choose the assignment solution 𝑓ℎ^ which
problem [40]. minimizes the assignment cost
We regard the 𝑛-th video 𝑉𝑛 = {c𝑘,𝑛 |𝑘 = 1, .., 𝐾} ∑︁
as a vertex set, where each 𝑘-th within-video cluster c𝑘,𝑛 𝑓ℎ^ = arg min 𝑓ℎ (c, c′ ) · 𝑤(c, c′ ), (4)
ℎ∈{1,...,𝑁 }
(c,c′ )∈𝐸
is a vertex. We construct⋃︀an 𝑁 -partite graph 𝐺 = (𝑉1 ∪
𝑉2 ∪ ... ∪ 𝑉𝑁 , 𝐸). 𝐸 = 𝑚<𝑛,𝑚,𝑛∈{1,...,𝑁 } {(c, c′ )|c ∈
where 𝑓ℎ (c, c′ ), ∀(c, c′ ) ∈ 𝐸 is a binary indicator func-
𝑉𝑚 , c′ ∈ 𝑉𝑛 } is the set of edges between within-video tion that describes the edge connection: 𝑓ℎ (c, c′ ) equals
clusters across videos. The edge weight 𝑤(c, c′ ) is the 1 when two vertices c, c′ are connected. The assignment
distance between centroids of two within-video clusters solution 𝑓ℎ^ describes the partition which leads to the 𝐾
c, c′ . The solution to the 𝑁 -dimensional assignment global clusters.
is a partition by dividing the graph 𝐺 into 𝐾 cliques
𝑍1 , 𝑍2 , ..., 𝑍𝐾 . A clique 𝑍𝑘 , which is a subset of 𝑁
vertices from 𝑁 different vertex sets, defines the 𝑘-th 3.3. Frame Labeling by Viterbi Decoding
global cluster. The induced sub-graphs of the cliques Given the embedded feature sequence e1∼𝑇𝑛 ,𝑛 of video
𝑍1 , 𝑍2 , ..., 𝑍𝐾 are complete and disjoint. We denote the 𝑛, we determine the optimal label sequence 𝑦ˆ1∼𝑇𝑛 ,𝑛 . The
edge set of the induced sub-graph of 𝑍𝑘 as 𝐸𝑍𝑘 . The posterior probability can be factorized into the product
cost of a clique is the sum of pairwise edge weights be- of likelihoods and the probability of a given temporal
tween the contained vertices. The cost of an assignment order, i.e., 𝑦ˆ1∼𝑇𝑛 ,𝑛 = arg max 𝑝(𝑦1∼𝑇𝑛 ,𝑛 |e1∼𝑇𝑛 ,𝑛 ) =
solution is the sum of the costs of all the 𝐾 cliques, i.e., 𝑦1∼𝑇𝑛 ,𝑛

arg max {Π𝑇𝑖=1
𝑛
𝑝𝑛 (e𝑖,𝑛 |𝑦𝑖,𝑛 ) · Π𝑇𝑖=1
𝑛
𝑝𝑛 (𝑦𝑖,𝑛 |𝑦1∼𝑖−1,𝑛 )}.
𝐾
∑︁ ∑︁ 𝑦1∼𝑇𝑛 ,𝑛
ℒ(𝑍1 , 𝑍2 , ..., 𝑍𝐾 ) = 𝑤(c, c′ ). (3) We fit a Gaussian model on each global cluster and
𝑘=1 (c,c′ )∈𝐸𝑍
𝑘 compute the frame-wise likelihoods, i.e., 𝑝𝑛 (x|𝑘) =
𝒩𝑘 (x; 𝜇𝑘 , Σ𝑘 ), 𝑘 ∈ {1, ..., 𝐾}. The temporal order con-
In order to solve this NP-hard problem, we employ an straint is used to limit the search space for the optimal
iterative multiple-hub heuristic [41]. In each iteration, label sequence by filtering out the sequences that do not
we choose a hub vertex set 𝑉ℎ = {c𝑘,ℎ |𝑘 = 1, .., 𝐾} follow the temporal order.
and there are (𝑁 − 1) non-hub vertex sets. We compute The related works [21, 22] apply K-means on the
an assignment solution in each iteration in two steps, as frames of all the videos. From the unified clustering they
is shown in Fig. 3: (1) We first perform (𝑁 − 1) bipar- derive only a single temporal order of clusters for all the
tite matchings between 𝑉ℎ and each of the remaining videos. However, this is an unrealistic assumption due to
non-hub vertex sets 𝑉ℎ . (2) Secondly, we determine the interchangeable steps in the activities, e.g., pour milk and
edge connection between pairs of non-hub vertex sets. pour sugar in making coffee. Instead, we can easily derive
On two non-hub vertex sets 𝑉ℎ , 𝑉ℎ′ , we connect two the temporal order for each video separately. We do so
vertices c𝑖,ℎ ∈ 𝑉ℎ and c𝑖′ ,ℎ′ ∈ 𝑉ℎ′ , if c𝑖,ℎ and c𝑖′ ,ℎ′ are by sorting the within-video clusters according to the av-
connected to the same vertex c𝑘,ℎ on 𝑉ℎ . erage timestamp of frames in each cluster. The output of
After the two steps, every hub vertex c𝑘,ℎ , with 𝑘 ∈ the Viterbi decoding is the optimal cluster label sequence
{1, .., 𝐾} and all the non-hub vertices connected to c𝑘,ℎ 𝑦ˆ
1∼𝑇𝑛 ,𝑛 . More details are given in the supplementary.
form the 𝑘-th clique 𝑍𝑘 . Therefore, the 𝑁 -partite graph
𝐺 is partitioned into 𝐾 complete and disjoint subgraphs.

4
Wei Lin et al. CEUR Workshop Proceedings 1–10

4. Experiments tion (i.e., LSTM+AL [38]) on each video individually, with-
out solving the alignment among different clusters or
4.1. Datasets & Evaluation Metrics segments across videos. For a fair comparison, these
are evaluated by local Hungarian matching on indi-
We evaluate on Breakfast [42], the YouTube Instructions
vidual videos, where a per-video best ground-truth-to-
dataset (YTI) [36] and 50 Salads [43]. Breakfast is com-
cluster-label mapping is determined using the ground
prised of 1712 videos recorded in various kitchens. There
truth on each video separately. This results in a separate
are 10 composite activities of breakfast preparation. YTI
label mapping for each video. Following [37], we also
is composed of 150 videos of 5 activities collected from
report results with 𝐾 set to the average number of ac-
YouTube. 50 Salads contains 50 videos of people prepar-
tions for each activity (i.e., 𝐾=avg.#gt) for a complete
ing salads. Following [20, 21, 22], we use the dense trajec-
comparison.
tory Fisher vector features (DTFV) [44] for Breakfast and
In Table 1, TAEC achieves strong results in compari-
50 Salads, and features provided by Alayrac et al. [36] on
son to the unsupervised state-of-the-art and is even com-
YTI. We use the evaluation protocol in [21] and report
parable to weakly supervised approaches. Although ap-
the performance in three metrics: (1) Mean over Frames
proaches without solving the alignment of clusters across
(MoF) is the frame-level accuracy over the frames of all
videos inherently lead to better scores in the evaluation
the videos. More frequent or longer action instances
settings of the local Hungarian matching, our approach
have a higher impact on the result. (2) Class-wise mean
still compares favorably.
Intersection over Union (cIoU) is the average over the
We compare qualitative results (with global Hungarian
IoU performance for each class and penalizes segmenta-
matching) of TAEC and MLP+kmeans [21] on 3 Breakfast
tion results with dominating segments. (3) The F1-score
activities in Fig. 4. We see that our two-step clustering
penalizes results with oversegmentation.
(the 2nd rows in all clustering result plots) already leads to
temporally consistent segments with relatively accurate
4.2. Implementation Details boundaries of action instances, while K-means (the 4th
rows in all clustering result plots) results in serious over-
For our SSTEN, we adapt the number of dilated residual
segmentation. The Viterbi decoding further improves
layers 𝑄 according to the dataset size: We set 𝑄 = 5
the segmentation by suppressing the oversegmentation
for YTI (15k frames per activity subset on average) and
and domination of incorrect clusters (the 2nd rows in all
𝑄 = 10 for Breakfast (360k) and 50 Salads (577k). The
final result plots). Moreover, MLP+kmeans [21] follows
dimension of the hidden representation is set to 32. We
the constraint of the fixed temporal order of segments
set 𝜆 in Eq. (1) to 0.002 (Breakfast), 0.01 (YTI) and 0.005 (50
on videos of each activity (the 4th rows in all final result
Salads). For clustering, we follow the protocol of [20, 36]
plots). In contrast, TAEC yields an individual temporal
and define the number of clusters 𝐾 separately for each
order for each video (the 2nd rows in all final result plots).
activity as the maximum number of ground truth classes.
Additional qualitative results and evaluation scores are
The values of 𝐾 for the three datasets are provided in
included in the supplemental material.
the supplementary material.
For the YouTube Instructions dataset, we follow the
protocol of [20, 21, 36] and report the results with and
4.3. Comparison with the State-of-the-Art without considering background frames. Here, our TAEC
We compare with unsupervised learning methods, as outperforms all recent works in almost all of the metrics
well as weakly and fully supervised approaches on under all three settings.
Breakfast (Table 1), YTI (Table 2) and 50 Salads (Ta- 50 Salads is a particularly challenging dataset for un-
ble 3). Most unsupervised segmentation approaches yield supervised approaches, as each video has a different or-
cluster-aware segments that are aligned across all the der of actions and additionally includes many repetitive
videos [20, 21, 22, 36, 45]. These approaches are eval- action instances. In the eval-level of 12 classes, TAEC
uated with the global Hungarian matching on all outperforms all approaches under the global Hungarian
videos, where the mapping between ground truth classes matching evaluation and achieves competitive results
and clusters is performed on all the videos of an activ- under the local Hungarian matching. In the challenging
ity, which results in one mapping for each activity. The mid-level evaluation of 19 classes, the sequential nature
number of clusters 𝐾 is set to the maximum number of of frames is less advantageous. Therefore, MLP+kmeans
ground truth classes for each activity (i.e., 𝐾=max.#gt). [21] outperforms TAEC. Generally, in the local match-
We focus on the performance comparison in this setting ing case, approaches without alignment across videos
and follow this setting in all the ablation studies. compare favorably.
Two recent approaches perform clustering (i.e.,
TWFINTCH [37]) or category-agnostic boundary detec-

5
Wei Lin et al. CEUR Workshop Proceedings 1–10

Making cereals Making juice Making fried egg
empty take bowl pour cereals pour milk empty cut orange squeeze orange take glass empty pour oil take plate take eggs fry egg
stir milk take squeezer take knife take plate pour juice butter pan add salt crack egg put egg2plate

cereals video 1 clustering result juice video 1 clustering result fried egg video 1 clustering result

cereals video 2 clustering result juice video 2 clustering result fried egg video 2 clustering result

cereals video 3 clustering result juice video 3 clustering result fried egg video 3 clustering result

cereals video 1 ﬁnal result juice video 1 ﬁnal result fried egg video 1 ﬁnal result

cereals video 2 ﬁnal result juice video 2 ﬁnal result fried egg video 2 ﬁnal result

cereals video 3 ﬁnal result juice video 3 ﬁnal result fried egg video 3 ﬁnal result

Figure 4: Qualitative results of clustering and final segmentation (with global Hungarian matching) on 3 activities (3 videos
each) on Breakfast. For each video, the 4-row-group displays the ground truth (1st row), TAEC (2nd row), TAEC with naïve
assignment (3rd row, quantitative comparison in Sec. 4.6), MLP+kmeans [21] (4th row). More quantitative and qualitative
segmentation results are given in the supplementary.

Table 1 Table 2
Comparison with state-of-the-art approaches on Breakfast Comparison with the state-of-the-art on YTI (in %). * denotes
(in %). * denotes approach without segment alignment across approach without segment alignment across videos, ‡ denotes
videos, ‡ denotes our reimplementation, underlined scores are our reimplementation, underlined scores are acquired from
acquired from the author. the author.
Breakfast YouTube Instructions

Approach Supervision MoF IoU F1 MoF IoU F1
Super- MoF IoU
Approach w/o w/o w/o
MSTCN++ [8] full 67.6 - - vision w bg w bg
bg bg bkg
G-FRNet [46] full 67.7 - -
DTGRM [10] full 68.3 - - Global Hungarian matching on all videos (𝐾= max. #gt)
SSTDA [47] full 70.3 - - Frank-Wolfe [36] w/o - - 24.4 - -
BCN [9] full 70.4 - - Mallow [20] w/o 27.8 - 27.0 - -
Global2local [48] full 70.7 - - UNet+MLP [22] w/o 28.9 8.3 29.9 - -
ASFormer [49] full 73.5 - - ASAL [45] w/o 44.9 - 32.1 - -
NN-Viterbi [15] weak 43.0 - - MLP+kmeans [21] w/o 39.0 9.8 28.3 14.5 9.6
D3TW [12] weak 45.7 - - MLP+kmeans ‡ w/o 39.4 9.9 29.6 14.4 9.7
TASL [50] weak 47.8 - - TAEC w/o 46.6 10.7 29.5 17.0 10.5
CDFL [13] weak 50.2 - - Local Hungarian matching on each video (𝐾= max. #gt)
Global Hungarian matching on all videos (𝐾= max. #gt) LSTM+AL [38]* w/o - - 39.7* - -
Mallow [20] w/o 34.6 - - MLP+kmeans ‡ w/o 62.2 21.6 47.0 22.7 21.6
UNet+MLP [22] w/o 48.1 - 29.9 TAEC w/o 67.9 23.7 49.4 24.8 23.6
ASAL [45] w/o 52.5 - 37.9 Local Hungarian matching on each video (𝐾= avg. #gt)
MLP+kmeans [21] w/o 41.8 - 26.4
MLP+kmeans ‡ w/o 42.9 13.1 25.5 TWFINCH [37]* w/o 56.7* - 48.2* - -
TAEC w/o 50.3 19.0 33.6 MLP+kmeans ‡ w/o 63.6 20.5 52.3 23.2 20.5
TAEC w/o 65.3 20.9 51.0 23.9 20.8
Local Hungarian matching on each video (𝐾= max. #gt)
LSTM+AL [38]* w/o 42.9* 46.9* -
UNet+MLP [22] w/o 52.2 - -
TWFINTCH [37]*
MLP+kmeans ‡
w/o
w/o
57.8*
61.2
-
30.3
-
35.9
4.4. Embedding and Clustering
TAEC w/o 64.3 41.2 42.6
We first evaluate our SSTEN embedding in combination
Local Hungarian matching on each video (𝐾= avg. #gt)
with K-means and two-step clustering on three feature
TWFINCH [37]*
MLP+kmeans ‡
w/o
w/o
62.7*
60.6
42.3*
27.8
-
46.4
types: the AlexNet fc6 features [57] pre-trained on Im-
TAEC w/o 62.6 32.3 49.6 ageNet [58], I3D features [59] pre-trained on the Kinet-
ics dataset [60], and the precomputed dense trajectory
Fisher vectors (DTFV) [44]. We also report results of
raw features without any temporal embedding. For a
fair comparison, we reduce the dimensions of the three

6
Wei Lin et al. CEUR Workshop Proceedings 1–10

Table 3 Comparison of raw features without embedding.
Comparison with the state-of-the-art on 50 Salads (in %). * Among the three types of features without temporal
denotes approach without segment alignment across videos, embedding, I3D achieves the best performance, while
‡ denotes our reimplementation. AlexNet features lead to the worst results. AlexNet fea-
50 Salads tures are computed from individual spatial frames. On
Level Approach Supervision MoF IoU F1
the contrary, each frame feature of DTFV and I3D is com-
STCNN [51] full 72.0 - -
puted based on a chunk of its temporal neighbor frames.
EDTCN [52] full 73.4 - - Therefore, the features already carry intrinsic temporal
MA [53] full 88.5 - -
consistency. Furthermore, the two-stream I3D model can
Global Hungarian matching on all videos (𝐾= max. #gt) leverage both RGB and optical flow. Therefore, I3D fea-
UNet+MLP [22] w/o 30.6 - - tures achieve a better performance than DTFV, which
ASAL [45] w/o 39.2 - -
eval MLP+kmeans [21] w/o 35.5 - - rely on handcrafted dense trajectories.
MLP+kmeans ‡ w/o 37.9 24.6 40.2 Comparison of SSTEN embeddings learned on
TAEC w/o 48.4 26.0 44.8
different features. When comparing the SSTEN em-
Local Hungarian matching on each video (𝐾= max. #gt)
beddings to the performance of the raw features, we see
LSTM+AL [38]*
MLP+kmeans ‡
w/o
w/o
60.6*
58.0
-
33.7
-
49.8
that SSTEN leads to a significant performance gain for
TAEC w/o 59.7 35.0 54.4 both clustering methods. For DTFV, the performance
Local Hungarian matching on each video (𝐾= avg. #gt) improvements by SSTEN are MoF 8.5%, IoU 6.0%, F1 8.9%
TWFINCH [37]* w/o 71.1* - - with K-means and MoF 15.8%, IoU 6.9%, F1 11.6% with
MLP+kmeans ‡ w/o 51.5 22.3 43.4 two-step clustering.
TAEC w/o 59.7 35.7 54.7
Among the three types of SSTEN embedded features,
SSTDA [47] full 83.2 - -
MSTCN++ [8] full 83.7 - - I3D has slightly better IoU and F1 scores while DTFV
HASR [54] full 83.9 - - leads to the best MoF scores for both, K-means and the
ASRF [55] full 84.5 - -
ASFormer [49] full 85.6 - - two-step clustering. Overall, the SSTEN embeddings
NNViterbi [15] weak 49.4 - -
learned from these two features perform comparably. We
CDFL [56] weak 54.7 - - conduct the following experiments using DTFV, which
Global Hungarian matching on all videos (𝐾= max. #gt) is also used in related works.
mid UNet+MLP [22] w/o 24.2 - -
ASAL [45] w/o 34.4 - -
MLP+kmeans [21] w/o 30.2 - - 4.5. Impact of Loss Terms on Clustering
MLP+kmeans ‡ w/o 29.1 15.7 23.4
TAEC w/o 26.6 14.9 23.4 To evaluate the impact of the two loss terms in Eq. (1), we
Local Hungarian matching on each video (𝐾= max. #gt) plot the quantitative segmentation results of SSTEN with
MLP+kmeans ‡ w/o 55.6 29.6 39.6 both K-means and the two-step clustering w.r.t. different
TAEC w/o 50.2 29.4 40.3
reconstruction loss coefficients 𝜆 in Fig. 5. In general,
Local Hungarian matching on each video (𝐾= avg. #gt) two-step clustering leads to a better performance than
TWFINCH [37]* w/o 66.5* - - K-means for almost all 𝜆 values (except for the case of
MLP+kmeans ‡ w/o 53.4 28.1 39.0
TAEC w/o 51.9 30.2 41.9 only reconstruction loss). With decreasing 𝜆, the relative
time prediction loss has an increasing impact and the em-
bedded features have better global temporal consistency,
features without embedding to 32 via PCA. We conduct which explains the increasing IoU and F1 scores. How-
the experiments on Breakfast and report the results with ever, at extremely small 𝜆 values, the embedded features
and without our SSTEN embedding in Table 4. overfit to the relative time prediction task, which results
in saturated IoU and F1 scores, and a significant drop in
MoF for both K-means and two-step clustering.
Table 4
To intuitively illustrate the loss term impact on the two-
Comparison of features of SSTEN embedding, together with
clustering methods on Breakfast (in %). step clustering, we plot the similarity matrices for SSTEN
embeddings trained with three different 𝜆 in Fig. 6. Here,
Model K-means Two-step clustering we look at the similarity matrices with temporal Gaussian
Feature Embedding MoF IoU F1 MoF IoU F1 kernel (bottom row). Intuitively, the similarity matrix
AlexNet 25.9 11.3 22.3 33.7 10.7 20.2 with clear diagonal block structure (Fig. 6(a2)), which is
I3D w/o 33.4 14.4 25.6 37.7 14.3 26.1
DTFV 30.8 11.8 23.0 34.5 12.1 22.0 the result of an appropriate ratio between the reconstruc-
AlexNet 33.0 14.2 27.0 39.1 16.1 30.7
tion loss and relative time prediction loss (𝜆 = 0.002),
I3D SSTEN 37.9 18.7 33.3 45.2 20.5 35.1 leads to the best segmentation performance. When 𝜆 be-
DTFV 39.3 17.8 31.9 50.3 19.0 33.6
comes larger (e.g., 𝜆 = 0.01), the reconstruction loss
has a larger impact and the diagonal block structure

7
Wei Lin et al. CEUR Workshop Proceedings 1–10

50 MoF two-step.
45 MoF kmeans Viterbi decoding) on Breakfast and 50 Salads in Table 5.
IoU two-step.
performance (in %)

40 IoU kmeans The global cluster assignment outperforms the naïve as-
35 F1 two-step.
F1 kmeans
30
25 Table 5
20 Impact of cluster assignment strategies for two-step clustering
15 on SSTEN embeddings on Breakfast and 50 Salads (eval level,
10 tive i.e., 12 classes) (in %).
onlyuction rela ion
only predict
nstr
reco time
Clustering Final
Dataset Strategy
Figure 5: Segmentation performance of both clustering meth- MoF IoU F1 MoF IoU F1
ods on SSTEN embeddings with different 𝜆 on Breakfast. global cluster 38.6 13.7 25.9 50.3 19.0 33.6
Breakfast
naïve 25.1 12.4 23.9 42.3 17.7 32.0
global cluster 45.3 23.0 43.7 48.4 26.0 44.8
50 Salads
naïve 28.9 16.2 32.9 35.0 22.7 38.0
(Fig. 6(b2)) becomes pale. Therefore, the performances of
embedded features with 𝜆 = 0.005, 𝜆 = 0.01 and only
signment by a large margin for both, clustering results
reconstruction loss degrade successively. On the other
and the final segmentation results, on both datasets. The
hand, for extremely small 𝜆 values (e.g., 𝜆 = 0.0005),
advantage of the global cluster assignment is even more
the block diagonal structure (Fig. 6(c2)) becomes noisy
evident on 50 Salads.
due to overfitting on relative time prediction.
We illustrate exemplary qualitative results of the clus-
1.0 tering and the final segmentation for 3 activities (with 3
w/o
0.8
videos each) on Breakfast in Fig. 4. For each video, the 4-
tmp.Gauss row group displays the ground truth (1st row), the result
0.6
(a1) (b1) (c1) with global cluster assignment (2nd row) and the result
0.4 with naïve assignment (3rd row). The 4th row shows the
with
0.2
result of MLP+kmeans [21]. By comparing all the 3rd
tmp.Gauss
rows of cereals video [id] final result in Fig. 4, we see that
(a2) (b2) (c2) 0.0
the naïve assignment simply assumes the sub-clusters in
(a) SSTEN (b) SSTEN (c) SSTEN the same temporal order in each video belong to the same
global cluster, while they might not be close to each other
Figure 6: Frame-to-frame similarity matrices of SSTEN em-
in the feature space. On the contrary, the global cluster
beddings for the same Breakfast video. Columns show the
similarity matrices for different 𝜆, while the rows show results assignment (the 2nd rows of cereals video [id] final result)
without (top) and with (bottom) temporal Gaussian kernel. yields an optimal assignment solution with respect to
the pairwise distances between sub-clusters, resulting in
different orderings of sub-clusters on each video. Note
Therefore, both the reconstruction and the relative that on some videos, global cluster assignment could lead
timestamp prediction loss, when combined in an appro- to the same assignment result as naïve assignment.
priate ratio, are indispensable to learn the effective rep-
resentation that preserves both spatial layout and the
temporal information. 5. Conclusion
We proposed a new pipeline for the unsupervised learn-
4.6. Impact of Cluster Assignment ing of action segmentation. For the feature embedding,
In this ablation study, we evaluate the efficacy of the we propose a temporal-aware embedding network that
global cluster assignment. For two-step clustering, we performs sequence-to-sequence learning with the pretext
evaluate two strategies of grouping within-video clus- tasks of relative timestamp prediction and feature recon-
ters into global clusters: (1) the naïve assignment, for struction. For clustering, we propose a two-step cluster-
which we order the sub-clusters according to the aver- ing schema, consisting of within-video clustering and
age timestamp and simply group the 𝑘-th sub-clusters cross-video global cluster assignment. The temporal em-
of all videos into a global cluster, i.e., the global cluster bedding of sequence-to-sequence learning together with
𝑍𝑘 = {c𝑘,𝑛 |𝑛 = 1, .., 𝑁 }, and (2) the global cluster two-step clustering is proven to be a well-suitable combi-
assignment, as detailed in Sec. 3.2.2. nation that considers the sequential nature of frames in
In order to show how the different cluster assignment both processing steps. Ultimately, we combine the tempo-
strategies affect the clustering result, we report both, ral embedding with a frame-to-cluster assignment based
the results of the two-step clustering (before Viterbi de- on Viterbi decoding, which achieves the unsupervised
coding) and the final segmentation performance (after state-of-the-art on three challenging benchmarks.

8
Wei Lin et al. CEUR Workshop Proceedings 1–10

References supervised action segmentation without ordering
constraints, in: CVPR, 2018.
[1] Z. Wang, Q. She, A. Smolic, Action-net: Multipath [20] F. Sener, A. Yao, Unsupervised learning and segmen-
excitation for action recognition, in: CVPR, 2021. tation of complex activities from video, in: CVPR,
[2] C. Feichtenhofer, X3d: Expanding architectures for 2018.
efficient video recognition, in: CVPR, 2020. [21] A. Kukleva, H. Kuehne, F. Sener, J. Gall, Unsuper-
[3] C. Yang, Y. Xu, J. Shi, B. Dai, B. Zhou, Temporal vised learning of action classes with continuous
pyramid network for action recognition, in: CVPR, temporal embedding, in: CVPR, 2019.
2020. [22] R. G. VidalMata, W. J. Scheirer, H. Kuehne, Joint
[4] M. Xu, C. Zhao, D. S. Rojas, A. Thabet, B. Ghanem, visual-temporal embedding for unsupervised learn-
G-tad: Sub-graph localization for temporal action ing of actions in untrimmed sequences, in: WACV,
detection, in: CVPR, 2020. 2021.
[5] P. Zhao, L. Xie, C. Ju, Y. Zhang, Y. Wang, Q. Tian, [23] B. L. Bhatnagar, S. Singh, C. Arora, C. V. Jawahar,
Bottom-up temporal action localization with mu- Unsupervised learning of deep feature representa-
tual regularization, in: ECCV, 2020. tion for clustering egocentric actions, in: IJCAI,
[6] Y. Bai, Y. Wang, Y. Tong, Y. Yang, Q. Liu, J. Liu, 2017.
Boundary content graph neural network for tem- [24] N. Srivastava, E. Mansimov, R. Salakhudinov, Un-
poral action proposal generation, in: ECCV, 2020. supervised learning of video representations using
[7] Y. A. Farha, J. Gall, MS-TCN: Multi-stage temporal LSTMs, in: ICML, 2015.
convolutional network for action segmentation, in: [25] E. L. Denton, V. Birodkar, Unsupervised learning
CVPR, 2019. of disentangled representations from video, in:
[8] S. Li, Y. A. Farha, Y. Liu, M.-M. Cheng, J. Gall, MS- NeurIPS, 2017.
TCN++: Multi-Stage Temporal Convolutional Net- [26] R. Villegas, J. Yang, S. Hong, X. Lin, H. Lee, De-
work for Action Segmentation, PAMI (2020). composing motion and content for natural video
[9] Z. Wang, Z. Gao, L. Wang, Z. Li, G. Wu, Boundary- sequence prediction, in: ICLR, 2017.
aware cascade networks for temporal action seg- [27] D. Kim, D. Cho, I. S. Kweon, Self-supervised video
mentation, in: ECCV, 2020. representation learning with space-time cubic puz-
[10] D. Wang, D. Hu, X. Li, D. Dou, Temporal relational zles, in: AAAI, 2019.
modeling with self-supervision for action segmen- [28] H.-Y. Lee, J.-B. Huang, M. Singh, M.-H. Yang, Un-
tation, in: AAAI, 2021. supervised representation learning by sorting se-
[11] Y. Huang, Y. Sugano, Y. Sato, Improving action quences, in: ICCV, 2017.
segmentation via graph-based temporal reasoning, [29] V. Ramanathan, K. Tang, G. Mori, L. Fei-Fei, Learn-
in: CVPR, 2020. ing temporal embeddings for complex video analy-
[12] C.-Y. Chang, D.-A. Huang, Y. Sui, L. Fei-Fei, J. C. sis, in: ICCV, 2015.
Niebles, D3tw: Discriminative differentiable dy- [30] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati,
namic time warping for weakly supervised action T. Tuytelaars, Modeling video evolution for action
alignment and segmentation, in: CVPR, 2019. recognition, in: CVPR, 2015.
[13] J. Li, P. Lei, S. Todorovic, Weakly supervised energy- [31] A. Cherian, B. Fernando, M. Harandi, S. Gould, Gen-
based learning for action segmentation, in: ICCV, eralized rank pooling for activity recognition, in:
2019. CVPR, 2017.
[14] A. Richard, H. Kuehne, J. Gall, Weakly supervised [32] M. Hoai, F. D. la Torre, Maximum margin temporal
action learning with RNN based fine-to-coarse mod- clustering, in: Artificial Intelligence and Statistics,
eling, in: CVPR, 2017. 2012.
[15] A. Richard, H. Kuehne, A. Iqbal, J. Gall, [33] S. Li, K. Li, Y. Fu, Temporal subspace clustering for
NeuralNetwork-Viterbi: A framework for human motion segmentation, in: CVPR, 2015.
weakly supervised video learning, in: CVPR, 2018. [34] S. Tierney, J. Gao, Y. Guo, Subspace clustering for
[16] D.-A. Huang, F.-F. Li, J. C. Niebles, Connectionist sequential data, in: CVPR, 2014.
temporal modeling for weakly supervised action [35] Y. Zhang, S. Tang, H. Sun, H. Neumann, Human
labeling, in: ECCV, 2016. motion parsing by hierarchical dynamic clustering,
[17] M. Fayyaz, J. Gall, SCT: Set constrained temporal in: BMVC, 2018.
transformer for set supervised action segmentation, [36] J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic,
in: CVPR, 2020. I. Laptev, S. Lacoste-Julien, Unsupervised learning
[18] J. Li, S. Todorovic, Set-constrained Viterbi for set- from narrated instruction videos, in: CVPR, 2016.
supervised action segmentation, in: CVPR, 2020. [37] M. S. Sarfraz, N. Murray, V. Sharma, A. Diba,
[19] A. Richard, H. Kuehne, J. Gall, Action sets: Weakly L. Van Gool, R. Stiefelhagen, Temporally-weighted

9
Wei Lin et al. CEUR Workshop Proceedings 1–10

hierarchical clustering for unsupervised action seg- boundaries, in: WACV, 2021.
mentation, in: CVPR, 2021. [56] J. Liu, A. Shahroudy, M. L. Perez, G. Wang, L.-Y.
[38] S. N. Aakur, S. Sarkar, A perceptual prediction Duan, A. K. Chichung, NTU RGB+D 120: A large-
framework for self supervised event segmentation, scale benchmark for 3D human activity understand-
in: CVPR, 2019. ing, PAMI (2019).
[39] L. Zelnik-Manor, P. Perona, Self-tuning spectral [57] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet
clustering, in: NeurIPS, 2005. classification with deep convolutional neural net-
[40] W. P. Pierskalla, Letter to the editor—the multi- works, in: NeurIPS, 2012.
dimensional assignment problem, Operations Re- [58] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-
search 16 (1968) 422–431. Fei, Imagenet: A large-scale hierarchical image
[41] H.-J. Bandelt, Y. Crama, F. C. Spieksma, Approxima- database, in: CVPR, 2009.
tion algorithms for multi-dimensional assignment [59] J. Carreira, A. Zisserman, Quo vadis, action recog-
problems with decomposable costs, Discrete Ap- nition? A new model and the kinetics dataset, in:
plied Mathematics 49 (1994) 25–50. CVPR, 2017.
[42] H. Kuehne, A. Arslan, T. Serre, The language of [60] W. Kay, J. Carreira, K. Simonyan, B. Zhang,
actions: Recovering the syntax and semantics of C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green,
goal-directed human activities, in: CVPR, 2014. T. Back, P. Natsev, et al., The kinetics human ac-
[43] S. Stein, S. J. McKenna, Combining embedded ac- tion video dataset, arXiv preprint arXiv:1705.06950
celerometers with computer vision for recognizing (2017).
food preparation activities, in: ACM International
Joint Conference on Pervasive and Ubiquitous Com-
puting, 2013.
[44] H. Wang, C. Schmid, Action recognition with im-
proved trajectories, in: ICCV, 2013.
[45] J. Li, S. Todorovic, Action shuffle alternating learn-
ing for unsupervised action segmentation, in:
CVPR, 2021.
[46] D. Wang, Y. Yuan, Q. Wang, Gated forward refine-
ment network for action segmentation, Neurocom-
puting 407 (2020) 63–71.
[47] M.-H. Chen, B. Li, Y. Bao, G. AlRegib, Z. Kira, Action
segmentation with joint self-supervised temporal
domain adaptation, in: CVPR, 2020.
[48] S.-H. Gao, Q. Han, Z.-Y. Li, P. Peng, L. Wang, M.-M.
Cheng, Global2local: Efficient structure search for
video action segmentation, in: CVPR, 2021.
[49] F. Yi, H. Wen, T. Jiang, Asformer: Transformer for
action segmentation, in: BMVC, 2021.
[50] Z. Lu, E. Elhamifar, Weakly-supervised action
segmentation and alignment via transcript-aware
union-of-subspaces learning, in: ICCV, 2021.
[51] C. Lea, A. Reiter, R. Vidal, G. D. Hager, Segmental
spatiotemporal cnns for fine-grained action seg-
mentation, in: ECCV, 2016.
[52] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, G. D. Hager,
Temporal convolutional networks for action seg-
mentation and detection, in: CVPR, 2017.
[53] C. Fermüller, F. Wang, Y. Yang, K. Zampogiannis,
Y. Zhang, F. Barranco, M. Pfeiffer, Prediction of ma-
nipulation actions, International Journal of Com-
puter Vision 126 (2018) 358–374.
[54] H. Ahn, D. Lee, Refining action segmentation with
hierarchical video representations, in: ICCV, 2021.
[55] Y. Ishikawa, S. Kasai, Y. Aoki, H. Kataoka, Alleviat-
ing over-segmentation errors by detecting action

10
TAEC: Unsupervised Action Segmentation with
Temporal-Aware Embedding and Clustering
Supplementary

1. Introduction 2.2. Frame Labeling by Viterbi Decoding
For additional insights into TAEC, we introduce the back- Additional explanations to Sec. 3.3 in the main
ground of spectral clustering in Sec. 2.1 and give details manuscript: The global cluster assignment delivers the or-
of the Viterbi decoding in Sec. 2.2. We perform more dered clusters on each video, which are aligned across all
ablation studies on comparing baseline embeddings and videos. To compute the final segmentation, we use the re-
clustering methods (Sec. 3.1), scaling of spatio-temporal sulting ordering and decode each video into a sequence of
similarity (Sec. 3.2), cluster ordering (Sec. 3.3), decoding 𝐾 temporally consistent segments. That is, we determine
strategies (Sec. 3.4). Finally, we provide more quantitative the optimal label sequence 𝑦ˆ1∼𝑇𝑛 ,𝑛 = {𝑦1,𝑛 , ..., 𝑦𝑇𝑛 ,𝑛 }
(Sec. 3.5) and qualitative segmentation results (Sec. 3.6) by re-assigning each frame to one of the temporally or-
on the three datasets. dered clusters.
Given the embedded feature sequence e1∼𝑇𝑛 ,𝑛 =
{e1,𝑛 , ..., e𝑇𝑛 ,𝑛 } and the temporal order of the clusters,
2. Method we search for the optimal label sequence that maximizes
the probability 𝑝(𝑦1∼𝑇𝑛 ,𝑛 |e1∼𝑇𝑛 ,𝑛 ). Following [6], this
2.1. Spectral Clustering posterior probability can be factorized into the product
of likelihoods and the probability of a given temporal
Background information related to Sec. 3.2.1 in the
order, i.e.,
main manuscript: Given the embedded feature sequence
e1 , e2 , ..., e𝑇 , we build a frame-to-frame similarity graph 𝑦ˆ
1∼𝑇𝑛 ,𝑛 = arg max 𝑝(𝑦1∼𝑇𝑛 ,𝑛 |e1∼𝑇𝑛 ,𝑛 )
𝐺 ∈ R𝑇 ×𝑇 , whose edge weight 𝑔(𝑖, 𝑗), 𝑖, 𝑗 ∈ {1, ..., 𝑇 } 𝑦1∼𝑇𝑛 ,𝑛
represents the similarity between frame 𝑖 and frame 𝑗. = arg max {Π𝑇𝑛 𝑝(e |𝑦 ) · Π𝑇𝑛 𝑝 (𝑦 |𝑦
𝑡=1 𝑡,𝑛 𝑡,𝑛 𝑡=1 𝑛 𝑡,𝑛 1∼(𝑡−1),𝑛 )}
Grouping the frames into 𝐾 clusters can be interpreted 𝑦1∼𝑇𝑛 ,𝑛
as a graph partition problem by cutting edges on 𝐺, re-
= arg max {Π𝑇𝑡=1 𝑛
𝑝(e𝑡,𝑛 |𝑦𝑡,𝑛 ) · 𝑝𝑛 (𝑦𝑡,𝑛 |𝑦𝑡−1,𝑛 )} (2)
sulting in 𝐾 subgraphs 𝐺1 , 𝐺2 , ..., 𝐺𝐾 . The normalized 𝑦1∼𝑇𝑛 ,𝑛
cut (Ncut) problem [1] is employed to compute a balanced
partition by minimizing the energy Here the likelihood 𝑝(e𝑡,𝑛 |𝑦𝑡,𝑛 ) is the probability of a
frame embedding e𝑡,𝑛 from the video 𝑛 belonging to a
𝐾
1 ∑︁ 𝑊 (𝐺𝑘 , 𝐺𝑘 ) cluster. Therefore, we fit a Gaussian distribution on each
ℒ𝑐𝑢𝑡 (𝐺1 , 𝐺2 , ..., 𝐺𝐾 ) = , (1) global cluster and compute the frame-wise likelihoods
2 vol(𝐺𝑘 )
𝑘=1
with the Gaussian model, i.e.,
where 𝑊 (𝐺𝑘 , 𝐺𝑘 ) represents the sum of edge weights 𝑝(x|𝑘) = 𝒩𝑘 (x; 𝜇𝑘 , Σ𝑘 ), 𝑘 ∈ {1, ..., 𝐾}. (3)
between elements in the subgraph 𝐺𝑘 and elements of
all the other subgraphs, i.e., the sum of weights of edges 𝑝𝑛 (𝑦𝑡,𝑛 |𝑦𝑡−1,𝑛 ) is the transition probability from label
to be cut. vol(𝐺𝑘 ) is the sum of weights of edges within 𝑦𝑡−1,𝑛 at frame 𝑡 − 1 to label 𝑦𝑡,𝑛 at frame 𝑡, which is
the resulting subgraph 𝐺𝑘 . Spectral clustering [2] is a defined by the temporal order of clusters. We denote the
relaxed solution to this NP-hard minimization problem in set of frame transitions defined by the temporal order of
Eq. (1) and has shown good performance on many graph- clusters on the 𝑛-th video by 𝑂𝑛 , e.g., for the temporal
based clustering problems, e.g. [3, 4, 5]. Note that while order of 𝑎 → 𝑏 → 𝑐 → 𝑑, 𝑂𝑛 = {𝑎 → 𝑏, 𝑏 → 𝑐, 𝑐 →
K-means operates on Euclidean distance in the feature 𝑑}. The transition probability is binary, i.e.,
space and assumes convex and isotropic clusters, spectral
clustering can find clusters with non-convex boundaries. 𝑝𝑛 (𝑦𝑖,𝑛 |𝑦𝑖−1,𝑛 ) (4)
= 1(𝑦𝑖,𝑛 = 𝑦𝑖−1,𝑛 ∨ 𝑦𝑖−1,𝑛 → 𝑦𝑖,𝑛 ∈ 𝑂𝑛 ).
26th Computer Vision Winter Workshop, Robert Sablatnig and Florian
Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023 This means that we allow either a transition to the next
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
cluster according to the temporal order, or we keep the
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org) cluster assignment of the previous frame.
Table 1
FC
FC
Comparison of combinations of embeddings and clustering
FC Dilated methods on Breakfast (in %).
FC residual layer
FC Model K-means Two-step clustering
FC Feature Embedding MoF IoU F1 MoF IoU F1
FC
FC
Rankloss [7] 35.2 15.6 28.8 34.7 13.4 23.7
MLP 42.9 13.1 25.5 32.7 10.9 21.2
DTFV AEMLP 34.7 13.6 25.8 32.6 11.6 21.4
(a) MLP (b) AEMLP (c) TCN
TCN 33.4 17.8 31.3 40.4 19.1 32.9
SSTEN 39.3 17.8 31.9 50.3 19.0 33.6
Figure 1: Three baseline variants of embedding networks.
1.0
0.8
0.6
Note that in two-step clustering, we derive the tem- 0.4

poral order of clusters on each video separately, by sort- 0.2
(a) Rankloss MLP (b) MLP (c) AEMLP (d) TCN (e) SSTEN 0.0
ing the clusters on the video according to the average
timestamp. Therefore, we have an individual 𝑂𝑛 for each Figure 2: Frame-to-frame similarity matrices of the embed-
video 𝑛. On the contrary, in K-means, there is a uniform ded sequences of the same video (computed by different em-
order of global clusters for all the videos and 𝑂𝑛 is thus bedding networks from DTFV features) on Breakfast.
the same for each video 𝑛.
The Viterbi algorithm for solving Eq. (2) is performed
in an iterative process using dynamic programming, i.e., as the temporal prior to train the model with only one
iteration.
𝑝(𝑦1∼𝑡,𝑛 |e1∼𝑡,𝑛 ) = (5) TCN and SSTEN are both networks for sequence-
max {𝑝(𝑦1∼𝑡−1,𝑛 |e1∼𝑡−1,𝑛 ) to-sequence learning, while Rankloss MLP, MLP and
𝑦𝑡,𝑛
AEMLP are trained on individual frames. By compar-
· 𝑝(e𝑡,𝑛 |𝑦𝑡,𝑛 ) · 𝑝(𝑦𝑡,𝑛 |𝑦𝑡−1,𝑛 )}. ing the performance between these two groups in Ta-
ble 1, we see that sequence-to-sequence learning leads to
The sequences that do not follow the temporal order will better performance, especially when combined with the
be filtered out in an early stage to narrow down the search two-step clustering, which results in clusters with better
range for the optimal label sequence. The output of the temporal consistency.
Viterbi decoding is the optimal cluster label sequence, For the two-step clustering, we also plot the frame-to-
i.e., 𝑦ˆ1∼𝑇𝑛 ,𝑛 . frame similarity matrices (spatial Gaussian kernel) of the
five embeddings for the same Breakfast video in Fig. 2.
3. Additional Results The plots show that Rankloss MLP, MLP and AEMLP,
which are trained on individual frames, do not expose
3.1. Embedding and clustering an appropriate temporal structure. There are noisy block
patterns even in positions far away from the diagonal,
Further, we compare our SSTEN embedding with three which results in noisy clusters and thus, leads to erro-
baseline variants (shown in Fig. 1): MLP temporal em- neous temporal orders and inferior assignment results in
bedding, autoencoder with MLP (AEMLP) and temporal the two-step clustering. The least noisy Rankloss MLP
convolutional network (TCN), in combination with the has the highest performance among these three. On the
two clustering methods. contrary, TCN and SSTEN embedded features, which
MLP uses three FC layers for relative timestamp pre- show a clear diagonal block structure in the similarity
diction. AEMLP uses MLP-based autencoder for both graph, achieve a better performance in the two-step clus-
relative timestamp prediction and feature reconstruction. tering. This verifies that the sequence-to-sequence em-
TCN deploys 𝑄 stacked dilated residual layers only for bedding learning (TCN and SSTEN) and two-step cluster-
relative timestamp prediction. ing are a well-suited combination to address the sequen-
Here, we also implement the Rankloss MLP embed- tial nature of frames in both processing steps of feature
ding [7] for reference. We report the performance of embedding and clustering.
these five embeddings in Table 1. Considering K-means clustering, the merit of having
Comparison of the five embeddings. We learn the a better sequential nature of the embedded features via
five embeddings (Rankloss MLP, MLP, AEMLP, TCN and sequence-to-sequence learning can also be seen from
SSTEN) on the DTFV features. Here, the Rankloss MLP the higher IoU and F1 scores (TCN: IoU 17.8%/F1 31.3%,
(consisting of two FC layers) is trained with a ranking vs. SSTEN: 17.8%/31.9%), as these penalize dominating
loss. We use the initialization of uniform segmentation
50 MoF
segments and oversegmentation. 45
IoU

performance (in %)
F1
In contrast to TCN, SSTEN can preserve the spatial 40

layout of the input features due to the feature reconstruc- 35
30
tion via the autoencoder. By comparing TCN and SSTEN, 25
we see that the SSTEN embedding with feature recon- 20
struction leads to a boost in the MoF score. The marginal 3 4 5 6 7 8 9 10 15 20
improvement of AEMLP over MLP is due to the fact that
the MLP structure with only FC layers is not well-suited Figure 3: Segmentation performance of the two-step cluster-
for feature reconstruction. ing on SSTEN embeddings (𝜆 = 0.002) with different 𝑚 (for
local scaling) on Breakfast.
Comparison between K-means and two-step clus-
tering. Considering the performance of the five embed-
dings with the two clustering methods, we see that K- Table 2
means leads to higher scores on the inferior embeddings Segmentation performance of two-step clustering on SSTEN
(Rankloss MLP, MLP and AEMLP) trained on individ- embeddings (𝜆 = 0.002) with respect to a fixed spatial scaling
ual frames, while two-step clustering performs better on factor 𝜎spat (without local scaling) on Breakfast (in %).
sequence-to-sequence learning-based embeddings (TCN
𝜎spat MoF IoU F1
and SSTEN). When combined with the proposed SSTEN
embedding, two-step clustering outperforms K-means by 0.5
0.7
47.1
48.0
18.0
18.5
33.4
33.9
a large margin in terms of the MoF score. We also tried 1 41.1 19.2 32.9
applying K-means on each video separately. However, 2
10
38.4
35.0
18.6
17.9
32.2
31.2
the performance dropped significantly. K-means depends
only on the spatial distance and results in oversegmenta-
tion, which leads to erroneous temporal order on each
Impact of the scaling of the temporal Gaussian
video and thus, an inferior global cluster assignment.
kernel. The temporal Gaussian kernel is operated on
the temporal distance between (︀ frames in a video.)︀With
2
3.2. Impact of Scaling in Spatiotemporal 𝜎tmp 2
= 2𝜎 ′ , the term exp −(𝑠𝑖 − 𝑠𝑗 )2 /(2𝜎 ′2 ) is in
Similarity the standard form of a Gaussian kernel. We set 𝜎 ′ = 1/6
so that the 6𝜎 ′ range of the temporal Gaussian is equal to
We perform spectral clustering with the proposed spa- the video length (since the length of each video is normal-
tiotemporal similarity. Here, we analyze the impact of ized to 1 for the relative timestamp prediction). The seg-
the scaling factors in the spatial and temporal Gaussian mentation performance with respect to 𝜎 ′ is shown in Ta-
kernels, i.e., 𝜎spat
2
and 𝜎tmp
2
. These adjust the extent to ble 3. Apparently, 𝜎 ′ = 1/6 leads to the best result. Here,
which two frames are considered similar to each other
and influence the clustering quality. The experiments are
Table 3
conducted for SSTEN embeddings on Breakfast.
Segmentation performance of two-step clustering on SSTEN
Impact of the scaling of the spatial Gaussian ker-
embeddings (𝜆 = 0.002) with respect to the temporal scaling
nel. For local scaling, we set 𝜎spat = 𝜎𝑖 𝜎𝑗 , where 𝜎𝑖 is factor 𝜎 ′ (𝜎 2 = 2𝜎 ′ 2 on Breakfast (in %).
2
tmp
the distance from e𝑖 to its 𝑚-th nearest neighbor in the
feature space. The resulting segmentation performance 𝜎 (𝜎tmp = 2𝜎 ′ 2 )
′ 2
MoF IoU F1
w.r.t. 𝑚 is shown in Fig. 3. With 𝑚 varying in the range ∞ (w/o tmp. Gauss) 41.5 16.5 30.6
of 3 to 20, the IoU and F1 scores remain stable. There 1/3 43.5 16.9 31.3
1/6 50.3 19.0 33.6
is a range of 𝑚 ∈ {8, 9} where the best MoF scores 1/12 44.3 18.5 34.1
are achieved, whereas for other scaling parameters, the
MoF score drops. Thus, we set 𝑚 = 9 for all following we also evaluate the case without the temporal Gaussian
evaluations. kernel, which leads to a drop in performance. The impact
For comparison, we also set 𝜎spat to fixed values (with- of the temporal Gaussian kernel on similarity matrices
out local scaling) and report the segmentation perfor- of SSTEN embeddings can also be seen by comparing the
mance in Table 2. We achieve great results at smaller top and bottom rows in Fig. 6 in the main manuscript.
𝜎spat values (0.5 and 0.7). For example, by adding the temporal Gaussian kernel,
However, with increasing 𝜎spat the MoF score drops we decrease the similarities in Fig. 6(a1) according to the
significantly, while there are only minor fluctuations in temporal distance between two frames, which leads to
IoU and F1. Apparently, 𝜎spat has a large impact on the clearer diagonal block structure in Fig. 6(a2). Thus, we
clustering quality. The local scaling eases the effort of set 𝜎 ′ = 1/6 for all following evaluations.
tuning 𝜎spat by dynamically determining the scaling fac-
tor.
Table 4 all video frames from the same activity with respect to a
Impact of cluster order for two-step clustering on SSTEN em- pseudo ground truth action annotation. The embedded
beddings (in %). frames of the whole activity set are then clustered and
Breakfast YTI
the likelihood for each frame and for each cluster is com-
Order
MoF IoU F1 Edit MoF IoU F1 Edit
puted. For the decoding, the authors build a histogram
of features with respect to their clusters with a hard as-
video-
wise
50.3 19.0 33.6 42.3 46.6 10.7 29.5 25.5 signment and set the length of each action with respect
uniform 53.5 15.7 32.2 33.0 40.7 7.7 25.1 20.3 to the overall amount of features per bin. After that, they
apply a Mallow model to sample different orderings for
each video with respect to the sampled distribution. The
3.3. Impact of Cluster Order resulting model is a combination of Mallow model sam-
pling and action length estimation based on the frame
One merit of performing within-video clustering is that distribution.
we can derive the temporal order of sub-clusters for each For this experiment, we evaluate the impact of the dif-
video separately. The video-wise individual order of clus- ferent decoding strategies on two embeddings: the Ran-
ters is used to guide the Viterbi decoding, which breaks kloss embedding [7] and our SSTEN embedding. Table 5
the common assumption that clusters follow the same reports the results of these two embeddings in combi-
temporal order in all the videos. In the following, we nation with three decodings: the Mallow model, Viterbi
verify the efficacy of the derived video-wise order of decoding with K-means and Viterbi decoding with two-
clusters. We use the same within-video clustering re- step clustering.
sult with global cluster assignment and perform Viterbi
decoding using two different temporal cluster orders: Table 5
(1) video-wise order: the temporal order of sub-clusters Comparison of combinations of embeddings and decoding
is determined on each video separately; and (2) uniform strategies on Breakfast (in %).
order: the uniform order is determined by sorting the
Rankloss [7] embed. SSTEN embed.
average timestamps of global clusters and is then applied Decoding

to all the videos. Table 4 reports the segmentation per- MoF IoU F1 MoF IoU F1

formance (after Viterbi) with these two orders for our Mallow [7]
kmeans+Viterbi
34.7
35.2
17.8
15.6
31.4
28.8
36.4
39.3
18.1
17.8
31.5
31.9
SSTEN embeddings on Breakfast and YTI. To measure the two-step.+Viterbi 34.7 13.4 23.7 50.3 19.0 33.6
correctness of the predicted segment order, we adopt the
segmental edit distance (Edit), which is a common metric Following [7], we run 7 iterations for the Rankloss
for supervised action segmentation, e.g., [8, 9, 10, 11]. embedding with the Mallow model. In each iteration,
It penalizes segmentation results that have a different the Rankloss embedding is retrained using the segmen-
segment order than the ground truth (i.e., it penalizes tation result from the last iteration as pseudo label, and
out-of-order predictions, as well as oversegmentation). the frame-wise likelihoods and the Mallow model are
From Table 4 we see that our video-wise order clearly updated.
outperforms the uniform order except for MoF on Break- Unlike the Mallow model, our Viterbi decoding is a
fast. Furthermore, the edit score verifies that our derived one-iteration procedure. It is operated on the embed-
video-wise temporal orders are valid. ding which is trained only once. When combining with
In our experiments we especially notice that the MoF Viterbi, we train the Rankloss model only once using the
and IoU scores could act contradictory to each other, initialized uniform segmentation as a prior. For SSTEN
e.g., the uniform order results in higher MoF scores (on with the Mallow model, we only run for one iteration,
Breakfast) at the cost of lower IoU scores. MoF tends to as we do not need to train SSTEN with pseudo labels
overfit on dominant classes (e.g., classes with longer ac- iteratively.
tion instances) while IoU is sensitive to underrepresented Considering the Rankloss results in Table 5 we see
classes and penalizes segmentation results with dominat- that combining it with the Mallow model achieves its
ing segments. Therefore, it is necessary that we consider highest IoU and F1 scores. This is because for Viterbi de-
all metrics for evaluation, as a higher MoF score does not coding, the Rankloss model trained only one-time using
always correspond to better performance in practice. the uniform initialization as pseudo label is lacking of a
strong temporal prior. Considering SSTEN, our Viterbi
3.4. Impact of Decoding Strategies decoding with two-step clustering clearly outperforms
the Mallow model. With Mallow, the SSTEN embedding
We compare our approach, which uses Viterbi decoding, has competitive IoU and F1 scores but significantly lower
with the Mallow model decoding that has been proposed MoF. We also tried running the Mallow model on SSTEN
in [7]. The authors propose a Rankloss embedding over embedded features for multiple iterations. However, this
resulted in a reduced number of clusters. Thus, we see Table 6
that an appropriate combination of embedding and de- Comparison of combinations of SSTEN and different clustering
coding strategy is necessary. methods in terms of clustering and final segmentation after
To have a closer look into the Viterbi decoding, we Viterbi decoding on Breakfast (in %).
visualize the likelihood grids computed from global clus- Clustering results Final results
Embedding Clustering
ters, as well as the resulting decoding path over time for MoF IoU F1 MoF IoU F1
two videos on Breakfast in Fig. 4. It shows that the decod- K-means 27.2 13.5 26.3 39.3 17.8 31.9
ing, which generates a full sequence of actions, is able SSTEN
two-step.cluster 38.6 13.7 25.9 50.3 19.0 33.6
to marginalize actions that do not occur in the video by
just assigning only very few frames to those ones and the
majority of frames are assigned to the clusters that occur
in the video. Even if the given temporal order constrains 3.5.2. Segmentation Results On Each Activity
that the resulting 𝐾 coherent segments have to follow We report the ground truth number of classes and
the fixed temporal order, the segments that actually do segmentation performance of MLP with K-means
not belong in the sequence will be marginalized because (MLP+kmeans, our reimplementation of [12]) and TAEC
the Viterbi algorithm decodes a path that maximizes the for each activity on Breakfast (Table 7), YTI (Table 8)
posterior probability. Overall, it turns out that the Viterbi and 50 Salads (Table 9). The evaluation is done with
decoding constrained by a temporal order performs bet- global Hungarian matching on all videos. The number of
ter than the Mallow model’s iterative re-ordering. clusters is set to the maximum number of ground truth
Making juice
classes for each activity (𝐾 = max.#gt).
Viterbi decoding path on likelihood grid frame axis
cluster axis

Table 7
prediction Maximum number of ground truth action classes and segmen-
ground truth tation performance of MLP+kmeans (our reimplementation of
Making fried egg [12]) and TAEC for the 10 activities on Breakfast (in %). The
Viterbi decoding path on likelihood grid frame axis
number of clusters is set to the maximum number of ground
cluster axis

truth classes for each activity (𝐾 = max.#gt).
prediction Breakfast
ground truth
Global Hungarian matching on all videos (𝐾 = max.#gt)
Activity 𝐾 Methods MoF IoU F1
Figure 4: Comparison of Viterbi decoding path on the like- coffee 7
MLP+kmeans 46.8 15.7 26.2
lihood grid computed from the global clusters resulted from TAEC 35.6 15.2 24.9

two-step clustering on the SSTEN embeddings, for two videos cereals 5
MLP+kmeans 48.8 25.8 37.2
TAEC 59.0 31.4 47.7
on Breakfast. The warm (red)/cool (blue) colors in the grid
indicate high/low likelihoods of a frame belonging to an ac- tea 7
MLP+kmeans 32.2 13.0 22.7
TAEC 39.2 16.3 26.1
tion class. It shows that the decoding assigns most frames
to occurring actions while marginalizing actions that do not MLP+kmeans 40.4 21.2 36.6
milk 5
TAEC 46.7 27.3 43.5
occur in the sequence by assigning only a few frames.
MLP+kmeans 36.9 14.1 27.9
juice 8
TAEC 52.2 22.3 36.2
MLP+kmeans 47.4 15.0 25.3
sandwich 9
TAEC 53.7 19.8 33.5
3.5. Quantitative Segmentation Results MLP+kmeans 34.5 10.8 19.9
scrambled egg 12
TAEC 48.1 15.2 28.3
3.5.1. Results of Clustering and Final
MLP+kmeans 36.4 11.5 24.5
Segmentation fried egg 9
TAEC 49.1 17.4 30.5

In order to show the advantage of two-step clustering salad 8
MLP+kmeans
TAEC
34.7
42.0
7.8
15.2
27.5
34.0
over K-means, when combined with the proposed SSTEN
MLP+kmeans 57.4 8.6 19.2
embedding, we report both, the results of clustering (be- pancake 14
TAEC 58.2 19.2 35.8
fore Viterbi decoding) and the final segmentation perfor- MLP+kmeans 42.9 13.1 25.5
mance (after Viterbi decoding) on Breakfast in Table 6. All -
TAEC 50.3 19.0 33.6
We see that the proposed two-step clustering leads to
superior performance than K-means, in terms of both
clustering (before Viterbi decoding) in most metrics, and
in terms of final segmentation (after Viterbi decoding).
Table 8 References
Maximum number of ground truth action classes and the
segmentation performance of MLP+kmeans (our reimplemen- [1] J. Shi, J. Malik, Normalized cuts and image segmen-
tation of [12]) and TAEC for the 5 activities on YTI (in %). The tation, PAMI 22 (2000) 888–905.
number of clusters is set to the maximum number of ground [2] U. V. Luxburg, A tutorial on spectral clustering,
truth classes for each activity (𝐾 = max.#gt). Statistics and Computing 17 (2007) 395–416.
YouTube Instructions [3] H.-C. Huang, Y.-Y. Chuang, C.-S. Chen, Affinity
Global Hungarian matching on all videos (𝐾 = max.#gt)
aggregation for spectral clustering, in: CVPR, 2012.
MoF IoU F1 MoF IoU
[4] Z. Li, J. Chen, Superpixel segmentation using linear
Activity 𝐾 Methods w/o w/o w/o w w spectral clustering, in: CVPR, 2015.
bkg bkg bkg bkg bkg
[5] R. Liu, H. Zhang, Segmentation of 3d meshes
coffee 10
MLP+kmeans 40.9 10.4 34.2 11.9 9.5 through spectral clustering, in: Pacific Conference
TAEC 42.8 10.5 26.6 12.4 9.6
on Computer Graphics and Applications, 2004.
MLP+kmeans 45.9 17.2 34.0 24.7 15.8
change tire11
TAEC 58.6 20.0 37.2 31.5 18.4 [6] A. Richard, H. Kuehne, A. Iqbal, J. Gall,
MLP+kmeans 30.6 4.5 24.4 5.1 4.1
NeuralNetwork-Viterbi: A framework for
jump car 12
TAEC 34.3 6.2 26.4 5.7 5.8 weakly supervised video learning, in: CVPR, 2018.
cpr 7
MLP+kmeans 34.4 9.9 31.2 15.0 8.6 [7] F. Sener, A. Yao, Unsupervised learning and segmen-
TAEC 38.0 7.9 33.3 16.6 6.9 tation of complex activities from video, in: CVPR,
repot 8
MLP+kmeans 29.8 7.2 23.1 10.1 6.4 2018.
TAEC 35.8 7.1 22.4 12.1 6.3
[8] Y. A. Farha, J. Gall, MS-TCN: Multi-stage temporal
All -
MLP+kmeans 39.4 9.9 29.6 14.4 9.7 convolutional network for action segmentation, in:
TAEC 46.6 10.7 29.5 17.0 10.5
CVPR, 2019.
[9] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, G. D. Hager,
Temporal convolutional networks for action seg-
Table 9
mentation and detection, in: CVPR, 2017.
Maximum number of ground truth action classes and the
segmentation performance of MLP+kmeans (our reimplemen-
[10] V. I. Morariu, L. S. Davis, Multi-agent event recog-
tation of [12]) and TAEC for the single activity on 50 Salads nition in structured scenarios, in: CVPR, 2011.
(in %). The number of clusters is set to the maximum number [11] H. S. Koppula, R. Gupta, A. Saxena, Learning human
of ground truth classes for each activity (𝐾 = max.#gt). activities and object affordances from rgb-d videos,
The International Journal of Robotics Research 32
50 Salads
(2013) 951–970.
Global Hungarian matching on all videos (𝐾 = max.#gt)
[12] A. Kukleva, H. Kuehne, F. Sener, J. Gall, Unsuper-
Activity 𝐾 Methods MoF IoU F1 vised learning of action classes with continuous
eval 12
MLP+kmeans 37.9 24.6 40.2 temporal embedding, in: CVPR, 2019.
TAEC 48.4 26.0 44.8
salad
MLP+kmeans 29.1 15.7 23.4
mid 19
TAEC 26.6 14.9 23.4

3.6. Qualitative Segmentation Results
We show the qualitative results of clustering and final
segmentation on 7 composite activities: making cereals
(Fig. 5), making milk (Fig. 6), making juice (Fig. 7), making
fried egg (Fig. 8), making pancake (Fig. 9) on Breakfast,
changing tire (Fig. 10) on YTI and making salad (Fig. 11)
on 50 Salads (eval 12 classes). The mapping between clus-
ter labels and ground truth classes is done with global
Hungarian matching on all videos. The number of clus-
ters is set to the maximum number of ground truth classes
for each activity (𝐾 = max.#gt).
For each activity, we visualize the results of 10 videos.
For each video, the 3-row-group displays the ground
truth (1st row), TAEC (2nd row), MLP+kmeans [12] (3rd
row).
Making cereals
empty take bowl pour cereals pour milk stir milk
Clustering result Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10

Figure 5: Qualitative results of clustering and final segmentation of 10 making cereals videos on Breakfast. For each video, the
3-row-group displays the ground truth (1st row), TAEC (2nd row), MLP+kmeans [12] (3rd row). The mapping between cluster
labels and ground truth classes is done with global Hungarian matching on all videos. The number of clusters is set to the
maximum number of ground truth classes for each activity (𝐾 = max.#gt).

Making milk
empty spoon powder pour milk stir milk take cup
Clustering result Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10

Figure 6: Qualitative results for 10 making milk videos on Breakfast.

Making juice
empty cut orange squeeze orange take glass take squeezer take knife take plate pour juice
Clustering result Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10

Figure 7: Qualitative results for 10 making juice videos on Breakfast.
Making fried egg
empty pour oil take plate take eggs fry egg butter pan add salt crack egg put egg2plate
Clustering result Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10

Figure 8: Qualitative results for 10 making fried egg videos on Breakfast.

Making pancake
empty pour ﬂour put pancake2plate fry pancake pour dough2pan stir dough spoon ﬂour
butter pan take eggs take plate crack egg pour oil take bowl pour milk
Clustering result Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10

Figure 9: Qualitative results for 10 making pancake videos on Breakfast.

Changing tire
bkg get things out start loose jack up unscrew wheel1 break on tight wheel
put wheel unscrew wheel2 screw wheel jack down put things back
Clustering result Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10

Figure 10: Qualitative results for 10 changing tire videos of the YouTube Instructions dataset. Similar to the Breakfast
illustrations, for each video the 3-row-group shows the ground truth (1st row), TAEC (2nd row), and MLP+kmeans [12] (3rd
row).
Making salad
action start add dressing add oil add pepper cut lettuce serve salad add vinegar
mix dressing mix ingredients peel cucumber place lettuce action end
Clustering result Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10

Figure 11: Qualitative results for 10 making salad videos on the 50 Salads dataset (from the eval-level, 12 action classes). Again,
for each video the 3-row-group shows the ground truth (1st row), TAEC (2nd row), and MLP+kmeans [12] (3rd row).