=Paper= {{Paper |id=Vol-3349/paper1 |storemode=property |title=TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering |pdfUrl=https://ceur-ws.org/Vol-3349/paper1.pdf |volume=Vol-3349 |authors=Wei Lin,Anna Kukleva,Horst Possegger,Hilde Kuehne,Horst Bischof |dblpUrl=https://dblp.org/rec/conf/cvww/LinKPKB23 }} ==TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering== https://ceur-ws.org/Vol-3349/paper1.pdf
TAEC: Unsupervised Action Segmentation with
Temporal-Aware Embedding and Clustering
Wei Lin1,2 , Anna Kukleva3 , Horst Possegger1 , Hilde Kuehne4 and Horst Bischof1
1
  Institute of Computer Graphics and Vision, Graz University of Technology, Austria
2
  Christian Doppler Laboratory for Semantic 3D Computer Vision, Austria
3
  Max-Planck-Institute for Informatics, Germany
4
  Goethe University Frankfurt, Germany


                                       Abstract
                                       Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action
                                       classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To
                                       address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences.
                                       In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction,
                                       and sequence-to-sequence learning, to preserve the spatial layout and sequential nature of the video features. A two-step
                                       clustering pipeline on these embedded feature representations then allows us to enforce temporal consistency within, as well
                                       as across videos. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to
                                       semantically meaningful action classes. Our evaluation on three challenging datasets shows the impact of each component
                                       and, furthermore, demonstrates our state-of-the-art unsupervised action segmentation results.

                                       Keywords
                                       Unsupervised learning, unsupervised clustering, action segmentation



1. Introduction                                                                                        sequences that correspond to semantically meaningful
                                                                                                       action classes. Recent approaches for unsupervised ac-
Action recognition has seen tremendous success in recent tion segmentation, e.g. [20, 21, 22], focus on three aspects:
years, especially in the context of short video clip classifi- (1) finding a suitable embedding space for the video data,
cation [1, 2, 3], action detection [4, 5, 6], and temporal ac- (2) identifying clusters of temporal segments across a
tion segmentation [7, 8, 9, 10]. The top-performing meth- large amount of videos, and (3) parsing the input videos
ods for temporal action segmentation, however, require given the respective feature embedding and cluster infor-
frame-wise annotations, which is expensive and imprac- mation. Sener and Yao [20] tackle the action segmenta-
tical for large-scale datasets [7, 8, 9, 11]. Consequently, tion task with a linear embedding and Mallow decoding,
a large body of research focuses on weakly-supervised while other approaches follow the pipeline of an MLP
approaches where only an ordered list [12, 13, 14, 15, 16] [21] or U-Net embedding [22], K-means clustering, and
or an unordered set [17, 18, 19] of action labels is needed. Viterbi decoding. However, these methods do not fully
These approaches assume that the actions that occur in leverage the temporal relationships of frames within a
each training video are known, and sometimes even re- video, either neglecting this information when learning
quire their exact ordering. Acquiring such ordered action the embedding [20, 22] or when clustering [21, 22].
lists, however, can still be time consuming or even infea-                                                Since temporal consistency is essential for all steps of
sible. For applications like indexing large video datasets the action segmentation pipeline, we propose TAEC, an
or human behavior analysis in neuroscience or medicine, approach that considers the sequential nature of frames
it is often unclear what actions should be annotated. In in a video for both, embedding learning and clustering.
these cases, it is necessary to automatically discover and The main steps of our approach are illustrated in Fig. 1.
identify recurring actions in large video datasets.                                                       Specifically, we first propose a sequence-to-sequence
    To address this problem, Sener and Yao [20] proposed temporal embedding network (SSTEN) that combines
the task of unsupervised action segmentation to identify pretext tasks of relative timestamp prediction and au-
patterns of recurring actions in long, untrimmed video toencoder feature reconstruction. While the autoencoder
26th Computer Vision Winter Workshop, Robert Sablatnig and Florian
                                                                                                       reconstruction retains the feature layout, the relative
Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023                                         timestamp prediction encodes the relative temporal in-
$ wei.lin@icg.tugraz.at (W. Lin); akukleva@mpi-inf.mpg.de                                              formation within a video. Sequence-to-sequence learning
(A. Kukleva); possegger@icg.tugraz.at (H. Possegger);                                                  enables the embedding of spatial layout and temporal
kuehne@uni-frankfurt.de (H. Kuehne); bischof@icg.tugraz.at                                             information on a complete video sequence.
(H. Bischof)
          © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License    To cluster the embedded features, we propose a
    CEUR
          Attribution 4.0 International (CC BY 4.0).
          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                       temporal-aware two-step clustering approach that con-



                                                                                             1
Wei Lin et al. CEUR Workshop Proceedings                                                                                               1–10


       Feature Embedding                    Within-Video                            Cross-Video                  Viterbi Decoding
         3 videos of Making coffee            Clustering                      Global Cluster Assignment
                     Feature
                    embedding                                                     2         1         3

                                                                                                                    2 → 1 → 3
                  frames of video 1
                                                                                 1           2            3
                     Feature
                    embedding
                                                                                                                    1 → 2 → 3
                  frames of video 2                                              1            2           3
                     Feature
                    embedding                                                                                       1 → 2 → 3
                                                                              Global      Global      Global
                                      Ck,n : the k-th within-video cluster   cluster 1   cluster 2   cluster 3
                  frames of video 3   in video n
                      (a)                             (b)                                   (c)                           (d)

Figure 1: Pipeline of TAEC. We compute the embedded features with the sequence-to-sequence temporal embedding network
(SSTEN). Thereupon, we perform a within-video clustering on each video individually and apply a cross-video global cluster
assignment to group the within-video clusters into global clusters. The global cluster assignment also defines the ordering of
the clusters in each video. Finally, we use Viterbi decoding to estimate temporally coherent segments for each video.



sists of a within-video clustering and a cross-video global                      and recognition of frame orders [27, 28, 29, 30, 31]. For
cluster assignment. Specifically, we perform cluster-                            instance, Srivastava et al. [24] exploit an LSTM-based au-
ing within each video, with a spatio-temporal similarity                         toencoder for learning video representations. Villegas et
among frames. Then we conduct global cluster assign-                             al. [26] and Denton and Birodkar [25] employed two en-
ment to group the clusters across videos. The global                             coders to generate feature representations of content and
cluster assignment defines the ordering of the clusters                          motion. The temporal order of frames or small chunks
for each video. In this way, we overcome the unrealis-                           is utilized as a self-supervision signal for representation
tic assumption that actions of an activity always follow                         learning on short video clips in [27] and [28]. Inspired by
the same temporal order. Such an assumption is com-                              these approaches, we employ two self-supervision tasks:
monly used in related works, e.g. [21, 22]. For instance,                        feature reconstruction and relative time prediction.
in the activity of making coffee, a unified temporal order                       Clustering of temporal sequences has been explored
between actions such as adding milk and adding sugar                             for parsing human motions [32, 33, 34, 35]. While Zhang
is assumed for all videos of making coffee, whereas our                          et al. [35] proposed a hierarchical dynamic clustering
approach can handle changes of the action order in dif-                          framework, Li et al. [33] and Tierney et al. [34] explored
ferent videos. After assigning all within-video clusters to                      temporal subspace clustering to segment human motion
a set of global clusters, we perform Viterbi decoding to                         data. In contrast to unsupervised action segmentation,
obtain a segmentation of temporally coherent segments.                           these methods are applied on each temporal sequence
   Our contributions can summarized as following:                                individually and do not consider association among se-
                                                                                 quences. Instead, we propose a cross-video global cluster
     • We design a sequence-to-sequence temporal em-
                                                                                 assignment to group within-video clusters across differ-
       bedding network (SSTEN), which combines rel-
                                                                                 ent videos into global clusters.
       ative timestamp prediction, autoencoder recon-
       struction and sequence-to-sequence learning.                              Unsupervised action segmentation on fine-grained
     • We propose a within-video clustering with a                               activities has recent work that either focus on the repre-
       novel spatio-temporal similarity formulation                              sentation learning [20, 22, 36] or the clustering step [23].
       among frames.                                                             However, the temporal information is neglected in at
                                                                                 least one of these two steps. For representation learning,
     • We propose a cross-video global cluster assign-
                                                                                 Sener and Yao [20] construct a feature embedding by
       ment to group within-video clusters across videos
                                                                                 learning a linear mapping from visual features to a latent
       into global clusters, which also overcomes the as-
                                                                                 space with a ranking loss. However, the linear model
       sumption that in all videos of an activity, actions
                                                                                 trained with individual frames does not consider the tem-
       follow the same temporal order.
                                                                                 poral association between frames. VidalMata et al. [22]
                                                                                 employ a U-Net trained on individual frames for future
2. Related Work                                                                  frame prediction. Predicting for one or a few steps ahead
                                                                                 only requires temporal relations within a small temporal
Unsupervised learning of video representations is                                window. Instead, we propose to learn a representation by
commonly performed via pretext tasks, such as recon-                             predicting the complete sequence of relative timestamps
struction [23, 24], future frame prediction [22, 25, 26],                        to encode the long-range temporal information.




                                                                             2
Wei Lin et al. CEUR Workshop Proceedings                                                                                                     1–10



   For the clustering step, related works [22, 23] neglect
temporal consistency of frames within a video. Instead,
we apply within-video clustering on each video with                           Dilated
                                                                           residual layer
a proposed similarity formulation that considers both                                                Dilated
                                                                                                  residual layer                   ReLU
spatial and temporal distances.                                                                                           Dilated Conv1d
   Two recent approaches perform clustering [37] or                           concat
                                                                                                          Decoder
cluster-agnostic boundary detection [38] on each video                                                     stage               Dilated
separately, without identifying clusters or segments                               Encoder                                  residual layer

across videos. [37] solves a task similar to human motion                           stage

parsing and evaluates the segmentation for each video
individually. [38] only detects boundaries of category-            Figure 2: Architecture of SSTEN: We stack 2 stages of encoder
agnostic segments, and does not identify if some seg-              and decoder for sequence-to-sequence feature reconstruction.
ments within a video or across videos are of the same              Each stage consists of 𝑄 dilated residual layers with dilated
                                                                   temporal convolution. The intermediate representation in
category. On the contrary, our segments on all videos
                                                                   encoder is used for relative time prediction (red block).
are category-aware as they are aligned globally across
videos by our global cluster assignment.

                                                                   stage, the hidden representation is a concatenation (in
3. Temporal-Aware Embedding                                        yellow) of the features from dilated residual layers and
   and Clustering (TAEC)                                           the predicted relative timestamps. The training loss is:
                                                                                𝑇𝑛                                     𝑇𝑛
We address unsupervised action segmentation as illus-
                                                                             𝑁 ∑︁                                   𝑁 ∑︁
                                                                                           ^ 𝑡,𝑛 ⃦2 +                                          2
                                                                             ∑︁    ⃦             ⃦          ∑︁      ∑︁
                                                                   ℒ=𝜆             ⃦x𝑡,𝑛 − x
                                                                                                    2
                                                                                                                            (𝑠𝑡,𝑛 −𝑠
                                                                                                                                   ^𝑝,𝑡,𝑛 ) ,
trated in Fig. 1. First, we learn a suitable feature embed-                  𝑛=1 𝑡=1                    𝑝∈{1,2} 𝑛=1 𝑡=1
                                                                                                                                               (1)
ding (Sec. 3.1). We then perform within-video clustering
on each video (Sec. 3.2.1), and group the within-video             where the coefficient 𝜆 balances the two terms. The
clusters into global clusters (Sec. 3.2.2). Finally, we com-       pretext tasks of reconstruction and relative timestamp
pute temporally coherent segments on each video using              prediction encode both, the spatial distribution and the
Viterbi decoding (Sec. 3.3).                                       global temporal information, into the embedded features.
                                                                   We compare SSTEN with several baseline embedding
                                                                   networks in the supplementary.
3.1. SSTEN: Sequence-to-Sequence
     Temporal Embedding Network                                    3.2. Two-Step Clustering
To learn a latent representation for temporal sequences,
                                                                   After learning the feature embedding, we group the em-
we adopt a sequence-to-sequence autoencoder. Inspired
                                                                   bedded features into 𝐾 clusters by a within-video clus-
by the multi-stage temporal convolutional network [7],
                                                                   tering and a cross-video global cluster assignment.
we use a concatenation of two stages for both encoder
and decoder, as shown in Fig. 2. Given a set {X𝑛 }𝑁   𝑛=1
of 𝑁 videos, where each video X𝑛 = {x𝑡,𝑛 }𝑇𝑡=1     𝑛
                                                       has         3.2.1. Within-Video Clustering
𝑇𝑛 frames, the outputs are reconstructed frame features            We perform spectral clustering on frames within each
{x̂𝑡,𝑛 }𝑇𝑡=1
          𝑛
             . The embedded features are the hidden repre-         video (detailed description in the supplementary). Given
sentation {e𝑡,𝑛 }𝑇𝑡=1
                    𝑛
                      .                                            the embedded feature sequence1 [e1 , e2 , ..., e𝑇 ], we build
   Every encoder and decoder stage consist of 1 × 1 con-           a frame-to-frame similarity matrix 𝐺 ∈ R𝑇 ×𝑇 . The
volution layers for dimension adjustment (Fig. 2 blue) and         entries 𝑔(𝑖, 𝑗), 𝑖, 𝑗 ∈ {1, ..., 𝑇 }, represent the similarity
𝑄 dilated residual layers (green), each containing a di-           between frame 𝑖 and frame 𝑗. To consider both the spatial
lated temporal 1D convolution. Since no fully connected            and temporal distance of features, we propose to measure
layers are employed, sequences of variable lengths can             the similarity by the product of two Gaussian kernels
be processed seamlessly. The dilation rate at the 𝑞-th
layer is 2𝑞−1 . By stacking dilated residual layers, the
                                                                                   (︃                )︃
                                                                                        ‖e𝑖 − e𝑗 ‖22                (𝑠𝑖 − 𝑠𝑗 )2
                                                                                                               (︂               )︂
temporal receptive field increases exponentially. The re-           𝑔(𝑖, 𝑗) = exp −           2
                                                                                                         · exp    −      2
                                                                                                                                   ,
                                                                                            𝜎spat                      𝜎tmp
ceptive field of the 𝑞-th layer is 1 + (𝑟 − 1) × (2𝑞 − 1),
                                                                                                                                 (2)
where 𝑟 is the kernel size. Therefore, each frame in the
                                                                   where 𝑠𝑖 , 𝑠𝑗 are the corresponding relative timestamps
hidden representation has a long temporal dependency
                                                                   of frame 𝑖, 𝑗 and 𝜎spat , 𝜎tmp are the scaling factors for the
on the input video. In each encoder stage, we use a 1 × 1
convolution layer (in red) to predict the frame-wise rela-
tive timestamps 𝑠𝑡,𝑛 = 𝑇𝑡𝑛 . At the end of each encoder            1
                                                                       For ease of notation, we omit the video index 𝑛.




                                                               3
Wei Lin et al. CEUR Workshop Proceedings                                                                                          1–10


                                                                              Hub vertex set                non-hub vertex sets
spatial and temporal Gaussian kernels. To avoid manu-
ally tuning 𝜎spat , we use local scaling [39] to estimate 𝜎spat
dynamically. To this end, we replace 𝜎spat 2
                                               by 𝜎𝑖 𝜎𝑗 , where
                                                                                        ...       ...       ...
𝜎𝑖 is the distance from e𝑖 to its 𝑚-th nearest neighbor in




                                                                                                                      ...
                                                                                                      ...
                                                                                    ...
the embedding space. We provide an ablation study on
scaling of the spatio-temporal similarity in the supple-
mentary. Consequently, frames of similar visual content Figure 3: In the ℎ-th iteration, ℎ ∈ {1, ..., 𝑁 }, the hub vertex
and relative timestamps are encouraged to be grouped set 𝑉ℎ is chosen and an assignment is computed by assigning
into the same cluster.                                          the vertices between 𝑉ℎ and the 𝑁 − 1 non-hub vertex sets.
                                                                   Solid lines denote bipartite matching results between 𝑉ℎ and
3.2.2. Cross-Video Global Cluster Assignment                       non-hub vertex sets (step (1)). Dashed lines denote connec-
                                                                   tions between non-hub vertex sets (step (2)).
After within-video clustering, we assign the 𝑁 × 𝐾
within-video clusters across videos into 𝐾 global clus-
ters. Every global cluster should contain 𝑁 within-video
clusters, each coming from a different video (c.f., Fig. 1).       By iterating over all possible initial hub vertex sets ℎ ∈
This can be interpreted as an 𝑁 -dimensional assignment            {1, ..., 𝑁 }, we choose the assignment solution 𝑓ℎ^ which
problem [40].                                                      minimizes the assignment cost
   We regard the 𝑛-th video 𝑉𝑛 = {c𝑘,𝑛 |𝑘 = 1, .., 𝐾}                                        ∑︁
as a vertex set, where each 𝑘-th within-video cluster c𝑘,𝑛                𝑓ℎ^ = arg min             𝑓ℎ (c, c′ ) · 𝑤(c, c′ ), (4)
                                                                                 ℎ∈{1,...,𝑁 }
                                                                                                (c,c′ )∈𝐸
is a vertex. We construct⋃︀an 𝑁 -partite graph 𝐺 = (𝑉1 ∪
𝑉2 ∪ ... ∪ 𝑉𝑁 , 𝐸). 𝐸 = 𝑚<𝑛,𝑚,𝑛∈{1,...,𝑁 } {(c, c′ )|c ∈
                                                                   where 𝑓ℎ (c, c′ ), ∀(c, c′ ) ∈ 𝐸 is a binary indicator func-
𝑉𝑚 , c′ ∈ 𝑉𝑛 } is the set of edges between within-video            tion that describes the edge connection: 𝑓ℎ (c, c′ ) equals
clusters across videos. The edge weight 𝑤(c, c′ ) is the           1 when two vertices c, c′ are connected. The assignment
distance between centroids of two within-video clusters            solution 𝑓ℎ^ describes the partition which leads to the 𝐾
c, c′ . The solution to the 𝑁 -dimensional assignment              global clusters.
is a partition by dividing the graph 𝐺 into 𝐾 cliques
𝑍1 , 𝑍2 , ..., 𝑍𝐾 . A clique 𝑍𝑘 , which is a subset of 𝑁
vertices from 𝑁 different vertex sets, defines the 𝑘-th            3.3. Frame Labeling by Viterbi Decoding
global cluster. The induced sub-graphs of the cliques              Given the embedded feature sequence e1∼𝑇𝑛 ,𝑛 of video
𝑍1 , 𝑍2 , ..., 𝑍𝐾 are complete and disjoint. We denote the         𝑛, we determine the optimal label sequence 𝑦ˆ1∼𝑇𝑛 ,𝑛 . The
edge set of the induced sub-graph of 𝑍𝑘 as 𝐸𝑍𝑘 . The               posterior probability can be factorized into the product
cost of a clique is the sum of pairwise edge weights be-           of likelihoods and the probability of a given temporal
tween the contained vertices. The cost of an assignment            order, i.e., 𝑦ˆ1∼𝑇𝑛 ,𝑛 = arg max 𝑝(𝑦1∼𝑇𝑛 ,𝑛 |e1∼𝑇𝑛 ,𝑛 ) =
solution is the sum of the costs of all the 𝐾 cliques, i.e.,                                     𝑦1∼𝑇𝑛 ,𝑛

                                                                   arg max {Π𝑇𝑖=1
                                                                               𝑛
                                                                                  𝑝𝑛 (e𝑖,𝑛 |𝑦𝑖,𝑛 ) · Π𝑇𝑖=1
                                                                                                        𝑛
                                                                                                           𝑝𝑛 (𝑦𝑖,𝑛 |𝑦1∼𝑖−1,𝑛 )}.
                                𝐾
                               ∑︁   ∑︁                             𝑦1∼𝑇𝑛 ,𝑛
     ℒ(𝑍1 , 𝑍2 , ..., 𝑍𝐾 ) =                𝑤(c, c′ ).   (3)   We fit a Gaussian model on each global cluster and
                             𝑘=1 (c,c′ )∈𝐸𝑍
                                            𝑘                  compute the frame-wise likelihoods, i.e., 𝑝𝑛 (x|𝑘) =
                                                               𝒩𝑘 (x; 𝜇𝑘 , Σ𝑘 ), 𝑘 ∈ {1, ..., 𝐾}. The temporal order con-
   In order to solve this NP-hard problem, we employ an straint is used to limit the search space for the optimal
iterative multiple-hub heuristic [41]. In each iteration, label sequence by filtering out the sequences that do not
we choose a hub vertex set 𝑉ℎ = {c𝑘,ℎ |𝑘 = 1, .., 𝐾} follow the temporal order.
and there are (𝑁 − 1) non-hub vertex sets. We compute            The related works [21, 22] apply K-means on the
an assignment solution in each iteration in two steps, as frames of all the videos. From the unified clustering they
is shown in Fig. 3: (1) We first perform (𝑁 − 1) bipar- derive only a single temporal order of clusters for all the
tite matchings between 𝑉ℎ and each of the remaining videos. However, this is an unrealistic assumption due to
non-hub vertex sets 𝑉ℎ . (2) Secondly, we determine the interchangeable steps in the activities, e.g., pour milk and
edge connection between pairs of non-hub vertex sets. pour sugar in making coffee. Instead, we can easily derive
On two non-hub vertex sets 𝑉ℎ , 𝑉ℎ′ , we connect two the temporal order for each video separately. We do so
vertices c𝑖,ℎ ∈ 𝑉ℎ and c𝑖′ ,ℎ′ ∈ 𝑉ℎ′ , if c𝑖,ℎ and c𝑖′ ,ℎ′ are by sorting the within-video clusters according to the av-
connected to the same vertex c𝑘,ℎ on 𝑉ℎ .                      erage timestamp of frames in each cluster. The output of
   After the two steps, every hub vertex c𝑘,ℎ , with 𝑘 ∈ the Viterbi decoding is the optimal cluster label sequence
{1, .., 𝐾} and all the non-hub vertices connected to c𝑘,ℎ 𝑦ˆ
                                                                1∼𝑇𝑛 ,𝑛 . More details are given in the supplementary.
form the 𝑘-th clique 𝑍𝑘 . Therefore, the 𝑁 -partite graph
𝐺 is partitioned into 𝐾 complete and disjoint subgraphs.



                                                               4
Wei Lin et al. CEUR Workshop Proceedings                                                                                      1–10



4. Experiments                                                        tion (i.e., LSTM+AL [38]) on each video individually, with-
                                                                      out solving the alignment among different clusters or
4.1. Datasets & Evaluation Metrics                                    segments across videos. For a fair comparison, these
                                                                      are evaluated by local Hungarian matching on indi-
We evaluate on Breakfast [42], the YouTube Instructions
                                                                      vidual videos, where a per-video best ground-truth-to-
dataset (YTI) [36] and 50 Salads [43]. Breakfast is com-
                                                                      cluster-label mapping is determined using the ground
prised of 1712 videos recorded in various kitchens. There
                                                                      truth on each video separately. This results in a separate
are 10 composite activities of breakfast preparation. YTI
                                                                      label mapping for each video. Following [37], we also
is composed of 150 videos of 5 activities collected from
                                                                      report results with 𝐾 set to the average number of ac-
YouTube. 50 Salads contains 50 videos of people prepar-
                                                                      tions for each activity (i.e., 𝐾=avg.#gt) for a complete
ing salads. Following [20, 21, 22], we use the dense trajec-
                                                                      comparison.
tory Fisher vector features (DTFV) [44] for Breakfast and
                                                                         In Table 1, TAEC achieves strong results in compari-
50 Salads, and features provided by Alayrac et al. [36] on
                                                                      son to the unsupervised state-of-the-art and is even com-
YTI. We use the evaluation protocol in [21] and report
                                                                      parable to weakly supervised approaches. Although ap-
the performance in three metrics: (1) Mean over Frames
                                                                      proaches without solving the alignment of clusters across
(MoF) is the frame-level accuracy over the frames of all
                                                                      videos inherently lead to better scores in the evaluation
the videos. More frequent or longer action instances
                                                                      settings of the local Hungarian matching, our approach
have a higher impact on the result. (2) Class-wise mean
                                                                      still compares favorably.
Intersection over Union (cIoU) is the average over the
                                                                         We compare qualitative results (with global Hungarian
IoU performance for each class and penalizes segmenta-
                                                                      matching) of TAEC and MLP+kmeans [21] on 3 Breakfast
tion results with dominating segments. (3) The F1-score
                                                                      activities in Fig. 4. We see that our two-step clustering
penalizes results with oversegmentation.
                                                                      (the 2nd rows in all clustering result plots) already leads to
                                                                      temporally consistent segments with relatively accurate
4.2. Implementation Details                                           boundaries of action instances, while K-means (the 4th
                                                                      rows in all clustering result plots) results in serious over-
For our SSTEN, we adapt the number of dilated residual
                                                                      segmentation. The Viterbi decoding further improves
layers 𝑄 according to the dataset size: We set 𝑄 = 5
                                                                      the segmentation by suppressing the oversegmentation
for YTI (15k frames per activity subset on average) and
                                                                      and domination of incorrect clusters (the 2nd rows in all
𝑄 = 10 for Breakfast (360k) and 50 Salads (577k). The
                                                                      final result plots). Moreover, MLP+kmeans [21] follows
dimension of the hidden representation is set to 32. We
                                                                      the constraint of the fixed temporal order of segments
set 𝜆 in Eq. (1) to 0.002 (Breakfast), 0.01 (YTI) and 0.005 (50
                                                                      on videos of each activity (the 4th rows in all final result
Salads). For clustering, we follow the protocol of [20, 36]
                                                                      plots). In contrast, TAEC yields an individual temporal
and define the number of clusters 𝐾 separately for each
                                                                      order for each video (the 2nd rows in all final result plots).
activity as the maximum number of ground truth classes.
                                                                      Additional qualitative results and evaluation scores are
The values of 𝐾 for the three datasets are provided in
                                                                      included in the supplemental material.
the supplementary material.
                                                                         For the YouTube Instructions dataset, we follow the
                                                                      protocol of [20, 21, 36] and report the results with and
4.3. Comparison with the State-of-the-Art                             without considering background frames. Here, our TAEC
We compare with unsupervised learning methods, as                     outperforms all recent works in almost all of the metrics
well as weakly and fully supervised approaches on                     under all three settings.
Breakfast (Table 1), YTI (Table 2) and 50 Salads (Ta-                    50 Salads is a particularly challenging dataset for un-
ble 3). Most unsupervised segmentation approaches yield               supervised approaches, as each video has a different or-
cluster-aware segments that are aligned across all the                der of actions and additionally includes many repetitive
videos [20, 21, 22, 36, 45]. These approaches are eval-               action instances. In the eval-level of 12 classes, TAEC
uated with the global Hungarian matching on all                       outperforms all approaches under the global Hungarian
videos, where the mapping between ground truth classes                matching evaluation and achieves competitive results
and clusters is performed on all the videos of an activ-              under the local Hungarian matching. In the challenging
ity, which results in one mapping for each activity. The              mid-level evaluation of 19 classes, the sequential nature
number of clusters 𝐾 is set to the maximum number of                  of frames is less advantageous. Therefore, MLP+kmeans
ground truth classes for each activity (i.e., 𝐾=max.#gt).             [21] outperforms TAEC. Generally, in the local match-
We focus on the performance comparison in this setting                ing case, approaches without alignment across videos
and follow this setting in all the ablation studies.                  compare favorably.
   Two recent approaches perform clustering (i.e.,
TWFINTCH [37]) or category-agnostic boundary detec-




                                                                  5
Wei Lin et al. CEUR Workshop Proceedings                                                                                                                          1–10


                    Making cereals                                         Making juice                                        Making fried egg
        empty take bowl    pour cereals   pour milk         empty cut orange     squeeze orange     take glass     empty pour oil take plate take eggs fry egg
        stir milk                                           take squeezer take knife   take plate    pour juice    butter pan add salt crack egg put egg2plate

           cereals video 1 clustering result                     juice video 1 clustering result                      fried egg video 1 clustering result


           cereals video 2 clustering result                     juice video 2 clustering result                      fried egg video 2 clustering result


           cereals video 3 clustering result                     juice video 3 clustering result                      fried egg video 3 clustering result




           cereals video 1 final result                           juice video 1 final result                            fried egg video 1 final result


           cereals video 2 final result                           juice video 2 final result                            fried egg video 2 final result


           cereals video 3 final result                           juice video 3 final result                            fried egg video 3 final result




Figure 4: Qualitative results of clustering and final segmentation (with global Hungarian matching) on 3 activities (3 videos
each) on Breakfast. For each video, the 4-row-group displays the ground truth (1st row), TAEC (2nd row), TAEC with naïve
assignment (3rd row, quantitative comparison in Sec. 4.6), MLP+kmeans [21] (4th row). More quantitative and qualitative
segmentation results are given in the supplementary.



Table 1                                                                                Table 2
Comparison with state-of-the-art approaches on Breakfast                               Comparison with the state-of-the-art on YTI (in %). * denotes
(in %). * denotes approach without segment alignment across                            approach without segment alignment across videos, ‡ denotes
videos, ‡ denotes our reimplementation, underlined scores are                          our reimplementation, underlined scores are acquired from
acquired from the author.                                                              the author.
                                 Breakfast                                                                              YouTube Instructions

          Approach              Supervision MoF       IoU            F1                                                     MoF       IoU       F1
                                                                                                                  Super-                               MoF       IoU
                                                                                             Approach                       w/o       w/o      w/o
        MSTCN++ [8]                full       67.6      -            -                                            vision                               w bg      w bg
                                                                                                                            bg         bg      bkg
        G-FRNet [46]               full       67.7      -            -
         DTGRM [10]                full       68.3      -            -                          Global Hungarian matching on all videos (𝐾= max. #gt)
         SSTDA [47]                full       70.3      -            -                   Frank-Wolfe [36]         w/o        -         -       24.4     -          -
           BCN [9]                 full       70.4      -            -                     Mallow [20]            w/o       27.8       -       27.0     -          -
       Global2local [48]           full       70.7      -            -                   UNet+MLP [22]            w/o       28.9      8.3      29.9     -          -
        ASFormer [49]              full       73.5      -            -                      ASAL [45]             w/o       44.9       -       32.1     -          -
        NN-Viterbi [15]          weak         43.0      -            -                   MLP+kmeans [21]          w/o       39.0      9.8      28.3    14.5       9.6
         D3TW [12]               weak         45.7      -            -                   MLP+kmeans ‡             w/o       39.4      9.9      29.6    14.4       9.7
          TASL [50]              weak         47.8      -            -                        TAEC                w/o       46.6     10.7      29.5    17.0      10.5
         CDFL [13]               weak         50.2      -            -                              Local Hungarian matching on each video (𝐾= max. #gt)
       Global Hungarian matching on all videos (𝐾= max. #gt)                             LSTM+AL [38]*            w/o        -        -        39.7*    -         -
         Mallow [20]              w/o         34.6     -             -                   MLP+kmeans ‡             w/o       62.2     21.6      47.0    22.7      21.6
        UNet+MLP [22]             w/o         48.1     -            29.9                    TAEC                  w/o       67.9     23.7      49.4    24.8      23.6
          ASAL [45]               w/o         52.5     -            37.9                            Local Hungarian matching on each video (𝐾= avg. #gt)
       MLP+kmeans [21]            w/o         41.8     -            26.4
        MLP+kmeans ‡              w/o         42.9    13.1          25.5                 TWFINCH [37]*            w/o       56.7*     -        48.2*    -         -
           TAEC                   w/o         50.3    19.0          33.6                 MLP+kmeans ‡             w/o       63.6     20.5      52.3    23.2      20.5
                                                                                            TAEC                  w/o       65.3     20.9      51.0    23.9      20.8
         Local Hungarian matching on each video (𝐾= max. #gt)
        LSTM+AL [38]*             w/o         42.9*   46.9*          -
        UNet+MLP [22]             w/o         52.2      -            -
       TWFINTCH [37]*
        MLP+kmeans ‡
                                  w/o
                                  w/o
                                              57.8*
                                              61.2
                                                        -
                                                       30.3
                                                                     -
                                                                    35.9
                                                                                       4.4. Embedding and Clustering
           TAEC                   w/o         64.3     41.2         42.6
                                                                                       We first evaluate our SSTEN embedding in combination
          Local Hungarian matching on each video (𝐾= avg. #gt)
                                                                                       with K-means and two-step clustering on three feature
       TWFINCH [37]*
       MLP+kmeans ‡
                                  w/o
                                  w/o
                                              62.7*
                                               60.6
                                                      42.3*
                                                       27.8
                                                                     -
                                                                    46.4
                                                                                       types: the AlexNet fc6 features [57] pre-trained on Im-
          TAEC                    w/o          62.6    32.3         49.6               ageNet [58], I3D features [59] pre-trained on the Kinet-
                                                                                       ics dataset [60], and the precomputed dense trajectory
                                                                                       Fisher vectors (DTFV) [44]. We also report results of
                                                                                       raw features without any temporal embedding. For a
                                                                                       fair comparison, we reduce the dimensions of the three



                                                                                   6
Wei Lin et al. CEUR Workshop Proceedings                                                                                                        1–10



Table 3                                                                                     Comparison of raw features without embedding.
Comparison with the state-of-the-art on 50 Salads (in %). *                              Among the three types of features without temporal
denotes approach without segment alignment across videos,                                embedding, I3D achieves the best performance, while
‡ denotes our reimplementation.                                                          AlexNet features lead to the worst results. AlexNet fea-
                                    50 Salads                                            tures are computed from individual spatial frames. On
   Level            Approach         Supervision         MoF      IoU      F1
                                                                                         the contrary, each frame feature of DTFV and I3D is com-
                   STCNN [51]              full          72.0      -        -
                                                                                         puted based on a chunk of its temporal neighbor frames.
                   EDTCN [52]              full          73.4      -        -            Therefore, the features already carry intrinsic temporal
                     MA [53]               full          88.5      -        -
                                                                                         consistency. Furthermore, the two-stream I3D model can
              Global Hungarian matching on all videos (𝐾= max. #gt)                      leverage both RGB and optical flow. Therefore, I3D fea-
              UNet+MLP [22]                w/o           30.6     -        -             tures achieve a better performance than DTFV, which
                ASAL [45]                  w/o           39.2     -        -
   eval      MLP+kmeans [21]               w/o           35.5     -        -             rely on handcrafted dense trajectories.
              MLP+kmeans ‡                 w/o           37.9    24.6     40.2              Comparison of SSTEN embeddings learned on
                 TAEC                      w/o           48.4    26.0     44.8
                                                                                         different features. When comparing the SSTEN em-
                   Local Hungarian matching on each video (𝐾= max. #gt)
                                                                                         beddings to the performance of the raw features, we see
              LSTM+AL [38]*
              MLP+kmeans ‡
                                           w/o
                                           w/o
                                                         60.6*
                                                         58.0
                                                                  -
                                                                 33.7
                                                                           -
                                                                          49.8
                                                                                         that SSTEN leads to a significant performance gain for
                 TAEC                      w/o           59.7    35.0     54.4           both clustering methods. For DTFV, the performance
                   Local Hungarian matching on each video (𝐾= avg. #gt)                  improvements by SSTEN are MoF 8.5%, IoU 6.0%, F1 8.9%
              TWFINCH [37]*                w/o           71.1*    -        -             with K-means and MoF 15.8%, IoU 6.9%, F1 11.6% with
              MLP+kmeans ‡                 w/o           51.5    22.3     43.4           two-step clustering.
                 TAEC                      w/o           59.7    35.7     54.7
                                                                                            Among the three types of SSTEN embedded features,
                   SSTDA [47]              full          83.2      -        -
                  MSTCN++ [8]              full          83.7      -        -            I3D has slightly better IoU and F1 scores while DTFV
                   HASR [54]               full          83.9      -        -            leads to the best MoF scores for both, K-means and the
                    ASRF [55]              full          84.5      -        -
                  ASFormer [49]            full          85.6      -        -            two-step clustering. Overall, the SSTEN embeddings
                  NNViterbi [15]        weak             49.4      -        -
                                                                                         learned from these two features perform comparably. We
                   CDFL [56]            weak             54.7      -        -            conduct the following experiments using DTFV, which
              Global Hungarian matching on all videos (𝐾= max. #gt)                      is also used in related works.
   mid        UNet+MLP [22]                w/o           24.2     -        -
                ASAL [45]                  w/o           34.4     -        -
             MLP+kmeans [21]               w/o           30.2     -        -             4.5. Impact of Loss Terms on Clustering
              MLP+kmeans ‡                 w/o           29.1    15.7     23.4
                 TAEC                      w/o           26.6    14.9     23.4           To evaluate the impact of the two loss terms in Eq. (1), we
                   Local Hungarian matching on each video (𝐾= max. #gt)                  plot the quantitative segmentation results of SSTEN with
              MLP+kmeans ‡                 w/o           55.6    29.6     39.6           both K-means and the two-step clustering w.r.t. different
                 TAEC                      w/o           50.2    29.4     40.3
                                                                                         reconstruction loss coefficients 𝜆 in Fig. 5. In general,
                   Local Hungarian matching on each video (𝐾= avg. #gt)                  two-step clustering leads to a better performance than
              TWFINCH [37]*                w/o           66.5*    -        -             K-means for almost all 𝜆 values (except for the case of
              MLP+kmeans ‡                 w/o           53.4    28.1     39.0
                 TAEC                      w/o           51.9    30.2     41.9           only reconstruction loss). With decreasing 𝜆, the relative
                                                                                         time prediction loss has an increasing impact and the em-
                                                                                         bedded features have better global temporal consistency,
features without embedding to 32 via PCA. We conduct                                     which explains the increasing IoU and F1 scores. How-
the experiments on Breakfast and report the results with                                 ever, at extremely small 𝜆 values, the embedded features
and without our SSTEN embedding in Table 4.                                              overfit to the relative time prediction task, which results
                                                                                         in saturated IoU and F1 scores, and a significant drop in
                                                                                         MoF for both K-means and two-step clustering.
Table 4
                                                                                            To intuitively illustrate the loss term impact on the two-
Comparison of features of SSTEN embedding, together with
clustering methods on Breakfast (in %).                                                  step clustering, we plot the similarity matrices for SSTEN
                                                                                         embeddings trained with three different 𝜆 in Fig. 6. Here,
          Model                    K-means                  Two-step clustering          we look at the similarity matrices with temporal Gaussian
 Feature     Embedding      MoF      IoU          F1      MoF      IoU          F1       kernel (bottom row). Intuitively, the similarity matrix
 AlexNet                    25.9    11.3          22.3    33.7     10.7     20.2         with clear diagonal block structure (Fig. 6(a2)), which is
  I3D             w/o       33.4    14.4          25.6    37.7     14.3     26.1
 DTFV                       30.8    11.8          23.0    34.5     12.1     22.0         the result of an appropriate ratio between the reconstruc-
 AlexNet                    33.0    14.2          27.0    39.1     16.1     30.7
                                                                                         tion loss and relative time prediction loss (𝜆 = 0.002),
  I3D        SSTEN          37.9    18.7          33.3    45.2     20.5     35.1         leads to the best segmentation performance. When 𝜆 be-
 DTFV                       39.3    17.8          31.9    50.3     19.0     33.6
                                                                                         comes larger (e.g., 𝜆 = 0.01), the reconstruction loss
                                                                                         has a larger impact and the diagonal block structure



                                                                                     7
   Wei Lin et al. CEUR Workshop Proceedings                                                                                                            1–10


                     50                                  MoF two-step.
                     45                                  MoF kmeans          Viterbi decoding) on Breakfast and 50 Salads in Table 5.
                                                         IoU two-step.
performance (in %)

                     40                                  IoU kmeans          The global cluster assignment outperforms the naïve as-
                     35                                  F1 two-step.
                                                         F1 kmeans
                     30
                     25                                                      Table 5
                     20                                                      Impact of cluster assignment strategies for two-step clustering
                     15                                                      on SSTEN embeddings on Breakfast and 50 Salads (eval level,
                     10                                  tive                i.e., 12 classes) (in %).
                          onlyuction                 rela ion
                                                only predict
                           nstr
                      reco                     time
                                                                                                                   Clustering                 Final
                                                                               Dataset       Strategy
   Figure 5: Segmentation performance of both clustering meth-                                              MoF       IoU       F1     MoF    IoU     F1
   ods on SSTEN embeddings with different 𝜆 on Breakfast.                                global cluster     38.6      13.7      25.9   50.3   19.0    33.6
                                                                               Breakfast
                                                                                             naïve          25.1      12.4      23.9   42.3   17.7    32.0
                                                                                           global cluster   45.3      23.0      43.7   48.4   26.0    44.8
                                                                               50 Salads
                                                                                               naïve        28.9      16.2      32.9   35.0   22.7    38.0
  (Fig. 6(b2)) becomes pale. Therefore, the performances of
  embedded features with 𝜆 = 0.005, 𝜆 = 0.01 and only
                                                                   signment by a large margin for both, clustering results
  reconstruction loss degrade successively. On the other
                                                                   and the final segmentation results, on both datasets. The
  hand, for extremely small 𝜆 values (e.g., 𝜆 = 0.0005),
                                                                   advantage of the global cluster assignment is even more
  the block diagonal structure (Fig. 6(c2)) becomes noisy
                                                                   evident on 50 Salads.
  due to overfitting on relative time prediction.
                                                                      We illustrate exemplary qualitative results of the clus-
                                                         1.0       tering and the final segmentation for 3 activities (with 3
           w/o
                                                         0.8
                                                                   videos each) on Breakfast in Fig. 4. For each video, the 4-
        tmp.Gauss                                                  row group displays the ground truth (1st row), the result
                                                         0.6
                       (a1)         (b1)        (c1)               with  global cluster assignment (2nd row) and the result
                                                         0.4       with naïve assignment (3rd row). The 4th row shows the
           with
                                                         0.2
                                                                   result of MLP+kmeans [21]. By comparing all the 3rd
        tmp.Gauss
                                                                   rows of cereals video [id] final result in Fig. 4, we see that
                       (a2)         (b2)        (c2)     0.0
                                                                   the naïve assignment simply assumes the sub-clusters in
                  (a) SSTEN    (b) SSTEN   (c) SSTEN               the same temporal order in each video belong to the same
                                                                   global cluster, while they might not be close to each other
  Figure 6: Frame-to-frame similarity matrices of SSTEN em-
                                                                   in the feature space. On the contrary, the global cluster
  beddings for the same Breakfast video. Columns show the
  similarity matrices for different 𝜆, while the rows show results assignment (the 2nd rows of cereals video [id] final result)
  without (top) and with (bottom) temporal Gaussian kernel.        yields an optimal assignment solution with respect to
                                                                   the pairwise distances between sub-clusters, resulting in
                                                                   different orderings of sub-clusters on each video. Note
     Therefore, both the reconstruction and the relative that on some videos, global cluster assignment could lead
  timestamp prediction loss, when combined in an appro- to the same assignment result as naïve assignment.
  priate ratio, are indispensable to learn the effective rep-
  resentation that preserves both spatial layout and the
  temporal information.                                            5. Conclusion
                                                                             We proposed a new pipeline for the unsupervised learn-
   4.6. Impact of Cluster Assignment                                         ing of action segmentation. For the feature embedding,
  In this ablation study, we evaluate the efficacy of the                    we propose a temporal-aware embedding network that
  global cluster assignment. For two-step clustering, we                     performs sequence-to-sequence learning with the pretext
  evaluate two strategies of grouping within-video clus-                     tasks of relative timestamp prediction and feature recon-
  ters into global clusters: (1) the naïve assignment, for                   struction. For clustering, we propose a two-step cluster-
  which we order the sub-clusters according to the aver-                     ing schema, consisting of within-video clustering and
  age timestamp and simply group the 𝑘-th sub-clusters                       cross-video global cluster assignment. The temporal em-
  of all videos into a global cluster, i.e., the global cluster              bedding of sequence-to-sequence learning together with
  𝑍𝑘 = {c𝑘,𝑛 |𝑛 = 1, .., 𝑁 }, and (2) the global cluster                     two-step clustering is proven to be a well-suitable combi-
  assignment, as detailed in Sec. 3.2.2.                                     nation that considers the sequential nature of frames in
     In order to show how the different cluster assignment                   both processing steps. Ultimately, we combine the tempo-
  strategies affect the clustering result, we report both,                   ral embedding with a frame-to-cluster assignment based
  the results of the two-step clustering (before Viterbi de-                 on Viterbi decoding, which achieves the unsupervised
  coding) and the final segmentation performance (after                      state-of-the-art on three challenging benchmarks.




                                                                         8
Wei Lin et al. CEUR Workshop Proceedings                                                                                1–10



References                                                             supervised action segmentation without ordering
                                                                       constraints, in: CVPR, 2018.
 [1] Z. Wang, Q. She, A. Smolic, Action-net: Multipath            [20] F. Sener, A. Yao, Unsupervised learning and segmen-
     excitation for action recognition, in: CVPR, 2021.                tation of complex activities from video, in: CVPR,
 [2] C. Feichtenhofer, X3d: Expanding architectures for                2018.
     efficient video recognition, in: CVPR, 2020.                 [21] A. Kukleva, H. Kuehne, F. Sener, J. Gall, Unsuper-
 [3] C. Yang, Y. Xu, J. Shi, B. Dai, B. Zhou, Temporal                 vised learning of action classes with continuous
     pyramid network for action recognition, in: CVPR,                 temporal embedding, in: CVPR, 2019.
     2020.                                                        [22] R. G. VidalMata, W. J. Scheirer, H. Kuehne, Joint
 [4] M. Xu, C. Zhao, D. S. Rojas, A. Thabet, B. Ghanem,                visual-temporal embedding for unsupervised learn-
     G-tad: Sub-graph localization for temporal action                 ing of actions in untrimmed sequences, in: WACV,
     detection, in: CVPR, 2020.                                        2021.
 [5] P. Zhao, L. Xie, C. Ju, Y. Zhang, Y. Wang, Q. Tian,          [23] B. L. Bhatnagar, S. Singh, C. Arora, C. V. Jawahar,
     Bottom-up temporal action localization with mu-                   Unsupervised learning of deep feature representa-
     tual regularization, in: ECCV, 2020.                              tion for clustering egocentric actions, in: IJCAI,
 [6] Y. Bai, Y. Wang, Y. Tong, Y. Yang, Q. Liu, J. Liu,                2017.
     Boundary content graph neural network for tem-               [24] N. Srivastava, E. Mansimov, R. Salakhudinov, Un-
     poral action proposal generation, in: ECCV, 2020.                 supervised learning of video representations using
 [7] Y. A. Farha, J. Gall, MS-TCN: Multi-stage temporal                LSTMs, in: ICML, 2015.
     convolutional network for action segmentation, in:           [25] E. L. Denton, V. Birodkar, Unsupervised learning
     CVPR, 2019.                                                       of disentangled representations from video, in:
 [8] S. Li, Y. A. Farha, Y. Liu, M.-M. Cheng, J. Gall, MS-             NeurIPS, 2017.
     TCN++: Multi-Stage Temporal Convolutional Net-               [26] R. Villegas, J. Yang, S. Hong, X. Lin, H. Lee, De-
     work for Action Segmentation, PAMI (2020).                        composing motion and content for natural video
 [9] Z. Wang, Z. Gao, L. Wang, Z. Li, G. Wu, Boundary-                 sequence prediction, in: ICLR, 2017.
     aware cascade networks for temporal action seg-              [27] D. Kim, D. Cho, I. S. Kweon, Self-supervised video
     mentation, in: ECCV, 2020.                                        representation learning with space-time cubic puz-
[10] D. Wang, D. Hu, X. Li, D. Dou, Temporal relational                zles, in: AAAI, 2019.
     modeling with self-supervision for action segmen-            [28] H.-Y. Lee, J.-B. Huang, M. Singh, M.-H. Yang, Un-
     tation, in: AAAI, 2021.                                           supervised representation learning by sorting se-
[11] Y. Huang, Y. Sugano, Y. Sato, Improving action                    quences, in: ICCV, 2017.
     segmentation via graph-based temporal reasoning,             [29] V. Ramanathan, K. Tang, G. Mori, L. Fei-Fei, Learn-
     in: CVPR, 2020.                                                   ing temporal embeddings for complex video analy-
[12] C.-Y. Chang, D.-A. Huang, Y. Sui, L. Fei-Fei, J. C.               sis, in: ICCV, 2015.
     Niebles, D3tw: Discriminative differentiable dy-             [30] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati,
     namic time warping for weakly supervised action                   T. Tuytelaars, Modeling video evolution for action
     alignment and segmentation, in: CVPR, 2019.                       recognition, in: CVPR, 2015.
[13] J. Li, P. Lei, S. Todorovic, Weakly supervised energy-       [31] A. Cherian, B. Fernando, M. Harandi, S. Gould, Gen-
     based learning for action segmentation, in: ICCV,                 eralized rank pooling for activity recognition, in:
     2019.                                                             CVPR, 2017.
[14] A. Richard, H. Kuehne, J. Gall, Weakly supervised            [32] M. Hoai, F. D. la Torre, Maximum margin temporal
     action learning with RNN based fine-to-coarse mod-                clustering, in: Artificial Intelligence and Statistics,
     eling, in: CVPR, 2017.                                            2012.
[15] A. Richard, H. Kuehne, A. Iqbal, J. Gall,                    [33] S. Li, K. Li, Y. Fu, Temporal subspace clustering for
     NeuralNetwork-Viterbi:           A framework for                  human motion segmentation, in: CVPR, 2015.
     weakly supervised video learning, in: CVPR, 2018.            [34] S. Tierney, J. Gao, Y. Guo, Subspace clustering for
[16] D.-A. Huang, F.-F. Li, J. C. Niebles, Connectionist               sequential data, in: CVPR, 2014.
     temporal modeling for weakly supervised action               [35] Y. Zhang, S. Tang, H. Sun, H. Neumann, Human
     labeling, in: ECCV, 2016.                                         motion parsing by hierarchical dynamic clustering,
[17] M. Fayyaz, J. Gall, SCT: Set constrained temporal                 in: BMVC, 2018.
     transformer for set supervised action segmentation,          [36] J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic,
     in: CVPR, 2020.                                                   I. Laptev, S. Lacoste-Julien, Unsupervised learning
[18] J. Li, S. Todorovic, Set-constrained Viterbi for set-             from narrated instruction videos, in: CVPR, 2016.
     supervised action segmentation, in: CVPR, 2020.              [37] M. S. Sarfraz, N. Murray, V. Sharma, A. Diba,
[19] A. Richard, H. Kuehne, J. Gall, Action sets: Weakly               L. Van Gool, R. Stiefelhagen, Temporally-weighted



                                                              9
Wei Lin et al. CEUR Workshop Proceedings                                                                            1–10



     hierarchical clustering for unsupervised action seg-           boundaries, in: WACV, 2021.
     mentation, in: CVPR, 2021.                                [56] J. Liu, A. Shahroudy, M. L. Perez, G. Wang, L.-Y.
[38] S. N. Aakur, S. Sarkar, A perceptual prediction                Duan, A. K. Chichung, NTU RGB+D 120: A large-
     framework for self supervised event segmentation,              scale benchmark for 3D human activity understand-
     in: CVPR, 2019.                                                ing, PAMI (2019).
[39] L. Zelnik-Manor, P. Perona, Self-tuning spectral          [57] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet
     clustering, in: NeurIPS, 2005.                                 classification with deep convolutional neural net-
[40] W. P. Pierskalla, Letter to the editor—the multi-              works, in: NeurIPS, 2012.
     dimensional assignment problem, Operations Re-            [58] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-
     search 16 (1968) 422–431.                                      Fei, Imagenet: A large-scale hierarchical image
[41] H.-J. Bandelt, Y. Crama, F. C. Spieksma, Approxima-            database, in: CVPR, 2009.
     tion algorithms for multi-dimensional assignment          [59] J. Carreira, A. Zisserman, Quo vadis, action recog-
     problems with decomposable costs, Discrete Ap-                 nition? A new model and the kinetics dataset, in:
     plied Mathematics 49 (1994) 25–50.                             CVPR, 2017.
[42] H. Kuehne, A. Arslan, T. Serre, The language of           [60] W. Kay, J. Carreira, K. Simonyan, B. Zhang,
     actions: Recovering the syntax and semantics of                C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green,
     goal-directed human activities, in: CVPR, 2014.                T. Back, P. Natsev, et al., The kinetics human ac-
[43] S. Stein, S. J. McKenna, Combining embedded ac-                tion video dataset, arXiv preprint arXiv:1705.06950
     celerometers with computer vision for recognizing              (2017).
     food preparation activities, in: ACM International
     Joint Conference on Pervasive and Ubiquitous Com-
     puting, 2013.
[44] H. Wang, C. Schmid, Action recognition with im-
     proved trajectories, in: ICCV, 2013.
[45] J. Li, S. Todorovic, Action shuffle alternating learn-
     ing for unsupervised action segmentation, in:
     CVPR, 2021.
[46] D. Wang, Y. Yuan, Q. Wang, Gated forward refine-
     ment network for action segmentation, Neurocom-
     puting 407 (2020) 63–71.
[47] M.-H. Chen, B. Li, Y. Bao, G. AlRegib, Z. Kira, Action
     segmentation with joint self-supervised temporal
     domain adaptation, in: CVPR, 2020.
[48] S.-H. Gao, Q. Han, Z.-Y. Li, P. Peng, L. Wang, M.-M.
     Cheng, Global2local: Efficient structure search for
     video action segmentation, in: CVPR, 2021.
[49] F. Yi, H. Wen, T. Jiang, Asformer: Transformer for
     action segmentation, in: BMVC, 2021.
[50] Z. Lu, E. Elhamifar, Weakly-supervised action
     segmentation and alignment via transcript-aware
     union-of-subspaces learning, in: ICCV, 2021.
[51] C. Lea, A. Reiter, R. Vidal, G. D. Hager, Segmental
     spatiotemporal cnns for fine-grained action seg-
     mentation, in: ECCV, 2016.
[52] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, G. D. Hager,
     Temporal convolutional networks for action seg-
     mentation and detection, in: CVPR, 2017.
[53] C. Fermüller, F. Wang, Y. Yang, K. Zampogiannis,
     Y. Zhang, F. Barranco, M. Pfeiffer, Prediction of ma-
     nipulation actions, International Journal of Com-
     puter Vision 126 (2018) 358–374.
[54] H. Ahn, D. Lee, Refining action segmentation with
     hierarchical video representations, in: ICCV, 2021.
[55] Y. Ishikawa, S. Kasai, Y. Aoki, H. Kataoka, Alleviat-
     ing over-segmentation errors by detecting action



                                                          10
TAEC: Unsupervised Action Segmentation with
Temporal-Aware Embedding and Clustering
Supplementary


1. Introduction                                                                                                                    2.2. Frame Labeling by Viterbi Decoding
For additional insights into TAEC, we introduce the back-                                                                          Additional explanations to Sec. 3.3 in the main
ground of spectral clustering in Sec. 2.1 and give details                                                                         manuscript: The global cluster assignment delivers the or-
of the Viterbi decoding in Sec. 2.2. We perform more                                                                               dered clusters on each video, which are aligned across all
ablation studies on comparing baseline embeddings and                                                                              videos. To compute the final segmentation, we use the re-
clustering methods (Sec. 3.1), scaling of spatio-temporal                                                                          sulting ordering and decode each video into a sequence of
similarity (Sec. 3.2), cluster ordering (Sec. 3.3), decoding                                                                       𝐾 temporally consistent segments. That is, we determine
strategies (Sec. 3.4). Finally, we provide more quantitative                                                                       the optimal label sequence 𝑦ˆ1∼𝑇𝑛 ,𝑛 = {𝑦1,𝑛 , ..., 𝑦𝑇𝑛 ,𝑛 }
(Sec. 3.5) and qualitative segmentation results (Sec. 3.6)                                                                         by re-assigning each frame to one of the temporally or-
on the three datasets.                                                                                                             dered clusters.
                                                                                                                                      Given the embedded feature sequence e1∼𝑇𝑛 ,𝑛 =
                                                                                                                                   {e1,𝑛 , ..., e𝑇𝑛 ,𝑛 } and the temporal order of the clusters,
2. Method                                                                                                                          we search for the optimal label sequence that maximizes
                                                                                                                                   the probability 𝑝(𝑦1∼𝑇𝑛 ,𝑛 |e1∼𝑇𝑛 ,𝑛 ). Following [6], this
2.1. Spectral Clustering                                                                                                           posterior probability can be factorized into the product
                                                                                                                                   of likelihoods and the probability of a given temporal
Background information related to Sec. 3.2.1 in the
                                                                                                                                   order, i.e.,
main manuscript: Given the embedded feature sequence
e1 , e2 , ..., e𝑇 , we build a frame-to-frame similarity graph 𝑦ˆ
                                                                  1∼𝑇𝑛 ,𝑛 = arg max 𝑝(𝑦1∼𝑇𝑛 ,𝑛 |e1∼𝑇𝑛 ,𝑛 )
𝐺 ∈ R𝑇 ×𝑇 , whose edge weight 𝑔(𝑖, 𝑗), 𝑖, 𝑗 ∈ {1, ..., 𝑇 }                   𝑦1∼𝑇𝑛 ,𝑛
represents the similarity between frame 𝑖 and frame 𝑗. = arg max {Π𝑇𝑛 𝑝(e |𝑦 ) · Π𝑇𝑛 𝑝 (𝑦 |𝑦
                                                                                 𝑡=1    𝑡,𝑛 𝑡,𝑛        𝑡=1 𝑛 𝑡,𝑛 1∼(𝑡−1),𝑛 )}
Grouping the frames into 𝐾 clusters can be interpreted              𝑦1∼𝑇𝑛 ,𝑛
as a graph partition problem by cutting edges on 𝐺, re-
                                                                 = arg max {Π𝑇𝑡=1 𝑛
                                                                                     𝑝(e𝑡,𝑛 |𝑦𝑡,𝑛 ) · 𝑝𝑛 (𝑦𝑡,𝑛 |𝑦𝑡−1,𝑛 )} (2)
sulting in 𝐾 subgraphs 𝐺1 , 𝐺2 , ..., 𝐺𝐾 . The normalized           𝑦1∼𝑇𝑛 ,𝑛
cut (Ncut) problem [1] is employed to compute a balanced
partition by minimizing the energy                                 Here the likelihood 𝑝(e𝑡,𝑛 |𝑦𝑡,𝑛 ) is the probability of a
                                                               frame embedding e𝑡,𝑛 from the video 𝑛 belonging to a
                                        𝐾
                                     1 ∑︁ 𝑊 (𝐺𝑘 , 𝐺𝑘 )         cluster. Therefore, we fit a Gaussian distribution on each
      ℒ𝑐𝑢𝑡 (𝐺1 , 𝐺2 , ..., 𝐺𝐾 ) =                        , (1) global cluster and compute the frame-wise likelihoods
                                     2       vol(𝐺𝑘 )
                                       𝑘=1
                                                               with the Gaussian model, i.e.,
where 𝑊 (𝐺𝑘 , 𝐺𝑘 ) represents the sum of edge weights                                                                                     𝑝(x|𝑘) = 𝒩𝑘 (x; 𝜇𝑘 , Σ𝑘 ), 𝑘 ∈ {1, ..., 𝐾}.         (3)
between elements in the subgraph 𝐺𝑘 and elements of
all the other subgraphs, i.e., the sum of weights of edges                                                                            𝑝𝑛 (𝑦𝑡,𝑛 |𝑦𝑡−1,𝑛 ) is the transition probability from label
to be cut. vol(𝐺𝑘 ) is the sum of weights of edges within                                                                          𝑦𝑡−1,𝑛 at frame 𝑡 − 1 to label 𝑦𝑡,𝑛 at frame 𝑡, which is
the resulting subgraph 𝐺𝑘 . Spectral clustering [2] is a                                                                           defined by the temporal order of clusters. We denote the
relaxed solution to this NP-hard minimization problem in                                                                           set of frame transitions defined by the temporal order of
Eq. (1) and has shown good performance on many graph-                                                                              clusters on the 𝑛-th video by 𝑂𝑛 , e.g., for the temporal
based clustering problems, e.g. [3, 4, 5]. Note that while                                                                         order of 𝑎 → 𝑏 → 𝑐 → 𝑑, 𝑂𝑛 = {𝑎 → 𝑏, 𝑏 → 𝑐, 𝑐 →
K-means operates on Euclidean distance in the feature                                                                              𝑑}. The transition probability is binary, i.e.,
space and assumes convex and isotropic clusters, spectral
clustering can find clusters with non-convex boundaries.                                                                               𝑝𝑛 (𝑦𝑖,𝑛 |𝑦𝑖−1,𝑛 )                                     (4)
                                                                                                                                           = 1(𝑦𝑖,𝑛 = 𝑦𝑖−1,𝑛 ∨ 𝑦𝑖−1,𝑛 → 𝑦𝑖,𝑛 ∈ 𝑂𝑛 ).
26th Computer Vision Winter Workshop, Robert Sablatnig and Florian
Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023                                                                     This means that we allow either a transition to the next
                                    © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                    Attribution 4.0 International (CC BY 4.0).
                                                                                                                                   cluster according to the temporal order, or we keep the
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)                                                        cluster assignment of the previous frame.
                                                                     Table 1
                         FC
      FC
                                                                     Comparison of combinations of embeddings and clustering
                         FC                     Dilated              methods on Breakfast (in %).
      FC                                     residual layer
                         FC                                                       Model                  K-means             Two-step clustering
                                   FC                                   Feature     Embedding     MoF      IoU     F1     MoF       IoU       F1
      FC
                         FC
                                                                                   Rankloss [7]   35.2     15.6    28.8   34.7      13.4      23.7
                                                                                      MLP         42.9     13.1    25.5   32.7      10.9      21.2
                                                                        DTFV         AEMLP        34.7     13.6    25.8   32.6      11.6      21.4
  (a) MLP              (b) AEMLP               (c) TCN
                                                                                      TCN         33.4     17.8    31.3   40.4      19.1      32.9
                                                                                     SSTEN        39.3     17.8    31.9   50.3      19.0      33.6
Figure 1: Three baseline variants of embedding networks.
                                                                                                                                                     1.0
                                                                                                                                                     0.8
                                                                                                                                                     0.6
   Note that in two-step clustering, we derive the tem-                                                                                              0.4

poral order of clusters on each video separately, by sort-                                                                                           0.2
                                                                     (a) Rankloss MLP (b) MLP     (c) AEMLP        (d) TCN        (e) SSTEN          0.0
ing the clusters on the video according to the average
timestamp. Therefore, we have an individual 𝑂𝑛 for each              Figure 2: Frame-to-frame similarity matrices of the embed-
video 𝑛. On the contrary, in K-means, there is a uniform             ded sequences of the same video (computed by different em-
order of global clusters for all the videos and 𝑂𝑛 is thus           bedding networks from DTFV features) on Breakfast.
the same for each video 𝑛.
   The Viterbi algorithm for solving Eq. (2) is performed
in an iterative process using dynamic programming, i.e.,             as the temporal prior to train the model with only one
                                                                     iteration.
   𝑝(𝑦1∼𝑡,𝑛 |e1∼𝑡,𝑛 ) =                                        (5)      TCN and SSTEN are both networks for sequence-
               max {𝑝(𝑦1∼𝑡−1,𝑛 |e1∼𝑡−1,𝑛 )                           to-sequence learning, while Rankloss MLP, MLP and
                𝑦𝑡,𝑛
                                                                     AEMLP are trained on individual frames. By compar-
                       · 𝑝(e𝑡,𝑛 |𝑦𝑡,𝑛 ) · 𝑝(𝑦𝑡,𝑛 |𝑦𝑡−1,𝑛 )}.         ing the performance between these two groups in Ta-
                                                                     ble 1, we see that sequence-to-sequence learning leads to
The sequences that do not follow the temporal order will             better performance, especially when combined with the
be filtered out in an early stage to narrow down the search          two-step clustering, which results in clusters with better
range for the optimal label sequence. The output of the              temporal consistency.
Viterbi decoding is the optimal cluster label sequence,                 For the two-step clustering, we also plot the frame-to-
i.e., 𝑦ˆ1∼𝑇𝑛 ,𝑛 .                                                    frame similarity matrices (spatial Gaussian kernel) of the
                                                                     five embeddings for the same Breakfast video in Fig. 2.
3. Additional Results                                                The plots show that Rankloss MLP, MLP and AEMLP,
                                                                     which are trained on individual frames, do not expose
3.1. Embedding and clustering                                        an appropriate temporal structure. There are noisy block
                                                                     patterns even in positions far away from the diagonal,
Further, we compare our SSTEN embedding with three                   which results in noisy clusters and thus, leads to erro-
baseline variants (shown in Fig. 1): MLP temporal em-                neous temporal orders and inferior assignment results in
bedding, autoencoder with MLP (AEMLP) and temporal                   the two-step clustering. The least noisy Rankloss MLP
convolutional network (TCN), in combination with the                 has the highest performance among these three. On the
two clustering methods.                                              contrary, TCN and SSTEN embedded features, which
   MLP uses three FC layers for relative timestamp pre-              show a clear diagonal block structure in the similarity
diction. AEMLP uses MLP-based autencoder for both                    graph, achieve a better performance in the two-step clus-
relative timestamp prediction and feature reconstruction.            tering. This verifies that the sequence-to-sequence em-
TCN deploys 𝑄 stacked dilated residual layers only for               bedding learning (TCN and SSTEN) and two-step cluster-
relative timestamp prediction.                                       ing are a well-suited combination to address the sequen-
   Here, we also implement the Rankloss MLP embed-                   tial nature of frames in both processing steps of feature
ding [7] for reference. We report the performance of                 embedding and clustering.
these five embeddings in Table 1.                                       Considering K-means clustering, the merit of having
   Comparison of the five embeddings. We learn the                   a better sequential nature of the embedded features via
five embeddings (Rankloss MLP, MLP, AEMLP, TCN and                   sequence-to-sequence learning can also be seen from
SSTEN) on the DTFV features. Here, the Rankloss MLP                  the higher IoU and F1 scores (TCN: IoU 17.8%/F1 31.3%,
(consisting of two FC layers) is trained with a ranking              vs. SSTEN: 17.8%/31.9%), as these penalize dominating
loss. We use the initialization of uniform segmentation
                                                                                        50                                                          MoF
segments and oversegmentation.                                                          45
                                                                                                                                                    IoU




                                                                   performance (in %)
                                                                                                                                                    F1
   In contrast to TCN, SSTEN can preserve the spatial                                   40

layout of the input features due to the feature reconstruc-                             35
                                                                                        30
tion via the autoencoder. By comparing TCN and SSTEN,                                   25
we see that the SSTEN embedding with feature recon-                                     20
struction leads to a boost in the MoF score. The marginal                                      3     4   5    6     7   8    9     10   15   20
improvement of AEMLP over MLP is due to the fact that
the MLP structure with only FC layers is not well-suited      Figure 3: Segmentation performance of the two-step cluster-
for feature reconstruction.                                   ing on SSTEN embeddings (𝜆 = 0.002) with different 𝑚 (for
                                                              local scaling) on Breakfast.
   Comparison between K-means and two-step clus-
tering. Considering the performance of the five embed-
dings with the two clustering methods, we see that K-         Table 2
means leads to higher scores on the inferior embeddings       Segmentation performance of two-step clustering on SSTEN
(Rankloss MLP, MLP and AEMLP) trained on individ-             embeddings (𝜆 = 0.002) with respect to a fixed spatial scaling
ual frames, while two-step clustering performs better on      factor 𝜎spat (without local scaling) on Breakfast (in %).
sequence-to-sequence learning-based embeddings (TCN
                                                                                             𝜎spat           MoF            IoU              F1
and SSTEN). When combined with the proposed SSTEN
embedding, two-step clustering outperforms K-means by                                         0.5
                                                                                              0.7
                                                                                                             47.1
                                                                                                             48.0
                                                                                                                            18.0
                                                                                                                            18.5
                                                                                                                                             33.4
                                                                                                                                             33.9
a large margin in terms of the MoF score. We also tried                                        1             41.1           19.2             32.9
applying K-means on each video separately. However,                                            2
                                                                                              10
                                                                                                             38.4
                                                                                                             35.0
                                                                                                                            18.6
                                                                                                                            17.9
                                                                                                                                             32.2
                                                                                                                                             31.2
the performance dropped significantly. K-means depends
only on the spatial distance and results in oversegmenta-
tion, which leads to erroneous temporal order on each
                                                               Impact of the scaling of the temporal Gaussian
video and thus, an inferior global cluster assignment.
                                                            kernel. The temporal Gaussian kernel is operated on
                                                            the temporal distance between   (︀ frames in a video.)︀With
                                                                            2
3.2. Impact of Scaling in Spatiotemporal 𝜎tmp                 2
                                                                  = 2𝜎 ′ , the term exp −(𝑠𝑖 − 𝑠𝑗 )2 /(2𝜎 ′2 ) is in
       Similarity                                           the standard form of a Gaussian kernel. We set 𝜎 ′ = 1/6
                                                            so that the 6𝜎 ′ range of the temporal Gaussian is equal to
We perform spectral clustering with the proposed spa- the video length (since the length of each video is normal-
tiotemporal similarity. Here, we analyze the impact of ized to 1 for the relative timestamp prediction). The seg-
the scaling factors in the spatial and temporal Gaussian mentation performance with respect to 𝜎 ′ is shown in Ta-
kernels, i.e., 𝜎spat
                2
                     and 𝜎tmp
                          2
                              . These adjust the extent to ble 3. Apparently, 𝜎 ′ = 1/6 leads to the best result. Here,
which two frames are considered similar to each other
and influence the clustering quality. The experiments are
                                                            Table 3
conducted for SSTEN embeddings on Breakfast.
                                                            Segmentation performance of two-step clustering on SSTEN
   Impact of the scaling of the spatial Gaussian ker-
                                                            embeddings (𝜆 = 0.002) with respect to the temporal scaling
nel. For local scaling, we set 𝜎spat = 𝜎𝑖 𝜎𝑗 , where 𝜎𝑖 is factor 𝜎 ′ (𝜎 2 = 2𝜎 ′ 2 on Breakfast (in %).
                                  2
                                                                          tmp
the distance from e𝑖 to its 𝑚-th nearest neighbor in the
feature space. The resulting segmentation performance           𝜎 (𝜎tmp = 2𝜎 ′ 2 )
                                                                  ′  2
                                                                                         MoF          IoU        F1
w.r.t. 𝑚 is shown in Fig. 3. With 𝑚 varying in the range        ∞ (w/o tmp. Gauss)       41.5         16.5      30.6
of 3 to 20, the IoU and F1 scores remain stable. There                  1/3              43.5         16.9      31.3
                                                                        1/6              50.3        19.0       33.6
is a range of 𝑚 ∈ {8, 9} where the best MoF scores                     1/12              44.3         18.5      34.1
are achieved, whereas for other scaling parameters, the
MoF score drops. Thus, we set 𝑚 = 9 for all following we also evaluate the case without the temporal Gaussian
evaluations.                                                kernel, which leads to a drop in performance. The impact
   For comparison, we also set 𝜎spat to fixed values (with- of the temporal Gaussian kernel on similarity matrices
out local scaling) and report the segmentation perfor- of SSTEN embeddings can also be seen by comparing the
mance in Table 2. We achieve great results at smaller top and bottom rows in Fig. 6 in the main manuscript.
𝜎spat values (0.5 and 0.7).                                 For example, by adding the temporal Gaussian kernel,
   However, with increasing 𝜎spat the MoF score drops we decrease the similarities in Fig. 6(a1) according to the
significantly, while there are only minor fluctuations in temporal distance between two frames, which leads to
IoU and F1. Apparently, 𝜎spat has a large impact on the clearer diagonal block structure in Fig. 6(a2). Thus, we
clustering quality. The local scaling eases the effort of set 𝜎 ′ = 1/6 for all following evaluations.
tuning 𝜎spat by dynamically determining the scaling fac-
tor.
Table 4                                                      all video frames from the same activity with respect to a
Impact of cluster order for two-step clustering on SSTEN em- pseudo ground truth action annotation. The embedded
beddings (in %).                                             frames of the whole activity set are then clustered and
                    Breakfast                  YTI
                                                             the likelihood for each frame and for each cluster is com-
     Order
              MoF  IoU     F1   Edit MoF   IoU     F1   Edit
                                                             puted. For the decoding, the authors build a histogram
                                                             of features with respect to their clusters with a hard as-
     video-
      wise
              50.3 19.0    33.6 42.3 46.6  10.7    29.5 25.5 signment and set the length of each action with respect
    uniform   53.5 15.7    32.2 33.0 40.7  7.7     25.1 20.3 to the overall amount of features per bin. After that, they
                                                             apply a Mallow model to sample different orderings for
                                                             each video with respect to the sampled distribution. The
3.3. Impact of Cluster Order                                 resulting model is a combination of Mallow model sam-
                                                             pling and action length estimation based on the frame
One merit of performing within-video clustering is that distribution.
we can derive the temporal order of sub-clusters for each       For this experiment, we evaluate the impact of the dif-
video separately. The video-wise individual order of clus- ferent decoding strategies on two embeddings: the Ran-
ters is used to guide the Viterbi decoding, which breaks kloss embedding [7] and our SSTEN embedding. Table 5
the common assumption that clusters follow the same reports the results of these two embeddings in combi-
temporal order in all the videos. In the following, we nation with three decodings: the Mallow model, Viterbi
verify the efficacy of the derived video-wise order of decoding with K-means and Viterbi decoding with two-
clusters. We use the same within-video clustering re- step clustering.
sult with global cluster assignment and perform Viterbi
decoding using two different temporal cluster orders: Table 5
(1) video-wise order: the temporal order of sub-clusters Comparison of combinations of embeddings and decoding
is determined on each video separately; and (2) uniform strategies on Breakfast (in %).
order: the uniform order is determined by sorting the
                                                                                    Rankloss [7] embed.      SSTEN embed.
average timestamps of global clusters and is then applied           Decoding

to all the videos. Table 4 reports the segmentation per-                           MoF      IoU      F1  MoF     IoU       F1

formance (after Viterbi) with these two orders for our              Mallow [7]
                                                                  kmeans+Viterbi
                                                                                   34.7
                                                                                   35.2
                                                                                           17.8
                                                                                           15.6
                                                                                                    31.4
                                                                                                    28.8
                                                                                                         36.4
                                                                                                         39.3
                                                                                                                 18.1
                                                                                                                 17.8
                                                                                                                          31.5
                                                                                                                          31.9
SSTEN embeddings on Breakfast and YTI. To measure the            two-step.+Viterbi 34.7    13.4     23.7 50.3    19.0    33.6
correctness of the predicted segment order, we adopt the
segmental edit distance (Edit), which is a common metric        Following [7], we run 7 iterations for the Rankloss
for supervised action segmentation, e.g., [8, 9, 10, 11]. embedding with the Mallow model. In each iteration,
It penalizes segmentation results that have a different the Rankloss embedding is retrained using the segmen-
segment order than the ground truth (i.e., it penalizes tation result from the last iteration as pseudo label, and
out-of-order predictions, as well as oversegmentation). the frame-wise likelihoods and the Mallow model are
   From Table 4 we see that our video-wise order clearly updated.
outperforms the uniform order except for MoF on Break-          Unlike the Mallow model, our Viterbi decoding is a
fast. Furthermore, the edit score verifies that our derived one-iteration procedure. It is operated on the embed-
video-wise temporal orders are valid.                        ding which is trained only once. When combining with
   In our experiments we especially notice that the MoF Viterbi, we train the Rankloss model only once using the
and IoU scores could act contradictory to each other, initialized uniform segmentation as a prior. For SSTEN
e.g., the uniform order results in higher MoF scores (on with the Mallow model, we only run for one iteration,
Breakfast) at the cost of lower IoU scores. MoF tends to as we do not need to train SSTEN with pseudo labels
overfit on dominant classes (e.g., classes with longer ac- iteratively.
tion instances) while IoU is sensitive to underrepresented      Considering the Rankloss results in Table 5 we see
classes and penalizes segmentation results with dominat- that combining it with the Mallow model achieves its
ing segments. Therefore, it is necessary that we consider highest IoU and F1 scores. This is because for Viterbi de-
all metrics for evaluation, as a higher MoF score does not coding, the Rankloss model trained only one-time using
always correspond to better performance in practice.         the uniform initialization as pseudo label is lacking of a
                                                             strong temporal prior. Considering SSTEN, our Viterbi
3.4. Impact of Decoding Strategies                           decoding with two-step clustering clearly outperforms
                                                             the Mallow model. With Mallow, the SSTEN embedding
We compare our approach, which uses Viterbi decoding, has competitive IoU and F1 scores but significantly lower
with the Mallow model decoding that has been proposed MoF. We also tried running the Mallow model on SSTEN
in [7]. The authors propose a Rankloss embedding over embedded features for multiple iterations. However, this
resulted in a reduced number of clusters. Thus, we see                     Table 6
that an appropriate combination of embedding and de-                       Comparison of combinations of SSTEN and different clustering
coding strategy is necessary.                                              methods in terms of clustering and final segmentation after
   To have a closer look into the Viterbi decoding, we                     Viterbi decoding on Breakfast (in %).
visualize the likelihood grids computed from global clus-                                                     Clustering results          Final results
                                                                               Embedding Clustering
ters, as well as the resulting decoding path over time for                                                    MoF    IoU     F1       MoF      IoU    F1
two videos on Breakfast in Fig. 4. It shows that the decod-                                   K-means         27.2   13.5    26.3     39.3     17.8   31.9
ing, which generates a full sequence of actions, is able                       SSTEN
                                                                                           two-step.cluster   38.6   13.7    25.9     50.3     19.0   33.6
to marginalize actions that do not occur in the video by
just assigning only very few frames to those ones and the
majority of frames are assigned to the clusters that occur
in the video. Even if the given temporal order constrains                  3.5.2. Segmentation Results On Each Activity
that the resulting 𝐾 coherent segments have to follow                      We report the ground truth number of classes and
the fixed temporal order, the segments that actually do                    segmentation performance of MLP with K-means
not belong in the sequence will be marginalized because                    (MLP+kmeans, our reimplementation of [12]) and TAEC
the Viterbi algorithm decodes a path that maximizes the                    for each activity on Breakfast (Table 7), YTI (Table 8)
posterior probability. Overall, it turns out that the Viterbi              and 50 Salads (Table 9). The evaluation is done with
decoding constrained by a temporal order performs bet-                     global Hungarian matching on all videos. The number of
ter than the Mallow model’s iterative re-ordering.                         clusters is set to the maximum number of ground truth
                              Making juice
                                                                           classes for each activity (𝐾 = max.#gt).
                   Viterbi decoding path on likelihood grid   frame axis
    cluster axis




                                                                           Table 7
                                 prediction                                Maximum number of ground truth action classes and segmen-
                                ground truth                               tation performance of MLP+kmeans (our reimplementation of
                            Making fried egg                               [12]) and TAEC for the 10 activities on Breakfast (in %). The
                   Viterbi decoding path on likelihood grid   frame axis
                                                                           number of clusters is set to the maximum number of ground
    cluster axis




                                                                           truth classes for each activity (𝐾 = max.#gt).
                                 prediction                                                                   Breakfast
                                ground truth
                                                                                        Global Hungarian matching on all videos (𝐾 = max.#gt)
                                                                                Activity             𝐾           Methods           MoF        IoU         F1
Figure 4: Comparison of Viterbi decoding path on the like-                      coffee                7
                                                                                                              MLP+kmeans           46.8      15.7         26.2
lihood grid computed from the global clusters resulted from                                                     TAEC               35.6      15.2         24.9

two-step clustering on the SSTEN embeddings, for two videos                     cereals               5
                                                                                                              MLP+kmeans           48.8      25.8         37.2
                                                                                                                TAEC               59.0      31.4         47.7
on Breakfast. The warm (red)/cool (blue) colors in the grid
indicate high/low likelihoods of a frame belonging to an ac-                      tea                 7
                                                                                                              MLP+kmeans           32.2      13.0         22.7
                                                                                                                TAEC               39.2      16.3         26.1
tion class. It shows that the decoding assigns most frames
to occurring actions while marginalizing actions that do not                                                  MLP+kmeans           40.4      21.2         36.6
                                                                                 milk                 5
                                                                                                                TAEC               46.7      27.3         43.5
occur in the sequence by assigning only a few frames.
                                                                                                              MLP+kmeans           36.9      14.1         27.9
                                                                                 juice                8
                                                                                                                TAEC               52.2      22.3         36.2
                                                                                                              MLP+kmeans           47.4      15.0         25.3
                                                                               sandwich               9
                                                                                                                TAEC               53.7      19.8         33.5
3.5. Quantitative Segmentation Results                                                                        MLP+kmeans           34.5      10.8         19.9
                                                                             scrambled egg           12
                                                                                                                TAEC               48.1      15.2         28.3
3.5.1. Results of Clustering and Final
                                                                                                              MLP+kmeans           36.4      11.5         24.5
       Segmentation                                                            fried egg              9
                                                                                                                TAEC               49.1      17.4         30.5

In order to show the advantage of two-step clustering                            salad                8
                                                                                                              MLP+kmeans
                                                                                                                TAEC
                                                                                                                                   34.7
                                                                                                                                   42.0
                                                                                                                                              7.8
                                                                                                                                             15.2
                                                                                                                                                          27.5
                                                                                                                                                          34.0
over K-means, when combined with the proposed SSTEN
                                                                                                              MLP+kmeans           57.4       8.6         19.2
embedding, we report both, the results of clustering (be-                      pancake               14
                                                                                                                TAEC               58.2      19.2         35.8
fore Viterbi decoding) and the final segmentation perfor-                                                     MLP+kmeans           42.9      13.1         25.5
mance (after Viterbi decoding) on Breakfast in Table 6.                           All                 -
                                                                                                                TAEC               50.3      19.0         33.6
We see that the proposed two-step clustering leads to
superior performance than K-means, in terms of both
clustering (before Viterbi decoding) in most metrics, and
in terms of final segmentation (after Viterbi decoding).
Table 8                                                               References
Maximum number of ground truth action classes and the
segmentation performance of MLP+kmeans (our reimplemen-                         [1] J. Shi, J. Malik, Normalized cuts and image segmen-
tation of [12]) and TAEC for the 5 activities on YTI (in %). The                    tation, PAMI 22 (2000) 888–905.
number of clusters is set to the maximum number of ground                       [2] U. V. Luxburg, A tutorial on spectral clustering,
truth classes for each activity (𝐾 = max.#gt).                                      Statistics and Computing 17 (2007) 395–416.
                              YouTube Instructions                              [3] H.-C. Huang, Y.-Y. Chuang, C.-S. Chen, Affinity
          Global Hungarian matching on all videos (𝐾 = max.#gt)
                                                                                    aggregation for spectral clustering, in: CVPR, 2012.
                                       MoF       IoU       F1     MoF     IoU
                                                                                [4] Z. Li, J. Chen, Superpixel segmentation using linear
  Activity 𝐾            Methods        w/o       w/o      w/o       w       w       spectral clustering, in: CVPR, 2015.
                                       bkg       bkg      bkg      bkg    bkg
                                                                                [5] R. Liu, H. Zhang, Segmentation of 3d meshes
  coffee       10
                      MLP+kmeans       40.9      10.4    34.2     11.9     9.5      through spectral clustering, in: Pacific Conference
                          TAEC         42.8      10.5     26.6    12.4    9.6
                                                                                    on Computer Graphics and Applications, 2004.
                      MLP+kmeans       45.9      17.2     34.0    24.7    15.8
  change tire11
                          TAEC         58.6      20.0    37.2     31.5    18.4  [6] A. Richard, H. Kuehne, A. Iqbal, J. Gall,
                      MLP+kmeans       30.6       4.5     24.4      5.1    4.1
                                                                                    NeuralNetwork-Viterbi:         A framework for
  jump car 12
                          TAEC         34.3       6.2    26.4      5.7     5.8      weakly supervised video learning, in: CVPR, 2018.
    cpr         7
                      MLP+kmeans       34.4       9.9     31.2    15.0    8.6   [7] F. Sener, A. Yao, Unsupervised learning and segmen-
                          TAEC         38.0       7.9    33.3     16.6     6.9      tation of complex activities from video, in: CVPR,
   repot        8
                      MLP+kmeans       29.8       7.2    23.1     10.1    6.4       2018.
                          TAEC         35.8       7.1     22.4    12.1     6.3
                                                                                [8] Y. A. Farha, J. Gall, MS-TCN: Multi-stage temporal
    All         -
                      MLP+kmeans       39.4       9.9    29.6     14.4     9.7      convolutional network for action segmentation, in:
                          TAEC         46.6      10.7     29.5    17.0    10.5
                                                                                    CVPR, 2019.
                                                                                [9] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, G. D. Hager,
                                                                                    Temporal convolutional networks for action seg-
Table 9
                                                                                    mentation and detection, in: CVPR, 2017.
Maximum number of ground truth action classes and the
segmentation performance of MLP+kmeans (our reimplemen-
                                                                               [10] V. I. Morariu, L. S. Davis, Multi-agent event recog-
tation of [12]) and TAEC for the single activity on 50 Salads                       nition in structured scenarios, in: CVPR, 2011.
(in %). The number of clusters is set to the maximum number [11] H. S. Koppula, R. Gupta, A. Saxena, Learning human
of ground truth classes for each activity (𝐾 = max.#gt).                            activities and object affordances from rgb-d videos,
                                                                                    The International Journal of Robotics Research 32
                                   50 Salads
                                                                                    (2013) 951–970.
              Global Hungarian matching on all videos (𝐾 = max.#gt)
                                                                               [12] A. Kukleva, H. Kuehne, F. Sener, J. Gall, Unsuper-
         Activity      𝐾         Methods           MoF      IoU       F1            vised learning of action classes with continuous
                     eval 12
                                MLP+kmeans         37.9     24.6     40.2           temporal embedding, in: CVPR, 2019.
                             TAEC        48.4    26.0   44.8
       salad
                          MLP+kmeans     29.1    15.7   23.4
                mid 19
                            TAEC         26.6    14.9   23.4




3.6. Qualitative Segmentation Results
We show the qualitative results of clustering and final
segmentation on 7 composite activities: making cereals
(Fig. 5), making milk (Fig. 6), making juice (Fig. 7), making
fried egg (Fig. 8), making pancake (Fig. 9) on Breakfast,
changing tire (Fig. 10) on YTI and making salad (Fig. 11)
on 50 Salads (eval 12 classes). The mapping between clus-
ter labels and ground truth classes is done with global
Hungarian matching on all videos. The number of clus-
ters is set to the maximum number of ground truth classes
for each activity (𝐾 = max.#gt).
   For each activity, we visualize the results of 10 videos.
For each video, the 3-row-group displays the ground
truth (1st row), TAEC (2nd row), MLP+kmeans [12] (3rd
row).
                                                                   Making cereals
                                                     empty     take bowl    pour cereals    pour milk      stir milk
                          Clustering result                                                                      Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10

Figure 5: Qualitative results of clustering and final segmentation of 10 making cereals videos on Breakfast. For each video, the
3-row-group displays the ground truth (1st row), TAEC (2nd row), MLP+kmeans [12] (3rd row). The mapping between cluster
labels and ground truth classes is done with global Hungarian matching on all videos. The number of clusters is set to the
maximum number of ground truth classes for each activity (𝐾 = max.#gt).




                                                                   Making milk
                                                    empty    spoon powder   pour milk   stir milk   take cup
                          Clustering result                                                                      Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10


Figure 6: Qualitative results for 10 making milk videos on Breakfast.




                                                                   Making juice
                               empty   cut orange    squeeze orange    take glass   take squeezer       take knife     take plate   pour juice
                          Clustering result                                                                      Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10


Figure 7: Qualitative results for 10 making juice videos on Breakfast.
                                                                  Making fried egg
                            empty   pour oil   take plate   take eggs   fry egg   butter pan   add salt   crack egg   put egg2plate
                         Clustering result                                                                     Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10


Figure 8: Qualitative results for 10 making fried egg videos on Breakfast.




                                                                  Making pancake
                               empty pour flour put pancake2plate fry pancake pour dough2pan stir dough                 spoon flour
                               butter pan take eggs take plate crack egg pour oil take bowl pour milk
                         Clustering result                                                                     Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10


Figure 9: Qualitative results for 10 making pancake videos on Breakfast.




                                                                  Changing tire
                                    bkg get things out start loose jack up unscrew wheel1 break on                       tight wheel
                                    put wheel unscrew wheel2 screw wheel jack down put things back
                         Clustering result                                                                     Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10

Figure 10: Qualitative results for 10 changing tire videos of the YouTube Instructions dataset. Similar to the Breakfast
illustrations, for each video the 3-row-group shows the ground truth (1st row), TAEC (2nd row), and MLP+kmeans [12] (3rd
row).
                                                          Making salad
                          action start add dressing add oil add pepper cut lettuce           serve salad   add vinegar
                          mix dressing mix ingredients peel cucumber place lettuce          action end
                          Clustering result                                               Final result
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10

Figure 11: Qualitative results for 10 making salad videos on the 50 Salads dataset (from the eval-level, 12 action classes). Again,
for each video the 3-row-group shows the ground truth (1st row), TAEC (2nd row), and MLP+kmeans [12] (3rd row).