<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wei Lin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Kukleva</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Horst Possegger</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hilde Kuehne</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Horst Bischof</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Christian Doppler Laboratory for Semantic 3</institution>
          <addr-line>D Computer Vision</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Goethe University Frankfurt</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Computer Graphics and Vision, Graz University of Technology</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Max-Planck-Institute for Informatics</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning, to preserve the spatial layout and sequential nature of the video features. A two-step clustering pipeline on these embedded feature representations then allows us to enforce temporal consistency within, as well as across videos. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes. Our evaluation on three challenging datasets shows the impact of each component and, furthermore, demonstrates our state-of-the-art unsupervised action segmentation results.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Unsupervised learning</kwd>
        <kwd>unsupervised clustering</kwd>
        <kwd>action segmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Feature Embedding</title>
        <p>3 videos of Making coffee
Feature
embedding
frames of video 1</p>
        <p>Feature
embedding
frames of video 2</p>
        <p>Feature
embedding
frames of video 3
(a)</p>
      </sec>
      <sec id="sec-1-2">
        <title>Within-Video</title>
      </sec>
      <sec id="sec-1-3">
        <title>Clustering</title>
      </sec>
      <sec id="sec-1-4">
        <title>Cross-Video</title>
      </sec>
      <sec id="sec-1-5">
        <title>Global Cluster Assignment</title>
      </sec>
      <sec id="sec-1-6">
        <title>Viterbi Decoding</title>
        <p>Ck,n : the k-th within-video cluster
in video n
(b)
2
1
1
Global
cluster 1
1
2
2
Global
cluster 2
(c)
3
3
3
Global
cluster 3
2 → 1 → 3
1 → 2 → 3
1 → 2 → 3
(d)
sists of a within-video clustering and a cross-video global and recognition of frame orders [27, 28, 29, 30, 31]. For
cluster assignment. Specifically, we perform cluster- instance, Srivastava et al. [24] exploit an LSTM-based
auing within each video, with a spatio-temporal similarity toencoder for learning video representations. Villegas et
among frames. Then we conduct global cluster assign- al. [26] and Denton and Birodkar [25] employed two
enment to group the clusters across videos. The global coders to generate feature representations of content and
cluster assignment defines the ordering of the clusters motion. The temporal order of frames or small chunks
for each video. In this way, we overcome the unrealis- is utilized as a self-supervision signal for representation
tic assumption that actions of an activity always follow learning on short video clips in [27] and [28]. Inspired by
the same temporal order. Such an assumption is com- these approaches, we employ two self-supervision tasks:
monly used in related works, e.g. [21, 22]. For instance, feature reconstruction and relative time prediction.
in the activity of making cofee, a unified temporal order Clustering of temporal sequences has been explored
between actions such as adding milk and adding sugar for parsing human motions [32, 33, 34, 35]. While Zhang
is assumed for all videos of making cofee, whereas our et al. [35] proposed a hierarchical dynamic clustering
approach can handle changes of the action order in dif- framework, Li et al. [33] and Tierney et al. [34] explored
ferent videos. After assigning all within-video clusters to temporal subspace clustering to segment human motion
a set of global clusters, we perform Viterbi decoding to data. In contrast to unsupervised action segmentation,
obtain a segmentation of temporally coherent segments. these methods are applied on each temporal sequence
Our contributions can summarized as following: individually and do not consider association among
se• We design a sequence-to-sequence temporal em- quences. Instead, we propose a cross-video global cluster
bedding network (SSTEN), which combines rel- assignment to group within-video clusters across
diferative timestamp prediction, autoencoder recon- ent videos into global clusters.</p>
        <p>struction and sequence-to-sequence learning. Unsupervised action segmentation on fine-grained
• We propose a within-video clustering with a activities has recent work that either focus on the
reprenovel spatio-temporal similarity formulation sentation learning [20, 22, 36] or the clustering step [23].
among frames. However, the temporal information is neglected in at
least one of these two steps. For representation learning,
• We propose a cross-video global cluster
assign</p>
        <p>Sener and Yao [20] construct a feature embedding by
ment to group within-video clusters across videos</p>
        <p>learning a linear mapping from visual features to a latent
into global clusters, which also overcomes the
as</p>
        <p>space with a ranking loss. However, the linear model
sumption that in all videos of an activity, actions</p>
        <p>trained with individual frames does not consider the
temfollow the same temporal order.</p>
        <p>poral association between frames. VidalMata et al. [22]
employ a U-Net trained on individual frames for future
2. Related Work frame prediction. Predicting for one or a few steps ahead
only requires temporal relations within a small temporal
Unsupervised learning of video representations is window. Instead, we propose to learn a representation by
commonly performed via pretext tasks, such as recon- predicting the complete sequence of relative timestamps
struction [23, 24], future frame prediction [22, 25, 26], to encode the long-range temporal information.</p>
        <p>For the clustering step, related works [22, 23] neglect
temporal consistency of frames within a video. Instead,
we apply within-video clustering on each video with
a proposed similarity formulation that considers both
spatial and temporal distances.</p>
        <p>Two recent approaches perform clustering [37] or
cluster-agnostic boundary detection [38] on each video
separately, without identifying clusters or segments
across videos. [37] solves a task similar to human motion
parsing and evaluates the segmentation for each video
individually. [38] only detects boundaries of
categoryagnostic segments, and does not identify if some
segments within a video or across videos are of the same
category. On the contrary, our segments on all videos
are category-aware as they are aligned globally across
videos by our global cluster assignment.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Temporal-Aware Embedding and Clustering (TAEC)</title>
      <p>stage, the hidden representation is a concatenation (in
yellow) of the features from dilated residual layers and
the predicted relative timestamps. The training loss is:</p>
      <sec id="sec-2-1">
        <title>3.1. SSTEN: Sequence-to-Sequence</title>
      </sec>
      <sec id="sec-2-2">
        <title>Temporal Embedding Network</title>
        <p>We address unsupervised action segmentation as illus- ℒ =  ∑︁ ∑︁ ⃦⃦x, − x^,⃦⃦22+
trated in Fig. 1. First, we learn a suitable feature embed- =1 =1
ding (Sec. 3.1). We then perform within-video clustering
on each video (Sec. 3.2.1), and group the within-video
clusters into global clusters (Sec. 3.2.2). Finally, we
compute temporally coherent segments on each video using
Viterbi decoding (Sec. 3.3).
(1)
where the coeficient  balances the two terms. The
pretext tasks of reconstruction and relative timestamp
prediction encode both, the spatial distribution and the
global temporal information, into the embedded features.</p>
        <p>We compare SSTEN with several baseline embedding
networks in the supplementary.</p>
        <p>∑︁ ∑︁ ∑︁ (,−^,,)2,
∈{1,2} =1 =1</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.2. Two-Step Clustering</title>
        <p>To learn a latent representation for temporal sequences, After learning the feature embedding, we group the
emwe adopt a sequence-to-sequence autoencoder. Inspired bedded features into  clusters by a within-video
clusby the multi-stage temporal convolutional network [7], tering and a cross-video global cluster assignment.
we use a concatenation of two stages for both encoder</p>
        <p>and decoder, as shown in Fig. 2. Given a set {X}=1
of  videos, where each video X = {x,}=1 has 3.2.1. Within-Video Clustering
 frames, the outputs are reconstructed frame features We perform spectral clustering on frames within each
{xˆ,}=1. The embedded features are the hidden repre- video (detailed description in the supplementary). Given
sentation {e,}=1. the embedded feature sequence1 [e1, e2, ..., e ], we build</p>
        <p>Every encoder and decoder stage consist of 1 × 1 con- a frame-to-frame similarity matrix  ∈ R × . The
volution layers for dimension adjustment (Fig. 2 blue) and entries (, ), ,  ∈ {1, ...,  }, represent the similarity
 dilated residual layers (green), each containing a di- between frame  and frame . To consider both the spatial
lated temporal 1D convolution. Since no fully connected and temporal distance of features, we propose to measure
layers are employed, sequences of variable lengths can the similarity by the product of two Gaussian kernels
be processed seamlessly. The dilation rate at the -th
ltaeymeproirsal2re−c1e.ptBivyesfietladckinincrgeadsielasteexdproenseidnutiaallllya.yTehrse, rteh-e (, ) = exp(︃− ‖e−s2paet ‖22 )︃ · exp(︂− ( −t2mp )2 )︂,
ceptive field of the -th layer is 1 + ( − 1) × (2 − 1), (2)
where  is the kernel size. Therefore, each frame in the where ,  are the corresponding relative timestamps
hidden representation has a long temporal dependency of frame ,  and spat, tmp are the scaling factors for the
on the input video. In each encoder stage, we use a 1 × 1
convolution layer (in red) to predict the frame-wise
rela</p>
        <p>. At the end of each encoder 1For ease of notation, we omit the video index .
tive timestamps , =</p>
        <p>Wei Lin et al. CEUR Workshop Proceedings
1–10
spatial and temporal Gaussian kernels. To avoid
manually tuning spat, we use local scaling [39] to estimate spat
dynamically. To this end, we replace s2pat by  , where
 is the distance from e to its -th nearest neighbor in
the embedding space. We provide an ablation study on
scaling of the spatio-temporal similarity in the
supplementary. Consequently, frames of similar visual content
and relative timestamps are encouraged to be grouped
into the same cluster.
3.2.2. Cross-Video Global Cluster Assignment
After within-video clustering, we assign the  × 
within-video clusters across videos into  global
clusters. Every global cluster should contain  within-video
clusters, each coming from a diferent video (c.f., Fig. 1).</p>
        <p>This can be interpreted as an  -dimensional assignment
problem [40].</p>
        <p>We regard the -th video  = {c,| = 1, .., }
as a vertex set, where each -th within-video cluster c,
is a vertex. We construct an  -partite graph  = (1 ∪
2 ∪ ... ∪  , ).  = ⋃︀&lt;,,∈{1,...,}{(c, c′)|c ∈
, c′ ∈ } is the set of edges between within-video
clusters across videos. The edge weight (c, c′) is the
distance between centroids of two within-video clusters
c, c′. The solution to the  -dimensional assignment
is a partition by dividing the graph  into  cliques
1, 2, ...,  . A clique , which is a subset of 
vertices from  diferent vertex sets, defines the -th
global cluster. The induced sub-graphs of the cliques
1, 2, ...,  are complete and disjoint. We denote the
edge set of the induced sub-graph of  as  . The
cost of a clique is the sum of pairwise edge weights
between the contained vertices. The cost of an assignment
solution is the sum of the costs of all the  cliques, i.e.,
Hub vertex set</p>
        <p>non-hub vertex sets
...</p>
        <p>...</p>
        <p>...</p>
        <p>By iterating over all possible initial hub vertex sets ℎ ∈
{1, ...,  }, we choose the assignment solution ℎ^ which
minimizes the assignment cost
ℎ^ = arg min
ℎ∈{1,...,} (c,c′)∈
∑︁
ℎ(c, c′) · (c, c′),</p>
        <p>(4)
where ℎ(c, c′), ∀(c, c′) ∈  is a binary indicator
function that describes the edge connection: ℎ(c, c′) equals
1 when two vertices c, c′ are connected. The assignment
solution ℎ^ describes the partition which leads to the 
global clusters.</p>
      </sec>
      <sec id="sec-2-4">
        <title>3.3. Frame Labeling by Viterbi Decoding</title>
        <p>Given the embedded feature sequence e1∼, of video
, we determine the optimal label sequence ˆ1∼,. The
posterior probability can be factorized into the product
of likelihoods and the probability of a given temporal
vertices c,ℎ ∈ ℎ and c′,ℎ′ ∈ ℎ′ , if c,ℎ and c′,ℎ′ are
connected to the same vertex c,ℎ on ℎ.</p>
        <p>
          After the two steps, every hub vertex c,ℎ, with  ∈
{1, .., } and all the non-hub vertices connected to c,ℎ
form the -th clique . Therefore, the  -partite graph
 is partitioned into  complete and disjoint subgraphs.
order, i.e., ˆ1∼, = arg max (1∼,|e1∼,) =
1∼,
ℒ (1, 2, ...,  ) = ∑︁ ∑︁ (c, c′). (
          <xref ref-type="bibr" rid="ref20">3</xref>
          ) War1eg∼mfitaa,xG{aΠus=si1an(meo,de|lo,n) ·eΠach=1glo(bal,c|lu1s∼ter−1a,nd)}.
=1 (c,c′)∈ compute the frame-wise likelihoods, i.e., (x|) =
        </p>
        <p>(x; , Σ),  ∈ {1, ..., }. The temporal order
con</p>
        <p>In order to solve this NP-hard problem, we employ an straint is used to limit the search space for the optimal
iterative multiple-hub heuristic [41]. In each iteration, label sequence by filtering out the sequences that do not
we choose a hub vertex set ℎ = {c,ℎ| = 1, .., } follow the temporal order.
and there are ( − 1) non-hub vertex sets. We compute The related works [21, 22] apply K-means on the
an assignment solution in each iteration in two steps, as frames of all the videos. From the unified clustering they
is shown in Fig. 3: (1) We first perform ( − 1) bipar- derive only a single temporal order of clusters for all the
tite matchings between ℎ and each of the remaining videos. However, this is an unrealistic assumption due to
non-hub vertex sets ℎ. (2) Secondly, we determine the interchangeable steps in the activities, e.g., pour milk and
edge connection between pairs of non-hub vertex sets. pour sugar in making cofee. Instead, we can easily derive
On two non-hub vertex sets ℎ, ℎ′ , we connect two the temporal order for each video separately. We do so
by sorting the within-video clusters according to the
average timestamp of frames in each cluster. The output of
the Viterbi decoding is the optimal cluster label sequence
ˆ1∼,. More details are given in the supplementary.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments</title>
      <p>tion (i.e., LSTM+AL [38]) on each video individually,
without solving the alignment among diferent clusters or
4.1. Datasets &amp; Evaluation Metrics segments across videos. For a fair comparison, these
are evaluated by local Hungarian matching on
indiWe evaluate on Breakfast [42], the YouTube Instructions vidual videos, where a per-video best
ground-truth-todataset (YTI) [36] and 50 Salads [43]. Breakfast is com- cluster-label mapping is determined using the ground
prised of 1712 videos recorded in various kitchens. There truth on each video separately. This results in a separate
are 10 composite activities of breakfast preparation. YTI label mapping for each video. Following [37], we also
is composed of 150 videos of 5 activities collected from report results with  set to the average number of
acYouTube. 50 Salads contains 50 videos of people prepar- tions for each activity (i.e., =avg.#gt) for a complete
ing salads. Following [20, 21, 22], we use the dense trajec- comparison.
tory Fisher vector features (DTFV) [44] for Breakfast and In Table 1, TAEC achieves strong results in
compari50 Salads, and features provided by Alayrac et al. [36] on son to the unsupervised state-of-the-art and is even
comYTI. We use the evaluation protocol in [21] and report parable to weakly supervised approaches. Although
apthe performance in three metrics: (1) Mean over Frames proaches without solving the alignment of clusters across
(MoF) is the frame-level accuracy over the frames of all videos inherently lead to better scores in the evaluation
the videos. More frequent or longer action instances settings of the local Hungarian matching, our approach
have a higher impact on the result. (2) Class-wise mean still compares favorably.</p>
      <p>
        Intersection over Union (cIoU) is the average over the We compare qualitative results (with global Hungarian
IoU performance for each class and penalizes segmenta- matching) of TAEC and MLP+kmeans [21] on 3 Breakfast
tion results with dominating segments. (
        <xref ref-type="bibr" rid="ref20">3</xref>
        ) The F1-score activities in Fig. 4. We see that our two-step clustering
penalizes results with oversegmentation. (the 2nd rows in all clustering result plots) already leads to
temporally consistent segments with relatively accurate
4.2. Implementation Details boundaries of action instances, while K-means (the 4th
For our SSTEN, we adapt the number of dilated residual rows in all clustering result plots) results in serious
overlayers  according to the dataset size: We set  = 5 segmentation. The Viterbi decoding further improves
for YTI (15k frames per activity subset on average) and the segmentation by suppressing the oversegmentation
 = 10 for Breakfast (360k) and 50 Salads (577k). The and domination of incorrect clusters (the 2nd rows in all
dimension of the hidden representation is set to 32. We ifnal result plots). Moreover, MLP+kmeans [21] follows
set  in Eq. (1) to 0.002 (Breakfast), 0.01 (YTI) and 0.005 (50 the constraint of the fixed temporal order of segments
Salads). For clustering, we follow the protocol of [20, 36] on videos of each activity (the 4th rows in all final result
and define the number of clusters  separately for each plots). In contrast, TAEC yields an individual temporal
activity as the maximum number of ground truth classes. order for each video (the 2nd rows in all final result plots).
The values of  for the three datasets are provided in Additional qualitative results and evaluation scores are
the supplementary material. included in the supplemental material.
      </p>
      <p>For the YouTube Instructions dataset, we follow the
protocol of [20, 21, 36] and report the results with and
4.3. Comparison with the State-of-the-Art without considering background frames. Here, our TAEC
We compare with unsupervised learning methods, as outperforms all recent works in almost all of the metrics
well as weakly and fully supervised approaches on under all three settings.</p>
      <p>Breakfast (Table 1), YTI (Table 2) and 50 Salads (Ta- 50 Salads is a particularly challenging dataset for
unble 3). Most unsupervised segmentation approaches yield supervised approaches, as each video has a diferent
orcluster-aware segments that are aligned across all the der of actions and additionally includes many repetitive
videos [20, 21, 22, 36, 45]. These approaches are eval- action instances. In the eval-level of 12 classes, TAEC
uated with the global Hungarian matching on all outperforms all approaches under the global Hungarian
videos, where the mapping between ground truth classes matching evaluation and achieves competitive results
and clusters is performed on all the videos of an activ- under the local Hungarian matching. In the challenging
ity, which results in one mapping for each activity. The mid-level evaluation of 19 classes, the sequential nature
number of clusters  is set to the maximum number of of frames is less advantageous. Therefore, MLP+kmeans
ground truth classes for each activity (i.e., =max.#gt). [21] outperforms TAEC. Generally, in the local
matchWe focus on the performance comparison in this setting ing case, approaches without alignment across videos
and follow this setting in all the ablation studies. compare favorably.</p>
      <p>Two recent approaches perform clustering (i.e.,
TWFINTCH [37]) or category-agnostic boundary
detecempty
stir milk</p>
      <p>Making cereals
take bowl
pour cereals
pour milk</p>
      <p>Making juice</p>
      <p>Making fried egg
empty cut orange squeeze orange
take squeezer take knife take plate
take glass
pour juice
empty pour oil take plate take eggs fry egg
butter pan add salt crack egg put egg2plate
cereals video 1 clustering result
juice video 1 clustering result
fried egg video 1 clustering result
cereals video 2 clustering result
juice video 2 clustering result
fried egg video 2 clustering result
cereals video 3 clustering result
juice video 3 clustering result
fried egg video 3 clustering result
cereals video 1 final result
juice video 1 final result
fried egg video 1 final result
cereals video 2 final result
juice video 2 final result
fried egg video 2 final result
cereals video 3 final result
juice video 3 final result
fried egg video 3 final result</p>
      <p>1–10</p>
      <p>Comparison of raw features without embedding.</p>
      <p>Among the three types of features without temporal
embedding, I3D achieves the best performance, while
AlexNet features lead to the worst results. AlexNet
features are computed from individual spatial frames. On
the contrary, each frame feature of DTFV and I3D is
computed based on a chunk of its temporal neighbor frames.</p>
      <p>Therefore, the features already carry intrinsic temporal
consistency. Furthermore, the two-stream I3D model can
leverage both RGB and optical flow. Therefore, I3D
features achieve a better performance than DTFV, which
rely on handcrafted dense trajectories.</p>
      <p>Comparison of SSTEN embeddings learned on
diferent features. When comparing the SSTEN
embeddings to the performance of the raw features, we see
that SSTEN leads to a significant performance gain for
both clustering methods. For DTFV, the performance
improvements by SSTEN are MoF 8.5%, IoU 6.0%, F1 8.9%
with K-means and MoF 15.8%, IoU 6.9%, F1 11.6% with
two-step clustering.</p>
      <p>Among the three types of SSTEN embedded features,
I3D has slightly better IoU and F1 scores while DTFV
leads to the best MoF scores for both, K-means and the
two-step clustering. Overall, the SSTEN embeddings
learned from these two features perform comparably. We
conduct the following experiments using DTFV, which
is also used in related works.</p>
      <sec id="sec-3-1">
        <title>4.5. Impact of Loss Terms on Clustering</title>
        <p>To evaluate the impact of the two loss terms in Eq. (1), we
plot the quantitative segmentation results of SSTEN with
both K-means and the two-step clustering w.r.t. diferent
reconstruction loss coeficients  in Fig. 5. In general,
two-step clustering leads to a better performance than
K-means for almost all  values (except for the case of
only reconstruction loss). With decreasing , the relative
time prediction loss has an increasing impact and the
embedded features have better global temporal consistency,
which explains the increasing IoU and F1 scores.
However, at extremely small  values, the embedded features
overfit to the relative time prediction task, which results
in saturated IoU and F1 scores, and a significant drop in
MoF for both K-means and two-step clustering.</p>
        <p>To intuitively illustrate the loss term impact on the
twostep clustering, we plot the similarity matrices for SSTEN
embeddings trained with three diferent  in Fig. 6. Here,
we look at the similarity matrices with temporal Gaussian
kernel (bottom row). Intuitively, the similarity matrix
with clear diagonal block structure (Fig. 6(a2)), which is
the result of an appropriate ratio between the
reconstruction loss and relative time prediction loss ( = 0.002),
leads to the best segmentation performance. When 
becomes larger (e.g.,  = 0.01), the reconstruction loss
has a larger impact and the diagonal block structure</p>
        <p>MoF two-step.</p>
        <p>MoF kmeans
IoU two-step.</p>
        <p>IoU kmeans
F1 two-step.</p>
        <p>F1 kmeans
toimnleyprreeladtiicvteion</p>
        <p>Viterbi decoding) on Breakfast and 50 Salads in Table 5.</p>
        <p>The global cluster assignment outperforms the naïve
as(Fig. 6(b2)) becomes pale. Therefore, the performances of
embedded features with  = 0.005,  = 0.01 and only
reconstruction loss degrade successively. On the other
hand, for extremely small  values (e.g.,  = 0.0005),
the block diagonal structure (Fig. 6(c2)) becomes noisy
due to overfitting on relative time prediction.
signment by a large margin for both, clustering results
and the final segmentation results, on both datasets. The
advantage of the global cluster assignment is even more
evident on 50 Salads.</p>
        <p>We illustrate exemplary qualitative results of the
clus1.0 tering and the final segmentation for 3 activities (with 3
tmpw./Goauss 0.8 vroidweogsroeuapchd)iospnlaByrseathkefagstroinunFdigt.r4u.thFo(r1setarcohwv)i,dtehoe,
rtehseu4lt(a1) (b1) (c1) 0.6 with global cluster assignment (2nd row) and the result
0.4 with naïve assignment (3rd row). The 4th row shows the
tmwp.iGthauss 0.2 rroeswusltooffceMreLaPls+vkidmeeoa[nids][fin2a1l]. rBeysucltoimnpFaigri.n4g, walel steheet3hradt
(a2) (b2) (c2) 0.0 the naïve assignment simply assumes the sub-clusters in
(a) SSTEN (b) SSTEN (c) SSTEN the same temporal order in each video belong to the same
global cluster, while they might not be close to each other
bFeigdudrineg6s:fForratmhee-stoa-mfreamBreeasikmfailsatrvitiydemoa.tCriocleusmofnsSSsThEoNw ethme- in the feature space. On the contrary, the global cluster
similarity matrices for diferent , while the rows show results assignment (the 2nd rows of cereals video [id] final result)
without (top) and with (bottom) temporal Gaussian kernel. yields an optimal assignment solution with respect to
the pairwise distances between sub-clusters, resulting in
diferent orderings of sub-clusters on each video. Note</p>
        <p>Therefore, both the reconstruction and the relative that on some videos, global cluster assignment could lead
timestamp prediction loss, when combined in an appro- to the same assignment result as naïve assignment.
priate ratio, are indispensable to learn the efective
representation that preserves both spatial layout and the
temporal information. 5. Conclusion
We proposed a new pipeline for the unsupervised
learn4.6. Impact of Cluster Assignment ing of action segmentation. For the feature embedding,
In this ablation study, we evaluate the eficacy of the we propose a temporal-aware embedding network that
global cluster assignment. For two-step clustering, we performs sequence-to-sequence learning with the pretext
evaluate two strategies of grouping within-video clus- tasks of relative timestamp prediction and feature
reconters into global clusters: (1) the naïve assignment, for struction. For clustering, we propose a two-step
clusterwhich we order the sub-clusters according to the aver- ing schema, consisting of within-video clustering and
age timestamp and simply group the -th sub-clusters cross-video global cluster assignment. The temporal
emof all videos into a global cluster, i.e., the global cluster bedding of sequence-to-sequence learning together with
 = {c,| = 1, ..,  }, and (2) the global cluster two-step clustering is proven to be a well-suitable
combiassignment, as detailed in Sec. 3.2.2. nation that considers the sequential nature of frames in</p>
        <p>In order to show how the diferent cluster assignment both processing steps. Ultimately, we combine the
tempostrategies afect the clustering result, we report both, ral embedding with a frame-to-cluster assignment based
the results of the two-step clustering (before Viterbi de- on Viterbi decoding, which achieves the unsupervised
coding) and the final segmentation performance (after state-of-the-art on three challenging benchmarks.
TAEC: Unsupervised Action Segmentation with
Temporal-Aware Embedding and Clustering
Supplementary
1. Introduction
For additional insights into TAEC, we introduce the
background of spectral clustering in Sec. 2.1 and give details
of the Viterbi decoding in Sec. 2.2. We perform more
ablation studies on comparing baseline embeddings and
clustering methods (Sec. 3.1), scaling of spatio-temporal
similarity (Sec. 3.2), cluster ordering (Sec. 3.3), decoding
strategies (Sec. 3.4). Finally, we provide more quantitative
(Sec. 3.5) and qualitative segmentation results (Sec. 3.6)
on the three datasets.
2. Method
2.1. Spectral Clustering
Background information related to Sec. 3.2.1 in the
main manuscript: Given the embedded feature sequence
e1, e2, ..., e , we build a frame-to-frame similarity graph
 ∈ R × , whose edge weight (, ), ,  ∈ {1, ...,  }
represents the similarity between frame  and frame .</p>
        <p>Grouping the frames into  clusters can be interpreted
as a graph partition problem by cutting edges on ,
resulting in  subgraphs 1, 2, ...,  . The normalized
cut (Ncut) problem [1] is employed to compute a balanced
partition by minimizing the energy
ℒ (1, 2, ...,  ) =
1 ∑︁ 
2
=1
 (, ) , (1)
vol()
2.2. Frame Labeling by Viterbi Decoding
Additional explanations to Sec. 3.3 in the main
manuscript: The global cluster assignment delivers the
ordered clusters on each video, which are aligned across all
videos. To compute the final segmentation, we use the
resulting ordering and decode each video into a sequence of
 temporally consistent segments. That is, we determine
the optimal label sequence ˆ1∼, = {1,, ..., ,}
by re-assigning each frame to one of the temporally
ordered clusters.</p>
        <p>Given the embedded feature sequence e1∼, =
{e1,, ..., e,} and the temporal order of the clusters,
we search for the optimal label sequence that maximizes
the probability (1∼,|e1∼,). Following [6], this
posterior probability can be factorized into the product
of likelihoods and the probability of a given temporal
order, i.e.,
ˆ1∼, = arg max (1∼,|e1∼,)</p>
        <p>1∼,
= arg max {Π=1(e,|,) · Π=1(,|1∼(−1),)}</p>
        <p>1∼,
= arg max {Π=1(e,|,) · (,|−1,)}
1∼,</p>
        <p>(2)</p>
        <p>
          Here the likelihood (e,|,) is the probability of a
frame embedding e, from the video  belonging to a
cluster. Therefore, we fit a Gaussian distribution on each
global cluster and compute the frame-wise likelihoods
with the Gaussian model, i.e.,
where  (, ) represents the sum of edge weights (x|) =  (x; , Σ),  ∈ {1, ..., }. (
          <xref ref-type="bibr" rid="ref20">3</xref>
          )
between elements in the subgraph  and elements of
all the other subgraphs, i.e., the sum of weights of edges (,|−1,) is the transition probability from label
to be cut. vol() is the sum of weights of edges within −1, at frame  − 1 to label , at frame , which is
the resulting subgraph . Spectral clustering [2] is a defined by the temporal order of clusters. We denote the
relaxed solution to this NP-hard minimization problem in set of frame transitions defined by the temporal order of
Eq. (1) and has shown good performance on many graph- clusters on the -th video by , e.g., for the temporal
based clustering problems, e.g. [3, 4, 5]. Note that while order of  →  →  → ,  = { → ,  → ,  →
K-means operates on Euclidean distance in the feature }. The transition probability is binary, i.e.,
space and assumes convex and isotropic clusters, spectral
clustering can find clusters with non-convex boundaries. (,|−1,) (4)
= 1(, = −1, ∨ −1, → , ∈ ).
2K6ltehbeCro(medpsu.)t,eKrrVeimsios,nLWowinetreAruWstorrikas,hAoup,stRroiab,eFrtebS.a1b5la-1tn7i,g20a2n3d Florian This means that we allow either a transition to the next
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License cluster according to the temporal order, or we keep the
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) cluster assignment of the previous frame.
        </p>
        <p>Dilated
residual layer
(a) MLP
(b) AEMLP</p>
        <p>(c) TCN</p>
        <p>Note that in two-step clustering, we derive the
temporal order of clusters on each video separately, by
sorting the clusters on the video according to the average
timestamp. Therefore, we have an individual  for each
video . On the contrary, in K-means, there is a uniform
order of global clusters for all the videos and  is thus
the same for each video .</p>
        <p>The Viterbi algorithm for solving Eq. (2) is performed
in an iterative process using dynamic programming, i.e.,
as the temporal prior to train the model with only one
iteration.
(1∼,|e1∼,) = (5) TCN and SSTEN are both networks for
sequencemax {(1∼−1,|e1∼−1,) to-sequence learning, while Rankloss MLP, MLP and
, AEMLP are trained on individual frames. By
compar· (e,|,) · (,|−1,)}. ing the performance between these two groups in
Table 1, we see that sequence-to-sequence learning leads to
The sequences that do not follow the temporal order will better performance, especially when combined with the
be filtered out in an early stage to narrow down the search two-step clustering, which results in clusters with better
range for the optimal label sequence. The output of the temporal consistency.</p>
        <p>Viterbi decoding is the optimal cluster label sequence, For the two-step clustering, we also plot the
frame-toi.e., ˆ1∼,. frame similarity matrices (spatial Gaussian kernel) of the
ifve embeddings for the same Breakfast video in Fig. 2.
3. Additional Results The plots show that Rankloss MLP, MLP and AEMLP,
which are trained on individual frames, do not expose
3.1. Embedding and clustering an appropriate temporal structure. There are noisy block
patterns even in positions far away from the diagonal,
Further, we compare our SSTEN embedding with three which results in noisy clusters and thus, leads to
errobaseline variants (shown in Fig. 1): MLP temporal em- neous temporal orders and inferior assignment results in
bedding, autoencoder with MLP (AEMLP) and temporal the two-step clustering. The least noisy Rankloss MLP
convolutional network (TCN), in combination with the has the highest performance among these three. On the
two clustering methods. contrary, TCN and SSTEN embedded features, which</p>
        <p>MLP uses three FC layers for relative timestamp pre- show a clear diagonal block structure in the similarity
diction. AEMLP uses MLP-based autencoder for both graph, achieve a better performance in the two-step
clusrelative timestamp prediction and feature reconstruction. tering. This verifies that the sequence-to-sequence
emTCN deploys  stacked dilated residual layers only for bedding learning (TCN and SSTEN) and two-step
clusterrelative timestamp prediction. ing are a well-suited combination to address the
sequen</p>
        <p>Here, we also implement the Rankloss MLP embed- tial nature of frames in both processing steps of feature
ding [7] for reference. We report the performance of embedding and clustering.
these five embeddings in Table 1. Considering K-means clustering, the merit of having</p>
        <p>Comparison of the five embeddings. We learn the a better sequential nature of the embedded features via
ifve embeddings (Rankloss MLP, MLP, AEMLP, TCN and sequence-to-sequence learning can also be seen from
SSTEN) on the DTFV features. Here, the Rankloss MLP the higher IoU and F1 scores (TCN: IoU 17.8%/F1 31.3%,
(consisting of two FC layers) is trained with a ranking vs. SSTEN: 17.8%/31.9%), as these penalize dominating
loss. We use the initialization of uniform segmentation
segments and oversegmentation.</p>
        <p>In contrast to TCN, SSTEN can preserve the spatial
layout of the input features due to the feature
reconstruction via the autoencoder. By comparing TCN and SSTEN,
we see that the SSTEN embedding with feature
reconstruction leads to a boost in the MoF score. The marginal
improvement of AEMLP over MLP is due to the fact that
the MLP structure with only FC layers is not well-suited
for feature reconstruction.</p>
        <p>Comparison between K-means and two-step
clustering. Considering the performance of the five
embeddings with the two clustering methods, we see that
Kmeans leads to higher scores on the inferior embeddings
(Rankloss MLP, MLP and AEMLP) trained on
individual frames, while two-step clustering performs better on
sequence-to-sequence learning-based embeddings (TCN
and SSTEN). When combined with the proposed SSTEN
embedding, two-step clustering outperforms K-means by
a large margin in terms of the MoF score. We also tried
applying K-means on each video separately. However,
the performance dropped significantly. K-means depends
only on the spatial distance and results in
oversegmentation, which leads to erroneous temporal order on each
video and thus, an inferior global cluster assignment.</p>
        <p>Impact of the scaling of the temporal Gaussian
kernel. The temporal Gaussian kernel is operated on
the temporal distance between frames in a video. With
3.2. Impact of Scaling in Spatiotemporal t2mp = 2′2, the term exp (︀ −( −  )2/(2′2))︀ is in
Similarity the standard form of a Gaussian kernel. We set ′ = 1/6
so that the 6′ range of the temporal Gaussian is equal to
We perform spectral clustering with the proposed spa- the video length (since the length of each video is
normaltiotemporal similarity. Here, we analyze the impact of ized to 1 for the relative timestamp prediction). The
segthe scaling factors in the spatial and temporal Gaussian mentation performance with respect to ′ is shown in
Takernels, i.e., s2pat and t2mp. These adjust the extent to ble 3. Apparently, ′ = 1/6 leads to the best result. Here,
which two frames are considered similar to each other
and influence the clustering quality. The experiments are
conducted for SSTEN embeddings on Breakfast. Table 3</p>
        <p>Impact of the scaling of the spatial Gaussian ker- eSmegbmedednitnagtiso(np=erf0o.r0m02a)nwceitohf rtewspoe-scttetpoctlhuestteerminpgooranl SscSaTliEnNg
nel. For local scaling, we set s2pat =  , where  is factor ′ (t2mp = 2′2 on Breakfast (in %).
the distance from e to its -th nearest neighbor in the
feature space. The resulting segmentation performance ′ (t2mp = 2′2) MoF IoU F1
w.r.t.  is shown in Fig. 3. With  varying in the range ∞ (w/o tmp. Gauss) 41.5 16.5 30.6
of 3 to 20, the IoU and F1 scores remain stable. There 1/3 43.5 16.9 31.3
is a range of  ∈ {8, 9} where the best MoF scores 11//162 5404..33 1198..50 3343..61
are achieved, whereas for other scaling parameters, the
MoF score drops. Thus, we set  = 9 for all following we also evaluate the case without the temporal Gaussian
evaluations. kernel, which leads to a drop in performance. The impact</p>
        <p>For comparison, we also set spat to fixed values (with- of the temporal Gaussian kernel on similarity matrices
out local scaling) and report the segmentation perfor- of SSTEN embeddings can also be seen by comparing the
mance in Table 2. We achieve great results at smaller top and bottom rows in Fig. 6 in the main manuscript.
spat values (0.5 and 0.7). For example, by adding the temporal Gaussian kernel,</p>
        <p>However, with increasing spat the MoF score drops we decrease the similarities in Fig. 6(a1) according to the
significantly, while there are only minor fluctuations in temporal distance between two frames, which leads to
IoU and F1. Apparently, spat has a large impact on the clearer diagonal block structure in Fig. 6(a2). Thus, we
clustering quality. The local scaling eases the efort of set ′ = 1/6 for all following evaluations.
tuning spat by dynamically determining the scaling
factor.</p>
        <p>Table 4 all video frames from the same activity with respect to a
Impact of cluster order for two-step clustering on SSTEN em- pseudo ground truth action annotation. The embedded
beddings (in %). frames of the whole activity set are then clustered and
Order MoF IBoUreakfasFt1 Edit MoF IoUYTI F1 Edit tphuetelidk.eFliohroothdefodreceoadchinfgr,amthee
aanudthfoorrseabcuhildcluashteisrtiosgcroammof features with respect to their clusters with a hard
asvwidiseeo- 50.3 19.0 33.6 42.3 46.6 10.7 29.5 25.5 signment and set the length of each action with respect
uniform 53.5 15.7 32.2 33.0 40.7 7.7 25.1 20.3 to the overall amount of features per bin. After that, they
apply a Mallow model to sample diferent orderings for
each video with respect to the sampled distribution. The
3.3. Impact of Cluster Order resulting model is a combination of Mallow model
sampling and action length estimation based on the frame
One merit of performing within-video clustering is that distribution.
we can derive the temporal order of sub-clusters for each For this experiment, we evaluate the impact of the
difvideo separately. The video-wise individual order of clus- ferent decoding strategies on two embeddings: the
Ranters is used to guide the Viterbi decoding, which breaks kloss embedding [7] and our SSTEN embedding. Table 5
the common assumption that clusters follow the same reports the results of these two embeddings in
combitemporal order in all the videos. In the following, we nation with three decodings: the Mallow model, Viterbi
verify the eficacy of the derived video-wise order of decoding with K-means and Viterbi decoding with
twoclusters. We use the same within-video clustering re- step clustering.
sult with global cluster assignment and perform Viterbi
decoding using two diferent temporal cluster orders: Table 5
(1) video-wise order: the temporal order of sub-clusters Comparison of combinations of embeddings and decoding
is determined on each video separately; and (2) uniform strategies on Breakfast (in %).
order: the uniform order is determined by sorting the
average timestamps of global clusters and is then applied Decoding Rankloss [7] embed. SSTEN embed.
to all the videos. Table 4 reports the segmentation per- MoF IoU F1 MoF IoU F1
formance (after Viterbi) with these two orders for our kmMeaanllso+wV[it7e]rbi 3354..72 1175..68 3218..84 3369..43 1187..18 3311..59
SSTEN embeddings on Breakfast and YTI. To measure the two-step.+Viterbi 34.7 13.4 23.7 50.3 19.0 33.6
correctness of the predicted segment order, we adopt the
segmental edit distance (Edit), which is a common metric Following [7], we run 7 iterations for the Rankloss
for supervised action segmentation, e.g., [8, 9, 10, 11]. embedding with the Mallow model. In each iteration,
It penalizes segmentation results that have a diferent the Rankloss embedding is retrained using the
segmensegment order than the ground truth (i.e., it penalizes tation result from the last iteration as pseudo label, and
out-of-order predictions, as well as oversegmentation). the frame-wise likelihoods and the Mallow model are</p>
        <p>From Table 4 we see that our video-wise order clearly updated.
outperforms the uniform order except for MoF on Break- Unlike the Mallow model, our Viterbi decoding is a
fast. Furthermore, the edit score verifies that our derived one-iteration procedure. It is operated on the
embedvideo-wise temporal orders are valid. ding which is trained only once. When combining with</p>
        <p>In our experiments we especially notice that the MoF Viterbi, we train the Rankloss model only once using the
and IoU scores could act contradictory to each other, initialized uniform segmentation as a prior. For SSTEN
e.g., the uniform order results in higher MoF scores (on with the Mallow model, we only run for one iteration,
Breakfast) at the cost of lower IoU scores. MoF tends to as we do not need to train SSTEN with pseudo labels
overfit on dominant classes (e.g., classes with longer ac- iteratively.
tion instances) while IoU is sensitive to underrepresented Considering the Rankloss results in Table 5 we see
classes and penalizes segmentation results with dominat- that combining it with the Mallow model achieves its
ing segments. Therefore, it is necessary that we consider highest IoU and F1 scores. This is because for Viterbi
deall metrics for evaluation, as a higher MoF score does not coding, the Rankloss model trained only one-time using
always correspond to better performance in practice. the uniform initialization as pseudo label is lacking of a
strong temporal prior. Considering SSTEN, our Viterbi
3.4. Impact of Decoding Strategies decoding with two-step clustering clearly outperforms
the Mallow model. With Mallow, the SSTEN embedding
We compare our approach, which uses Viterbi decoding, has competitive IoU and F1 scores but significantly lower
with the Mallow model decoding that has been proposed MoF. We also tried running the Mallow model on SSTEN
in [7]. The authors propose a Rankloss embedding over embedded features for multiple iterations. However, this
resulted in a reduced number of clusters. Thus, we see Table 6
that an appropriate combination of embedding and de- Comparison of combinations of SSTEN and diferent clustering
coding strategy is necessary. methods in terms of clustering and final segmentation after</p>
        <p>To have a closer look into the Viterbi decoding, we Viterbi decoding on Breakfast (in %).
vteisrus,aaliszewtehllealsiktehleihroeosdulgtirnidgsdceocmodpiuntgedpafrthomovgelrotbiaml eclfuosr- Embedding Clustering CMlouFsterIionUg resulFt1s MoFFinalIroeUsultsF1
two videos on Breakfast in Fig. 4. It shows that the decod- K-means 27.2 13.5 26.3 39.3 17.8 31.9
ing, which generates a full sequence of actions, is able SSTEN two-step.cluster 38.6 13.7 25.9 50.3 19.0 33.6
to marginalize actions that do not occur in the video by
just assigning only very few frames to those ones and the
majority of frames are assigned to the clusters that occur
in the video. Even if the given temporal order constrains 3.5.2. Segmentation Results On Each Activity
that the resulting  coherent segments have to follow We report the ground truth number of classes and
the fixed temporal order, the segments that actually do segmentation performance of MLP with K-means
not belong in the sequence will be marginalized because (MLP+kmeans, our reimplementation of [12]) and TAEC
the Viterbi algorithm decodes a path that maximizes the for each activity on Breakfast (Table 7), YTI (Table 8)
posterior probability. Overall, it turns out that the Viterbi and 50 Salads (Table 9). The evaluation is done with
decoding constrained by a temporal order performs bet- global Hungarian matching on all videos. The number of
ter than the Mallow model’s iterative re-ordering. clusters is set to the maximum number of ground truth
classes for each activity ( = max.#gt).
cofee
cereals
tea
milk
juice
sandwich
scrambled egg
fried egg
salad
pancake</p>
        <p>All

7
5
7
5
8
9
9
8
12
14</p>
        <p>Methods
MLP+kmeans</p>
        <p>TAEC
MLP+kmeans</p>
        <p>TAEC
MLP+kmeans</p>
        <p>TAEC
MLP+kmeans</p>
        <p>TAEC
MLP+kmeans</p>
        <p>TAEC
MLP+kmeans</p>
        <p>TAEC
MLP+kmeans</p>
        <p>TAEC
MLP+kmeans</p>
        <p>TAEC
MLP+kmeans</p>
        <p>TAEC
MLP+kmeans</p>
        <p>TAEC
MLP+kmeans</p>
        <p>TAEC</p>
        <p>MoF
46.8
35.6</p>
        <p>Making juice
Viterbi decoding path on likelihood grid
prediction
ground truth</p>
        <p>Making fried egg
Viterbi decoding path on likelihood grid
prediction
ground truth
frame axis
frame axis
3.5. Quantitative Segmentation Results
3.5.1. Results of Clustering and Final</p>
        <p>Segmentation
In order to show the advantage of two-step clustering
over K-means, when combined with the proposed SSTEN
embedding, we report both, the results of clustering
(before Viterbi decoding) and the final segmentation
performance (after Viterbi decoding) on Breakfast in Table 6.</p>
        <p>We see that the proposed two-step clustering leads to
superior performance than K-means, in terms of both
clustering (before Viterbi decoding) in most metrics, and
in terms of final segmentation (after Viterbi decoding).</p>
        <p>MoF
w/o
bkg
3.6. Qualitative Segmentation Results
We show the qualitative results of clustering and final
segmentation on 7 composite activities: making cereals
(Fig. 5), making milk (Fig. 6), making juice (Fig. 7), making
fried egg (Fig. 8), making pancake (Fig. 9) on Breakfast,
changing tire (Fig. 10) on YTI and making salad (Fig. 11)
on 50 Salads (eval 12 classes). The mapping between
cluster labels and ground truth classes is done with global
Hungarian matching on all videos. The number of
clusters is set to the maximum number of ground truth classes
for each activity ( = max.#gt).</p>
        <p>For each activity, we visualize the results of 10 videos.</p>
        <p>For each video, the 3-row-group displays the ground
truth (1st row), TAEC (2nd row), MLP+kmeans [12] (3rd
row).
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10
vid 1
vid 2
vid 3
vid 4
vid 5
vid 6
vid 7
vid 8
vid 9
vid 10</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          constraints, in: CVPR,
          <year>2018</year>
          . [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>She</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smolic</surname>
          </string-name>
          , Action-net: Multipath [20]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <article-title>Unsupervised learning</article-title>
          and segmen-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>excitation for action recognition</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2021</year>
          .
          <article-title>tation of complex activities from video</article-title>
          , in: CVPR, [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Feichtenhofer</surname>
          </string-name>
          , X3d: Expanding architectures for
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>eficient video recognition</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2020</year>
          . [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kukleva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuehne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gall</surname>
          </string-name>
          , Unsuper[3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Temporal vised learning of action classes with continuous</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>pyramid network for action recognition, in: CVPR, temporal embedding</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <year>2020</year>
          . [22]
          <string-name>
            <surname>R. G. VidalMata</surname>
          </string-name>
          , W. J.
          <string-name>
            <surname>Scheirer</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Kuehne</surname>
            , Joint [4]
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          <string-name>
            <surname>Rojas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Thabet</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ghanem</surname>
          </string-name>
          , visual
          <article-title>-temporal embedding for unsupervised learn-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          detection, in: CVPR,
          <year>2020</year>
          .
          <year>2021</year>
          . [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          , [23]
          <string-name>
            <given-names>B. L.</given-names>
            <surname>Bhatnagar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Jawahar</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>tual regularization</article-title>
          ,
          <source>in: ECCV</source>
          ,
          <year>2020</year>
          .
          <article-title>tion for clustering egocentric actions</article-title>
          , in: IJCAI, [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          , J. Liu,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>Boundary content graph neural network for tem-</article-title>
          [24]
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , E. Mansimov,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhudinov</surname>
          </string-name>
          , Un-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>poral action proposal generation</article-title>
          ,
          <source>in: ECCV</source>
          ,
          <year>2020</year>
          .
          <article-title>supervised learning of video representations using [7</article-title>
          ]
          <string-name>
            <given-names>Y. A.</given-names>
            <surname>Farha</surname>
          </string-name>
          , J. Gall,
          <article-title>MS-TCN: Multi-stage temporal LSTMs</article-title>
          , in: ICML,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>convolutional network for action segmentation</article-title>
          , in: [25]
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Birodkar</surname>
          </string-name>
          , Unsupervised learning
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          ,
          <year>2019</year>
          .
          <article-title>of disentangled representations from video</article-title>
          , in: [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. A.</given-names>
            <surname>Farha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          , M.-M. Cheng, J. Gall, MS- NeurIPS,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>TCN</given-names>
            ++:
            <surname>Multi-Stage Temporal Convolutional</surname>
          </string-name>
          Net- [26]
          <string-name>
            <given-names>R.</given-names>
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          , De-
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>work for Action Segmentation</article-title>
          ,
          <string-name>
            <surname>PAMI</surname>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>composing motion and content for natural video</article-title>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Wu, Boundary- sequence prediction,
          <source>in: ICLR</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>aware cascade networks for temporal action seg-</article-title>
          [27]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. S.</given-names>
            <surname>Kweon</surname>
          </string-name>
          , Self-supervised video
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          mentation, in: ECCV,
          <year>2020</year>
          .
          <article-title>representation learning with space-time cubic puz</article-title>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <article-title>Temporal relational zles</article-title>
          ,
          <source>in: AAAI</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>modeling with self-supervision for action segmen</article-title>
          - [28]
          <string-name>
            <given-names>H.-Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Huang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>M.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          , Un-
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          tation, in: AAAI,
          <year>2021</year>
          .
          <article-title>supervised representation learning by sorting</article-title>
          se[11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sugano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sato</surname>
          </string-name>
          ,
          <article-title>Improving action quences</article-title>
          , in: ICCV,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>segmentation via graph-based temporal reasoning</article-title>
          , [29]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramanathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          , Learn-
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>in: CVPR</source>
          ,
          <year>2020</year>
          .
          <article-title>ing temporal embeddings for complex video analy</article-title>
          [12]
          <string-name>
            <surname>C.-Y. Chang</surname>
            ,
            <given-names>D.-A.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Sui</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Fei-Fei</surname>
          </string-name>
          , J. C. sis, in: ICCV,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Niebles</surname>
            , D3tw: Discriminative diferentiable dy- [30]
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Fernando</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Gavves</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          <string-name>
            <surname>Oramas</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Ghodrati,
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>alignment and segmentation</article-title>
          , in: CVPR,
          <year>2019</year>
          . recognition, in: CVPR,
          <year>2015</year>
          . [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Todorovic</surname>
          </string-name>
          , Weakly supervised energy- [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cherian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fernando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gould</surname>
          </string-name>
          , Gen-
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          2019. CVPR,
          <year>2017</year>
          . [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Richard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuehne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gall</surname>
          </string-name>
          , Weakly supervised [32]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hoai</surname>
          </string-name>
          , F. D. la Torre, Maximum margin temporal
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          eling, in: CVPR,
          <year>2017</year>
          .
          <year>2012</year>
          . [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Richard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuehne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Iqbal</surname>
          </string-name>
          , J. Gall, [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <article-title>Temporal subspace clustering for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>NeuralNetwork-Viterbi</surname>
          </string-name>
          :
          <article-title>A framework for human motion segmentation</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <article-title>weakly supervised video learning</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2018</year>
          . [34]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tierney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          , Subspace clustering for [16]
          <string-name>
            <surname>D.-A. Huang</surname>
            ,
            <given-names>F.-F.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J. C.</given-names>
          </string-name>
          <string-name>
            <surname>Niebles</surname>
          </string-name>
          ,
          <article-title>Connectionist sequential data</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <article-title>temporal modeling for weakly supervised action</article-title>
          [35]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Neumann</surname>
          </string-name>
          , Human
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          labeling, in: ECCV,
          <year>2016</year>
          .
          <article-title>motion parsing by hierarchical dynamic clustering</article-title>
          , [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fayyaz</surname>
          </string-name>
          , J. Gall, SCT: Set constrained temporal in: BMVC,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <article-title>transformer for set supervised action segmentation</article-title>
          , [36]
          <string-name>
            <surname>J.-B. Alayrac</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Agrawal</surname>
          </string-name>
          , J. Sivic,
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>in: CVPR</source>
          ,
          <year>2020</year>
          . I. Laptev,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lacoste-Julien</surname>
          </string-name>
          , Unsupervised learning [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Todorovic</surname>
          </string-name>
          ,
          <article-title>Set-constrained Viterbi for set- from narrated instruction videos</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <article-title>supervised action segmentation</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2020</year>
          . [37]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Sarfraz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Murray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Diba</surname>
          </string-name>
          , [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Richard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuehne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gall</surname>
          </string-name>
          , Action sets:
          <string-name>
            <surname>Weakly L. Van Gool</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Stiefelhagen</surname>
          </string-name>
          , Temporally-weighted
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <article-title>hierarchical clustering for unsupervised action seg- boundaries</article-title>
          , in: WACV,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          mentation, in: CVPR,
          <year>2021</year>
          . [56]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shahroudy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          , L.-Y. [38]
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Aakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <article-title>A perceptual prediction Duan, A. K. Chichung</article-title>
          , NTU RGB+
          <article-title>D 120: A large-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>framework for self supervised event segmentation, scale benchmark for 3D human activity understand-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <source>in: CVPR</source>
          ,
          <year>2019</year>
          . ing,
          <source>PAMI</source>
          (
          <year>2019</year>
          ). [39]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zelnik-Manor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          , Self-tuning spectral [57]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          , Imagenet
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          clustering, in: NeurIPS,
          <year>2005</year>
          .
          <article-title>classification with deep convolutional neural net</article-title>
          [40]
          <string-name>
            <given-names>W. P.</given-names>
            <surname>Pierskalla</surname>
          </string-name>
          ,
          <article-title>Letter to the editor-the multi- works</article-title>
          , in: NeurIPS,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <article-title>dimensional assignment problem</article-title>
          , Operations Re- [58]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . Fei-
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <source>search 16</source>
          (
          <year>1968</year>
          )
          <fpage>422</fpage>
          -
          <lpage>431</lpage>
          . Fei,
          <article-title>Imagenet: A large-scale hierarchical image [41]</article-title>
          <string-name>
            <surname>H.-J. Bandelt</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Crama</surname>
            ,
            <given-names>F. C.</given-names>
          </string-name>
          <string-name>
            <surname>Spieksma</surname>
          </string-name>
          , Approxima- database, in: CVPR,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <article-title>tion algorithms for multi-dimensional assignment</article-title>
          [59]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , Quo vadis, action recog-
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <source>plied Mathematics</source>
          <volume>49</volume>
          (
          <year>1994</year>
          )
          <fpage>25</fpage>
          -
          <lpage>50</lpage>
          . CVPR,
          <year>2017</year>
          . [42]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuehne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Arslan</surname>
          </string-name>
          , T. Serre, The language of [60]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <article-title>goal-directed human activities</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2014</year>
          . T. Back,
          <string-name>
            <given-names>P.</given-names>
            <surname>Natsev</surname>
          </string-name>
          , et al.,
          <source>The kinetics human ac</source>
          [43]
          <string-name>
            <given-names>S.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. J. McKenna</surname>
          </string-name>
          ,
          <article-title>Combining embedded ac- tion video dataset</article-title>
          ,
          <source>arXiv preprint arXiv:1705.06950</source>
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <article-title>celerometers with computer vision for recognizing (</article-title>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <surname>puting</surname>
          </string-name>
          ,
          <year>2013</year>
          . [44]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <article-title>Action recognition with im-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <article-title>proved trajectories</article-title>
          , in: ICCV,
          <year>2013</year>
          . [45]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Todorovic</surname>
          </string-name>
          , Action shufle alternating learn-
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          ,
          <year>2021</year>
          . [46]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Gated forward refine-
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <source>puting 407</source>
          (
          <year>2020</year>
          )
          <fpage>63</fpage>
          -
          <lpage>71</lpage>
          . [47]
          <string-name>
            <surname>M.-H. Chen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Bao</surname>
            , G. AlRegib,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kira</surname>
          </string-name>
          , Action
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <article-title>domain adaptation</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2020</year>
          . [48]
          <string-name>
            <given-names>S.-H.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , M.-M.
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          Cheng, Global2local:
          <article-title>Eficient structure search for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          <article-title>video action segmentation</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2021</year>
          . [49]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wen</surname>
          </string-name>
          , T. Jiang, Asformer: Transformer for
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <article-title>action segmentation</article-title>
          ,
          <source>in: BMVC</source>
          ,
          <year>2021</year>
          . [50]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          , E. Elhamifar, Weakly-supervised action
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <article-title>union-of-subspaces learning</article-title>
          ,
          <source>in: ICCV</source>
          ,
          <year>2021</year>
          . [51]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vidal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Hager</surname>
          </string-name>
          , Segmental
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          mentation, in: ECCV,
          <year>2016</year>
          . [52]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Flynn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vidal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Hager</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          <article-title>mentation and detection</article-title>
          , in: CVPR,
          <year>2017</year>
          . [53]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fermüller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zampogiannis</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          <source>puter Vision</source>
          <volume>126</volume>
          (
          <year>2018</year>
          )
          <fpage>358</fpage>
          -
          <lpage>374</lpage>
          . [54]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Refining action segmentation with</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          <article-title>hierarchical video representations</article-title>
          ,
          <source>in: ICCV</source>
          ,
          <year>2021</year>
          . [55]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ishikawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kasai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Aoki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kataoka</surname>
          </string-name>
          , Alleviat-
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          <article-title>ing over-segmentation errors by detecting action [1</article-title>
          ]
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <article-title>Normalized cuts and image segmen-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          <string-name>
            <surname>tation</surname>
          </string-name>
          , PAMI
          <volume>22</volume>
          (
          <year>2000</year>
          )
          <fpage>888</fpage>
          -
          <lpage>905</lpage>
          . [2]
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <article-title>A tutorial on spectral clustering,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          <source>Statistics and Computing</source>
          <volume>17</volume>
          (
          <year>2007</year>
          )
          <fpage>395</fpage>
          -
          <lpage>416</lpage>
          . [3]
          <string-name>
            <given-names>H.-C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-Y.</given-names>
            <surname>Chuang</surname>
          </string-name>
          , C.-S. Chen, Afinity
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          <article-title>aggregation for spectral clustering</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2012</year>
          . [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Superpixel segmentation using linear
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          <article-title>spectral clustering</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2015</year>
          . [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Segmentation of 3d meshes
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          <source>on Computer Graphics and Applications</source>
          ,
          <year>2004</year>
          . [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Richard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuehne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Iqbal</surname>
          </string-name>
          , J. Gall,
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          <article-title>weakly supervised video learning</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2018</year>
          . [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <article-title>Unsupervised learning</article-title>
          and segmen-
        </mixed-citation>
      </ref>
      <ref id="ref62">
        <mixed-citation>
          <year>2018</year>
          . [8]
          <string-name>
            <given-names>Y. A.</given-names>
            <surname>Farha</surname>
          </string-name>
          , J. Gall,
          <article-title>MS-TCN: Multi-stage temporal</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref63">
        <mixed-citation>
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          ,
          <year>2019</year>
          . [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Flynn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vidal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Hager</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref64">
        <mixed-citation>
          <article-title>mentation and detection</article-title>
          , in: CVPR,
          <year>2017</year>
          . [10]
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Morariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Davis</surname>
          </string-name>
          , Multi-agent event recog-
        </mixed-citation>
      </ref>
      <ref id="ref65">
        <mixed-citation>
          <article-title>nition in structured scenarios</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2011</year>
          . [11]
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Koppula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saxena</surname>
          </string-name>
          , Learning human
        </mixed-citation>
      </ref>
      <ref id="ref66">
        <mixed-citation>
          <source>The International Journal of Robotics Research 32</source>
        </mixed-citation>
      </ref>
      <ref id="ref67">
        <mixed-citation>
          (
          <year>2013</year>
          )
          <fpage>951</fpage>
          -
          <lpage>970</lpage>
          . [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kukleva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuehne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gall</surname>
          </string-name>
          , Unsuper-
        </mixed-citation>
      </ref>
      <ref id="ref68">
        <mixed-citation>
          <article-title>temporal embedding</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>