1. Introduction

LoGo: Local-Global Context Modeling and Cross-Level Regression for Temporal Action Localization

Li Xinxin

Yang Zhe

1 0 School of Computer and Software, Chengdu Jincheng College , Chengdu, 611731 , China 1 Tangshan Research Institute, Southwest Jiaotong University , Tangshan 063000 , China

2025

The explosive growth of user-generated videos has driven the demand for automated video understanding in applications such as retrieval, surveillance, and human-computer interaction. Temporal Action Localization (TAL), a critical task in this domain, aims to identify temporal boundaries and categories of actions in untrimmed videos. However, existing methods struggle with challenges including large variations in action duration, ambiguous boundaries, and strong background noise. This paper proposes LoGo, an end-to-end framework that unifies Local-Global Context Modeling and a Cross-Level Feature Fusion Regression Head (CLFF-Head) to significantly improve localization accuracy. The key innovations include: 1) The LoGo Block, which integrates depthwise separable convolutions for local structural modeling with channel attention mechanisms for global semantic awareness, achieving balanced local-global dependency learning through residual fusion; 2) The CLFF-Head, which enhances boundary regression stability and accuracy via adaptive multi-scale feature fusion. Extensive experiments on THUMOS14 and ActivityNet1.3 demonstrate that LoGo outperforms state-of-the-art methods, achieving new SOTA performance on THUMOS14 and competitive results on ActivityNet1.3, validating its efectiveness and generalizability.

eol>Temporal Action Localization Video Understanding Local-Global Context Modeling

1. Introduction

In recent years, the explosive growth of user-generated videos on internet platforms has fueled the demand for automatic video understanding in applications such as retrieval, surveillance, and humancomputer interaction. Temporal Action Localization (TAL), a crucial task in this domain, aims to identify the temporal boundaries and categories of actions in untrimmed videos. Despite recent advances, TAL remains challenging due to large variations in action duration, ambiguous boundaries, and strong background noise.

To tackle these challenges, many methods have been proposed using convolutional [ 1, 2 ], recurrent [ 3 ], or graph-based networks[ 4, 5, 6 ] to model temporal dependencies. However, the uneven distribution of action durations—ranging from brief gestures to prolonged activities—demands simultaneous modeling of both short- and long-term dependencies. CNNs ofer strong local modeling but struggle with longrange context, while Transformers [ 7, 8 ] capture global semantics efectively but lack sensitivity to local details, which are critical for precise boundary detection.

Accurate boundary localization remains a bottleneck. While [ 8 ] introduces an eficient regression head, its expressiveness is limited in complex scenes. [ 9 ] improves regression with a stronger head, yet its limited cross-level feature fusion results in suboptimal localization under challenging backgrounds.

To address these issues, we propose LoGo, an end-to-end TAL framework that unifies Local-Global Context modeling with a Cross-Level Feature Fusion Regression Head (CLFF-Head). The LoGo Block integrates depthwise separable convolutions for local structure modeling with channel attention for global context, connected via a residual fusion mechanism to balance precision and long-range awareness. Furthermore, the CLFF-Head adaptively fuses multi-scale semantic features from the feature pyramid, significantly enhancing boundary regression stability and accuracy without compromising eficiency.

The main contributions of this paper are summarized as follows: • We introduce the LoGo Block module, which combines the local modeling capability based on depthwise separable convolutions with the global modeling capability driven by channel attention mechanisms, achieving eficient integration through a residual structure. This module efectively captures both local structural details and long-range contextual dependencies. • We design the Cross-Level Feature Fusion Regression Head (CLFF-Head), which introduces a multi-scale feature adaptive fusion mechanism, significantly improving the stability and accuracy of boundary localization.

Section 2 outlines relevant prior work. Section 3 details the proposed methodology. Section 4 presents and discusses the results of our experiments. Lastly, Section 5 summarizes our main conclusions.

2. Related Work

Temporal Action Localization(TAL) In Temporal Action Localization (TAL), two-stage and singlestage methods are employed to detect actions in videos. Two-stage methods involve generating action proposals and classifying them, which can be achieved through anchor windows[ 10, 11 ], action boundary localization[ 12, 13 ], graph representation[ 4 ], or Transformers[ 7, 8 ]. On the other hand, single-stage TAL performs both proposal generation and classification in a single pass, without a separate proposal generation step. Pioneering work[ 14 ] in this field developed anchor-based single-stage TAL using convolutional networks, inspired by single-stage object detectors[ 15 ]. Additionally, there have been anchor-free single-stage models proposed[ 16 ], incorporating a saliency-based refinement module. Object detection Object detection is closely related to Temporal Action Localization (TAL), with both tasks sharing similar challenges. General Focal Loss [17] enhances bounding box regression by transforming it from learning a Dirac delta distribution to a more general distribution function. Several methods [18, 19, 20] leverage Depthwise Convolution to model network structures, while certain branched designs [21, 22] have demonstrated strong generalization capabilities. These approaches ofer valuable insights for designing the architecture of TAL systems.

3. Methodology

Temporal Action Localization Given an input video , we make the assumption that can be represented by a collection of feature vectors = { 1, 2, … , } defined at discrete time steps = {1, 2, … , } , where the total duration varies across diferent videos. For instance, could represent the feature vector of a video clip extracted from a 3D convolutional network at time . The objective of temporal action localization (TAL) is to predict the action label = { 1, 2, … , } based on the input video sequence . Here, comprises action instances , and the number of instances can vary across videos. Each instance = ( , , ) is characterized by its starting time (onset), ending time (ofset), and the corresponding action label . The starting time lies in the range of [1, ] , the ending time lies in the range of [1, ] , and the action label belongs to the set of pre-defined categories 1, .., , where represents the total number of categories. Additionally, it is required that is less than for each instance. Therefore, the task of TAL presents a challenging problem of predicting structured outputs.

3.1. Method Overview

The overall architecture of LoGo is shown in Fig. 1. Our model consists of three parts: a video feature extractor, a Multi-scale LoGo Encoder, and two sub-task heads. Concretely, for a given video clip, we extract video features using a pre-trained 3D-CNN model. Then, the extracted features are passed through the Multi-scale LoGo Encoder, which performs downsampling operations to better represent features at diferent temporal scales. Finally, the pyramid features produced by the Multi-scale LoGo Encoder are processed by two task-specific heads to generate the final predictions. In the following, we will describe the details of our model.

+ MLP +

LoGo Group Norm Layer Norm

Zl-1

LoGo Block

downsample LoGo Block ... downsample

3.2. Multi-scale LoGo Encoder

The input feature is first encoded into multi-scale temporal feature pyramid = { 1, 2, ..., } using Multi-scale LoGo Encoder . The encoder simply contains two 1D convolutional neural network layers as feature projection layers, followed by − 1 Local-Global Context Modeling (LoGo) blocks to produce feature pyramid .

First, the input features −1 from the previous layer of the pyramid are passed through a Layer Normalization (LN) operation to stabilize the feature distribution. These normalized features are then fed into the LoGo block, which jointly captures local and global temporal information. Through a residual connection, the output of the LoGo block is added to the original input features, resulting in the new feature representation ̄−1 .

Next, the feature ̄−1 is processed through a Group Normalization (GN) operation to further enhance the training stability of the model. Then, a Multi-Layer Perceptron (MLP) is applied to perform nonlinear transformation, yielding the feature ̂ −1 . Again, through a residual connection, the output of the MLP is added to ̄−1 , producing the updated feature representation.

Finally, the processed feature ̂ −1 undergoes a downsampling operation. Downsampling is implemented via a 1D max-pooling operation with a window size of 3 and a stride of 2, reducing the temporal dimension of the features and passing them to the next layer of the pyramid.

Here are the mathematical formulas corresponding to these steps: ̄−1 = ( ( ̂ −1 = ( ( = (

−1 )) + −1 , ̄−1 )) + ̄−1 , ̂ −1 ) (1) (2) (3) where ∈ [1, ] , is the LayerNorm operation, is the GroupNorm operation, is implemented by a 1D max-pooling with a window size of 3 and stride of 2.

Finally, the encoded feature pyramid is constructed by combining the features of all the LoGo blocks as = { 1, 2, ..., } LoGo Block To simultaneously capture local structural details and global semantics in action instances, this study proposes the LoGo Block module, which integrates depthwise separable convolution (for local modeling) and channel attention mechanisms (for global modeling) to achieve eficient fusion of multi-scale temporal features. Specifically, the LoGo Block first applies LayerNorm normalization to stabilize the feature distribution and improve training efectiveness. It then employs depthwise separable convolution to model local patterns, efectively capturing short-term dependencies and fine-grained variations in temporal sequences.

Global modeling is implemented through two pathways: Pathway 1 generates channel attention weights (global modulation factor) via global average pooling followed by a fully connected layer, while Pathway 2 extracts salient features through max pooling and multiplies them with linearly transformed features for dynamic channel weighting. These pathways respectively focus on global context and salient features to enhance multi-scale action characteristics.

The final output combines three components: local features multiplied by the global modulation factor, dynamically weighted salient features, and the original input through residual connections. This design strengthens feature representation capabilities while facilitating stable gradient propagation.

AvgPool FC + ReLU

✖ Conv ✖ FC

MaxPool

FC + ReLU ＋ x

Overall, the LoGo Block adopts a lightweight structure to enable multi-level modeling of temporal action information, significantly improving the model’s ability to recognize action boundaries and understand semantics in complex backgrounds.

Mathematically, the LoGo can be written as: = () +

() + = ( ( ())), = ( ( ())), where and denotes fully-connected layer and the 1-D depth-wise convolution layer over temporal dimension. and are given as: where () is the average pooling for all features over the temporal dimension and () is the max pooling for all features over the temporal dimension.

3.3. Temporal Action Localization Decoder

Next, our model uses a decoder to decode the feature pyramid = { 1, 2, ..., } extracted by the multi-path temporal feature encoder into sequence labels =̂ { ̂ 1, 2̂, ..., ̂ } . The decoder of this model (4) (5) (6) consists of a classification head and a cross-level feature fusion regression head, both of which are lightweight convolutional networks.

Classification Head Based on the feature pyramid , the task of the classification head is to predict the action category probabilities ( ) for each moment at diferent levels of the pyramid. The classification head in this chapter adopts a simple and eficient 1D convolutional network, with shared parameters across diferent levels to reduce model complexity. Specifically, the network consists of three 1D convolutional layers, with a kernel size of 3. ReLU activation and LayerNorm normalization are applied in the first two layers to enhance training stability. Finally, after a Sigmoid function mapping, the output is the probability distribution for each action category.

Cross-Level Feature Fusion Regression Head Inspired by the Trident-head structure in TriDet, the regression head in this paper emphasizes the role of relative boundary feature information at diferent levels of the feature pyramid in action localization. However, Trident-head relies solely on a single feature layer for boundary estimation, which may limit its performance in boundary localization. To address this, we propose a Cross-Level Feature Fusion Regression Head (CLFFHead). This method not only utilizes the features of the current pyramid layer when predicting boundary ofsets but also incorporates the feature information from the previous layer, thus enhancing the complementarity of features at diferent scales and improving the stability and accuracy of boundary regression.

Zl-1

Zl downsample w2 ✖ ✖ w1 ＋ ^l Z

Given the feature sequence ∈ ℝ × output from the feature pyramid, we first obtain three feature sequences from three branches: ∈ ℝ , ∈ ℝ and ∈ ℝ ×2×(+1) . and represent the response values for the start and end boundaries of an action at each moment, respectively. Both and are obtained through 1D convolutions. For ∈ ℝ ×2×(+1) , during its derivation, we not only utilize the features of the current layer but also incorporate the feature information from the previous pyramid layer. This cross-level feature fusion allows the model to leverage richer contextual information, enhancing the robustness of boundary prediction. In the feature fusion process, we introduce two learnable parameters, 1 and 2, to ensure that the fusion weights can be dynamically adjusted. This process is illustrated in Fig. 3, where the blue blocks denote the features processed by the LoGo Encoder. This mechanism enables the model to adaptively select the appropriate feature proportion, optimizing boundary prediction performance under diferent tasks and data distributions. For more details about Trident-head decoder, the readers can refer to [ 9 ]. All heads are modeled using a threelayer convolutional neural network, with shared parameters across all feature pyramid layers to reduce the number of parameters.

3.4. Loss Function

The loss function plays a crucial role in model training by providing an optimization objective that measures the error between predictions and ground truth, thereby evaluating model performance. A well-designed loss function not only impacts learning efectiveness but also directly influences convergence speed. Thus, the selection and optimization of the loss function are critical aspects of model training that cannot be overlooked.

For training, our model adopts a hybrid optimization strategy combining classification loss and regression loss to improve classification accuracy and bounding box localization precision.

Classification Loss: We employ Focal Loss, a loss function specifically designed to address class imbalance. By introducing a modulating factor, Focal Loss assigns higher weights to hard-to-classify samples, enhancing the model’s focus on challenging categories during training and improving overall classification performance.

Regression Loss: We utilize GIoU Loss (Generalized IoU). Compared to traditional IoU, GIoU not only considers the overlapping area between predicted and ground-truth bounding boxes but also accounts for diferences in their minimum enclosing rectangles. This improvement addresses the limitations of IoU when bounding boxes are not fully aligned, enabling more precise positional optimization and enhancing temporal localization accuracy for action regions.

Each layer in the feature pyramid outputs a temporal feature ∈ ℝ 2−1 × , which is then processed

by a classification head and a cross-layer feature fusion regression head for temporal action localization. The output of layer at time is denoted as ̂ = ( ̂ , ̂ , ̂ ) . The overall loss function is formulated as:

ℒ =

1 pos ,

1 neg , ∑ { >0} ( IoUℒcls + ℒreg) + ∑ { =0}ℒcls, (7) where pos and neg represent the numbers of positive and negative samples, respectively.

4. Experiments 4.1. Dataset and Experimental Setup

Datasets

We conduct experiments on two datasets, including THUMOS14, and ActivityNet-1.3. These datasets have been widely adopted as standard benchmarks in the temporal action localization task.

THUMOS14 is a large-scale video dataset, which contains a large number of open-source videos capturing human actions from 20 classes in real environments. Among all the videos, there are 220 (3,007 action instances) and 213 (3,358 action instances) untrimmed videos with temporal annotations in validation and test set, respectively. Following the common setting in THUMOS14, we use the validation set for training and report results on the test set.

ActivityNet-1.3 is another popular large-scale dataset for TAL. It includes around 20,000 videos (more than 600 hours) with 200 action categories. The dataset has three subsets:10,024 videos for training, 4,926 for validation, and 5,044 for testing. On average, each video comprises approximately 1.5 actions. Following the common practice, we train our model on the training set and report the performance on the validation set.

Evaluation Metric

We use the mean Average Precision (mAP) at various temporal Intersection over Unions (tIoU) thresholds to evaluate the TAL performance of diferent methods. For THUMOS14 datasets, we report the results at IoU thresholds [0.3:0.7:0.1]. For ActivityNet-1.3 dataset, we report the results at IoU thresholds [0.5,0.75,0.95].

Implementation Details

Our model is trained end-to-end with AdamW [23] optimizer. The initial learning rate is set to 10−4 for THUMOS14 and 10−3 for ActivityNet. We detach the gradient before the start boundary head and end boundary head and initialize the CNN weights of these two heads with a Gaussian distribution (0, 0.1)

to stabilize the training process. The learning rate is updated with Cosine Annealing schedule. We train 40 and 15 epochs for THUMOS14, ActivityNet (containing warmup 20, 10 epochs). We conduct our experiments on a single NVIDIA A100 GPU.

4.2. Main Results

THUMOS14

We adopt the commonly used I3D [33] as our backbone feature and Tab. 1 presents the results. Our method achieves an average mAP of 69.2%, outperforming all previous methods including one-stage and two-stage methods. Notably, our method also achieves better performance than recent Transformer-based methods [ 7, 8 ], which demonstrates that the simple design can also have impressive results. Its performance improvement is more evident at higher IoU thresholds (e.g., 0.6 and 0.7), highlighting the model’s strength in accurate localization. Although our regression head is inspired by TriDet, the overall architecture of our framework difers significantly from TriDet. Therefore, a direct comparison may not efectively demonstrate the efectiveness of CLFFHead. Instead, we choose to validate it through ablation studies under a unified framework.

ActivityNet ActivityNet. For the ActivityNet v1.3 dataset, we adopt the TSP R(2+1)D [34] as our backbone feature. Following previous methods [ 7, 8, 16 ], the video classification score predicted from the UntrimmedNet is adopted to multiply with the final detection score. Tab. 2 presents the results. our method achieves the highest scores at IoU thresholds of 0.5 and 0.75, reaching 54.8% and 37.8% respectively—on par with TransGMC. The average mAP also reaches 36.7%, matching or slightly outperforming the current best methods. Although the performance at the strictest IoU threshold (0.95) is slightly lower than TadTR[26] , our method still maintains a leading overall performance, especially under the more commonly used thresholds of 0.5 and 0.75. This suggests that our method is not only efective for densely labeled and boundary-ambiguous videos (such as those in THUMOS14), but also adaptable to longer videos with large action spans in more complex scenes (as in ActivityNet-1.3).

4.3. Ablation Study

In this section, we mainly conduct the ablation studies on the THUMOS14 dataset. The efectiveness of LoGo block To assess the contribution of the LoGo block, we conduct a set of ablation studies by replacing or simplifying its components. Specifically, we begin with a baseline temporal feature pyramid adopted from [ 8, 16 ], which consists of two 1D convolutional layers and a shortcut connection. We then progressively enhance this baseline by introducing ActionFormer (SA), the LoGo block, and the CLFF-Head.

As shown in Tab. 3, replacing the standard convolutional block with self-attention improves the average mAP from 62.1% to 66.8%. Further adding the LoGo block yields a notable gain to 68.0%, and ifnally, the full model with LoGo and CLFF-Head achieves the best performance with an average mAP of 69.2% on THUMOS14. The efectiveness of regression head To verify the efectiveness of the proposed Cross-Level Feature Fusion Regression Head, we conduct ablation studies on three types of regression heads: (1) a lightweight regression head adopted from [ 8 ], (2) the regression head used in [ 9 ], and (3) our proposed Cross-Level Feature Fusion Regression Head. All other hyperparameters (e.g., the number of pyramid layers) are kept identical to those used in our framework. Tab. 4 presents the results. While the regression head from TriDet performs slightly better at high IoU (0.7), our proposed head shows more stable performance across thresholds and achieves the highest overall mAP (69.2%). This demonstrates the benefits of integrating multi-level features to enhance regression robustness and precision. The efectiveness of feature pyramid level To investigate the impact of the number of feature pyramid layers on model performance, we conducted experiments with diferent pyramid depths and evaluated their performance under multiple IoU thresholds. Tab. 5 presents the results. As the number of layers increases from 3 to 6, performance steadily improves, peaking at 69.2% mAP with 6 layers. However, further increasing to 7 layers slightly degrades performance, suggesting that while deeper pyramids can better capture multi-scale action information, excessive depth may introduce redundancy or training dificulties.

5. Conclusions

This paper addresses the critical challenge of jointly optimizing local detail capture and global semantic modeling in temporal action localization (TAL) by proposing the LoGo framework based on local-global context modeling. The designed LoGo Block integrates the local structural modeling capability of depthwise separable convolutions with the global semantic awareness of channel attention mechanisms, achieving eficient fusion of multi-scale temporal features. Furthermore, the proposed Cross-Level Feature Fusion Regression Head (CLFF-Head) significantly enhances boundary localization stability and accuracy through adaptive fusion of multi-level semantic information from the feature pyramid. Experiments on THUMOS14 and ActivityNet1.3 demonstrate that the LoGo framework excels in complex backgrounds and multi-scale action scenarios, outperforming existing methods. These results validate the efectiveness of the local-global collaborative modeling strategy and cross-layer feature fusion mechanism.

Acknowledgments

This work is partially supported by National Natural Science Foundation of China (62376231), Sichuan Science and Technology Program (24NSFSC1070), Fundamental Research Funds for the Central Universities (2682025ZTPY052).

Declaration on Generative AI

During the preparation of this work, the authors used DeepSeek, Grammarly in order to: Grammar and spelling check, Paraphrase and reword. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. feature for anchor-free temporal action localization, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3320–3329. [17] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, J. Yang, Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection, Advances in neural information processing systems 33 (2020) 21002–21012. [18] F. Chollet, Xception: Deep learning with depthwise separable convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258. [19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: Eficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861 (2017). [20] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11976–11986. [21] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141. [22] C. Szegedy, S. Iofe, V. Vanhoucke, A. Alemi, Inception-v4, inception-resnet and the impact of residual connections on learning, in: Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017. [23] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101 (2017). [24] Z. Zhu, W. Tang, L. Wang, N. Zheng, G. Hua, Enriching local and global contexts for temporal action localization, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13516–13525. [25] D. Shi, Y. Zhong, Q. Cao, J. Zhang, L. Ma, J. Li, D. Tao, React: Temporal action detection with relational queries, in: European conference on computer vision, Springer, 2022, pp. 105–121. [26] X. Liu, Q. Wang, Y. Hu, X. Tang, S. Zhang, S. Bai, X. Bai, End-to-end temporal action detection with transformer, IEEE Transactions on Image Processing 31 (2022) 5427–5441. [27] T. N. Tang, K. Kim, K. Sohn, Temporalmaxer: Maximize temporal context with only max pooling for temporal action localization, arXiv preprint arXiv:2303.09055 (2023). [28] J. Shao, X. Wang, R. Quan, J. Zheng, J. Yang, Y. Yang, Action sensitivity learning for temporal action localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13457–13469. [29] K. Xia, L. Wang, Y. Shen, S. Zhou, G. Hua, W. Tang, Exploring action centers for temporal action localization, IEEE Transactions on Multimedia 25 (2023) 9425–9436. [30] J. Yang, P. Wei, Z. Ren, N. Zheng, Gated multi-scale transformer for temporal action localization,

IEEE Transactions on Multimedia 26 (2023) 5705–5717. [31] W. Wu, T. Lu, J. Wang, P. Tang, F. Gao, Temporal action detection with frequency attention mechanism, in: 2024 7th International Conference on Mechatronics, Robotics and Automation (ICMRA), IEEE, 2024, pp. 137–141. [32] Z. Zhang, C. Palmero, S. Escalera, Dualh: A dual hierarchical model for temporal action localization, in: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), IEEE, 2024, pp. 1–10. [33] J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, in: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308. [34] H. Alwassel, S. Giancola, B. Ghanem, Tsp: Temporally-sensitive pretraining of video encoders for localization tasks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3173–3183.

[1]

Zhao ,

Xiong ,

Wang ,

Wu ,

Tang ,

Lin , Temporal action detection with structured segment networks , in: Proceedings of the IEEE international conference on computer vision , 2017 , pp. 2914 - 2923 .

[2]

Shou ,

Chan ,

Zareian ,

Miyazawa ,

S.-F.

Chang , Cdc: Convolutional-de -convolutional networks for precise temporal action localization in untrimmed videos , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2017 , pp. 5734 - 5743 .

[3]

Buch ,

Escorcia ,

Ghanem ,

Fei-Fei ,

J. C.

Niebles , End-to-end, single-stream temporal action detection in untrimmed videos ( 2019 ).

[4]

Bai ,

Wang ,

Tong ,

Yang ,

Liu , J. Liu, Boundary content graph neural network for temporal action proposal generation , in: Computer Vision-ECCV 2020 : 16th European Conference, Glasgow, UK, August 23- 28 , 2020 , Proceedings, Part XXVIII 16 , Springer, 2020 , pp. 121 - 137 .

[5]

Xu ,

Zhao ,

D. S.

Rojas ,

Thabet ,

Ghanem , G-tad: Sub-graph localization for temporal action detection , in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020 , pp. 10156 - 10165 .

[6]

Yang ,

Qin ,

Huang , Acgnet: Action complement graph network for weakly-supervised temporal action localization , in: Proceedings of the AAAI conference on artificial intelligence , volume 36 , 2022 , pp. 3090 - 3098 .

[7]

Cheng , G. Bertasius, Tallformer: Temporal action localization with a long-memory transformer , in: European Conference on Computer Vision , Springer, 2022 , pp. 503 - 521 .

[8]

C.-L.

Zhang , J. Wu,

Li , Actionformer: Localizing moments of actions with transformers , in: European Conference on Computer Vision , Springer, 2022 , pp. 492 - 510 .

[9]

Shi ,

Zhong ,

Cao , L. Ma,

Li ,

Tao , Tridet: Temporal action detection with relative boundary modeling , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023 , pp. 18857 - 18866 .

[10]

Escorcia ,

Caba Heilbron ,

J. C.

Niebles ,

Ghanem , Daps: Deep action proposals for action understanding , in: Computer Vision-ECCV 2016 : 14th European Conference, Amsterdam, The Netherlands, October 11-14 , 2016 , Proceedings, Part III 14 , Springer, 2016 , pp. 768 - 784 .

[11]

Buch ,

Escorcia ,

Shen ,

Ghanem ,

J. Carlos

Niebles , Sst: Single-stream temporal action proposals , in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , 2017 , pp. 2911 - 2920 .

[12]

Lin ,

Liu ,

Li ,

Ding ,

Wen , Bmn: Boundary-matching network for temporal action proposal generation , in: Proceedings of the IEEE/CVF international conference on computer vision , 2019 , pp. 3889 - 3898 .

[13]

Gong ,

Zheng ,

Mu , Scale matters: Temporal scale aggregation network for precise action localization in untrimmed videos , in: 2020 IEEE international conference on multimedia and expo (ICME) , IEEE, 2020 , pp. 1 - 6 .

[14]

Qing ,

Su ,

Gan ,

Wang ,

Wu ,

Wang ,

Qiao ,

Yan ,

Gao ,

Sang , Temporal context aggregation network for temporal action proposal refinement , in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021 , pp. 485 - 494 .

[15]

Redmon ,

Divvala ,

Girshick ,

Farhadi , You only look once: Unified, real-time object detection , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2016 , pp. 779 - 788 .

[16]

Lin ,

Xu ,

Luo ,

Wang ,

Tai ,

Wang ,

Li ,

Huang ,

Fu , Learning salient boundary