<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LoGo: Local-Global Context Modeling and Cross-Level Regression for Temporal Action Localization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Li Xinxin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yang Zhe</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer and Software, Chengdu Jincheng College</institution>
          ,
          <addr-line>Chengdu, 611731</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tangshan Research Institute, Southwest Jiaotong University</institution>
          ,
          <addr-line>Tangshan 063000</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The explosive growth of user-generated videos has driven the demand for automated video understanding in applications such as retrieval, surveillance, and human-computer interaction. Temporal Action Localization (TAL), a critical task in this domain, aims to identify temporal boundaries and categories of actions in untrimmed videos. However, existing methods struggle with challenges including large variations in action duration, ambiguous boundaries, and strong background noise. This paper proposes LoGo, an end-to-end framework that unifies Local-Global Context Modeling and a Cross-Level Feature Fusion Regression Head (CLFF-Head) to significantly improve localization accuracy. The key innovations include: 1) The LoGo Block, which integrates depthwise separable convolutions for local structural modeling with channel attention mechanisms for global semantic awareness, achieving balanced local-global dependency learning through residual fusion; 2) The CLFF-Head, which enhances boundary regression stability and accuracy via adaptive multi-scale feature fusion. Extensive experiments on THUMOS14 and ActivityNet1.3 demonstrate that LoGo outperforms state-of-the-art methods, achieving new SOTA performance on THUMOS14 and competitive results on ActivityNet1.3, validating its efectiveness and generalizability.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Temporal Action Localization</kwd>
        <kwd>Video Understanding</kwd>
        <kwd>Local-Global Context Modeling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, the explosive growth of user-generated videos on internet platforms has fueled the
demand for automatic video understanding in applications such as retrieval, surveillance, and
humancomputer interaction. Temporal Action Localization (TAL), a crucial task in this domain, aims to identify
the temporal boundaries and categories of actions in untrimmed videos. Despite recent advances, TAL
remains challenging due to large variations in action duration, ambiguous boundaries, and strong
background noise.</p>
      <p>
        To tackle these challenges, many methods have been proposed using convolutional [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], recurrent [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
or graph-based networks[
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ] to model temporal dependencies. However, the uneven distribution of
action durations—ranging from brief gestures to prolonged activities—demands simultaneous modeling
of both short- and long-term dependencies. CNNs ofer strong local modeling but struggle with
longrange context, while Transformers [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] capture global semantics efectively but lack sensitivity to
local details, which are critical for precise boundary detection.
      </p>
      <p>
        Accurate boundary localization remains a bottleneck. While [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] introduces an eficient regression
head, its expressiveness is limited in complex scenes. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] improves regression with a stronger head, yet
its limited cross-level feature fusion results in suboptimal localization under challenging backgrounds.
      </p>
      <p>To address these issues, we propose LoGo, an end-to-end TAL framework that unifies Local-Global
Context modeling with a Cross-Level Feature Fusion Regression Head (CLFF-Head). The LoGo Block
integrates depthwise separable convolutions for local structure modeling with channel attention for
global context, connected via a residual fusion mechanism to balance precision and long-range awareness.
Furthermore, the CLFF-Head adaptively fuses multi-scale semantic features from the feature pyramid,
significantly enhancing boundary regression stability and accuracy without compromising eficiency.</p>
      <p>The main contributions of this paper are summarized as follows:
• We introduce the LoGo Block module, which combines the local modeling capability based on
depthwise separable convolutions with the global modeling capability driven by channel attention
mechanisms, achieving eficient integration through a residual structure. This module efectively
captures both local structural details and long-range contextual dependencies.
• We design the Cross-Level Feature Fusion Regression Head (CLFF-Head), which introduces a
multi-scale feature adaptive fusion mechanism, significantly improving the stability and accuracy
of boundary localization.</p>
      <p>Section 2 outlines relevant prior work. Section 3 details the proposed methodology. Section 4 presents
and discusses the results of our experiments. Lastly, Section 5 summarizes our main conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Temporal Action Localization(TAL) In Temporal Action Localization (TAL), two-stage and
singlestage methods are employed to detect actions in videos. Two-stage methods involve generating action
proposals and classifying them, which can be achieved through anchor windows[
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], action boundary
localization[
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ], graph representation[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], or Transformers[
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. On the other hand, single-stage
TAL performs both proposal generation and classification in a single pass, without a separate proposal
generation step. Pioneering work[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] in this field developed anchor-based single-stage TAL using
convolutional networks, inspired by single-stage object detectors[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Additionally, there have been
anchor-free single-stage models proposed[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], incorporating a saliency-based refinement module.
Object detection Object detection is closely related to Temporal Action Localization (TAL), with
both tasks sharing similar challenges. General Focal Loss [17] enhances bounding box regression
by transforming it from learning a Dirac delta distribution to a more general distribution function.
Several methods [18, 19, 20] leverage Depthwise Convolution to model network structures, while certain
branched designs [21, 22] have demonstrated strong generalization capabilities. These approaches ofer
valuable insights for designing the architecture of TAL systems.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Temporal Action Localization Given an input video  , we make the assumption that  can be
represented by a collection of feature vectors  = { 1,  2, … ,   } defined at discrete time steps  =
{1, 2, … ,  } , where the total duration  varies across diferent videos. For instance,   could represent
the feature vector of a video clip extracted from a 3D convolutional network at time  . The objective of
temporal action localization (TAL) is to predict the action label  = { 1,  2, … ,   } based on the input
video sequence  . Here,  comprises  action instances   , and the number of instances  can vary
across videos. Each instance   = (  ,   ,   ) is characterized by its starting time   (onset), ending time  
(ofset), and the corresponding action label   . The starting time   lies in the range of [1,  ] , the ending
time   lies in the range of [1,  ] , and the action label   belongs to the set of pre-defined categories
1, ..,  , where  represents the total number of categories. Additionally, it is required that   is less than
  for each instance. Therefore, the task of TAL presents a challenging problem of predicting structured
outputs.</p>
      <sec id="sec-3-1">
        <title>3.1. Method Overview</title>
        <p>The overall architecture of LoGo is shown in Fig. 1. Our model consists of three parts: a video feature
extractor, a Multi-scale LoGo Encoder, and two sub-task heads. Concretely, for a given video clip, we
extract video features using a pre-trained 3D-CNN model. Then, the extracted features are passed
through the Multi-scale LoGo Encoder, which performs downsampling operations to better represent
features at diferent temporal scales. Finally, the pyramid features produced by the Multi-scale LoGo
Encoder are processed by two task-specific heads to generate the final predictions. In the following, we
will describe the details of our model.</p>
        <p>+
MLP
+</p>
        <p>LoGo
Group Norm
Layer Norm</p>
        <p>Z1</p>
        <p>Zl-1</p>
        <p>Zl</p>
        <p>LoGo Block</p>
        <p>downsample
LoGo Block
... downsample</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Multi-scale LoGo Encoder</title>
        <p>The input feature  is first encoded into multi-scale temporal feature pyramid  = { 1,  2, ...,   } using
Multi-scale LoGo Encoder . The encoder  simply contains two 1D convolutional neural network layers
as feature projection layers, followed by  − 1 Local-Global Context Modeling (LoGo) blocks to produce
feature pyramid  .</p>
        <p>First, the input features  −1 from the previous layer of the pyramid are passed through a Layer
Normalization (LN) operation to stabilize the feature distribution. These normalized features are then
fed into the LoGo block, which jointly captures local and global temporal information. Through a
residual connection, the output of the LoGo block is added to the original input features, resulting in
the new feature representation  ̄−1 .</p>
        <p>Next, the feature  ̄−1 is processed through a Group Normalization (GN) operation to further enhance
the training stability of the model. Then, a Multi-Layer Perceptron (MLP) is applied to perform nonlinear
transformation, yielding the feature  ̂ −1 . Again, through a residual connection, the output of the MLP
is added to  ̄−1 , producing the updated feature representation.</p>
        <p>Finally, the processed feature  ̂ −1 undergoes a downsampling operation. Downsampling is
implemented via a 1D max-pooling operation with a window size of 3 and a stride of 2, reducing the temporal
dimension of the features and passing them to the next layer of the pyramid.</p>
        <p>Here are the mathematical formulas corresponding to these steps:
 ̄−1 = ( (
 ̂ −1 =   ( (
  =  (</p>
        <p>−1 )) +  −1 ,
 ̄−1 )) +  ̄−1 ,
 ̂ −1 )
(1)
(2)
(3)
where  ∈ [1, ] ,  is the LayerNorm operation,  is the GroupNorm operation,   is
implemented by a 1D max-pooling with a window size of 3 and stride of 2.</p>
        <p>Finally, the encoded feature pyramid is constructed by combining the features of all the LoGo blocks
as  = { 1,  2, ...,   }
LoGo Block To simultaneously capture local structural details and global semantics in action
instances, this study proposes the LoGo Block module, which integrates depthwise separable convolution
(for local modeling) and channel attention mechanisms (for global modeling) to achieve eficient fusion
of multi-scale temporal features. Specifically, the LoGo Block first applies LayerNorm normalization to
stabilize the feature distribution and improve training efectiveness. It then employs depthwise separable
convolution to model local patterns, efectively capturing short-term dependencies and fine-grained
variations in temporal sequences.</p>
        <p>Global modeling is implemented through two pathways: Pathway 1 generates channel attention
weights (global modulation factor) via global average pooling followed by a fully connected layer, while
Pathway 2 extracts salient features through max pooling and multiplies them with linearly transformed
features for dynamic channel weighting. These pathways respectively focus on global context and
salient features to enhance multi-scale action characteristics.</p>
        <p>The final output combines three components: local features multiplied by the global modulation
factor, dynamically weighted salient features, and the original input through residual connections. This
design strengthens feature representation capabilities while facilitating stable gradient propagation.</p>
        <p>AvgPool
FC + ReLU</p>
        <p>✖
Conv
✖
FC</p>
        <p>MaxPool</p>
        <p>FC + ReLU
＋
x</p>
        <p>Overall, the LoGo Block adopts a lightweight structure to enable multi-level modeling of temporal
action information, significantly improving the model’s ability to recognize action boundaries and
understand semantics in complex backgrounds.</p>
        <p>Mathematically, the LoGo can be written as:
 
=  ()

+</p>
        <p>() + 
 
 
=  ( (  ())),
=  ( (  ())),
where   and  denotes fully-connected layer and the 1-D depth-wise convolution layer over
temporal dimension.   and   are given as:
where   () is the average pooling for all features over the temporal dimension and   ()
is the max pooling for all features over the temporal dimension.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Temporal Action Localization Decoder</title>
        <p>Next, our model uses a decoder to decode the feature pyramid  = { 1,  2, ...,   } extracted by the
multi-path temporal feature encoder into sequence labels  =̂ { ̂ 1,  2̂, ...,  ̂ } . The decoder of this model
(4)
(5)
(6)
consists of a classification head and a cross-level feature fusion regression head, both of which are
lightweight convolutional networks.</p>
        <p>Classification Head Based on the feature pyramid  , the task of the classification head is to predict
the action category probabilities (  ) for each moment  at diferent levels of the pyramid. The
classification head in this chapter adopts a simple and eficient 1D convolutional network, with shared
parameters across diferent levels to reduce model complexity. Specifically, the network consists of three
1D convolutional layers, with a kernel size of 3. ReLU activation and LayerNorm normalization are
applied in the first two layers to enhance training stability. Finally, after a Sigmoid function mapping,
the output is the probability distribution for each action category.</p>
        <p>Cross-Level Feature Fusion Regression Head Inspired by the Trident-head structure in TriDet, the
regression head in this paper emphasizes the role of relative boundary feature information at diferent
levels of the feature pyramid in action localization. However, Trident-head relies solely on a single
feature layer for boundary estimation, which may limit its performance in boundary localization. To
address this, we propose a Cross-Level Feature Fusion Regression Head (CLFFHead). This method
not only utilizes the features of the current pyramid layer when predicting boundary ofsets but also
incorporates the feature information from the previous layer, thus enhancing the complementarity of
features at diferent scales and improving the stability and accuracy of boundary regression.</p>
        <p>Zl-1</p>
        <p>Zl
downsample
w2
✖
✖ w1
＋
^l
Z</p>
        <p>
          Given the feature sequence  ∈ ℝ  × output from the feature pyramid, we first obtain three feature
sequences from three branches:   ∈ ℝ ,   ∈ ℝ and   ∈ ℝ ×2×(+1) .   and   represent the response
values for the start and end boundaries of an action at each moment, respectively. Both   and   are
obtained through 1D convolutions. For   ∈ ℝ ×2×(+1) , during its derivation, we not only utilize the
features of the current layer but also incorporate the feature information from the previous pyramid
layer. This cross-level feature fusion allows the model to leverage richer contextual information,
enhancing the robustness of boundary prediction. In the feature fusion process, we introduce two
learnable parameters,  1 and  2, to ensure that the fusion weights can be dynamically adjusted. This
process is illustrated in Fig. 3, where the blue blocks denote the features processed by the LoGo
Encoder. This mechanism enables the model to adaptively select the appropriate feature proportion,
optimizing boundary prediction performance under diferent tasks and data distributions. For more
details about Trident-head decoder, the readers can refer to [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. All heads are modeled using a
threelayer convolutional neural network, with shared parameters across all feature pyramid layers to reduce
the number of parameters.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Loss Function</title>
        <p>The loss function plays a crucial role in model training by providing an optimization objective that
measures the error between predictions and ground truth, thereby evaluating model performance.
A well-designed loss function not only impacts learning efectiveness but also directly influences
convergence speed. Thus, the selection and optimization of the loss function are critical aspects of
model training that cannot be overlooked.</p>
        <p>For training, our model adopts a hybrid optimization strategy combining classification loss and
regression loss to improve classification accuracy and bounding box localization precision.</p>
        <p>Classification Loss: We employ Focal Loss, a loss function specifically designed to address class
imbalance. By introducing a modulating factor, Focal Loss assigns higher weights to hard-to-classify
samples, enhancing the model’s focus on challenging categories during training and improving overall
classification performance.</p>
        <p>Regression Loss: We utilize GIoU Loss (Generalized IoU). Compared to traditional IoU, GIoU not only
considers the overlapping area between predicted and ground-truth bounding boxes but also accounts
for diferences in their minimum enclosing rectangles. This improvement addresses the limitations
of IoU when bounding boxes are not fully aligned, enabling more precise positional optimization and
enhancing temporal localization accuracy for action regions.</p>
        <sec id="sec-3-4-1">
          <title>Each layer  in the feature pyramid outputs a temporal feature   ∈ ℝ</title>
        </sec>
        <sec id="sec-3-4-2">
          <title>2−1  × , which is then processed</title>
          <p>by a classification head and a cross-layer feature fusion regression head for temporal action localization.
The output of layer  at time  is denoted as  ̂ = ( ̂ ,  ̂ ,  ̂ ) . The overall loss function is formulated as:</p>
          <p>ℒ =</p>
          <p>1
 pos ,</p>
          <p>1
 neg ,
∑  {  &gt;0}
( IoUℒcls + ℒreg) +
∑  {  =0}ℒcls,
(7)
where  pos and  neg represent the numbers of positive and negative samples, respectively.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset and Experimental Setup</title>
        <p>Datasets</p>
        <p>We conduct experiments on two datasets, including THUMOS14, and ActivityNet-1.3. These
datasets have been widely adopted as standard benchmarks in the temporal action localization task.</p>
        <p>THUMOS14 is a large-scale video dataset, which contains a large number of open-source videos
capturing human actions from 20 classes in real environments. Among all the videos, there are 220
(3,007 action instances) and 213 (3,358 action instances) untrimmed videos with temporal annotations in
validation and test set, respectively. Following the common setting in THUMOS14, we use the validation
set for training and report results on the test set.</p>
        <p>ActivityNet-1.3 is another popular large-scale dataset for TAL. It includes around 20,000 videos (more
than 600 hours) with 200 action categories. The dataset has three subsets:10,024 videos for training,
4,926 for validation, and 5,044 for testing. On average, each video comprises approximately 1.5 actions.
Following the common practice, we train our model on the training set and report the performance on
the validation set.</p>
        <p>Evaluation Metric</p>
        <p>We use the mean Average Precision (mAP) at various temporal Intersection
over Unions (tIoU) thresholds to evaluate the TAL performance of diferent methods. For THUMOS14
datasets, we report the results at IoU thresholds [0.3:0.7:0.1]. For ActivityNet-1.3 dataset, we report the
results at IoU thresholds [0.5,0.75,0.95].</p>
        <p>Implementation Details</p>
        <p>Our model is trained end-to-end with AdamW [23] optimizer. The initial
learning rate is set to 10−4 for THUMOS14 and 10−3 for ActivityNet. We detach the gradient before
the start boundary head and end boundary head and initialize the CNN weights of these two heads
with a Gaussian distribution  (0, 0.1)</p>
        <p>to stabilize the training process. The learning rate is updated
with Cosine Annealing schedule. We train 40 and 15 epochs for THUMOS14, ActivityNet (containing
warmup 20, 10 epochs). We conduct our experiments on a single NVIDIA A100 GPU.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Main Results</title>
        <p>THUMOS14</p>
        <p>
          We adopt the commonly used I3D [33] as our backbone feature and Tab. 1 presents the
results. Our method achieves an average mAP of 69.2%, outperforming all previous methods including
one-stage and two-stage methods. Notably, our method also achieves better performance than recent
Transformer-based methods [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ], which demonstrates that the simple design can also have impressive
results. Its performance improvement is more evident at higher IoU thresholds (e.g., 0.6 and 0.7),
highlighting the model’s strength in accurate localization. Although our regression head is inspired
by TriDet, the overall architecture of our framework difers significantly from TriDet. Therefore, a
direct comparison may not efectively demonstrate the efectiveness of CLFFHead. Instead, we choose
to validate it through ablation studies under a unified framework.
        </p>
        <p>
          ActivityNet ActivityNet. For the ActivityNet v1.3 dataset, we adopt the TSP R(2+1)D [34] as our
backbone feature. Following previous methods [
          <xref ref-type="bibr" rid="ref16 ref7 ref8">7, 8, 16</xref>
          ], the video classification score predicted from
the UntrimmedNet is adopted to multiply with the final detection score. Tab. 2 presents the results.
our method achieves the highest scores at IoU thresholds of 0.5 and 0.75, reaching 54.8% and 37.8%
respectively—on par with TransGMC. The average mAP also reaches 36.7%, matching or slightly
outperforming the current best methods. Although the performance at the strictest IoU threshold (0.95)
is slightly lower than TadTR[26] , our method still maintains a leading overall performance, especially
under the more commonly used thresholds of 0.5 and 0.75. This suggests that our method is not only
efective for densely labeled and boundary-ambiguous videos (such as those in THUMOS14), but also
adaptable to longer videos with large action spans in more complex scenes (as in ActivityNet-1.3).
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Ablation Study</title>
        <p>
          In this section, we mainly conduct the ablation studies on the THUMOS14 dataset.
The efectiveness of LoGo block To assess the contribution of the LoGo block, we conduct a set
of ablation studies by replacing or simplifying its components. Specifically, we begin with a baseline
temporal feature pyramid adopted from [
          <xref ref-type="bibr" rid="ref16 ref8">8, 16</xref>
          ], which consists of two 1D convolutional layers and a
shortcut connection. We then progressively enhance this baseline by introducing ActionFormer (SA),
the LoGo block, and the CLFF-Head.
        </p>
        <p>
          As shown in Tab. 3, replacing the standard convolutional block with self-attention improves the
average mAP from 62.1% to 66.8%. Further adding the LoGo block yields a notable gain to 68.0%, and
ifnally, the full model with LoGo and CLFF-Head achieves the best performance with an average mAP
of 69.2% on THUMOS14.
The efectiveness of regression head To verify the efectiveness of the proposed Cross-Level
Feature Fusion Regression Head, we conduct ablation studies on three types of regression heads: (1) a
lightweight regression head adopted from [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], (2) the regression head used in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], and (3) our proposed
Cross-Level Feature Fusion Regression Head. All other hyperparameters (e.g., the number of pyramid
layers) are kept identical to those used in our framework. Tab. 4 presents the results. While the
regression head from TriDet performs slightly better at high IoU (0.7), our proposed head shows more
stable performance across thresholds and achieves the highest overall mAP (69.2%). This demonstrates
the benefits of integrating multi-level features to enhance regression robustness and precision.
The efectiveness of feature pyramid level To investigate the impact of the number of feature
pyramid layers on model performance, we conducted experiments with diferent pyramid depths and
evaluated their performance under multiple IoU thresholds. Tab. 5 presents the results. As the number
of layers increases from 3 to 6, performance steadily improves, peaking at 69.2% mAP with 6 layers.
However, further increasing to 7 layers slightly degrades performance, suggesting that while deeper
pyramids can better capture multi-scale action information, excessive depth may introduce redundancy
or training dificulties.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This paper addresses the critical challenge of jointly optimizing local detail capture and global semantic
modeling in temporal action localization (TAL) by proposing the LoGo framework based on local-global
context modeling. The designed LoGo Block integrates the local structural modeling capability of
depthwise separable convolutions with the global semantic awareness of channel attention mechanisms,
achieving eficient fusion of multi-scale temporal features. Furthermore, the proposed Cross-Level
Feature Fusion Regression Head (CLFF-Head) significantly enhances boundary localization stability
and accuracy through adaptive fusion of multi-level semantic information from the feature pyramid.
Experiments on THUMOS14 and ActivityNet1.3 demonstrate that the LoGo framework excels in complex
backgrounds and multi-scale action scenarios, outperforming existing methods. These results validate
the efectiveness of the local-global collaborative modeling strategy and cross-layer feature fusion
mechanism.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work is partially supported by National Natural Science Foundation of China (62376231), Sichuan
Science and Technology Program (24NSFSC1070), Fundamental Research Funds for the Central
Universities (2682025ZTPY052).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used DeepSeek, Grammarly in order to: Grammar
and spelling check, Paraphrase and reword. After using this tool, the authors reviewed and edited the
content as needed and take full responsibility for the publication’s content.
feature for anchor-free temporal action localization, in: Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, 2021, pp. 3320–3329.
[17] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, J. Yang, Generalized focal loss: Learning
qualified and distributed bounding boxes for dense object detection, Advances in neural information
processing systems 33 (2020) 21002–21012.
[18] F. Chollet, Xception: Deep learning with depthwise separable convolutions, in: Proceedings of
the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
[19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam,
Mobilenets: Eficient convolutional neural networks for mobile vision applications, arXiv preprint
arXiv:1704.04861 (2017).
[20] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, in:
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.
11976–11986.
[21] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, 2018, pp. 7132–7141.
[22] C. Szegedy, S. Iofe, V. Vanhoucke, A. Alemi, Inception-v4, inception-resnet and the impact of
residual connections on learning, in: Proceedings of the AAAI conference on artificial intelligence,
volume 31, 2017.
[23] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101
(2017).
[24] Z. Zhu, W. Tang, L. Wang, N. Zheng, G. Hua, Enriching local and global contexts for temporal
action localization, in: Proceedings of the IEEE/CVF international conference on computer vision,
2021, pp. 13516–13525.
[25] D. Shi, Y. Zhong, Q. Cao, J. Zhang, L. Ma, J. Li, D. Tao, React: Temporal action detection with
relational queries, in: European conference on computer vision, Springer, 2022, pp. 105–121.
[26] X. Liu, Q. Wang, Y. Hu, X. Tang, S. Zhang, S. Bai, X. Bai, End-to-end temporal action detection
with transformer, IEEE Transactions on Image Processing 31 (2022) 5427–5441.
[27] T. N. Tang, K. Kim, K. Sohn, Temporalmaxer: Maximize temporal context with only max pooling
for temporal action localization, arXiv preprint arXiv:2303.09055 (2023).
[28] J. Shao, X. Wang, R. Quan, J. Zheng, J. Yang, Y. Yang, Action sensitivity learning for temporal action
localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023,
pp. 13457–13469.
[29] K. Xia, L. Wang, Y. Shen, S. Zhou, G. Hua, W. Tang, Exploring action centers for temporal action
localization, IEEE Transactions on Multimedia 25 (2023) 9425–9436.
[30] J. Yang, P. Wei, Z. Ren, N. Zheng, Gated multi-scale transformer for temporal action localization,</p>
      <p>IEEE Transactions on Multimedia 26 (2023) 5705–5717.
[31] W. Wu, T. Lu, J. Wang, P. Tang, F. Gao, Temporal action detection with frequency attention
mechanism, in: 2024 7th International Conference on Mechatronics, Robotics and Automation
(ICMRA), IEEE, 2024, pp. 137–141.
[32] Z. Zhang, C. Palmero, S. Escalera, Dualh: A dual hierarchical model for temporal action localization,
in: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG),
IEEE, 2024, pp. 1–10.
[33] J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset,
in: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.
6299–6308.
[34] H. Alwassel, S. Giancola, B. Ghanem, Tsp: Temporally-sensitive pretraining of video encoders for
localization tasks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision,
2021, pp. 3173–3183.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Temporal action detection with structured segment networks</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2914</fpage>
          -
          <lpage>2923</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zareian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Miyazawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Chang</surname>
          </string-name>
          , Cdc: Convolutional-de
          <article-title>-convolutional networks for precise temporal action localization in untrimmed videos</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5734</fpage>
          -
          <lpage>5743</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Buch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Escorcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ghanem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Niebles</surname>
          </string-name>
          ,
          <article-title>End-to-end, single-stream temporal action detection in untrimmed videos (</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          , J. Liu,
          <article-title>Boundary content graph neural network for temporal action proposal generation</article-title>
          , in: Computer Vision-ECCV
          <year>2020</year>
          : 16th European Conference, Glasgow, UK,
          <year>August</year>
          23-
          <issue>28</issue>
          ,
          <year>2020</year>
          , Proceedings,
          <source>Part XXVIII 16</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>121</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Rojas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thabet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ghanem</surname>
          </string-name>
          , G-tad:
          <article-title>Sub-graph localization for temporal action detection</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>10156</fpage>
          -
          <lpage>10165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Huang</surname>
          </string-name>
          , Acgnet:
          <article-title>Action complement graph network for weakly-supervised temporal action localization</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>36</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>3090</fpage>
          -
          <lpage>3098</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , G. Bertasius,
          <string-name>
            <surname>Tallformer:</surname>
          </string-name>
          <article-title>Temporal action localization with a long-memory transformer</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>503</fpage>
          -
          <lpage>521</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.-L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Wu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Actionformer: Localizing moments of actions with transformers</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>492</fpage>
          -
          <lpage>510</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cao</surname>
          </string-name>
          , L. Ma,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tridet:</surname>
          </string-name>
          <article-title>Temporal action detection with relative boundary modeling</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>18857</fpage>
          -
          <lpage>18866</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Escorcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Caba Heilbron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Niebles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ghanem</surname>
          </string-name>
          ,
          <string-name>
            <surname>Daps:</surname>
          </string-name>
          <article-title>Deep action proposals for action understanding</article-title>
          ,
          <source>in: Computer Vision-ECCV</source>
          <year>2016</year>
          : 14th European Conference, Amsterdam, The Netherlands,
          <source>October 11-14</source>
          ,
          <year>2016</year>
          , Proceedings,
          <source>Part III 14</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>768</fpage>
          -
          <lpage>784</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Buch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Escorcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ghanem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Carlos</given-names>
            <surname>Niebles</surname>
          </string-name>
          ,
          <article-title>Sst: Single-stream temporal action proposals</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2911</fpage>
          -
          <lpage>2920</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wen</surname>
          </string-name>
          , Bmn:
          <article-title>Boundary-matching network for temporal action proposal generation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3889</fpage>
          -
          <lpage>3898</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mu</surname>
          </string-name>
          , Scale matters:
          <article-title>Temporal scale aggregation network for precise action localization in untrimmed videos</article-title>
          ,
          <source>in: 2020 IEEE international conference on multimedia and expo (ICME)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sang</surname>
          </string-name>
          ,
          <article-title>Temporal context aggregation network for temporal action proposal refinement</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>485</fpage>
          -
          <lpage>494</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>You only look once: Unified, real-time object detection</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          , Learning salient boundary
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>