<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Self-Attention Refinement for Actionness Temporal Action Localization 1</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiale Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanzhu Hu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yingjian Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yan Qin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Beijing University of Posts and Telecommunications</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute Of Urban Safety And Environmental Science,Beijing Academy Of Science And Technology</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <fpage>194</fpage>
      <lpage>199</lpage>
      <abstract>
        <p>In this paper ,In this paper, for the problem of temporal action localisation, we propose a way to obtain the start and end times and types of actions based on actionness by aggregating action instance segmentation on a sequence of temporal features. In addition we believe that the context of actions is not only reflected in the results of convolution, but also the characteristics of inter class similarity and intra class consistency are essential. For this reason, we designed temporal self -attention mechanism (TSA) and temporal pyramid pooling module (TPP). Our results show that the single-stage model can achieve considerable accuracy after proper feature fusion.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;actionness</kwd>
        <kwd>TSA</kwd>
        <kwd>TPP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Action recognition is a key technology in the field of computer vision. In recent years, great progress
has been made in motion recognition technology, and relevant technologies have also been applied in
video understanding, intelligent security and other directions. With the continuous progress of deep
learning technology and image algorithm, some large-scale motion recognition network models and
complex scene data sets are also produced, which promotes the progress of this field. With the
continuous deepening of research, action recognition technology has also changed from simple
primitive video classification to action instance identification of complex scenes, and then to action
positioning. As network models improve, more information is learned and the model's output becomes
more complex.</p>
      <p>At present, motion recognition algorithms are mainly divided into video classification, sequential
motion recognition and spatio-temporal motion recognition. The video classification technology mainly
uses the trimmed video with fixed length as input, and determines the video category after extracting
features through the backbone network. On the basis of action recognition, the uncut video is used to
predict the action category and starting time in the video through the feature information.
Spatiotemporal action recognition not only locates the time of action, but also locates the spatial position
of action. Compared with the simple classification of video and the complex location of spatio-temporal
motion, temporal motion recognition is the most widely used method in the field of abnormal behavior
recognition.</p>
      <p>
        The main method of Temporal action detection is similar to that of target detection. After data
processing, different proposal methods are combined with features to complete decoding and obtain
output. Many detectors are developed based on target detection network, SSAD[
        <xref ref-type="bibr" rid="ref2">1</xref>
        ] developed from SSD
method and DaoTAD[
        <xref ref-type="bibr" rid="ref1 ref3">2</xref>
        ] developed from the RetinaNet, etc. According to the proposed method, timing
action recognition technology can be divided into base anchor, anchor free and actionness. The Base
anchor method mainly presents starting frames of different sizes and scales, calculates the intersection
ratio between the current output position of the model and anchor, and determines the allocation of
current positive and negative samples through the intersection ratio. R-C3D[
        <xref ref-type="bibr" rid="ref4">3</xref>
        ] uses 3D full convolution
to extract features from videos. TAL-Net[
        <xref ref-type="bibr" rid="ref5">4</xref>
        ] enhances the feature effectiveness of R-C3D by obtaining
global attention to the timing sequence of its sensitive field. STPN[
        <xref ref-type="bibr" rid="ref6">5</xref>
        ] conducted constraints by enhancing
feature sparsity, MAAN[
        <xref ref-type="bibr" rid="ref7">6</xref>
        ] enhanced generalization by reducing model dominant factors, and Zhong[
        <xref ref-type="bibr" rid="ref8">7</xref>
        ]
et al. introduced fine-processing operations to achieve proposal accuracy of the model. In this paper,
we design a method based on feature extension mechanism and receptive field fusion mechanism, and
achieve excellent performance.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>The self-attentive temporal action recognition model uses a backbone feature network to extract
three-dimensional features, a fully connected layer to obtain one-dimensional features.</p>
      <sec id="sec-2-1">
        <title>Suppose we give an unclipped video { }</title>
        <p>{ }</p>
        <p>, where 
action category, start time and end time.</p>
        <p>We use the mixed mode of optical flow and RGB as the input, and use I3D network pre trained on
Kinect as the backbone to extract the network. For the input of a video 
∈  ×××</p>
        <p>, the output is
and their action instance clips, including category
is a one pot encoding vector and C is the action category. Our goal is to output the

 ∈ 
∈ 
×
××</p>
        <p>× , where  and 
, where T represents the timing length, 
contains the spatial information of each frame of the video image and the overall timing information.
Then the feature map is sent to the TAS module to obtain the global attention, and weighted and fused
through the shared convolution channel.
into the original sequence to obtain a new sequence order ∈ 
×
. Through different convolutions,
action suggestion sequences are obtained through shared channels, including action instance
distribution, classifier and time boundary regression. Then, when the obtained action instance
distribution is merged with  , it is sent to TPP, and the refined action instance sequence is output
disappear after dimensional compression to obtain new 1D data
represents the data dimension, and the new 1D data</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2.1. Feature Extractor</title>
      <p>For the feature extraction part, we used the high performance ResNet I3D as the backbone extraction
network. To scale, we used resneti3d with the last layer of the averaging pool removed to extract the
video feature information, and used a global averaging pool approach to compress the features. Given
the input video clip, 1D features of shape (512,100) were obtained.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2. Temporal Self Attention Module</title>
      <p>In order to obtain the global attention of action instances throughout the time series and to reduce
the impact of the limitations of local feature extraction, we designed the TSA module. We consider that
different action instances have some similarity in the sample distribution. In time series information,
due to the continuous nature of actions, we believe that it is valuable to calculate the feature similarity
between successive categories and successive action instances. We find that different actions and
actions should have different weight coefficients to the background before similar actions, and by doing
so, we can avoid learning too much irrelevant information. Similarly, if two sets of patches represent
the same action instance, the information they learn is redundant, so we should avoid this situation. To
solve this problem, we avoid this situation by calculating the 
. in such a way that the time span
between the two sets of patches is guaranteed to exceed a certain threshold. the working model of TSA
is shown in Figure 2.</p>
      <p>Formally, two different action instances  and  , and the context B between them, form a local
whether  and  , belong to the same type of action.
time series ( , B,  ).  carries the temporal information ( , , ) and 
( , , ). By decoding, we obtain the category information 
in the input information. we then build the condition matrix A, where (, )
represents
 (,  ) =
by i, and the maximum value obtained after calculating 
for softmax is used as the current
prediction category, bg represents the background sample, and action represents that the current category
belongs to the action category.</p>
      <p>Similarly, in order to ensure that the correlation weight of two groups of different instances is
calbetween  and  , and we measure this feature through 
culated, we construct the distance similarity matrix B, where (, )
 = 
 = 
 (, , 
) = 
∩ 
∪ 
( )
( )
( )

√

instance.</p>
      <p>where interval represents the span between the start and end time of the patch prediction action
The core mapping of each patch, so we use the self attention mechanism in Transformer to map the
time series y to different feature spaces through space mapping.</p>
      <p>The feature maps of time series in different spaces are obtained. Finally, the final weight coefficient
matrix M is obtained by fusion, where (, )
represents the weight map of feature 
on feature  .</p>
      <sec id="sec-4-1">
        <title>The final mapping coefficient matrix (, )</title>
        <p>is formed through matrices A, B and M, representing
the correlation system between actions and background, and a new time series vector is obtained
through weighted fusion coefficient.</p>
        <p>The matrix G is obtained by fusion to effectively ensure that the currently acquired weight matrix
does not incorporate redundant information. Through the matrix V, a filtering operation is performed
on the feature vector after we have acquired the global attention, and finally a new vector is obtained
for the output, which at this point effectively acquires the similarity between action classes and expands
the action information.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>2.3. Temporal Pyramid Pooling Module</title>
      <p>In order to better judge the enhanced time series features, we obtain the class and interval information
through the convolution channel, and or the distribution of the most important action instances, and
prejudge the current features by calculating</p>
      <p>and</p>
      <p>Next, in order to fully extract the deep features of the time series information and limit the start and
end times to the precision boundaries, we use the TPP module to refine on the temporal features to
obtain the saliency boundary features, as shown in Figure 3. The features after fusing the action instance
distribution feature 
and the global attentional temporal feature  ∈ 
×
are used as input,
while to aggregate deeper temporal features, we perform feature downsampling by using a feature
pyramid, e.g. for feature f belonging to</p>
      <p>, using a convolution operation to process.</p>
      <p>When processing temporal features using the convolution operation, due to the short time span of
some temporal features, too much feature information is lost in the process of downsampling, but at the
same time, in order to maximize the perceptual field, we have to stack convolution blocks, so we
introduced the hole convolution method, through different proportions of hole convolution blocks, to obtain
different scales of temporal feature information, for example, for feature f belongs to 
using the convolution operation to process:</p>
      <p>Based on the actionness approach, we use a mask to obtain the current action instance distribution
as well as the action interval distribution by aggregating the instance distributions. Specifically, we use
linear interpolation to upsample features to the original feature sequence length T. By obtaining
information about the action instance temporal distribution, we restore the action distribution in the original
output temporal sequence by means of feature mapping, and finally obtain the category information by
performing softmax operations on the features of each aggregated dimension.</p>
    </sec>
    <sec id="sec-6">
      <title>3. Experiment</title>
    </sec>
    <sec id="sec-7">
      <title>3.1. Datasets</title>
      <p>Thumos’14 includes 200 training sets and 212 test sets in the time sequence motion detection
direction. Each sample contains 20 types of actions, which are basically daily actions. Frame level annotation
includes the start time, end time and kind of each action.</p>
      <p>ActivityNet1.2 is a large action recognition data set. The entire data set contains 4819 training sets
and 2383 data sets for testing. It also uses frame level annotation for training and testing.</p>
    </sec>
    <sec id="sec-8">
      <title>3.2. Training</title>
    </sec>
    <sec id="sec-9">
      <title>3.3. Result</title>
      <p>We sampled the RGB and optical streams at 10 frames per second on the dataset, with each segment
limited to 128 frames in length and not having overlapping frames. For the training process, we used
temporal random sampling for sampling. Specifically, the model input was limited to contain at least
one action instance with a Tiou of 0.75 or higher, and the frame space size was limited to 112x112.To
speed up training, we used I3D pre-trained on kinect as the backbone feature to extract network weights.</p>
      <p>It can be seen from Table 1 and Table2 that our model performs at a leading level among all
I3Ddominated feature extraction networks. On the Thumos14 dataset, our model handles the optimal level
on Map@0.5 and Map@0.6, which benefits from our TAS and TPP modules, and on the ActivityNet
dataset we are also at the leading level on Map@0.75. This shows that the modules we have designed
are effective.
4. Conclusions</p>
      <p>For the purpose of inaccurate localisation accuracy of temporal action recognition, we propose a
TSA module for extracting the global attention of the model, while using suitable filters to remove
redundant temporal space information, and we use a TPP module to fuse the temporal information of
different levels of sensory fields to enhance the performance of the model on different scales of action
instances, our model is implemented end-to-end and achieves in effect almost the same level of
effectiveness as the two-stage model, while greatly reducing inference time, and we used a single-stream
columnar network structure, which is less computational than a large two-stream network like slowfast.</p>
      <p>However, our model also has certain limitations when it comes to inference. Firstly, our annotation
requirements for real samples are at the frame level, raising the difficulty of acquiring the dataset, and
secondly, attentional information can easily be incorrectly fused when faced with action behaviours that
are relatively similar. Therefore, much improvement is needed to address this point in the coming time.</p>
    </sec>
    <sec id="sec-10">
      <title>5. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>Table 2 Different algorithms performance on</article-title>
          <source>ActivityNet ActivityNet1.2 Model Backbone 0.5 0</source>
          .75
          <string-name>
            <given-names>TURN</given-names>
            <surname>I3D -- --</surname>
          </string-name>
          R-C3D
          <source>C3D 26</source>
          .
          <fpage>8</fpage>
          --
          <source>TAL I3D 38.2 18.3 GTAN P3D 52.6 34.1 SSN TS 43.2 28.7 BSN TS 46.5 30.0 BMN TS 50.1 34.8 BU-TAL I3D 43.5 33.9 TSA-TAL I3D 51.1 35</source>
          .0
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>[Tianwei</given-names>
            <surname>Lin</surname>
          </string-name>
          .:
          <article-title>Single shot temporal action detection</article-title>
          .
          <source>In Proceedings of the 25th ACM international conference on Multimedia</source>
          , pages
          <fpage>988</fpage>
          -
          <lpage>996</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Wang</surname>
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>RGB stream is enough for temporal action detection[J]</article-title>
          .
          <source>arXiv preprint arXiv:2107.04362</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Huijuan</given-names>
            <surname>Xu.:</surname>
          </string-name>
          R-c3d:
          <article-title>Region onvolutional 3d network for temporal activity detection</article-title>
          .
          <source>InProceedings of the IEEE international conference on computer vision</source>
          , pages
          <fpage>5783</fpage>
          -
          <lpage>5792</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Yu-Wei Chao</surname>
          </string-name>
          .
          <article-title>:Rethinking the faster r-cnn architecture for temporal action localization</article-title>
          .
          <source>In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <fpage>1130</fpage>
          -
          <lpage>1139</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Phuc</given-names>
            <surname>Nguyen</surname>
          </string-name>
          .:
          <article-title>Weakly supervised action localization by sparse temporal pooling network</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <fpage>6752</fpage>
          -
          <lpage>761</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Yuan</surname>
            <given-names>Yuan</given-names>
          </string-name>
          .:
          <article-title>Marginalized average attentional network for weakly-supervised learning</article-title>
          .
          <source>arXiv preprint arXiv: 1905.08586</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Jia-Xing Zhong</surname>
          </string-name>
          .:
          <article-title>Step-by-step erasion, one-by-one collection: a weakly supervised temporal action detector</article-title>
          .
          <source>In Proceedings of the 26th ACM international conference on Multimedia</source>
          , pages
          <fpage>35</fpage>
          -
          <lpage>44</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>