<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Two-Stream Network and Attention Mechanism for Sports Video Classification in Table tennis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pengcheng Dong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongxin Xie</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fuqiang Zheng</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiande Sun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information Science and Engineering, Shandong Normal University</institution>
          ,
          <addr-line>Jinan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Physical Education, Shandong Normal University</institution>
          ,
          <addr-line>Jinan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Precise recognition of fine-grained actions in sports videos requires robust models proficient in capturing intricate spatiotemporal cues. Our study introduces a novel hybrid framework that combines SlowFast[1] for refined temporal modeling with CBAM[ 2] for channel and spatial attention. Additionally, TAM[3] integrates sophisticated temporal attention mechanisms within our innovative architecture. Our model aims to elevate the comprehension and identification of intricate actions within high-speed sports videos, with a specific emphasis on table tennis. We validate our proposed framework using the rigorous TTStroke-21 dataset[4, 5], showcasing its superior performance in fine-grained action classification and accurate position detection within table tennis videos. Experimental outcomes vividly demonstrate the eficacy of our hybrid approach in discerning nuanced stroke variations and precisely localizing actions[6], signifying its substantial potential in sports analytics and comprehensive player performance assessment.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent years have witnessed a notable surge in sports video analysis, notably in the nuanced
deciphering of intricate movements within table tennis videos[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Advanced methodologies
have become pivotal in extracting comprehensive insights into player performance, enabling
coaches and analysts to refine training strategies and optimize athletes’ potential.
      </p>
      <p>
        Our proposed framework stands as a significant stride in this domain, amalgamating a
sophisticated fusion of cutting-edge techniques. Employing SlowFast[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for temporal modeling
assures a profound comprehension of temporal dynamics, empowering our model to discern the
intrinsic rapid stroke variations prevalent in the realm of table tennis. Moreover, the CBAM[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
dynamically recalibrates channel and spatial information, while the TAM[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] refines temporal
representations. Their collective efect significantly boosts our model’s precision in recognizing
ifne-grained actions within video sequences.
      </p>
      <p>This choice of model was propelled by SlowFast’s capability to harmoniously combine spatial
and temporal cues in sports videos, particularly suited for the fast-paced nature of table tennis.
The incorporation of SlowFast serves as a cornerstone in our model, ofering nuanced insights
into temporal dynamics crucial for recognizing and categorizing fine-grained actions within the
context of table tennis. Therefore, its selection was grounded in its capacity to comprehend the
rapid and intricate motions intrinsic to this high-speed sport, aiming to substantially enhance
sports analytics methodologies and further our understanding of athlete performance in dynamic
sporting environments.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <sec id="sec-2-1">
        <title>2.1. Integration of Attention Mechanisms</title>
        <p>
          To refine Fine-Grained Action Classification and Position Detection within Table tennis video
analysis, our network architecture incorporates attention mechanisms such as Channel
Attention Block (CBAM)[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] for channel-specific focus and Spatiotemporal Attention Mechanism
(TAM)[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] for temporal insights. Seamlessly integrated, these mechanisms significantly enhance
discriminative capabilities, efectively capturing intricate spatiotemporal patterns inherent in
Table tennis sequences.
        </p>
        <p>
          The CBAM[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] module enriches network capability by focusing on insightful channel-wise
relationships within feature maps. It adaptively recalibrates feature responses to emphasize
salient features while mitigating irrelevant information. The integration of CBAM[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] facilitates
the extraction of discriminative spatial features, enhancing fine-grained action classification
and precise position detection within Table tennis sequences.
        </p>
        <p>
          Simultaneously, TAM [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] adeptly captures long-range dependencies and temporal
relationships in Table tennis videos. Operating through attention mechanisms across temporal
dimensions, TAM[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] empowers the model to highlight crucial temporal information, significantly
contributing to the precision of action recognition and position detection. It efectively filters
out redundant frames and accentuates subtle temporal dynamics during gameplay.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. SlowFast Networks</title>
        <p>
          The SlowFast[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]architecture adeptly processes spatial and temporal information via dual
pathways, enabling a thorough analysis of motion dynamics in Table tennis videos. While the
slow pathway meticulously captures intricate spatial details, the fast pathway focuses on rapid
temporal changes, facilitating the fusion of detailed spatial and dynamic temporal features.
This fusion significantly augments the precision of identifying fine-grained actions and
accurately detecting player positions during gameplay. Furthermore, the integration of Adaptive
Time Attention accentuates critical temporal segments, enhancing the network’s proficiency
in discerning significant temporal dynamics, thereby refining action recognition and position
detection[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In summary, the incorporation of SlowFast[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]markedly amplifies the accuracy of
Fine-Grained Action Classification and Position Detection in Table tennis videos, showcasing its
pivotal role in advancing the landscape of sports video analysis.The model architecture, depicted
in Figure 1, showcases the integration of CBAM[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] within the slow branch and TAM within the
fast branch. CBAM’s[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] placement within the slow branch capitalizes on its capability to extract
spatial and channel-related details, given the branch’s lower count of image frames but richer
channel information. Conversely, in the fast branch characterized by fewer channel details but
more image frames, TAM excels in capturing temporal relationships among these frames.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Analysis</title>
      <p>
        This study presents a comprehensive experimental framework for video classification
employing the sophisticated architecture termed SlowFast[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].Our model was benchmarked against
Timesformer[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and SlowFast[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] in experimental comparisons.Timesformer[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is a neural
network model designed for time series data, utilizing a Transformer[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] architecture optimized
with time-based attention mechanisms to enhance temporal feature processing.We use the
pre-training model from open-mmlab in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The experimental setup involves a batch size
of 8 samples, an initial learning rate of 1e-2, weight decay of 1e-5, momentum of 0.9,
encompassing a training regime spanning 500 epochs. Furthermore, employing the cross-entropy loss
function for classification tasks[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], a dynamic learning rate scheduler, based on validation set
performance, ensures continual performance improvement.
      </p>
      <p>In Table 1, the accuracy of Timesformer is reported as 82.17%. Subsequent ablation
experiments were carried out to substantiate the superiority of our model. Accuracy was recorded at
82.17% when employing only SlowFast, which increased to 84.28% upon integrating CBAM with
SlowFast. Further enhancement to 87.39% was achieved by combining both CBAM and TAM
with SlowFast. Additionally, comparative analysis against the Baseline model demonstrated
an accuracy of 74.6% for our model, as illustrated in Table 2. Our analysis identifies three
primary reasons for these errors. Firstly, despite the SlowFast network’s capability in capturing
spatiotemporal information, it might encounter challenges in discerning very subtle or nuanced
actions, especially in highly dynamic sports like table tennis. This limitation could afect the
precision of action classification and position detection by not adequately capturing fine-grained
details.Secondly, the dual-pathway design of SlowFast introduces a trade-of between capturing
spatial details and temporal dynamics. Achieving a balance between these pathways to
efectively capture both spatial and temporal information remains a challenge, impacting the model’s
consistency in discerning fine-grained actions and accurately detecting player positions.Lastly,
within the defensive and ofensive datasets, there exists similarity in actions despite the diferent
labels assigned. While defensive data includes "backspin," "block," and "push," ofensive data
comprises "flip," "hit," and "loop." Despite the diferent labels, the high similarity in the actions
poses challenges for accurate classification.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion and Outlook</title>
      <p>Future research endeavors may concentrate on refining the model’s adaptability to diverse
scenarios within table tennis matches. Exploring supplementary attention mechanisms or
incorporating diverse deep learning architectures holds promise in augmenting comprehension
and precision for identifying intricate table tennis movements. Moreover, investigating transfer
learning and generalization across diverse sports domains could significantly broaden the scope
of applying this technology in sports video analysis.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Feichtenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Slowfast networks for video recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6202</fpage>
          -
          <lpage>6211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Woo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          , J.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. S.</given-names>
            <surname>Kweon</surname>
          </string-name>
          , Cbam:
          <article-title>Convolutional block attention module</article-title>
          ,
          <source>in: Proceedings of the European conference on computer vision (ECCV)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. Y.</given-names>
            <surname>Chau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. R. L.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Y.</given-names>
            <surname>Tam</surname>
          </string-name>
          ,
          <article-title>Examining the technology acceptance model using physician acceptance of telemedicine technology</article-title>
          ,
          <source>Journal of management information systems 16</source>
          (
          <year>1999</year>
          )
          <fpage>91</fpage>
          -
          <lpage>112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.-E.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          ,
          <article-title>Fine grained sport action recognition with twin spatio-temporal convolutional neural networks: Application to table tennis</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          <volume>79</volume>
          (
          <year>2020</year>
          )
          <fpage>20429</fpage>
          -
          <lpage>20447</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.-E.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Morlier,</surname>
          </string-name>
          <article-title>Sport action recognition with siamese spatiotemporal cnns: Application to table tennis</article-title>
          ,
          <source>in: 2018 International Conference on Content-Based Multimedia Indexing (CBMI)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Erades</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. V. B.</given-names>
            <surname>Mansencal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dufner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <article-title>Sportsvideo: A multimedia dataset for event and position detection in table tennis and swimming</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2023 Workshop</source>
          , Amsterdam,
          <source>The Netherlands and Online and Online, 1-2 February</source>
          <year>2024</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Muhammad</surname>
          </string-name>
          , et al.,
          <article-title>Video-based table tennis tracking and trajectory prediction using convolutional neural networks</article-title>
          ,
          <source>Fractals</source>
          <volume>30</volume>
          (
          <year>2022</year>
          )
          <fpage>2240156</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>3d convolutional neural networks for human action recognition</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>35</volume>
          (
          <year>2012</year>
          )
          <fpage>221</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <article-title>Learning spatio-temporal representation with pseudo-3d residual networks</article-title>
          ,
          <source>in: proceedings of the IEEE International Conference on Computer Vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5533</fpage>
          -
          <lpage>5541</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bertasius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Torresani</surname>
          </string-name>
          ,
          <article-title>Is space-time attention all you need for video understanding?</article-title>
          , in: ICML, volume
          <volume>2</volume>
          ,
          <year>2021</year>
          , p.
          <fpage>4</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            .
            <surname>Xiao</surname>
          </string-name>
          , E. Wu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Transformer in transformer,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>15908</fpage>
          -
          <lpage>15919</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Contributors</surname>
          </string-name>
          ,
          <article-title>Openmmlab's next generation video understanding toolbox and benchmark</article-title>
          , https://github.com/open-mmlab/
          <year>mmaction2</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Belkin</surname>
          </string-name>
          ,
          <article-title>Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks</article-title>
          , arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>07322</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <article-title>Baseline method for the sport task of mediaeval 2023 3d cnns using attention mechanisms for table tennis stoke detection and classification</article-title>
          .,
          <source>in: Working Notes Proceedings of the MediaEval 2023 Workshop</source>
          , Amsterdam,
          <source>The Netherlands and Online and Online, 1-2 February</source>
          <year>2024</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>