1. Introduction

Two-Stream Network and Attention Mechanism for Sports Video Classification in Table tennis

Pengcheng Dong

Hongxin Xie

Fuqiang Zheng

Jiande Sun

0 0 School of Information Science and Engineering, Shandong Normal University , Jinan , China 1 School of Physical Education, Shandong Normal University , Jinan , China

Precise recognition of fine-grained actions in sports videos requires robust models proficient in capturing intricate spatiotemporal cues. Our study introduces a novel hybrid framework that combines SlowFast[1] for refined temporal modeling with CBAM[ 2] for channel and spatial attention. Additionally, TAM[3] integrates sophisticated temporal attention mechanisms within our innovative architecture. Our model aims to elevate the comprehension and identification of intricate actions within high-speed sports videos, with a specific emphasis on table tennis. We validate our proposed framework using the rigorous TTStroke-21 dataset[4, 5], showcasing its superior performance in fine-grained action classification and accurate position detection within table tennis videos. Experimental outcomes vividly demonstrate the eficacy of our hybrid approach in discerning nuanced stroke variations and precisely localizing actions[6], signifying its substantial potential in sports analytics and comprehensive player performance assessment.

1. Introduction

Recent years have witnessed a notable surge in sports video analysis, notably in the nuanced deciphering of intricate movements within table tennis videos[ 7 ]. Advanced methodologies have become pivotal in extracting comprehensive insights into player performance, enabling coaches and analysts to refine training strategies and optimize athletes’ potential.

Our proposed framework stands as a significant stride in this domain, amalgamating a sophisticated fusion of cutting-edge techniques. Employing SlowFast[ 1 ] for temporal modeling assures a profound comprehension of temporal dynamics, empowering our model to discern the intrinsic rapid stroke variations prevalent in the realm of table tennis. Moreover, the CBAM[ 2 ] dynamically recalibrates channel and spatial information, while the TAM[ 3 ] refines temporal representations. Their collective efect significantly boosts our model’s precision in recognizing ifne-grained actions within video sequences.

This choice of model was propelled by SlowFast’s capability to harmoniously combine spatial and temporal cues in sports videos, particularly suited for the fast-paced nature of table tennis. The incorporation of SlowFast serves as a cornerstone in our model, ofering nuanced insights into temporal dynamics crucial for recognizing and categorizing fine-grained actions within the context of table tennis. Therefore, its selection was grounded in its capacity to comprehend the rapid and intricate motions intrinsic to this high-speed sport, aiming to substantially enhance sports analytics methodologies and further our understanding of athlete performance in dynamic sporting environments.

2. Approach 2.1. Integration of Attention Mechanisms

To refine Fine-Grained Action Classification and Position Detection within Table tennis video analysis, our network architecture incorporates attention mechanisms such as Channel Attention Block (CBAM)[ 2 ] for channel-specific focus and Spatiotemporal Attention Mechanism (TAM)[ 3 ] for temporal insights. Seamlessly integrated, these mechanisms significantly enhance discriminative capabilities, efectively capturing intricate spatiotemporal patterns inherent in Table tennis sequences.

The CBAM[ 2 ] module enriches network capability by focusing on insightful channel-wise relationships within feature maps. It adaptively recalibrates feature responses to emphasize salient features while mitigating irrelevant information. The integration of CBAM[ 2 ] facilitates the extraction of discriminative spatial features, enhancing fine-grained action classification and precise position detection within Table tennis sequences.

Simultaneously, TAM [ 3 ] adeptly captures long-range dependencies and temporal relationships in Table tennis videos. Operating through attention mechanisms across temporal dimensions, TAM[ 3 ] empowers the model to highlight crucial temporal information, significantly contributing to the precision of action recognition and position detection. It efectively filters out redundant frames and accentuates subtle temporal dynamics during gameplay.

2.2. SlowFast Networks

The SlowFast[ 1 ]architecture adeptly processes spatial and temporal information via dual pathways, enabling a thorough analysis of motion dynamics in Table tennis videos. While the slow pathway meticulously captures intricate spatial details, the fast pathway focuses on rapid temporal changes, facilitating the fusion of detailed spatial and dynamic temporal features. This fusion significantly augments the precision of identifying fine-grained actions and accurately detecting player positions during gameplay. Furthermore, the integration of Adaptive Time Attention accentuates critical temporal segments, enhancing the network’s proficiency in discerning significant temporal dynamics, thereby refining action recognition and position detection[ 8 ]. In summary, the incorporation of SlowFast[ 1 ]markedly amplifies the accuracy of Fine-Grained Action Classification and Position Detection in Table tennis videos, showcasing its pivotal role in advancing the landscape of sports video analysis.The model architecture, depicted in Figure 1, showcases the integration of CBAM[ 2 ] within the slow branch and TAM within the fast branch. CBAM’s[ 2 ] placement within the slow branch capitalizes on its capability to extract spatial and channel-related details, given the branch’s lower count of image frames but richer channel information. Conversely, in the fast branch characterized by fewer channel details but more image frames, TAM excels in capturing temporal relationships among these frames.

3. Results and Analysis

This study presents a comprehensive experimental framework for video classification employing the sophisticated architecture termed SlowFast[ 1 ].Our model was benchmarked against Timesformer[ 10 ] and SlowFast[ 1 ] in experimental comparisons.Timesformer[ 10 ] is a neural network model designed for time series data, utilizing a Transformer[ 11 ] architecture optimized with time-based attention mechanisms to enhance temporal feature processing.We use the pre-training model from open-mmlab in [ 12 ]. The experimental setup involves a batch size of 8 samples, an initial learning rate of 1e-2, weight decay of 1e-5, momentum of 0.9, encompassing a training regime spanning 500 epochs. Furthermore, employing the cross-entropy loss function for classification tasks[ 13 ], a dynamic learning rate scheduler, based on validation set performance, ensures continual performance improvement.

In Table 1, the accuracy of Timesformer is reported as 82.17%. Subsequent ablation experiments were carried out to substantiate the superiority of our model. Accuracy was recorded at 82.17% when employing only SlowFast, which increased to 84.28% upon integrating CBAM with SlowFast. Further enhancement to 87.39% was achieved by combining both CBAM and TAM with SlowFast. Additionally, comparative analysis against the Baseline model demonstrated an accuracy of 74.6% for our model, as illustrated in Table 2. Our analysis identifies three primary reasons for these errors. Firstly, despite the SlowFast network’s capability in capturing spatiotemporal information, it might encounter challenges in discerning very subtle or nuanced actions, especially in highly dynamic sports like table tennis. This limitation could afect the precision of action classification and position detection by not adequately capturing fine-grained details.Secondly, the dual-pathway design of SlowFast introduces a trade-of between capturing spatial details and temporal dynamics. Achieving a balance between these pathways to efectively capture both spatial and temporal information remains a challenge, impacting the model’s consistency in discerning fine-grained actions and accurately detecting player positions.Lastly, within the defensive and ofensive datasets, there exists similarity in actions despite the diferent labels assigned. While defensive data includes "backspin," "block," and "push," ofensive data comprises "flip," "hit," and "loop." Despite the diferent labels, the high similarity in the actions poses challenges for accurate classification.

4. Discussion and Outlook

Future research endeavors may concentrate on refining the model’s adaptability to diverse scenarios within table tennis matches. Exploring supplementary attention mechanisms or incorporating diverse deep learning architectures holds promise in augmenting comprehension and precision for identifying intricate table tennis movements. Moreover, investigating transfer learning and generalization across diverse sports domains could significantly broaden the scope of applying this technology in sports video analysis.

[1]

Feichtenhofer ,

Fan ,

Malik ,

He , Slowfast networks for video recognition , in: Proceedings of the IEEE/CVF international conference on computer vision , 2019 , pp. 6202 - 6211 .

[2]

Woo ,

Park , J.-

Lee ,

I. S.

Kweon , Cbam: Convolutional block attention module , in: Proceedings of the European conference on computer vision (ECCV) , 2018 , pp. 3 - 19 .

[3]

P. J.

Hu ,

P. Y.

Chau ,

O. R. L.

Sheng ,

K. Y.

Tam , Examining the technology acceptance model using physician acceptance of telemedicine technology , Journal of management information systems 16 ( 1999 ) 91 - 112 .

[4]

P.-E.

Martin ,

Benois-Pineau ,

Péteri ,

Morlier , Fine grained sport action recognition with twin spatio-temporal convolutional neural networks: Application to table tennis , Multimedia Tools and Applications 79 ( 2020 ) 20429 - 20447 .

[5]

P.-E.

Martin ,

Benois-Pineau ,

Péteri , J. Morlier, Sport action recognition with siamese spatiotemporal cnns: Application to table tennis , in: 2018 International Conference on Content-Based Multimedia Indexing (CBMI) , IEEE, 2018 , pp. 1 - 6 .

[6]

Erades ,

Martin ,

R. V. B.

Mansencal ,

Péteri ,

Morlier ,

Dufner ,

Benois-Pineau , Sportsvideo: A multimedia dataset for event and position detection in table tennis and swimming , in: Working Notes Proceedings of the MediaEval 2023 Workshop , Amsterdam, The Netherlands and Online and Online, 1-2 February 2024 , CEUR Workshop Proceedings, CEUR-WS.org, 2023 .

[7]

Li ,

S. G.

Ali ,

Zhang ,

Sheng ,

Li ,

Jung ,

Wang ,

Yang ,

Lu ,

Muhammad , et al., Video-based table tennis tracking and trajectory prediction using convolutional neural networks , Fractals 30 ( 2022 ) 2240156 .

[8]

Ji ,

Xu ,

Yang ,

Yu , 3d convolutional neural networks for human action recognition , IEEE transactions on pattern analysis and machine intelligence 35 ( 2012 ) 221 - 231 .

[9]

Qiu ,

Yao ,

Mei , Learning spatio-temporal representation with pseudo-3d residual networks , in: proceedings of the IEEE International Conference on Computer Vision , 2017 , pp. 5533 - 5541 .

[10]

Bertasius ,

Wang ,

Torresani , Is space-time attention all you need for video understanding? , in: ICML, volume 2 , 2021 , p. 4 .

[11]

Han , A . Xiao , E. Wu,

Guo ,

Xu ,

Wang , Transformer in transformer, Advances in Neural Information Processing Systems 34 ( 2021 ) 15908 - 15919 .

[12]

Contributors , Openmmlab's next generation video understanding toolbox and benchmark , https://github.com/open-mmlab/ mmaction2 , 2020 .

[13]

Hui ,

Belkin , Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks , arXiv preprint arXiv: 2006 . 07322 ( 2020 ).

[14]

Martin , Baseline method for the sport task of mediaeval 2023 3d cnns using attention mechanisms for table tennis stoke detection and classification ., in: Working Notes Proceedings of the MediaEval 2023 Workshop , Amsterdam, The Netherlands and Online and Online, 1-2 February 2024 , CEUR Workshop Proceedings, CEUR-WS.org, 2023 .