Two Stream Network for Stroke Detection in Table Tennis Anam Zahra, Pierre-Etienne Martin CCP Department, Max Planck Institute for Evolutionary Anthropology, D-04103 Leipzig, Germany anam_zahra@eva.mpg.de,pierre_etienne_martin@eva.mpg.de Figure 1: Pipeline method for stroke detection from videos. Cuboids of RGB and optical flow are fed to the network and classified as stroke or non-stroke. The feature dimension is described as follow: 𝑅𝐺𝐵𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑠 × 𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 × ℎ𝑒𝑖𝑔ℎ𝑡 × 𝑤𝑖𝑑𝑡ℎ. ABSTRACT event detection and action classification, especially from low-resolution This paper presents a table tennis stroke detection method from videos, are helpful for monitoring and training purposes. For exam- videos. The method relies on a two-stream Convolutional Neural ple in [6, 10], the authors automate the performance analysis for Network processing in parallel the RGB Stream and its computed the training optimization of players. Similarly, Sports Video task at optical flow. The method has been developed as part of the Media- MediaEval 2021 benchmark aims at improving athlete performance Eval 2021 benchmark for the Sport task. Our contribution did not and training experience through the first steps of stroke detection outperform the provided baseline on the test set but has performed and classification from videos. the best among the other participants with regard to the mAP Event detection in videos is the first step to many other hot- metric. topics such as video summarizing [7], automated semantic segmen- tation [1] and action recognition [4, 16]. These methods may be used to build summary, selecting highlights, and assisting players 1 INTRODUCTION in training sessions. One way to approach the problem of event With the advent of Convolutional Neural Networks (CNNs), es- detection in sports with balls, can be through ball detection and pecially after the success of AlexNet [9], object detection, local- tracking. Several researchers have tried to get the 2D, and 3D ball ization, and classification from images and videos have greatly trajectories in order to achieve so [17, 18, 20]. progressed [3, 5, 9, 22].The development of computer vision meth- Inspired from [8, 14, 15, 21], this method combines the optical ods has motivated broader applications in the academic world. Our flow and features learned from the RGB stream in order to detect a team is currently working on egocentric recordings from children stroke in table tennis and assess its duration. This implementation in kindergarten and at home. The analysis of these recordings shall is an extension of the baseline code provided by the Sport Task give us an automatic overview of their interactions on a daily basis. organizers [11]. We hope to link these interactions with their cognitive development and, thereby, better understand early child development. With our participation in the Sports Video Task [12], in the stroke detection 2 APPROACH subtask, we hope to perfect our knowledge in event detection and Initially, we sought to use ball detection and tracking to perform transpose it to our project. stroke detection. The first implementation used the pretrained The diversity of applications and visual data in sport, makes model TTNet [21]. However, the model failed to adapt to the acqui- sports video analysis attractive for researchers. Automated sport sition conditions from TTStroke-21 [13], on which the task is built Copyright 2021 for this paper by its authors. Use permitted under Creative Commons upon, and no fine-tuning was possible since no ball coordinates License Attribution 4.0 International (CC BY 4.0). are available in the provided annotations. Therefore we decided to MediaEval’21, December 13-15 2021, Online train a model from scratch. MediaEval’21, December 13-15 2021, Online Anam Zahra, Pierre-Etienne Martin In this section, we first present the preparation of the videos and then the model presenting the processed data. Both processes are depicted in Fig. 1. Post processing is performed to form a final decision. 2.1 Data Preparation In video content analysis, the motion of objects of interest between frames can be of significant interest in order to understand their evolution in space. As such, we decided to use optical flow as a modality to perform stroke detection. Inspired by [14], we decided to use DeepFlow method [23] to compute the optical flow from consecutive frames. The optical flow is computed from frames resized to 320 × 128. This size was initially chosen to keep the ball at least two pixels big, as it has previously been done in [21]. Both the RGB and optical flow frames are consecutively stacked in a tensor of length 75. As in [11], stroke detection is tackled as a classification problem with two classes: “Stroke” and “Non-stroke”. Figure 2: Training Process 2.2 Model As shown in figure 1, our Two-Stream model is composed of two Table 1: Stroke concentration and duration in frame per set. branches of the same length. Each branch is a succession of four blocks and each block is composed of a convolutional layer with Set # Strokes/1K frames Mean Min Max 3 × 3 × 3 filters, followed by a ReLU activation function, and a Train 1.85 143.2 ± 36.16 52 296 2 × 2 × 2 pooling layer. The output of each branch is then flattened Valid 2.28 134.3 ± 26.13 72 292 and fed to a fully connected layer that outputs a feature vector of Test 0.57 361.0 ± 770.7 75 4500 length 500. Both feature vectors are then concatenated and fed into a final fully connected layer of length two to predict the “Stroke” and “Non-storke” classes. One branch takes RGB frames of the video and the other computed optical flow. The model is trained Indeed, by looking at the stroke distribution across the different using a stochastic gradient descent method over 250 epochs with a sets, see table 1, we may notice how little the inferred stroke ratio is learning rate of 0.001, a batch size of 10, a weight decay of 0.005, and on the test set: 0.57 strokes for 1000 frames, whereas the stroke rate a Nesterov momentum [19] of 0.5. The negative samples creation is 1.85 and 2.28 for 1000 frames in the training and validation sets. and input processing is the same as the baseline [11]. Furthermore, our post processing was not limited in term of stroke duration, leading to everlasting strokes: 4500 frames - meaning the 2.3 Post Processing fusions of 60 consecutive video segments. These points indicate that our post-processing method can be improved. Our model classifies 75 consecutive frames. In order to create stroke A better separation of the stroke may be reached by defining segments over the whole video, we classify every 75 frames of the the event using ball tracking and the ball motion [2]. This was our videos, which leads to applying a sliding window without overlap. initial attempt, inspired by [21], but the available pretrained model If two consecutive segments are classified as stroke, the segments considers a different point of view and was unable to adapt to the are fused to create only one stroke. TTStroke-21 videos point of view. 3 RESULTS AND ANALYSIS The metrics for evaluating the detection performance are described 4 CONCLUSION in [12]. Our approach reached a mean Average Precision (mAP) The Sports Video Task, and more specifically the stroke detection of 0.00124 and a Global Intersection over Union (G-IoU) of 0.0700. subtask, has proven to be challenging. Even if our implementation It falls behind the baseline which reaches respectively 0.0173 and has learned to classify strokes, we were not able to outperform 0.144. Our other attempts using early concatenation of the RGB the baseline performance. We have underlined the importance of and Optical Flow modalities - meaning an input of size 5 × 320 × 128 the post processing step through a stroke concentration and dura- in one branch model - or training method without shuffling of the tion analysis. Furthermore, our failure to adapt a pretrained model data, reached even lesser performance. on similar dataset, but with a different acquisition point of view, Nevertheless, from a classification point of view, and according stresses the difficulty of the deep trained models to adapt to a to the Fig. 2, our model learned the stroke features and can perform change of scene, which is inherent to the fine-grained aspect of reasonable results when stroke boundaries are known: 86.4% of the classification subtask. As first time participants, we thought accuracy on the validation set after only 60 epochs. Which may to tackle only one task to ease our submission. However, we now indicates that the main failure is coming from the post processing believe that a method tackling both the detection and classification method. may be the best for solving the Sport Video subtasks. Sports Video Task MediaEval’21, December 13-15 2021, Online REFERENCES IEEE, 170–173. [1] Lamberto Ballan, Marco Bertini, Alberto Del Bimbo, Lorenzo Seidenari, [18] Hnin Myint, Patrick Wong, Laurence Dooley, and Adrian Hopgood. and Giuseppe Serra. 2011. Event detection and recognition for seman- 2016. Tracking a table tennis ball for umpiring purposes using a tic annotation of video. Multimedia tools and applications 51, 1 (2011), multi-agent system. (2016). 279–302. [19] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. [2] Jordan Calandre, Renaud Péteri, Laurent Mascarilla, and Benoit Trem- 2013. On the importance of initialization and momentum in deep blais. 2021. Table Tennis ball kinematic parameters estimation from learning. In International conference on machine learning. PMLR, 1139– non-intrusive single-view videos. In 2021 International Conference on 1147. Content-Based Multimedia Indexing (CBMI). IEEE, 1–6. [20] Sho Tamaki and Hideo Saito. 2013. Reconstruction of 3d trajectories [3] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recogni- for performance analysis in table tennis. In Proceedings of the IEEE tion? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Conference on Computer Vision and Pattern Recognition. 6299–6308. 1019–1026. [4] Chandni J Dhamsania and Tushar V Ratanpara. 2016. A survey on [21] Roman Voeikov, Nikolay Falaleev, and Ruslan Baikulov. 2020. TTNet: human action recognition from videos. In 2016 online international Real-time temporal and spatial video analysis of table tennis. In Pro- conference on green engineering and technologies (IC-GET). IEEE, 1–5. ceedings of the IEEE/CVF Conference on Computer Vision and Pattern [5] Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international Recognition Workshops. 884–885. conference on computer vision. 1440–1448. [22] Heng Wang and Cordelia Schmid. 2013. Action recognition with im- [6] Mike D Hughes and Roger M Bartlett. 2002. The use of performance proved trajectories. In Proceedings of the IEEE international conference indicators in performance analysis. Journal of sports sciences 20, 10 on computer vision. 3551–3558. (2002), 739–754. [23] Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, and Cordelia [7] Yasmin S Khan and Soudamini Pawar. 2015. Video summarization: Schmid. 2013. DeepFlow: Large Displacement Optical Flow with Deep survey on event detection and summarization in soccer videos. Inter- Matching. In 2013 IEEE International Conference on Computer Vision. national Journal of Advanced Computer Science and Applications 6, 11 1385–1392. https://doi.org/10.1109/ICCV.2013.175 (2015), 256–259. [8] Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, and others. 2015. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Lille. [9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im- agenet classification with deep convolutional neural networks. Ad- vances in neural information processing systems 25 (2012), 1097–1105. [10] Adrian Lees. 2003. Science and the major racket sports: a review. Journal of sports sciences 21, 9 (2003), 707–732. [11] Pierre-Etienne Martin. 2021. Spatio-Temporal CNN baseline method for the Sports Video Task of MediaEval 2021 benchmark. In MediaEval (CEUR Workshop Proceedings). CEUR-WS.org. [12] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2021. Sports Video: Fine-Grained Action Detection and Classification of Table Tennis Strokes from videos for MediaEval 2021. In MediaEval (CEUR Workshop Proceedings). CEUR-WS.org. [13] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. 2018. Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis. In CBMI. IEEE, 1–6. [14] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. 2019. Optimal Choice of Motion Estimation Methods for Fine-Grained Action Classification with 3D Convolutional Networks. In 2019 IEEE International Conference on Image Processing, ICIP 2019, Taipei, Taiwan, September 22-25, 2019. IEEE, 554–558. https://doi.org/ 10.1109/ICIP.2019.8803780 [15] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. 2020. 3D attention mechanisms in Twin Spatio-Temporal Convolutional Neural Networks. Application to action classification in videos of table tennis games.. In 25th International Conference on Pattern Recognition (ICPR2020) - MiCo Milano Congress Center, Italy, 10-15 January 2021. [16] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. 2020. Fine grained sport action recognition with twin spatio- temporal convolutional neural networks. Multimedia Tools and Appli- cations 79, 27 (2020), 20429–20447. [17] Hnin Myint, Patrick Wong, Laurence Dooley, and Adrian Hopgood. 2015. Tracking a table tennis ball for umpiring purposes. In 2015 14th IAPR International Conference on Machine Vision Applications (MVA).