=Paper=
{{Paper
|id=Vol-3181/paper55
|storemode=property
|title=Two Stream Network for Stroke Detection in Table Tennis
|pdfUrl=https://ceur-ws.org/Vol-3181/paper55.pdf
|volume=Vol-3181
|authors=Anam Zahra,Pierre-Etienne Martin
|dblpUrl=https://dblp.org/rec/conf/mediaeval/ZahraM21
}}
==Two Stream Network for Stroke Detection in Table Tennis==
Two Stream Network for Stroke Detection in Table Tennis
Anam Zahra, Pierre-Etienne Martin
CCP Department, Max Planck Institute for Evolutionary Anthropology, D-04103 Leipzig, Germany
anam_zahra@eva.mpg.de,pierre_etienne_martin@eva.mpg.de
Figure 1: Pipeline method for stroke detection from videos. Cuboids of RGB and optical flow are fed to the network and
classified as stroke or non-stroke. The feature dimension is described as follow: 𝑅𝐺𝐵𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑠 × 𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 × ℎ𝑒𝑖𝑔ℎ𝑡 × 𝑤𝑖𝑑𝑡ℎ.
ABSTRACT event detection and action classification, especially from low-resolution
This paper presents a table tennis stroke detection method from videos, are helpful for monitoring and training purposes. For exam-
videos. The method relies on a two-stream Convolutional Neural ple in [6, 10], the authors automate the performance analysis for
Network processing in parallel the RGB Stream and its computed the training optimization of players. Similarly, Sports Video task at
optical flow. The method has been developed as part of the Media- MediaEval 2021 benchmark aims at improving athlete performance
Eval 2021 benchmark for the Sport task. Our contribution did not and training experience through the first steps of stroke detection
outperform the provided baseline on the test set but has performed and classification from videos.
the best among the other participants with regard to the mAP Event detection in videos is the first step to many other hot-
metric. topics such as video summarizing [7], automated semantic segmen-
tation [1] and action recognition [4, 16]. These methods may be
used to build summary, selecting highlights, and assisting players
1 INTRODUCTION in training sessions. One way to approach the problem of event
With the advent of Convolutional Neural Networks (CNNs), es- detection in sports with balls, can be through ball detection and
pecially after the success of AlexNet [9], object detection, local- tracking. Several researchers have tried to get the 2D, and 3D ball
ization, and classification from images and videos have greatly trajectories in order to achieve so [17, 18, 20].
progressed [3, 5, 9, 22].The development of computer vision meth- Inspired from [8, 14, 15, 21], this method combines the optical
ods has motivated broader applications in the academic world. Our flow and features learned from the RGB stream in order to detect a
team is currently working on egocentric recordings from children stroke in table tennis and assess its duration. This implementation
in kindergarten and at home. The analysis of these recordings shall is an extension of the baseline code provided by the Sport Task
give us an automatic overview of their interactions on a daily basis. organizers [11].
We hope to link these interactions with their cognitive development
and, thereby, better understand early child development. With our
participation in the Sports Video Task [12], in the stroke detection 2 APPROACH
subtask, we hope to perfect our knowledge in event detection and Initially, we sought to use ball detection and tracking to perform
transpose it to our project. stroke detection. The first implementation used the pretrained
The diversity of applications and visual data in sport, makes model TTNet [21]. However, the model failed to adapt to the acqui-
sports video analysis attractive for researchers. Automated sport sition conditions from TTStroke-21 [13], on which the task is built
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons upon, and no fine-tuning was possible since no ball coordinates
License Attribution 4.0 International (CC BY 4.0). are available in the provided annotations. Therefore we decided to
MediaEval’21, December 13-15 2021, Online
train a model from scratch.
MediaEval’21, December 13-15 2021, Online Anam Zahra, Pierre-Etienne Martin
In this section, we first present the preparation of the videos
and then the model presenting the processed data. Both processes
are depicted in Fig. 1. Post processing is performed to form a final
decision.
2.1 Data Preparation
In video content analysis, the motion of objects of interest between
frames can be of significant interest in order to understand their
evolution in space. As such, we decided to use optical flow as a
modality to perform stroke detection. Inspired by [14], we decided
to use DeepFlow method [23] to compute the optical flow from
consecutive frames. The optical flow is computed from frames
resized to 320 × 128. This size was initially chosen to keep the
ball at least two pixels big, as it has previously been done in [21].
Both the RGB and optical flow frames are consecutively stacked
in a tensor of length 75. As in [11], stroke detection is tackled as a
classification problem with two classes: “Stroke” and “Non-stroke”. Figure 2: Training Process
2.2 Model
As shown in figure 1, our Two-Stream model is composed of two Table 1: Stroke concentration and duration in frame per set.
branches of the same length. Each branch is a succession of four
blocks and each block is composed of a convolutional layer with Set # Strokes/1K frames Mean Min Max
3 × 3 × 3 filters, followed by a ReLU activation function, and a Train 1.85 143.2 ± 36.16 52 296
2 × 2 × 2 pooling layer. The output of each branch is then flattened Valid 2.28 134.3 ± 26.13 72 292
and fed to a fully connected layer that outputs a feature vector of Test 0.57 361.0 ± 770.7 75 4500
length 500. Both feature vectors are then concatenated and fed into
a final fully connected layer of length two to predict the “Stroke”
and “Non-storke” classes. One branch takes RGB frames of the
video and the other computed optical flow. The model is trained Indeed, by looking at the stroke distribution across the different
using a stochastic gradient descent method over 250 epochs with a sets, see table 1, we may notice how little the inferred stroke ratio is
learning rate of 0.001, a batch size of 10, a weight decay of 0.005, and on the test set: 0.57 strokes for 1000 frames, whereas the stroke rate
a Nesterov momentum [19] of 0.5. The negative samples creation is 1.85 and 2.28 for 1000 frames in the training and validation sets.
and input processing is the same as the baseline [11]. Furthermore, our post processing was not limited in term of stroke
duration, leading to everlasting strokes: 4500 frames - meaning the
2.3 Post Processing fusions of 60 consecutive video segments. These points indicate
that our post-processing method can be improved.
Our model classifies 75 consecutive frames. In order to create stroke
A better separation of the stroke may be reached by defining
segments over the whole video, we classify every 75 frames of the
the event using ball tracking and the ball motion [2]. This was our
videos, which leads to applying a sliding window without overlap.
initial attempt, inspired by [21], but the available pretrained model
If two consecutive segments are classified as stroke, the segments
considers a different point of view and was unable to adapt to the
are fused to create only one stroke.
TTStroke-21 videos point of view.
3 RESULTS AND ANALYSIS
The metrics for evaluating the detection performance are described 4 CONCLUSION
in [12]. Our approach reached a mean Average Precision (mAP) The Sports Video Task, and more specifically the stroke detection
of 0.00124 and a Global Intersection over Union (G-IoU) of 0.0700. subtask, has proven to be challenging. Even if our implementation
It falls behind the baseline which reaches respectively 0.0173 and has learned to classify strokes, we were not able to outperform
0.144. Our other attempts using early concatenation of the RGB the baseline performance. We have underlined the importance of
and Optical Flow modalities - meaning an input of size 5 × 320 × 128 the post processing step through a stroke concentration and dura-
in one branch model - or training method without shuffling of the tion analysis. Furthermore, our failure to adapt a pretrained model
data, reached even lesser performance. on similar dataset, but with a different acquisition point of view,
Nevertheless, from a classification point of view, and according stresses the difficulty of the deep trained models to adapt to a
to the Fig. 2, our model learned the stroke features and can perform change of scene, which is inherent to the fine-grained aspect of
reasonable results when stroke boundaries are known: 86.4% of the classification subtask. As first time participants, we thought
accuracy on the validation set after only 60 epochs. Which may to tackle only one task to ease our submission. However, we now
indicates that the main failure is coming from the post processing believe that a method tackling both the detection and classification
method. may be the best for solving the Sport Video subtasks.
Sports Video Task MediaEval’21, December 13-15 2021, Online
REFERENCES IEEE, 170–173.
[1] Lamberto Ballan, Marco Bertini, Alberto Del Bimbo, Lorenzo Seidenari, [18] Hnin Myint, Patrick Wong, Laurence Dooley, and Adrian Hopgood.
and Giuseppe Serra. 2011. Event detection and recognition for seman- 2016. Tracking a table tennis ball for umpiring purposes using a
tic annotation of video. Multimedia tools and applications 51, 1 (2011), multi-agent system. (2016).
279–302. [19] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
[2] Jordan Calandre, Renaud Péteri, Laurent Mascarilla, and Benoit Trem- 2013. On the importance of initialization and momentum in deep
blais. 2021. Table Tennis ball kinematic parameters estimation from learning. In International conference on machine learning. PMLR, 1139–
non-intrusive single-view videos. In 2021 International Conference on 1147.
Content-Based Multimedia Indexing (CBMI). IEEE, 1–6. [20] Sho Tamaki and Hideo Saito. 2013. Reconstruction of 3d trajectories
[3] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recogni- for performance analysis in table tennis. In Proceedings of the IEEE
tion? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.
Conference on Computer Vision and Pattern Recognition. 6299–6308. 1019–1026.
[4] Chandni J Dhamsania and Tushar V Ratanpara. 2016. A survey on [21] Roman Voeikov, Nikolay Falaleev, and Ruslan Baikulov. 2020. TTNet:
human action recognition from videos. In 2016 online international Real-time temporal and spatial video analysis of table tennis. In Pro-
conference on green engineering and technologies (IC-GET). IEEE, 1–5. ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
[5] Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international Recognition Workshops. 884–885.
conference on computer vision. 1440–1448. [22] Heng Wang and Cordelia Schmid. 2013. Action recognition with im-
[6] Mike D Hughes and Roger M Bartlett. 2002. The use of performance proved trajectories. In Proceedings of the IEEE international conference
indicators in performance analysis. Journal of sports sciences 20, 10 on computer vision. 3551–3558.
(2002), 739–754. [23] Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, and Cordelia
[7] Yasmin S Khan and Soudamini Pawar. 2015. Video summarization: Schmid. 2013. DeepFlow: Large Displacement Optical Flow with Deep
survey on event detection and summarization in soccer videos. Inter- Matching. In 2013 IEEE International Conference on Computer Vision.
national Journal of Advanced Computer Science and Applications 6, 11 1385–1392. https://doi.org/10.1109/ICCV.2013.175
(2015), 256–259.
[8] Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, and others. 2015.
Siamese neural networks for one-shot image recognition. In ICML
deep learning workshop, Vol. 2. Lille.
[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im-
agenet classification with deep convolutional neural networks. Ad-
vances in neural information processing systems 25 (2012), 1097–1105.
[10] Adrian Lees. 2003. Science and the major racket sports: a review.
Journal of sports sciences 21, 9 (2003), 707–732.
[11] Pierre-Etienne Martin. 2021. Spatio-Temporal CNN baseline method
for the Sports Video Task of MediaEval 2021 benchmark. In MediaEval
(CEUR Workshop Proceedings). CEUR-WS.org.
[12] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud
Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2021.
Sports Video: Fine-Grained Action Detection and Classification of
Table Tennis Strokes from videos for MediaEval 2021. In MediaEval
(CEUR Workshop Proceedings). CEUR-WS.org.
[13] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
Morlier. 2018. Sport Action Recognition with Siamese Spatio-Temporal
CNNs: Application to Table Tennis. In CBMI. IEEE, 1–6.
[14] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
Morlier. 2019. Optimal Choice of Motion Estimation Methods for
Fine-Grained Action Classification with 3D Convolutional Networks.
In 2019 IEEE International Conference on Image Processing, ICIP 2019,
Taipei, Taiwan, September 22-25, 2019. IEEE, 554–558. https://doi.org/
10.1109/ICIP.2019.8803780
[15] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
Morlier. 2020. 3D attention mechanisms in Twin Spatio-Temporal
Convolutional Neural Networks. Application to action classification
in videos of table tennis games.. In 25th International Conference on
Pattern Recognition (ICPR2020) - MiCo Milano Congress Center, Italy,
10-15 January 2021.
[16] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
Morlier. 2020. Fine grained sport action recognition with twin spatio-
temporal convolutional neural networks. Multimedia Tools and Appli-
cations 79, 27 (2020), 20429–20447.
[17] Hnin Myint, Patrick Wong, Laurence Dooley, and Adrian Hopgood.
2015. Tracking a table tennis ball for umpiring purposes. In 2015 14th
IAPR International Conference on Machine Vision Applications (MVA).