=Paper=
{{Paper
|id=Vol-2882/MediaEval_20_paper_2
|storemode=property
|title=Sports
Video
Classification: Classification of Strokes in Table Tennis for MediaEval 2020
|pdfUrl=https://ceur-ws.org/Vol-2882/paper2.pdf
|volume=Vol-2882
|authors=Pierre-Etienne Martin,Jenny Benois-Pineau,Boris Mansencal,Renaud Péteri,Laurent Mascarilla,Jordan Calandre,Julien
Morlier
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MartinBMPMCM20
}}
==Sports
Video
Classification: Classification of Strokes in Table Tennis for MediaEval 2020==
Sports Video Classification: Classification of Strokes in Table
Tennis for MediaEval 2020
Pierre-Etienne Martin1 , Jenny Benois-Pineau1 , Boris Mansencal1 ,
Renaud Péteri2 , Laurent Mascarilla2 , Jordan Calandre2 ,
Julien Morlier3
1 Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, Talence, France
2 MIA, La Rochelle University, La Rochelle, France
3 IMS, University of Bordeaux, Talence, France
mediaeval.sport.task@diff.u-bordeaux.fr
ABSTRACT Several datasets such as UCF-101 [24], HMDB [10] and AVA [7]
Fine grained action classification has raised new challenges com- have been used for many years as benchmarks for action classifi-
pared to classical action classification problem. Sport video analysis cation methods. In [15], spatio-temporal dependencies are learned
is a very popular research topic, due to the variety of application ar- from the video using only RGB images for classification. This
eas, ranging from multimedia intelligent devices with user-tailored method is promising but its scores are still below multi-modal meth-
digests, up to analysis of athletes’ performances. Running since 2019 ods such I3D [4]. More recently, datasets have been enriched, like JH-
as a part of MediaEval, we offer a task which consists in classifying MDB [8] and Kinetics [2, 3, 9] or fused like AVA_Kinetics [12]. Some
table tennis strokes from videos recorded in natural conditions at also focus on the intra-class dissimilarity such as the Something-
the University of Bordeaux. The aim is to build tools for teachers, Something dataset. Others, such as the Olympic Sports dataset [22],
coaches and players to analyse table tennis games. Such tools could focus on sport actions only. However those datasets are not ded-
lead to an automatic profiling of the player and the training session icated to a specific sport and its associated rules. Few datasets
could then be adapted for improving sports skills more efficiently. focus on fine-grained classification. We can cite FineGym [23], in-
troduced recently, which focuses on gymnastic videos, and our
TTStroke21 [21] comprising table tennis strokes.
1 INTRODUCTION TTStroke-21 is manually annotated by professional players or
teachers of table tennis, making the annotation process more time
Action detection and classification is one of the main challenges in consuming, but more temporally and qualitatively accurate. Classifi-
visual content analysis and mining [26]. Over the last few years, the cation methods such as I3D model [4] or LTC model [28] performing
number of datasets for action classification has drastically increased well on UCF-101 dataset inspired the work done in [18, 21] which in-
in terms of video content, resolution, localization and number of troduces a TSTCNN - Twin Spatio Temporal Convolutional Neural
classes. However the latest research shows that classification per- Network. Here, the video stream and derived computed optical flow
formed using deep neural networks often focuses on the whole are passed through the branches of the TSTCNN. In [19] the normal-
scene and the background and not on the action itself. ization of the flow is also investigated to improve the classification
Sport video analysis has been a very popular research topic, due score while in [20] an attention block is introduced to improve the
to the variety of application areas, ranging from multimedia intelli- performances and speed of convergence. The inter-similarity of
gent devices with user-tailored digests, up to analysis of athletes’ actions - strokes - in TTStroke-21 makes the classification task
performance [5]. The Sport Video Classification project was initi- challenging and the multi-modal method seemed to improve per-
ated by the Faculty of Sports (STAPS) and the computer science formances. To better understand learned features and classification
laboratories (LaBRI) of the University of Bordeaux, and by the MIA process taking place in the TSTCNN, we also developed a new
laboratory of La Rochelle University1 . The goal of this project is visualization technique [6].
to develop artificial intelligence and multimedia indexing methods Recent work focusing on table tennis [30] tries to get the tactics
for the recognition of table tennis sport activities. The ultimate of the players based on their performance during matches using a
goal is to evaluate the performance of athletes, with a particular Markov chain model. In [14, 27, 32] stroke recognition is performed
focus on students, in order to develop optimal training strategies. using sensors. In [29] segmentation of the player, ball coordinates,
To that aim, a video corpus named TTStroke-21 was recorded with event detection is explored while [13, 31] focus solely on the trajec-
volunteer players. These data are of great scientific interest for the tory of the ball.
Multimedia community participating in the MediaEval campaign. In this task overview paper, in section 2, we introduce the specific
conditions of usage of this data, then describe TTStroke-21 and
1 This work was supported by the New Aquitania Region through CRISP project -
the task respectively in sections 3 and 4. The evaluation method is
ComputeR vIsion for Sport Performance and the MIRES federation.
explained in section 5. Supplementary notes are shared in section 6.
More information can be found on the dedicated GitHub web page2 .
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
MediaEval’20, 14-15 December 2020, Online
2 https://multimediaeval.github.io/2020-Sports-Video-Classification-Task/
MediaEval’20, 14-15 December 2020, Online P-e Martin et al.
for each frame of the original video. The detected face is blurred
and frames are re-encoded in a video.
The organisation of the delivered data is as follows:
• The provided dataset is split into two subsets: i) training
set and ii) test set;
• In each directory, there are several videos (in MPEG-4
a. Video acquisition b. Annotation platform format) and each video may contain several actions;
Figure 1: TTStroke-21 acquisition process • Each video file is provided with a XML file describing the ac-
tions present in the video and if the player is right-handed
or left-handed;
2 SPECIFIC CONDITIONS OF USAGE • Each action has 3 attributes: the starting frame, the ending
TTStroke-21 is constituted of videos with players playing table frame, and the stroke class;
tennis in natural conditions. Even if we are using an automatic • In the train set XML files, all the attributes are specified.
tool for blurring players’ faces, some faces are misdetected on few In the test set XML files, only the starting and ending
frames and thus some players remain identifiable. In order to re- frames are specified. The stroke class attribute is purposely
spect the personal data and privacy of the players, this dataset set to value: “Unknown”, and should be updated by the
is subject to a usage agreement, referred to as Special Conditions. participants to one of the 20 valid classes.
These Special Conditions apply to the use of videos, referred to as
.
Images, generated in the framework of the program Sports video
classification: classification of strokes in table tennis, for the im-
plementation of the MediaEval program. They correspond to the 4 TASK DESCRIPTION
specific usage agreement referred to in the Usage agreement for the The Sport Video Annotation task consists, for each action of each
MediaEval 2020 Research Collections, signed between the User and test video, in assigning a label using a given taxonomy of 20 classes
the University of Delft. The full and complete acceptance, without of table tennis strokes.
any reservation, of these Special Conditions is a mandatory pre- Participants may submit up to five runs. For each run, they must
requisite for the provision of the Images as part of the MediaEval provide one XML file per video file containing, with the actions
2020 evaluation campaign. A complete reading of these conditions associated with the recognised stroke class. Runs may be submitted
is necessary and requires the user, for example, to obscure the as an archive (zip or tar.gz file) with each run in a different directory.
faces (blurring, black banner, etc.) in the video before use in any Participants should also indicate if any external data, such as other
publication and to destroy the data by October 1st, 2021. dataset or pretrained networks, was used to compute their runs.
The task is considered fully automatic. Once the video are provided
3 DATASET DESCRIPTION to the system, results should be produced without any human
In the MediaEval 2020 campaign, we release the same subset of the intervention.
TTStroke-21 dataset than last year. The only difference is the blur-
ring of the faces and the specification if the player is right-handed 5 EVALUATION
or left-handed. The dataset has been recorded in a sport faculty For MediaEval 2020, we propose a light-weight classification task.
facility using a light-weight equipment, such as GoPro cameras. It is It consists in classification of table tennis strokes which temporal
constituted of player-centred videos recorded in natural conditions borders are supplied in the XML files accompanying each video file.
without markers or sensors, see Fig 1. It comprises 20 table tennis Hence for each test video the participants are invited to produce
stroke classes, i.e. 8 services: Serve Forehand Backspin, Serve an XML file in which each stroke is labelled accordingly to the
Forehand Loop, Serve Forehand Sidespin, Serve Forehand given taxonomy. This means that the default label “unknown” has
Topspin, Serve Backhand Backspin, Serve Backhand Loop, to be replaced by the label of the stroke class that the participant’s
Serve Backhand Sidespin, Serve Backhand Topspin; 6 offen- system has assigned. All submissions will be evaluated in terms of
sive strokes: Offensive Forehand Hit, Offensive Forehand per-class accuracy (𝐴𝑖 ) and of global accuracy (𝐺𝐴).
Loop, Offensive Forehand Flip, Offensive Backhand Hit, The organizers will also provide to the participants different
Offensive Backhand Loop, Offensive Backhand Flip; and confusion matrices: one considering all the classes, and others
6 defensive strokes: Defensive Forehand Push, Defensive considering the type of the stroke such as: offensive, defensive and
Forehand Block, Defensive Forehand Backspin, Defensive defensive and/or using forehand and backhand superclasses of the
Backhand Push, Defensive Backhand Block, Defensive strokes.
Backhand Backspin. Also, all the strokes can be divided in two
super-classes: Forehand and Backhand. This taxonomy was de-
signed with professional table tennis teachers. 6 DISCUSSION
All videos are recorded in MPEG-4 format. Unlike the task at The participants from last years have reached a maximum accuracy
MediaEval 2019 [16], most of the faces are blurred. To do so, faces of 22.9% [17], 14.1%[1] and 11.3% [25] leaving room for improve-
are detected with OpenCV deep learning face detector, based on the ment. Participants are welcome to share their difficulties and their
Single Shot Detector (SSD) framework with a ResNet base network, results even if they seem not sufficiently good.
Sports Video Classification: Classification of Strokes in Table Tennis MediaEval’20, 14-15 December 2020, Online
REFERENCES [17] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud
[1] Jordan Calandre, Renaud Péteri, and Laurent Mascarilla. 2019. Optical Péteri, and Julien Morlier. 2019. Siamese Spatio-Temporal Convolu-
Flow Singularities for Sports Video Annotation: Detection of Strokes tional Neural Network for Stroke Classification in Table Tennis Games,
in Table Tennis, See [11]. See [11].
[2] João Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and [18] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
Andrew Zisserman. 2018. A Short Note about Kinetics-600. CoRR Morlier. 2018. Sport Action Recognition with Siamese Spatio-Temporal
abs/1808.01340 (2018). CNNs: Application to Table Tennis. In CBMI 2018, La Rochelle, France,
[3] João Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. September 4-6, 2018. IEEE, 1–6.
2019. A Short Note on the Kinetics-700 Human Action Dataset. CoRR [19] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
abs/1907.06987 (2019). Morlier. 2019. Optimal Choice of Motion Estimation Methods for Fine-
[4] João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recog- Grained Action Classification with 3D Convolutional Networks. In
nition? A New Model and the Kinetics Dataset. (2017), 4724–4733. IEEE ICIP 2019, Taipei, Taiwan, September 22-25, 2019. IEEE, 554–558.
[5] Moritz Einfalt, Dan Zecha, and Rainer Lienhart. 2018. Activity- [20] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
Conditioned Continuous Human Pose Estimation for Performance Morlier. 2020. 3D attention mechanisms in Twin Spatio-Temporal
Analysis of Athletes Using the Example of Swimming. In IEEE WACV Convolutional Neural Networks. Application to action classification
2018, Lake Tahoe, NV, USA, March 12-15, 2018. 446–455. in videos of table tennis games.. In 2ICPR2020 - MiCo Milano Congress
[6] Kazi Ahmed Asif Fuad, Pierre-Etienne Martin, Romai Giot, Romain Center, Italy, 10-15 January 2021.
Bourqui, Jenny Benois-Pineau, and Akka Zemmari. 2020. Feature [21] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
Understanding in 3D CNNs for Actions Recognition in Video. In Tenth Morlier. 2020. Fine grained sport action recognition with Twin spatio-
International Conference on Image Processing Theory, Tools and Appli- temporal convolutional neural networks. Multim. Tools Appl. 79, 27-28
cations, IPTA 2020, Paris, France, November 9-12, 2020. 1–6. (2020), 20429–20447.
[7] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Panto- [22] Juan Carlos Niebles, Chih-Wei Chen, and Fei-Fei Li. 2010. Modeling
faru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Su- Temporal Structure of Decomposable Motion Segments for Activity
sanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. Classification. In Computer Vision - ECCV 2010, Heraklion, Crete, Greece,
2018. AVA: A Video Dataset of Spatio-Temporally Localized Atomic September 5-11, 2010, Proceedings, Part II (Lecture Notes in Computer
Visual Actions. (2018), 6047–6056. Science), Kostas Daniilidis, Petros Maragos, and Nikos Paragios (Eds.),
[8] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Vol. 6312. Springer, 392–405.
Michael J. Black. 2013. Towards Understanding Action Recognition. In [23] Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. FineGym: A
IEEE ICCV 2013, Sydney, Australia, December 1-8, 2013. IEEE Computer Hierarchical Video Dataset for Fine-grained Action Understanding.
Society, 3192–3199. CoRR abs/2004.06704 (2020).
[9] Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, [24] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012.
Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, UCF101: A Dataset of 101 Human Actions Classes From Videos in The
Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Wild. CoRR abs/1212.0402 (2012).
Kinetics Human Action Video Dataset. CoRR abs/1705.06950 (2017). [25] Siddharth Sriraman, Srinath Srinivasan, Vishnu K. Krishnan, Bhuvana
[10] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso A. J, and T. T. Mirnalinee. 2019. MediaEval 2019: LRCNs for Stroke
Poggio, and Thomas Serre. 2011. HMDB: A large video database Detection in Table Tennis, See [11].
for human motion recognition. In IEEE ICCV 2011, Barcelona, Spain, [26] Andrei Stoian, Marin Ferecatu, Jenny Benois-Pineau, and Michel Cru-
November 6-13, 2011, Dimitris N. Metaxas, Long Quan, Alberto Sanfeliu, cianu. 2016. Fast Action Localization in Large-Scale Video Archives.
and Luc Van Gool (Eds.). IEEE Computer Society, 2556–2563. IEEE Trans. Circuits Syst. Video Techn. 26, 10 (2016), 1917–1930.
[11] Martha A. Larson, Steven Alexander Hicks, Mihai Gabriel Con- [27] S. S. Tabrizi, S. Pashazadeh, and V. Javani. 2020. Comparative Study
stantin, Benjamin Bischke, Alastair Porter, Peijian Zhao, Mathias Lux, of Table Tennis Forehand Strokes Classification Using Deep Learning
Laura Cabrera Quiros, Jordan Calandre, and Gareth Jones (Eds.). 2020. and SVM. IEEE Sensors Journal (2020), 1–1.
Working Notes Proceedings of the MediaEval 2019 Workshop, Sophia [28] Gül Varol, Ivan Laptev, and Cordelia Schmid. 2018. Long-Term Tem-
Antipolis, France, 27-30 October 2019. CEUR Workshop Proceedings, poral Convolutions for Action Recognition. IEEE Trans. Pattern Anal.
Vol. 2670. CEUR-WS.org. Mach. Intell. 40, 6 (2018), 1510–1517.
[12] Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander [29] Roman Voeikov, Nikolay Falaleev, and Ruslan Baikulov. 2020. TTNet:
Vostrikov, and Andrew Zisserman. 2020. The AVA-Kinetics Localized Real-time temporal and spatial video analysis of table tennis. CoRR
Human Actions Video Dataset. CoRR abs/2005.00214 (2020). abs/2004.09927 (2020).
[13] Hsien-I Lin, Zhangguo Yu, and Yi-Chen Huang. 2020. Ball Tracking [30] Jiachen Wang, Kejian Zhao, Dazhen Deng, Anqi Cao, Xiao Xie, Zheng
and Trajectory Prediction for Table-Tennis Robots. Sensors 20, 2 (2020). Zhou, Hui Zhang, and Yingcai Wu. 2020. Tac-Simur: Tactic-based
[14] Ruichen Liu, Zhelong Wang, Xin Shi, Hongyu Zhao, Sen Qiu, Jie Li, Simulative Visual Analytics of Table Tennis. IEEE Trans. Vis. Comput.
and Ning Yang. 2019. Table Tennis Stroke Recognition Based on Graph. 26, 1 (2020), 407–417.
Body Sensor Network. In IDCS 2019, Naples, Italy, October 10-12, 2019, [31] Erwin Wu and Hideki Koike. 2020. FuturePong: Real-time Table Tennis
Proceedings (Lecture Notes in Computer Science), Raffaele Montella, Trajectory Forecasting using Pose Prediction Network. In CHI 2020,
Angelo Ciaramella, Giancarlo Fortino, Antonio Guerrieri, and Antonio Honolulu, HI, USA, Regina Bernhaupt, Florian ’Floyd’ Mueller, David
Liotta (Eds.), Vol. 11874. Springer, 1–10. Verweij, Josh Andres, Joanna McGrenere, Andy Cockburn, Ignacio
[15] Zheng Liu and Haifeng Hu. 2019. Spatiotemporal Relation Networks Avellino, Alix Goguey, Pernille Bjøn, Shengdong Zhao, Briane Paul
for Video Action Recognition. IEEE Access 7 (2019), 14969–14976. Samson, and Rafal Kocielnik (Eds.). ACM, 1–8.
[16] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud [32] Kun Xia, Hanyu Wang, Menghan Xu, Zheng Li, Sheng He, and Yusong
Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2019. Tang. 2020. Racquet Sports Recognition Using a Hybrid Clustering
Sports Video Annotation: Detection of Strokes in Table Tennis Task Model Learned from Integrated Wearable Sensor. Sensors 20, 6 (2020),
for MediaEval 2019, See [11]. 1638.