Sports Video Classification: Classification of Strokes in Table Tennis for MediaEval 2020

Sports Video Classification: Classification of Strokes in Table Tennis for MediaEval 2020 Pierre-EtienneMartin Univ. Bordeaux CNRS Bordeaux INP LaBRI

Talence France

JennyBenois-Pineau Univ. Bordeaux CNRS Bordeaux INP LaBRI

Talence France

BorisMansencal Univ. Bordeaux CNRS Bordeaux INP LaBRI

Talence France

RenaudPéteri MIA La Rochelle University

La Rochelle France

LaurentMascarilla MIA La Rochelle University

La Rochelle France

JordanCalandre MIA La Rochelle University

La Rochelle France

JulienMorlier IMS University of Bordeaux

Talence France

Sports Video Classification: Classification of Strokes in Table Tennis for MediaEval 2020 BCCDD486319A879B052AB3F2A6775794 GROBID - A machine learning software for extracting information from scholarly documents

Fine grained action classification has raised new challenges compared to classical action classification problem. Sport video analysis is a very popular research topic, due to the variety of application areas, ranging from multimedia intelligent devices with user-tailored digests, up to analysis of athletes' performances. Running since 2019 as a part of MediaEval, we offer a task which consists in classifying table tennis strokes from videos recorded in natural conditions at the University of Bordeaux. The aim is to build tools for teachers, coaches and players to analyse table tennis games. Such tools could lead to an automatic profiling of the player and the training session could then be adapted for improving sports skills more efficiently.

INTRODUCTION

Action detection and classification is one of the main challenges in visual content analysis and mining [26]. Over the last few years, the number of datasets for action classification has drastically increased in terms of video content, resolution, localization and number of classes. However the latest research shows that classification performed using deep neural networks often focuses on the whole scene and the background and not on the action itself.

Sport video analysis has been a very popular research topic, due to the variety of application areas, ranging from multimedia intelligent devices with user-tailored digests, up to analysis of athletes' performance [5]. The Sport Video Classification project was initiated by the Faculty of Sports (STAPS) and the computer science laboratories (LaBRI) of the University of Bordeaux, and by the MIA laboratory of La Rochelle University 1 . The goal of this project is to develop artificial intelligence and multimedia indexing methods for the recognition of table tennis sport activities. The ultimate goal is to evaluate the performance of athletes, with a particular focus on students, in order to develop optimal training strategies. To that aim, a video corpus named TTStroke-21 was recorded with volunteer players. These data are of great scientific interest for the Multimedia community participating in the MediaEval campaign.

Several datasets such as UCF-101 [24], HMDB [10] and AVA [7] have been used for many years as benchmarks for action classification methods. In [15], spatio-temporal dependencies are learned from the video using only RGB images for classification. This method is promising but its scores are still below multi-modal methods such I3D [4]. More recently, datasets have been enriched, like JH-MDB [8] and Kinetics [2,3,9] or fused like AVA_Kinetics [12]. Some also focus on the intra-class dissimilarity such as the Something-Something dataset. Others, such as the Olympic Sports dataset [22], focus on sport actions only. However those datasets are not dedicated to a specific sport and its associated rules. Few datasets focus on fine-grained classification. We can cite FineGym [23], introduced recently, which focuses on gymnastic videos, and our TTStroke21 [21] comprising table tennis strokes.

TTStroke-21 is manually annotated by professional players or teachers of table tennis, making the annotation process more time consuming, but more temporally and qualitatively accurate. Classification methods such as I3D model [4] or LTC model [28] performing well on UCF-101 dataset inspired the work done in [18,21] which introduces a TSTCNN -Twin Spatio Temporal Convolutional Neural Network. Here, the video stream and derived computed optical flow are passed through the branches of the TSTCNN. In [19] the normalization of the flow is also investigated to improve the classification score while in [20] an attention block is introduced to improve the performances and speed of convergence. The inter-similarity of actions -strokes -in TTStroke-21 makes the classification task challenging and the multi-modal method seemed to improve performances. To better understand learned features and classification process taking place in the TSTCNN, we also developed a new visualization technique [6].

Recent work focusing on table tennis [30] tries to get the tactics of the players based on their performance during matches using a Markov chain model. In [14,27,32] stroke recognition is performed using sensors. In [29] segmentation of the player, ball coordinates, event detection is explored while [13,31] focus solely on the trajectory of the ball.

In this task overview paper, in section 2, we introduce the specific conditions of usage of this data, then describe TTStroke-21 and the task respectively in sections 3 and 4. The evaluation method is explained in section 5. Supplementary notes are shared in section 6. More information can be found on the dedicated GitHub web page2 .

P-e Martin et al.

DATASET DESCRIPTION

In the MediaEval 2020 campaign, we release the same subset of the TTStroke-21 dataset than last year. The only difference is the blurring of the faces and the specification if the player is right-handed or left-handed. The dataset has been recorded in a sport faculty facility using a light-weight equipment, such as GoPro cameras. [16], most of the faces are blurred. To do so, faces are detected with OpenCV deep learning face detector, based on the Single Shot Detector (SSD) framework with a ResNet base network, for each frame of the original video. The detected face is blurred and frames are re-encoded in a video.

The organisation of the delivered data is as follows:

• The provided dataset is split into two subsets: i) training set and ii) test set; • In each directory, there are several videos (in MPEG-4 format) and each video may contain several actions; • Each video file is provided with a XML file describing the actions present in the video and if the player is right-handed or left-handed; • Each action has 3 attributes: the starting frame, the ending frame, and the stroke class; • In the train set XML files, all the attributes are specified.

In the test set XML files, only the starting and ending frames are specified. The stroke class attribute is purposely set to value: "Unknown", and should be updated by the participants to one of the 20 valid classes.

TASK DESCRIPTION

The Sport Video Annotation task consists, for each action of each test video, in assigning a label using a given taxonomy of 20 classes of table tennis strokes. Participants may submit up to five runs. For each run, they must provide one XML file per video file containing, with the actions associated with the recognised stroke class. Runs may be submitted as an archive (zip or tar.gz file) with each run in a different directory. Participants should also indicate if any external data, such as other dataset or pretrained networks, was used to compute their runs. The task is considered fully automatic. Once the video are provided to the system, results should be produced without any human intervention.

EVALUATION

For MediaEval 2020, we propose a light-weight classification task. It consists in classification of table tennis strokes which temporal borders are supplied in the XML files accompanying each video file. Hence for each test video the participants are invited to produce an XML file in which each stroke is labelled accordingly to the given taxonomy. This means that the default label "unknown" has to be replaced by the label of the stroke class that the participant's system has assigned. All submissions will be evaluated in terms of per-class accuracy (𝐴 𝑖 ) and of global accuracy (𝐺𝐴).

The organizers will also provide to the participants different confusion matrices: one considering all the classes, and others considering the type of the stroke such as: offensive, defensive and defensive and/or using forehand and backhand superclasses of the strokes.

DISCUSSION

The participants from last years have reached a maximum accuracy of 22.9% [17], 14.1% [1] and 11.3% [25] leaving room for improvement. Participants are welcome to share their difficulties and their results even if they seem not sufficiently good.

a. Video acquisition b. Annotation platform

Figure 1 :1Figure 1: TTStroke-21 acquisition process https://multimediaeval.github.io/2020-Sports-Video-Classification-Task/

This work was supported by the New Aquitania Region through CRISP project -ComputeR vIsion for Sport Performance and the MIRES federation.

Optical Flow Singularities for Sports Video Annotation: Detection of Strokes in Table Tennis JordanCalandre RenaudPéteri LaurentMascarilla 2019 11 A Short Note about Kinetics-600 JoãoCarreira EricNoland AndrasBanki-Horvath ChloeHillier AndrewZisserman CoRR abs/1808.01340 2018. 2018 A Short Note on the Kinetics-700 Human Action Dataset JoãoCarreira EricNoland ChloeHillier AndrewZisserman CoRR abs/1907.06987 2019. 2019 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset JoãoCarreira AndrewZisserman 2017. 2017 Activity-Conditioned Continuous Human Pose Estimation for Performance Analysis of Athletes Using the Example of Swimming MoritzEinfalt DanZecha RainerLienhart IEEE WACV 2018

Lake Tahoe, NV, USA

2018. March 12-15, 2018 Feature Understanding in 3D CNNs for Actions Recognition in Video KaziAhmed AsifFuad Pierre-EtienneMartin RomaiGiot RomainBourqui JennyBenois-Pineau AkkaZemmari Tenth International Conference on Image Processing Theory, Tools and Applications, IPTA 2020

Paris, France

2020. November 9-12, 2020 AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions ChunhuiGu ChenSun DavidARoss CarlVondrick CarolinePantofaru YeqingLi SudheendraVijayanarasimhan GeorgeToderici SusannaRicco RahulSukthankar CordeliaSchmid JitendraMalik 2018. 2018 Towards Understanding Action Recognition HueihanJhuang JuergenGall SilviaZuffi CordeliaSchmid MichaelJBlack IEEE ICCV 2013

Sydney, Australia

IEEE Computer Society 2013. December 1-8, 2013 The Kinetics Human Action Video Dataset WillKay JoãoCarreira KarenSimonyan BrianZhang ChloeHillier SudheendraVijayanarasimhan FabioViola TimGreen TrevorBack PaulNatsev MustafaSuleyman AndrewZisserman CoRR abs/1705.06950 2017. 2017 HMDB: A large video database for human motion recognition HildegardKuehne HueihanJhuang EstíbalizGarrote TomasoAPoggio ThomasSerre IEEE ICCV 2011 DimitrisNMetaxas LongQuan AlbertoSanfeliu LucVan Gool

Barcelona, Spain

IEEE Computer Society 2011. November 6-13, 2011 Mihai MarthaALarson StevenAlexanderHicks Working Notes Proceedings of the MediaEval 2019 Workshop CEUR Workshop Proceedings GabrielConstantin BenjaminBischke AlastairPorter PeijianZhao MathiasLux LauraCabrera Quiros JordanCalandre GarethJones

Sophia Antipolis, France

2020. 27-30 October 2019 2670 The AVA-Kinetics Localized Human Actions Video Dataset AngLi MeghanaThotakuri DavidARoss JoãoCarreira AlexanderVostrikov AndrewZisserman CoRR abs/2005.00214 2020. 2020 Ball Tracking and Trajectory Prediction for Table-Tennis Robots -IHsien ZhangguoLin Yi-ChenYu Huang Sensors 20 2 2020. 2020 Table Tennis Stroke Recognition Based on Body Sensor Network RuichenLiu ZhelongWang XinShi HongyuZhao SenQiu JieLi NingYang IDCS 2019 Lecture Notes in Computer Science RaffaeleMontella AngeloCiaramella GiancarloFortino AntonioGuerrieri AntonioLiotta

Naples, Italy

Springer 2019. October 10-12, 2019 11874 Spatiotemporal Relation Networks for Video Action Recognition ZhengLiu HaifengHu IEEE Access 7 2019. 2019 Sports Video Annotation: Detection of Strokes in Table Tennis Task for MediaEval Pierre-EtienneMartin JennyBenois-Pineau BorisMansencal RenaudPéteri LaurentMascarilla JordanCalandre JulienMorlier 2019. 2019 Siamese Spatio-Temporal Convolutional Neural Network for Stroke Classification in Table Tennis Games Pierre-EtienneMartin JennyBenois-Pineau BorisMansencal RenaudPéteri JulienMorlier 2019 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis Pierre-EtienneMartin JennyBenois-Pineau RenaudPéteri JulienMorlier CBMI 2018

La Rochelle, France

IEEE 2018. September 4-6, 2018 Optimal Choice of Motion Estimation Methods for Fine-Grained Action Classification with 3D Convolutional Networks Pierre-EtienneMartin JennyBenois-Pineau RenaudPéteri JulienMorlier IEEE ICIP 2019

Taipei, Taiwan

IEEE 2019. September 22-25, 2019 3D attention mechanisms in Twin Spatio-Temporal Convolutional Neural Networks Pierre-EtienneMartin JennyBenois-Pineau RenaudPéteri JulienMorlier Application to action classification in videos of table tennis games

, Italy

2020. 10-15 January 2021 2ICPR2020 -MiCo Milano Congress Center Fine grained sport action recognition with Twin spatiotemporal convolutional neural networks Pierre-EtienneMartin JennyBenois-Pineau RenaudPéteri JulienMorlier Multim. Tools Appl 79 2020. 2020 Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification JuanCarlos Niebles Chih-WeiChen Fei-FeiLi Computer Vision -ECCV 2010 Lecture Notes in Computer Science KostasDaniilidis PetrosMaragos NikosParagios

Heraklion, Crete, Greece

Springer 2010. September 5-11, 2010 6312 Proceedings, Part II FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding DianShao YueZhao BoDai DahuaLin CoRR abs/2004.06704 2020. 2020 UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild KhurramSoomro MubarakAmir Roshan Zamir Shah CoRR abs/1212.0402 2012. 2012 SiddharthSriraman SrinathSrinivasan KVishnu Krishnan JBhuvana TTMirnalinee MediaEval 2019: LRCNs for Stroke Detection in Table Tennis 2019 Fast Action Localization in Large-Scale Video Archives AndreiStoian MarinFerecatu JennyBenois-Pineau MichelCrucianu IEEE Trans. Circuits Syst. Video Techn 26 10 2016. 2016 Comparative Study of Table Tennis Forehand Strokes Classification Using Deep Learning and SVM SSTabrizi SPashazadeh VJavani IEEE Sensors Journal 2020. 2020 Long-Term Temporal Convolutions for Action Recognition GülVarol IvanLaptev CordeliaSchmid IEEE Trans. Pattern Anal. Mach. Intell 40 6 2018. 2018 TTNet: Real-time temporal and spatial video analysis of table tennis RomanVoeikov NikolayFalaleev RuslanBaikulov CoRR abs/2004.09927 2020. 2020 Tac-Simur: Tactic-based Simulative Visual Analytics of Table Tennis JiachenWang KejianZhao DazhenDeng AnqiCao XiaoXie ZhengZhou HuiZhang YingcaiWu IEEE Trans. Vis. Comput. Graph 26 1 2020. 2020 FuturePong: Real-time Table Tennis Trajectory Forecasting using Pose Prediction Network ErwinWu HidekiKoike CHI 2020 ReginaBernhaupt Florian'Floyd' Mueller DavidVerweij JoshAndres JoannaMcgrenere AndyCockburn IgnacioAvellino AlixGoguey PernilleBjøn ShengdongZhao BrianePaul Samson RafalKocielnik

Honolulu, HI, USA

2020 Racquet Sports Recognition Using a Hybrid Clustering Model Learned from Integrated Wearable Sensor KunXia HanyuWang MenghanXu ZhengLi ShengHe YusongTang Sensors 20 1638 2020. 2020