Sports Video: Fine-Grained Action Detection and Classification of Table Tennis Strokes from Videos for MediaEval 2021 Pierre-Etienne Martin1 , Jordan Calandre2 , Boris Mansencal3 , Jenny Benois-Pineau3 , Renaud Péteri2 , Laurent Mascarilla2 , Julien Morlier4 1 CCP Department, Max Planck Institute for Evolutionary Anthropology, D-04103 Leipzig, Germany 2 MIA, La Rochelle University, La Rochelle, France 3 Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, Talence, France 4 IMS, University of Bordeaux, Talence, France mediaeval.sport.task@diff.u-bordeaux.fr ABSTRACT optimal training strategies. To that aim, the video corpus named Sports video analysis is a prevalent research topic due to the variety TTStroke-21 was recorded with volunteer players. of application areas, ranging from multimedia intelligent devices Datasets such as UCF-101 [26], HMDB [6, 8], AVA [5] and Kinet- with user-tailored digests up to analysis of athletes’ performance. ics [1, 2, 7, 9, 25] are being use in the scope of action recognition The Sports Video task is part of the MediaEval 2021 benchmark. with, year after year, an increasing number of considered videos This task tackles fine-grained action detection and classification and classes. Few datasets focus on fine-grained classification in from videos. The focus is on recordings of table tennis games. Run- sports such as FineGym [24] and TTStroke21 [19]. ning since 2019, the task has offered a classification challenge from To tackle the increasing complexity of the datasets, we have on untrimmed video recorded in natural conditions with known tem- one hand methods getting the most of the temporal information: for poral boundaries for each stroke. This year, the dataset is extended example, in [11], where spatio-temporal dependencies are learned and offers, in addition, a detection challenge from untrimmed videos from the video using only RGB data. And on the other hand, meth- without annotations. This work aims at creating tools for sports ods combining other modalities extracted from videos, such as the coaches and players in order to analyze sports performance. Move- optical flow [3, 18, 27]. The inter-similarity of actions - strokes - in ment analysis and player profiling may be built upon such technol- TTStroke-21 makes the classification task challenging, and both ogy to enrich the training experience of athletes and improve their cited aspects shall be used to improve performance. performance. The following sections present the Sport task this year and its specific terms of use. Complementary information on the task may be found on the dedicated page from the MediaEval website2 . 1 INTRODUCTION Action detection and classification are one of the main challenges 2 TASK DESCRIPTION in computer vision [20]. Over the last few years, the number of This task uses the TTStroke-21 database [19]. This dataset is con- datasets and their complexity dedicated to action classification has stituted of recordings of table tennis players performing in natural drastically increased [12]. Sports video analysis is one branch of conditions. This task offers researchers an opportunity to test their computer vision and applications in this area range from multime- fine-grained classification methods for detecting and classifying dia intelligent devices with user-tailored digests, up to analysis of strokes in table tennis videos. Compared to the Sports Video 2020 athletes’ performance [4, 21, 28]. A large amount of work is devoted edition, this year, we extend the task with detection, and enrich the to the analysis of sports gestures using motion capture systems. data set with new and more diverse stroke samples. The task now However, body-worn sensors and markers could disturb the natural offers two subtasks. Each subtask has its own split of the dataset, behavior of sports players. This issue motivates the development leading to different train, validation, and test sets. of methods for game analysis using non-invasive equipment such Participants can choose to participate in only one or both sub- as video recordings from cameras. tasks and submit up to five runs for each. The participants must The Sports Video Classification project was initiated by the provide one XML file per video file present in the test set for each Sports Faculty (STAPS) and the computer science laboratory LaBRI run. The content of the XML file varies according to the subtask. of the University of Bordeaux, and the MIA laboratory of La Runs may be submitted as an archive (zip file), with each run in a Rochelle University1 . This project aims to develop artificial intel- different directory for each subtask. Participants should also submit ligence and multimedia indexing methods for the recognition of a working notes paper, which describes their method and indicates table tennis activities. The ultimate goal is to evaluate the perfor- if any external data, such as other datasets or pretrained networks, mance of athletes, with a particular focus on students, to develop was used to compute their runs. The task is considered fully auto- 1 This work was supported by the New Aquitania Region through CRISP project - matic: once the videos are provided to the system, results should be ComputeR vIsion for Sports Performance and the MIRES federation. produced without human intervention. Participants are encouraged to release their code publicly with their submission. This year, a Copyright 2021 for this paper by its authors. Use permitted under Creative Commons baseline for both subtasks was shared publicly [13]. License Attribution 4.0 International (CC BY 4.0). MediaEval’21, December 13-15 2021, Online 2 https://multimediaeval.github.io/editions/2021/tasks/sportsvideo/ MediaEval’21, December 13-15 2021, Online P-e Martin et al. 2.1 Subtask 1 - Stroke Detection Participants must build a system that detects whether a stroke has been performed, whatever its class, and extract its temporal boundaries. The aim is to distinguish between moments of interest in a game (players performing strokes) from irrelevant moments Figure 1: Key frames of a same stroke from TTStroke-21 (time between strokes, picking up the ball, having a break. . . ). This subtask can be a preliminary step for later recognizing a stroke that 3 DATASET DESCRIPTION has been performed. The dataset has been recorded at the STAPS using lightweight Participants have to segment regions where a stroke is performed equipment. It is constituted of player-centered videos recorded in in the provided videos. Provided XML files contain the stroke tem- natural conditions without markers or sensors, see Fig 1. Profes- poral boundaries (frame index of the videos) related to the train and sional table tennis teachers designed a dedicated taxonomy. The validation sets. We invite the participants to fill an XML file for each dataset comprises 20 table tennis stroke classes: height services, six test video in which each stroke should be temporally segmented offensive strokes, and six defensive strokes. The strokes may be frame-wise following the same structure. divided in two super-classes: Forehand and Backhand. For this subtask, the videos are not shared across train, validation, All videos are recorded in MPEG-4 format. We blurred fhe faces and test sets; however, a same player may appear in the different of the players for each original video frame using OpenCV deep sets. The Intersection over Union (IoU) and Average Precision (AP) learning face detector, based on the Single Shot Detector (SSD) metrics will be used for evaluation. Both are usually used for image framework with a ResNet base network. A tracking method has segmentation but are adapted for this task: been implemented to decrease the false positive rate. The detected faces are blurred, and the video is re-encoded in MPEG-4. • Global IoU: the frame-wise overlap between the ground Compared with Sports Video 2020 edition, this year, the data set truth and the predicted strokes across all the videos. is enriched with new and more diverse video samples. A total of • Instance AP: each stroke represents an instance to be de- 100 minutes of table tennis games across 28 videos recorded at 120 tected. Detection is considered True when the IoU between frames per second is considered. It represents more than 718 000 prediction and ground truth is above an IoU threshold. 20 frames in HD (1920 × 1080). An additional validation set is also thresholds from 0.5 to 0.95 with a step of 0.05 are consid- provided for better comparison across participants. This set may be ered, similarly to the COCO challenge [10]. This metric used for training when submitting the test set’s results. Twenty-two will be used for the final ranking of participants. videos are used for the Stroke Classification subtask, representing 1017 strokes randomly distributed in the different sets following the 2.2 Subtask 2 - Stroke Classification previously given proportions. The same videos are used in the train and validation sets of the Segmentation subtask, and six additional This subtask is similar to the main task of the previous edition [15]. videos, without annotations, are dedicated to its test set. This year the dataset is extended, and a validation set is provided. Participants are required to build a classification system that au- 4 SPECIFIC TERMS OF USE tomatically labels video segments according to a performed stroke. Although faces are automatically blurred to preserve anonymity, There are 20 possible stroke classes. The temporal boundaries of some faces are misdetected, and thus some players remain identifi- each stroke are supplied in the XML files accompanying each video able. In order to respect the personal data of the players, this dataset in each set. The XML files dedicated to the train and validation sets is subject to a usage agreement, referred to as Special Conditions. contain the stroke class as a label, while in the test set, the label The complete acceptance of these Special Conditions is a mandatory is set to “Unknown”. Hence for each XML file in the test set, the prerequisite for the provision of the Images as part of the MediaEval participants are invited to replace the default label “Unknown” by 2021 evaluation campaign. A complete reading of these conditions the stroke class that the participant’s system has assigned according is necessary and requires the user, for example, to obscure the faces to the given taxonomy. (blurring, black banner) in the video before use in any publication For this subtask, the videos are shared across the sets following and to destroy the data by October 1st, 2022. a random distribution of all the strokes with the proportions of 60%, 20% and 20% respectively for the train, validation and test sets. 5 DISCUSSIONS All submissions will be evaluated in terms of global accuracy for This year the Sports Video task of MediaEval proposes two subtasks: ranking and detailed with per-class accuracy. i) Detection and ii) Classification of strokes from videos. Even if Last year, the best global accuracy (31.4%) was obtained by [22] the players’ faces are blurred, the provided videos still fall under using Channel-Separated CNN. [17] is second (26.6%) using 3D particular usage conditions that the participants need to accept. attention mechanism and [23] third (16.7%) using pose information Participants are encouraged to share their difficulties and their and cascade labelling method. Improvement has been observed com- results even if they seem not sufficiently good. All the investigations, pared to the previous edition [14] with a best accuracy of 22.9% [16]. even when not successful, may inspire future methods. This improvement seems to be correlated by various factors such as: i) multi-modal methods, ii) deeper and more complex CNN cap- ACKNOWLEDGMENTS turing simultaneously spatial and temporal features, and iii) class Many thanks to the players, coaches, and annotators who con- decision following a cascade method. tributed to TTStroke-21. Sports Video Task MediaEval’21, December 13-15 2021, Online REFERENCES Networks for Table Tennis Strokes Classification Task. In MediaEval (CEUR [1] João Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Workshop Proceedings), Vol. 2882. CEUR-WS.org. Zisserman. 2018. A Short Note about Kinetics-600. CoRR abs/1808.01340 (2018). [23] Soichiro Sato and Masaki Aono. 2020. Leveraging Human Pose Estimation [2] João Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. 2019. A Short Model for Stroke Classification in Table Tennis. In MediaEval (CEUR Workshop Note on the Kinetics-700 Human Action Dataset. CoRR abs/1907.06987 (2019). Proceedings), Vol. 2882. CEUR-WS.org. [3] João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A [24] Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. FineGym: A Hierarchical New Model and the Kinetics Dataset. In CVPR. IEEE Computer Society, 4724– Video Dataset for Fine-Grained Action Understanding. In CVPR. IEEE, 2613– 4733. 2622. [4] Moritz Einfalt, Dan Zecha, and Rainer Lienhart. 2018. Activity-Conditioned [25] Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Continuous Human Pose Estimation for Performance Analysis of Athletes Using Zisserman. 2020. A Short Note on the Kinetics-700-2020 Human Action Dataset. the Example of Swimming. In WACV. 446–455. CoRR abs/2010.10864 (2020). [5] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing [26] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Suk- A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR thankar, Cordelia Schmid, and Jitendra Malik. 2018. AVA: A Video Dataset of abs/1212.0402 (2012). Spatio-Temporally Localized Atomic Visual Actions. (2018), 6047–6056. [27] Gül Varol, Ivan Laptev, and Cordelia Schmid. 2018. Long-Term Temporal Con- [6] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. volutions for Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 2013. Towards Understanding Action Recognition. In ICCV. IEEE Computer (2018), 1510–1517. Society, 3192–3199. [28] Roman Voeikov, Nikolay Falaleev, and Ruslan Baikulov. 2020. TTNet: Real-time [7] Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra temporal and spatial video analysis of table tennis. (2020), 3866–3874. Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. CoRR abs/1705.06950 (2017). [8] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso A. Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recogni- tion. In ICCV. IEEE Computer Society, 2556–2563. [9] Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, and Andrew Zisserman. 2020. The AVA-Kinetics Localized Human Actions Video Dataset. CoRR abs/2005.00214 (2020). [10] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Com- mon Objects in Context. In Computer Vision - ECCV 2014 - 13th European Con- ference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V (Lecture Notes in Computer Science), David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.), Vol. 8693. Springer, 740–755. https://doi.org/10.1007/ 978-3-319-10602-1_48 [11] Zheng Liu and Haifeng Hu. 2019. Spatiotemporal Relation Networks for Video Action Recognition. IEEE Access 7 (2019), 14969–14976. [12] Pierre-Etienne Martin. 2020. Fine-Grained Action Detection and Classification from Videos with Spatio-Temporal Convolutional Neural Networks. Application to Table Tennis. (Détection et classification fines d’actions à partir de vidéos par réseaux de neurones à convolutions spatio-temporelles. Application au tennis de table). Ph.D. Dissertation. University of La Rochelle, France. https://tel.archives-ouvertes.fr/ tel-03128769 [13] Pierre-Etienne Martin. 2021. Spatio-Temporal CNN baseline method for the Sports Video Task of MediaEval 2021 benchmark. In MediaEval (CEUR Workshop Proceedings). CEUR-WS.org. [14] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2019. Sports Video Annotation: Detection of Strokes in Table Tennis Task for MediaEval 2019. In MediaEval (CEUR Workshop Proceedings), Vol. 2670. CEUR-WS.org. [15] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2020. Sports Video Classification: Classification of Strokes in Table Tennis for MediaEval 2020. In MediaEval (CEUR Workshop Proceedings), Vol. 2882. CEUR-WS.org. [16] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud Péteri, and Julien Morlier. 2019. Siamese Spatio-Temporal Convolutional Neural Network for Stroke Classification in Table Tennis Games. In MediaEval (CEUR Workshop Proceedings), Vol. 2670. CEUR-WS.org. [17] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud Péteri, and Julien Morlier. 2020. Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal CNN for MediaEval 2020. In MediaEval (CEUR Workshop Proceedings), Vol. 2882. CEUR-WS.org. [18] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. 2019. Optimal Choice of Motion Estimation Methods for Fine-Grained Action Classification with 3D Convolutional Networks. In ICIP. IEEE, 554–558. [19] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. 2020. Fine grained sport action recognition with Twin spatio-temporal convolu- tional neural networks. Multim. Tools Appl. 79, 27-28 (2020), 20429–20447. [20] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, Akka Zemmari, and Julien Morlier. 2021. 3D Convolutional Networks for Action Recognition: Application to Sport Gesture Recognition. Springer International Publishing. [21] Marion Morel, Catherine Achard, Richard Kulpa, and Séverine Dubuisson. 2017. Automatic evaluation of sports motion: A generic computation of spatial and temporal errors. Image Vis. Comput. 64 (2017), 67–78. [22] Hai Nguyen-Truong, San Cao, N. A. Khoa Nguyen, Bang-Dang Pham, Hieu Dao, Minh-Quan Le, Hoang-Phuc Nguyen-Dinh, Hai-Dang Nguyen, and Minh-Triet Tran. 2020. HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural