Sports Video: Fine-Grained Action Detection and Classification
       of Table Tennis Strokes from Videos for MediaEval 2021
                 Pierre-Etienne Martin1 , Jordan Calandre2 , Boris Mansencal3 , Jenny Benois-Pineau3 ,
                                Renaud Péteri2 , Laurent Mascarilla2 , Julien Morlier4
                 1 CCP Department, Max Planck Institute for Evolutionary Anthropology, D-04103 Leipzig, Germany
                                                   2 MIA, La Rochelle University, La Rochelle, France
                                           3 Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, Talence, France
                                                      4 IMS, University of Bordeaux, Talence, France

                                                          mediaeval.sport.task@diff.u-bordeaux.fr

ABSTRACT                                                                             optimal training strategies. To that aim, the video corpus named
Sports video analysis is a prevalent research topic due to the variety               TTStroke-21 was recorded with volunteer players.
of application areas, ranging from multimedia intelligent devices                       Datasets such as UCF-101 [26], HMDB [6, 8], AVA [5] and Kinet-
with user-tailored digests up to analysis of athletes’ performance.                  ics [1, 2, 7, 9, 25] are being use in the scope of action recognition
The Sports Video task is part of the MediaEval 2021 benchmark.                       with, year after year, an increasing number of considered videos
This task tackles fine-grained action detection and classification                   and classes. Few datasets focus on fine-grained classification in
from videos. The focus is on recordings of table tennis games. Run-                  sports such as FineGym [24] and TTStroke21 [19].
ning since 2019, the task has offered a classification challenge from                   To tackle the increasing complexity of the datasets, we have on
untrimmed video recorded in natural conditions with known tem-                       one hand methods getting the most of the temporal information: for
poral boundaries for each stroke. This year, the dataset is extended                 example, in [11], where spatio-temporal dependencies are learned
and offers, in addition, a detection challenge from untrimmed videos                 from the video using only RGB data. And on the other hand, meth-
without annotations. This work aims at creating tools for sports                     ods combining other modalities extracted from videos, such as the
coaches and players in order to analyze sports performance. Move-                    optical flow [3, 18, 27]. The inter-similarity of actions - strokes - in
ment analysis and player profiling may be built upon such technol-                   TTStroke-21 makes the classification task challenging, and both
ogy to enrich the training experience of athletes and improve their                  cited aspects shall be used to improve performance.
performance.                                                                            The following sections present the Sport task this year and its
                                                                                     specific terms of use. Complementary information on the task may
                                                                                     be found on the dedicated page from the MediaEval website2 .
1    INTRODUCTION
Action detection and classification are one of the main challenges                   2    TASK DESCRIPTION
in computer vision [20]. Over the last few years, the number of                      This task uses the TTStroke-21 database [19]. This dataset is con-
datasets and their complexity dedicated to action classification has                 stituted of recordings of table tennis players performing in natural
drastically increased [12]. Sports video analysis is one branch of                   conditions. This task offers researchers an opportunity to test their
computer vision and applications in this area range from multime-                    fine-grained classification methods for detecting and classifying
dia intelligent devices with user-tailored digests, up to analysis of                strokes in table tennis videos. Compared to the Sports Video 2020
athletes’ performance [4, 21, 28]. A large amount of work is devoted                 edition, this year, we extend the task with detection, and enrich the
to the analysis of sports gestures using motion capture systems.                     data set with new and more diverse stroke samples. The task now
However, body-worn sensors and markers could disturb the natural                     offers two subtasks. Each subtask has its own split of the dataset,
behavior of sports players. This issue motivates the development                     leading to different train, validation, and test sets.
of methods for game analysis using non-invasive equipment such                           Participants can choose to participate in only one or both sub-
as video recordings from cameras.                                                    tasks and submit up to five runs for each. The participants must
   The Sports Video Classification project was initiated by the                      provide one XML file per video file present in the test set for each
Sports Faculty (STAPS) and the computer science laboratory LaBRI                     run. The content of the XML file varies according to the subtask.
of the University of Bordeaux, and the MIA laboratory of La                          Runs may be submitted as an archive (zip file), with each run in a
Rochelle University1 . This project aims to develop artificial intel-                different directory for each subtask. Participants should also submit
ligence and multimedia indexing methods for the recognition of                       a working notes paper, which describes their method and indicates
table tennis activities. The ultimate goal is to evaluate the perfor-                if any external data, such as other datasets or pretrained networks,
mance of athletes, with a particular focus on students, to develop                   was used to compute their runs. The task is considered fully auto-
1 This work was supported by the New Aquitania Region through CRISP project -        matic: once the videos are provided to the system, results should be
ComputeR vIsion for Sports Performance and the MIRES federation.                     produced without human intervention. Participants are encouraged
                                                                                     to release their code publicly with their submission. This year, a
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   baseline for both subtasks was shared publicly [13].
License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, December 13-15 2021, Online
                                                                                     2 https://multimediaeval.github.io/editions/2021/tasks/sportsvideo/
MediaEval’21, December 13-15 2021, Online                                                                                       P-e Martin et al.


2.1    Subtask 1 - Stroke Detection
Participants must build a system that detects whether a stroke
has been performed, whatever its class, and extract its temporal
boundaries. The aim is to distinguish between moments of interest
in a game (players performing strokes) from irrelevant moments                 Figure 1: Key frames of a same stroke from TTStroke-21
(time between strokes, picking up the ball, having a break. . . ). This
subtask can be a preliminary step for later recognizing a stroke that      3     DATASET DESCRIPTION
has been performed.                                                        The dataset has been recorded at the STAPS using lightweight
   Participants have to segment regions where a stroke is performed        equipment. It is constituted of player-centered videos recorded in
in the provided videos. Provided XML files contain the stroke tem-         natural conditions without markers or sensors, see Fig 1. Profes-
poral boundaries (frame index of the videos) related to the train and      sional table tennis teachers designed a dedicated taxonomy. The
validation sets. We invite the participants to fill an XML file for each   dataset comprises 20 table tennis stroke classes: height services, six
test video in which each stroke should be temporally segmented             offensive strokes, and six defensive strokes. The strokes may be
frame-wise following the same structure.                                   divided in two super-classes: Forehand and Backhand.
   For this subtask, the videos are not shared across train, validation,      All videos are recorded in MPEG-4 format. We blurred fhe faces
and test sets; however, a same player may appear in the different          of the players for each original video frame using OpenCV deep
sets. The Intersection over Union (IoU) and Average Precision (AP)         learning face detector, based on the Single Shot Detector (SSD)
metrics will be used for evaluation. Both are usually used for image       framework with a ResNet base network. A tracking method has
segmentation but are adapted for this task:                                been implemented to decrease the false positive rate. The detected
                                                                           faces are blurred, and the video is re-encoded in MPEG-4.
      • Global IoU: the frame-wise overlap between the ground                 Compared with Sports Video 2020 edition, this year, the data set
        truth and the predicted strokes across all the videos.             is enriched with new and more diverse video samples. A total of
      • Instance AP: each stroke represents an instance to be de-          100 minutes of table tennis games across 28 videos recorded at 120
        tected. Detection is considered True when the IoU between          frames per second is considered. It represents more than 718 000
        prediction and ground truth is above an IoU threshold. 20          frames in HD (1920 × 1080). An additional validation set is also
        thresholds from 0.5 to 0.95 with a step of 0.05 are consid-        provided for better comparison across participants. This set may be
        ered, similarly to the COCO challenge [10]. This metric            used for training when submitting the test set’s results. Twenty-two
        will be used for the final ranking of participants.                videos are used for the Stroke Classification subtask, representing
                                                                           1017 strokes randomly distributed in the different sets following the
2.2    Subtask 2 - Stroke Classification                                   previously given proportions. The same videos are used in the train
                                                                           and validation sets of the Segmentation subtask, and six additional
This subtask is similar to the main task of the previous edition [15].     videos, without annotations, are dedicated to its test set.
This year the dataset is extended, and a validation set is provided.
   Participants are required to build a classification system that au-     4     SPECIFIC TERMS OF USE
tomatically labels video segments according to a performed stroke.         Although faces are automatically blurred to preserve anonymity,
There are 20 possible stroke classes. The temporal boundaries of           some faces are misdetected, and thus some players remain identifi-
each stroke are supplied in the XML files accompanying each video          able. In order to respect the personal data of the players, this dataset
in each set. The XML files dedicated to the train and validation sets      is subject to a usage agreement, referred to as Special Conditions.
contain the stroke class as a label, while in the test set, the label      The complete acceptance of these Special Conditions is a mandatory
is set to “Unknown”. Hence for each XML file in the test set, the          prerequisite for the provision of the Images as part of the MediaEval
participants are invited to replace the default label “Unknown” by         2021 evaluation campaign. A complete reading of these conditions
the stroke class that the participant’s system has assigned according      is necessary and requires the user, for example, to obscure the faces
to the given taxonomy.                                                     (blurring, black banner) in the video before use in any publication
   For this subtask, the videos are shared across the sets following       and to destroy the data by October 1st, 2022.
a random distribution of all the strokes with the proportions of
60%, 20% and 20% respectively for the train, validation and test sets.     5     DISCUSSIONS
All submissions will be evaluated in terms of global accuracy for          This year the Sports Video task of MediaEval proposes two subtasks:
ranking and detailed with per-class accuracy.                              i) Detection and ii) Classification of strokes from videos. Even if
   Last year, the best global accuracy (31.4%) was obtained by [22]        the players’ faces are blurred, the provided videos still fall under
using Channel-Separated CNN. [17] is second (26.6%) using 3D               particular usage conditions that the participants need to accept.
attention mechanism and [23] third (16.7%) using pose information          Participants are encouraged to share their difficulties and their
and cascade labelling method. Improvement has been observed com-           results even if they seem not sufficiently good. All the investigations,
pared to the previous edition [14] with a best accuracy of 22.9% [16].     even when not successful, may inspire future methods.
This improvement seems to be correlated by various factors such
as: i) multi-modal methods, ii) deeper and more complex CNN cap-           ACKNOWLEDGMENTS
turing simultaneously spatial and temporal features, and iii) class        Many thanks to the players, coaches, and annotators who con-
decision following a cascade method.                                       tributed to TTStroke-21.
Sports Video Task                                                                                                        MediaEval’21, December 13-15 2021, Online


REFERENCES                                                                                         Networks for Table Tennis Strokes Classification Task. In MediaEval (CEUR
 [1] João Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew                   Workshop Proceedings), Vol. 2882. CEUR-WS.org.
     Zisserman. 2018. A Short Note about Kinetics-600. CoRR abs/1808.01340 (2018).            [23] Soichiro Sato and Masaki Aono. 2020. Leveraging Human Pose Estimation
 [2] João Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. 2019. A Short                Model for Stroke Classification in Table Tennis. In MediaEval (CEUR Workshop
     Note on the Kinetics-700 Human Action Dataset. CoRR abs/1907.06987 (2019).                    Proceedings), Vol. 2882. CEUR-WS.org.
 [3] João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A               [24] Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. FineGym: A Hierarchical
     New Model and the Kinetics Dataset. In CVPR. IEEE Computer Society, 4724–                     Video Dataset for Fine-Grained Action Understanding. In CVPR. IEEE, 2613–
     4733.                                                                                         2622.
 [4] Moritz Einfalt, Dan Zecha, and Rainer Lienhart. 2018. Activity-Conditioned               [25] Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew
     Continuous Human Pose Estimation for Performance Analysis of Athletes Using                   Zisserman. 2020. A Short Note on the Kinetics-700-2020 Human Action Dataset.
     the Example of Swimming. In WACV. 446–455.                                                    CoRR abs/2010.10864 (2020).
 [5] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing           [26] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101:
     Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Suk-                   A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR
     thankar, Cordelia Schmid, and Jitendra Malik. 2018. AVA: A Video Dataset of                   abs/1212.0402 (2012).
     Spatio-Temporally Localized Atomic Visual Actions. (2018), 6047–6056.                    [27] Gül Varol, Ivan Laptev, and Cordelia Schmid. 2018. Long-Term Temporal Con-
 [6] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black.            volutions for Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6
     2013. Towards Understanding Action Recognition. In ICCV. IEEE Computer                        (2018), 1510–1517.
     Society, 3192–3199.                                                                      [28] Roman Voeikov, Nikolay Falaleev, and Ruslan Baikulov. 2020. TTNet: Real-time
 [7] Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra               temporal and spatial video analysis of table tennis. (2020), 3866–3874.
     Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa
     Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video
     Dataset. CoRR abs/1705.06950 (2017).
 [8] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso A. Poggio, and
     Thomas Serre. 2011. HMDB: A large video database for human motion recogni-
     tion. In ICCV. IEEE Computer Society, 2556–2563.
 [9] Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov,
     and Andrew Zisserman. 2020. The AVA-Kinetics Localized Human Actions Video
     Dataset. CoRR abs/2005.00214 (2020).
[10] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva
     Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Com-
     mon Objects in Context. In Computer Vision - ECCV 2014 - 13th European Con-
     ference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V (Lecture
     Notes in Computer Science), David J. Fleet, Tomás Pajdla, Bernt Schiele, and
     Tinne Tuytelaars (Eds.), Vol. 8693. Springer, 740–755. https://doi.org/10.1007/
     978-3-319-10602-1_48
[11] Zheng Liu and Haifeng Hu. 2019. Spatiotemporal Relation Networks for Video
     Action Recognition. IEEE Access 7 (2019), 14969–14976.
[12] Pierre-Etienne Martin. 2020. Fine-Grained Action Detection and Classification from
     Videos with Spatio-Temporal Convolutional Neural Networks. Application to Table
     Tennis. (Détection et classification fines d’actions à partir de vidéos par réseaux de
     neurones à convolutions spatio-temporelles. Application au tennis de table). Ph.D.
     Dissertation. University of La Rochelle, France. https://tel.archives-ouvertes.fr/
     tel-03128769
[13] Pierre-Etienne Martin. 2021. Spatio-Temporal CNN baseline method for the
     Sports Video Task of MediaEval 2021 benchmark. In MediaEval (CEUR Workshop
     Proceedings). CEUR-WS.org.
[14] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud Péteri,
     Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2019. Sports Video
     Annotation: Detection of Strokes in Table Tennis Task for MediaEval 2019. In
     MediaEval (CEUR Workshop Proceedings), Vol. 2670. CEUR-WS.org.
[15] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud Péteri,
     Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2020. Sports Video
     Classification: Classification of Strokes in Table Tennis for MediaEval 2020. In
     MediaEval (CEUR Workshop Proceedings), Vol. 2882. CEUR-WS.org.
[16] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud Péteri, and
     Julien Morlier. 2019. Siamese Spatio-Temporal Convolutional Neural Network
     for Stroke Classification in Table Tennis Games. In MediaEval (CEUR Workshop
     Proceedings), Vol. 2670. CEUR-WS.org.
[17] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud Péteri,
     and Julien Morlier. 2020. Classification of Strokes in Table Tennis with a Three
     Stream Spatio-Temporal CNN for MediaEval 2020. In MediaEval (CEUR Workshop
     Proceedings), Vol. 2882. CEUR-WS.org.
[18] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier.
     2019. Optimal Choice of Motion Estimation Methods for Fine-Grained Action
     Classification with 3D Convolutional Networks. In ICIP. IEEE, 554–558.
[19] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier.
     2020. Fine grained sport action recognition with Twin spatio-temporal convolu-
     tional neural networks. Multim. Tools Appl. 79, 27-28 (2020), 20429–20447.
[20] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, Akka Zemmari,
     and Julien Morlier. 2021. 3D Convolutional Networks for Action Recognition:
     Application to Sport Gesture Recognition. Springer International Publishing.
[21] Marion Morel, Catherine Achard, Richard Kulpa, and Séverine Dubuisson. 2017.
     Automatic evaluation of sports motion: A generic computation of spatial and
     temporal errors. Image Vis. Comput. 64 (2017), 67–78.
[22] Hai Nguyen-Truong, San Cao, N. A. Khoa Nguyen, Bang-Dang Pham, Hieu Dao,
     Minh-Quan Le, Hoang-Phuc Nguyen-Dinh, Hai-Dang Nguyen, and Minh-Triet
     Tran. 2020. HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural