=Paper=
{{Paper
|id=Vol-2882/MediaEval_20_paper_2
|storemode=property
|title=Sports
            Video
            Classification: Classification of Strokes in Table Tennis for MediaEval 2020
|pdfUrl=https://ceur-ws.org/Vol-2882/paper2.pdf
|volume=Vol-2882
|authors=Pierre-Etienne Martin,Jenny Benois-Pineau,Boris Mansencal,Renaud Péteri,Laurent Mascarilla,Jordan Calandre,Julien
          Morlier
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MartinBMPMCM20
}}
==Sports
            Video
            Classification: Classification of Strokes in Table Tennis for MediaEval 2020==
<pdf width="1500px">https://ceur-ws.org/Vol-2882/paper2.pdf</pdf>
<pre>
     Sports Video Classification: Classification of Strokes in Table
                      Tennis for MediaEval 2020
                                Pierre-Etienne Martin1 , Jenny Benois-Pineau1 , Boris Mansencal1 ,
                                     Renaud Péteri2 , Laurent Mascarilla2 , Jordan Calandre2 ,
                                                          Julien Morlier3
                                           1 Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, Talence, France
                                                   2 MIA, La Rochelle University, La Rochelle, France
                                                      3 IMS, University of Bordeaux, Talence, France

                                                          mediaeval.sport.task@diff.u-bordeaux.fr

ABSTRACT                                                                                Several datasets such as UCF-101 [24], HMDB [10] and AVA [7]
Fine grained action classification has raised new challenges com-                    have been used for many years as benchmarks for action classifi-
pared to classical action classification problem. Sport video analysis               cation methods. In [15], spatio-temporal dependencies are learned
is a very popular research topic, due to the variety of application ar-              from the video using only RGB images for classification. This
eas, ranging from multimedia intelligent devices with user-tailored                  method is promising but its scores are still below multi-modal meth-
digests, up to analysis of athletes’ performances. Running since 2019                ods such I3D [4]. More recently, datasets have been enriched, like JH-
as a part of MediaEval, we offer a task which consists in classifying                MDB [8] and Kinetics [2, 3, 9] or fused like AVA_Kinetics [12]. Some
table tennis strokes from videos recorded in natural conditions at                   also focus on the intra-class dissimilarity such as the Something-
the University of Bordeaux. The aim is to build tools for teachers,                  Something dataset. Others, such as the Olympic Sports dataset [22],
coaches and players to analyse table tennis games. Such tools could                  focus on sport actions only. However those datasets are not ded-
lead to an automatic profiling of the player and the training session                icated to a specific sport and its associated rules. Few datasets
could then be adapted for improving sports skills more efficiently.                  focus on fine-grained classification. We can cite FineGym [23], in-
                                                                                     troduced recently, which focuses on gymnastic videos, and our
                                                                                     TTStroke21 [21] comprising table tennis strokes.
1    INTRODUCTION                                                                       TTStroke-21 is manually annotated by professional players or
                                                                                     teachers of table tennis, making the annotation process more time
Action detection and classification is one of the main challenges in                 consuming, but more temporally and qualitatively accurate. Classifi-
visual content analysis and mining [26]. Over the last few years, the                cation methods such as I3D model [4] or LTC model [28] performing
number of datasets for action classification has drastically increased               well on UCF-101 dataset inspired the work done in [18, 21] which in-
in terms of video content, resolution, localization and number of                    troduces a TSTCNN - Twin Spatio Temporal Convolutional Neural
classes. However the latest research shows that classification per-                  Network. Here, the video stream and derived computed optical flow
formed using deep neural networks often focuses on the whole                         are passed through the branches of the TSTCNN. In [19] the normal-
scene and the background and not on the action itself.                               ization of the flow is also investigated to improve the classification
   Sport video analysis has been a very popular research topic, due                  score while in [20] an attention block is introduced to improve the
to the variety of application areas, ranging from multimedia intelli-                performances and speed of convergence. The inter-similarity of
gent devices with user-tailored digests, up to analysis of athletes’                 actions - strokes - in TTStroke-21 makes the classification task
performance [5]. The Sport Video Classification project was initi-                   challenging and the multi-modal method seemed to improve per-
ated by the Faculty of Sports (STAPS) and the computer science                       formances. To better understand learned features and classification
laboratories (LaBRI) of the University of Bordeaux, and by the MIA                   process taking place in the TSTCNN, we also developed a new
laboratory of La Rochelle University1 . The goal of this project is                  visualization technique [6].
to develop artificial intelligence and multimedia indexing methods                      Recent work focusing on table tennis [30] tries to get the tactics
for the recognition of table tennis sport activities. The ultimate                   of the players based on their performance during matches using a
goal is to evaluate the performance of athletes, with a particular                   Markov chain model. In [14, 27, 32] stroke recognition is performed
focus on students, in order to develop optimal training strategies.                  using sensors. In [29] segmentation of the player, ball coordinates,
To that aim, a video corpus named TTStroke-21 was recorded with                      event detection is explored while [13, 31] focus solely on the trajec-
volunteer players. These data are of great scientific interest for the               tory of the ball.
Multimedia community participating in the MediaEval campaign.                           In this task overview paper, in section 2, we introduce the specific
                                                                                     conditions of usage of this data, then describe TTStroke-21 and
1 This work was supported by the New Aquitania Region through CRISP project -
                                                                                     the task respectively in sections 3 and 4. The evaluation method is
ComputeR vIsion for Sport Performance and the MIRES federation.
                                                                                     explained in section 5. Supplementary notes are shared in section 6.
                                                                                     More information can be found on the dedicated GitHub web page2 .
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
MediaEval’20, 14-15 December 2020, Online
                                                                                     2 https://multimediaeval.github.io/2020-Sports-Video-Classification-Task/
MediaEval’20, 14-15 December 2020, Online                                                                                       P-e Martin et al.


                                                                         for each frame of the original video. The detected face is blurred
                                                                         and frames are re-encoded in a video.
                                                                            The organisation of the delivered data is as follows:
                                                                               • The provided dataset is split into two subsets: i) training
                                                                                 set and ii) test set;
                                                                               • In each directory, there are several videos (in MPEG-4
    a. Video acquisition            b. Annotation platform                       format) and each video may contain several actions;
            Figure 1: TTStroke-21 acquisition process                          • Each video file is provided with a XML file describing the ac-
                                                                                 tions present in the video and if the player is right-handed
                                                                                 or left-handed;
2     SPECIFIC CONDITIONS OF USAGE                                             • Each action has 3 attributes: the starting frame, the ending
TTStroke-21 is constituted of videos with players playing table                  frame, and the stroke class;
tennis in natural conditions. Even if we are using an automatic                • In the train set XML files, all the attributes are specified.
tool for blurring players’ faces, some faces are misdetected on few              In the test set XML files, only the starting and ending
frames and thus some players remain identifiable. In order to re-                frames are specified. The stroke class attribute is purposely
spect the personal data and privacy of the players, this dataset                 set to value: “Unknown”, and should be updated by the
is subject to a usage agreement, referred to as Special Conditions.              participants to one of the 20 valid classes.
These Special Conditions apply to the use of videos, referred to as
                                                                         .
Images, generated in the framework of the program Sports video
classification: classification of strokes in table tennis, for the im-
plementation of the MediaEval program. They correspond to the            4    TASK DESCRIPTION
specific usage agreement referred to in the Usage agreement for the      The Sport Video Annotation task consists, for each action of each
MediaEval 2020 Research Collections, signed between the User and         test video, in assigning a label using a given taxonomy of 20 classes
the University of Delft. The full and complete acceptance, without       of table tennis strokes.
any reservation, of these Special Conditions is a mandatory pre-            Participants may submit up to five runs. For each run, they must
requisite for the provision of the Images as part of the MediaEval       provide one XML file per video file containing, with the actions
2020 evaluation campaign. A complete reading of these conditions         associated with the recognised stroke class. Runs may be submitted
is necessary and requires the user, for example, to obscure the          as an archive (zip or tar.gz file) with each run in a different directory.
faces (blurring, black banner, etc.) in the video before use in any      Participants should also indicate if any external data, such as other
publication and to destroy the data by October 1st, 2021.                dataset or pretrained networks, was used to compute their runs.
                                                                         The task is considered fully automatic. Once the video are provided
3     DATASET DESCRIPTION                                                to the system, results should be produced without any human
In the MediaEval 2020 campaign, we release the same subset of the        intervention.
TTStroke-21 dataset than last year. The only difference is the blur-
ring of the faces and the specification if the player is right-handed    5    EVALUATION
or left-handed. The dataset has been recorded in a sport faculty         For MediaEval 2020, we propose a light-weight classification task.
facility using a light-weight equipment, such as GoPro cameras. It is    It consists in classification of table tennis strokes which temporal
constituted of player-centred videos recorded in natural conditions      borders are supplied in the XML files accompanying each video file.
without markers or sensors, see Fig 1. It comprises 20 table tennis      Hence for each test video the participants are invited to produce
stroke classes, i.e. 8 services: Serve Forehand Backspin, Serve          an XML file in which each stroke is labelled accordingly to the
Forehand Loop, Serve Forehand Sidespin, Serve Forehand                   given taxonomy. This means that the default label “unknown” has
Topspin, Serve Backhand Backspin, Serve Backhand Loop,                   to be replaced by the label of the stroke class that the participant’s
Serve Backhand Sidespin, Serve Backhand Topspin; 6 offen-                system has assigned. All submissions will be evaluated in terms of
sive strokes: Offensive Forehand Hit, Offensive Forehand                 per-class accuracy (𝐴𝑖 ) and of global accuracy (𝐺𝐴).
Loop, Offensive Forehand Flip, Offensive Backhand Hit,                       The organizers will also provide to the participants different
Offensive Backhand Loop, Offensive Backhand Flip; and                    confusion matrices: one considering all the classes, and others
6 defensive strokes: Defensive Forehand Push, Defensive                  considering the type of the stroke such as: offensive, defensive and
Forehand Block, Defensive Forehand Backspin, Defensive                   defensive and/or using forehand and backhand superclasses of the
Backhand Push, Defensive Backhand Block, Defensive                       strokes.
Backhand Backspin. Also, all the strokes can be divided in two
super-classes: Forehand and Backhand. This taxonomy was de-
signed with professional table tennis teachers.                          6    DISCUSSION
   All videos are recorded in MPEG-4 format. Unlike the task at          The participants from last years have reached a maximum accuracy
MediaEval 2019 [16], most of the faces are blurred. To do so, faces      of 22.9% [17], 14.1%[1] and 11.3% [25] leaving room for improve-
are detected with OpenCV deep learning face detector, based on the       ment. Participants are welcome to share their difficulties and their
Single Shot Detector (SSD) framework with a ResNet base network,         results even if they seem not sufficiently good.
Sports Video Classification: Classification of Strokes in Table Tennis                                   MediaEval’20, 14-15 December 2020, Online


REFERENCES                                                                      [17] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud
 [1] Jordan Calandre, Renaud Péteri, and Laurent Mascarilla. 2019. Optical           Péteri, and Julien Morlier. 2019. Siamese Spatio-Temporal Convolu-
     Flow Singularities for Sports Video Annotation: Detection of Strokes            tional Neural Network for Stroke Classification in Table Tennis Games,
     in Table Tennis, See [11].                                                      See [11].
 [2] João Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and       [18] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
     Andrew Zisserman. 2018. A Short Note about Kinetics-600. CoRR                   Morlier. 2018. Sport Action Recognition with Siamese Spatio-Temporal
     abs/1808.01340 (2018).                                                          CNNs: Application to Table Tennis. In CBMI 2018, La Rochelle, France,
 [3] João Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman.                September 4-6, 2018. IEEE, 1–6.
     2019. A Short Note on the Kinetics-700 Human Action Dataset. CoRR          [19] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
     abs/1907.06987 (2019).                                                          Morlier. 2019. Optimal Choice of Motion Estimation Methods for Fine-
 [4] João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recog-              Grained Action Classification with 3D Convolutional Networks. In
     nition? A New Model and the Kinetics Dataset. (2017), 4724–4733.                IEEE ICIP 2019, Taipei, Taiwan, September 22-25, 2019. IEEE, 554–558.
 [5] Moritz Einfalt, Dan Zecha, and Rainer Lienhart. 2018. Activity-            [20] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
     Conditioned Continuous Human Pose Estimation for Performance                    Morlier. 2020. 3D attention mechanisms in Twin Spatio-Temporal
     Analysis of Athletes Using the Example of Swimming. In IEEE WACV                Convolutional Neural Networks. Application to action classification
     2018, Lake Tahoe, NV, USA, March 12-15, 2018. 446–455.                          in videos of table tennis games.. In 2ICPR2020 - MiCo Milano Congress
 [6] Kazi Ahmed Asif Fuad, Pierre-Etienne Martin, Romai Giot, Romain                 Center, Italy, 10-15 January 2021.
     Bourqui, Jenny Benois-Pineau, and Akka Zemmari. 2020. Feature              [21] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
     Understanding in 3D CNNs for Actions Recognition in Video. In Tenth             Morlier. 2020. Fine grained sport action recognition with Twin spatio-
     International Conference on Image Processing Theory, Tools and Appli-           temporal convolutional neural networks. Multim. Tools Appl. 79, 27-28
     cations, IPTA 2020, Paris, France, November 9-12, 2020. 1–6.                    (2020), 20429–20447.
 [7] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Panto-        [22] Juan Carlos Niebles, Chih-Wei Chen, and Fei-Fei Li. 2010. Modeling
     faru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Su-              Temporal Structure of Decomposable Motion Segments for Activity
     sanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik.             Classification. In Computer Vision - ECCV 2010, Heraklion, Crete, Greece,
     2018. AVA: A Video Dataset of Spatio-Temporally Localized Atomic                September 5-11, 2010, Proceedings, Part II (Lecture Notes in Computer
     Visual Actions. (2018), 6047–6056.                                              Science), Kostas Daniilidis, Petros Maragos, and Nikos Paragios (Eds.),
 [8] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and                Vol. 6312. Springer, 392–405.
     Michael J. Black. 2013. Towards Understanding Action Recognition. In       [23] Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. FineGym: A
     IEEE ICCV 2013, Sydney, Australia, December 1-8, 2013. IEEE Computer            Hierarchical Video Dataset for Fine-grained Action Understanding.
     Society, 3192–3199.                                                             CoRR abs/2004.06704 (2020).
 [9] Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier,       [24] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012.
     Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back,               UCF101: A Dataset of 101 Human Actions Classes From Videos in The
     Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The                  Wild. CoRR abs/1212.0402 (2012).
     Kinetics Human Action Video Dataset. CoRR abs/1705.06950 (2017).           [25] Siddharth Sriraman, Srinath Srinivasan, Vishnu K. Krishnan, Bhuvana
[10] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso A.                  J, and T. T. Mirnalinee. 2019. MediaEval 2019: LRCNs for Stroke
     Poggio, and Thomas Serre. 2011. HMDB: A large video database                    Detection in Table Tennis, See [11].
     for human motion recognition. In IEEE ICCV 2011, Barcelona, Spain,         [26] Andrei Stoian, Marin Ferecatu, Jenny Benois-Pineau, and Michel Cru-
     November 6-13, 2011, Dimitris N. Metaxas, Long Quan, Alberto Sanfeliu,          cianu. 2016. Fast Action Localization in Large-Scale Video Archives.
     and Luc Van Gool (Eds.). IEEE Computer Society, 2556–2563.                      IEEE Trans. Circuits Syst. Video Techn. 26, 10 (2016), 1917–1930.
[11] Martha A. Larson, Steven Alexander Hicks, Mihai Gabriel Con-               [27] S. S. Tabrizi, S. Pashazadeh, and V. Javani. 2020. Comparative Study
     stantin, Benjamin Bischke, Alastair Porter, Peijian Zhao, Mathias Lux,          of Table Tennis Forehand Strokes Classification Using Deep Learning
     Laura Cabrera Quiros, Jordan Calandre, and Gareth Jones (Eds.). 2020.           and SVM. IEEE Sensors Journal (2020), 1–1.
     Working Notes Proceedings of the MediaEval 2019 Workshop, Sophia           [28] Gül Varol, Ivan Laptev, and Cordelia Schmid. 2018. Long-Term Tem-
     Antipolis, France, 27-30 October 2019. CEUR Workshop Proceedings,               poral Convolutions for Action Recognition. IEEE Trans. Pattern Anal.
     Vol. 2670. CEUR-WS.org.                                                         Mach. Intell. 40, 6 (2018), 1510–1517.
[12] Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander         [29] Roman Voeikov, Nikolay Falaleev, and Ruslan Baikulov. 2020. TTNet:
     Vostrikov, and Andrew Zisserman. 2020. The AVA-Kinetics Localized               Real-time temporal and spatial video analysis of table tennis. CoRR
     Human Actions Video Dataset. CoRR abs/2005.00214 (2020).                        abs/2004.09927 (2020).
[13] Hsien-I Lin, Zhangguo Yu, and Yi-Chen Huang. 2020. Ball Tracking           [30] Jiachen Wang, Kejian Zhao, Dazhen Deng, Anqi Cao, Xiao Xie, Zheng
     and Trajectory Prediction for Table-Tennis Robots. Sensors 20, 2 (2020).        Zhou, Hui Zhang, and Yingcai Wu. 2020. Tac-Simur: Tactic-based
[14] Ruichen Liu, Zhelong Wang, Xin Shi, Hongyu Zhao, Sen Qiu, Jie Li,               Simulative Visual Analytics of Table Tennis. IEEE Trans. Vis. Comput.
     and Ning Yang. 2019. Table Tennis Stroke Recognition Based on                   Graph. 26, 1 (2020), 407–417.
     Body Sensor Network. In IDCS 2019, Naples, Italy, October 10-12, 2019,     [31] Erwin Wu and Hideki Koike. 2020. FuturePong: Real-time Table Tennis
     Proceedings (Lecture Notes in Computer Science), Raffaele Montella,             Trajectory Forecasting using Pose Prediction Network. In CHI 2020,
     Angelo Ciaramella, Giancarlo Fortino, Antonio Guerrieri, and Antonio            Honolulu, HI, USA, Regina Bernhaupt, Florian ’Floyd’ Mueller, David
     Liotta (Eds.), Vol. 11874. Springer, 1–10.                                      Verweij, Josh Andres, Joanna McGrenere, Andy Cockburn, Ignacio
[15] Zheng Liu and Haifeng Hu. 2019. Spatiotemporal Relation Networks                Avellino, Alix Goguey, Pernille Bjøn, Shengdong Zhao, Briane Paul
     for Video Action Recognition. IEEE Access 7 (2019), 14969–14976.                Samson, and Rafal Kocielnik (Eds.). ACM, 1–8.
[16] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud        [32] Kun Xia, Hanyu Wang, Menghan Xu, Zheng Li, Sheng He, and Yusong
     Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2019.          Tang. 2020. Racquet Sports Recognition Using a Hybrid Clustering
     Sports Video Annotation: Detection of Strokes in Table Tennis Task              Model Learned from Integrated Wearable Sensor. Sensors 20, 6 (2020),
     for MediaEval 2019, See [11].                                                   1638.

</pre>