Spatio-Temporal CNN Baseline Method for the Sports Video Task of MediaEval 2021 Benchmark Pierre-Etienne Martin CCP Department, Max Planck Institute for Evolutionary Anthropology, D-04103 Leipzig, Germany pierre_etienne_martin@eva.mpg.de Probabilistic 1 30 1 Output 60 1 3 80 3 3 1 1 500 1 120 20 for classification 60 30 15 2 for detection FC 98 15 1 12 30 SoftMax 24 120 Conv FC 60 49 Conv Pool ReLU Pool (3x3x3) Conv Pool (3x3x3) ReLU Video Stream (3x3x3) ReLU ReLU Figure 1: Spatio-Temporal CNN architecture for Stroke Classification and Detection. ABSTRACT As part of the task organization, for the first time since the begin- This paper presents the baseline method proposed for the Sports ning of the Sports Video task (in 2019 [11]), we decided to provide Video task part of the MediaEval 2021 benchmark. This task pro- a baseline to alleviate minor aspects of the task, such as video and poses a stroke detection and a stroke classification subtasks. This xml processing; and help the participants in their submission. The baseline addresses both subtasks. The spatio-temporal CNN archi- baseline method uses a 3D CNN inspired from [13, 14]. We adjusted tecture and the training process of the model are tailored according the method to answer both proposed subtasks of this year’s edi- to the addressed subtask. The method has the purpose of helping tion [12]: stroke detection and stroke classification from videos the participants to solve the task and is not meant to reach state- of the TTStroke-21 corpus. The implementation of the method is of-the-art performance. Still, for the detection task, the baseline is available publicly on Github1 . performing better than the other participants, which stresses the 2 METHOD difficulty of such a task. In order to perform classification and detection, we consider the 1 INTRODUCTION model architecture presented in Fig. 1. For each subtask, a distinct Most recent action detection and classification methods developed model has been trained on the train set. We train both using a in the literature have been using deep learning approaches and stochastic gradient approach with a Nesterov momentum of 0.5 [16], high-dimensional spaces. In the domain of image classification, a weight decay of 0.005 [6] and a constant learning rate of 0.0001. a specific kind of Neural Network has become very popular: the Both models are trained over 500 epochs. The objective function Convolutional Neural Networks (CNNs). Since the breakthrough is the cross-entropy loss of the output processed by the softmax at the 2012 ImageNet Challenge, CNNs have demonstrated a great function (eq. 1) summing over the batch: improvement for image classification. 𝑒𝑥𝑝 (𝑦 ′ ) For video applications in general and action recognition in par- L (𝑦, 𝑐𝑙𝑎𝑠𝑠) = −𝑙𝑜𝑔( Í𝑁 𝑐𝑙𝑎𝑠𝑠 ) (1) ticular, the first models proposed were a direct extension of image 𝑖 𝑒𝑥𝑝 (𝑦𝑖 ) classification methods [1, 20] using 2D convolutions. However, to At each epoch, the model is validated on the validation set. The better capture the temporal information proper to video content, model performing the best on this set is saved and then evaluated on the use of 3D convolutions has emerged [7, 8]. One can also consider the test set. The model is fed with the video frames resized to 120 × temporal information using the motion extracted from successive 120 and staked successively in cuboids of length 98, representing frames, such as the optical flow. The latest can be used i) as a single approximately 0.82 seconds. modality or in parallel with the RGB information [2, 3, 20, 21]; or ii) For the detection task, we inferred Non-stroke segments from the to train a network for extracting motion features to perform classi- annotated Stroke segments. We considered only segments between fication at a later stage [4]. These methods also raise the question of two consecutive strokes greater than 200 frames. Such a segment how to fuse the different modalities [5, 10]. In [9, 19], the estimated is divided in successive blocks of 200 frames, non overlapping, and pose is used jointly with these two modalities to perform action added has a negative sample for training the model. The split using classification. In [15] all the three modalities are used and fused in 200 frames allows a correct number of negative samples: from the order to perform stroke classification. 783 train and 234 validation segments, we inferred respectively for each set 1196 and 260 negative segments. No negative segments Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). have been inferred from the test set. Stroke detection is tackled MediaEval’21, December 13-15 2021, Online 1 https://github.com/ccp-eva/SportTaskME21 MediaEval’21, December 13-15 2021, Online Pierre-Etienne Martin as a classification task by considering two classes: Stroke on Non- Table 1: Baseline performance in term of accuracy (%). stroke. From the test set, which has no temporal boundaries, we Global Type and Hand-Sided Type Hand-Side created window proposals of length 150 every 150 frames for all the videos. This size was chosen empirically and meant to be revised to 20.4 33 48.9 59.3 achieve good performance. For the classification task, all the classes were not represented in the dataset but we still consider all the 20 possible stroke classes. To train the model, we inputted the RGB cuboids composed of the successive frames from the starting frame of the considered segment. The desired output is the class vector summing to one and binary at training time. Its length is the number of considered classes: 2 for detection and 20 for classification. Each element represents the probability of belonging to a class. During inference, we follow a similar procedure, and the class decision is the argmax of the output vector. 3 RESULTS a. Type b. Hand-side This section presents the results per subtask according to the met- Figure 2: Confusion matrices with higher level categories. rics presented in [12]. From table 1, we can state that the performance of the baseline, 3.1 Subtask 1 - Stroke Detection considering all classes, is limited. This may be improved by fur- The detection subtask was tackled as a classification task, consid- ther analysis of the corpus and further training. Indeed, only 18 ering the strokes and non-strokes samples. After 500 epochs, the classes over the 20 possible were present in the corpus this year, model reached 98.3% and 75.7% of accuracy, respectively, on the which simplifies the complexity of the task and could have been train and validation sets. On the test set, the model is evaluated taken into account in the model’s design. Fig. 2.a reveals that the using the mAP metric. This metric takes into account the number services have not been learned at all, which is undoubtedly due to of actions detected and their overlapping with the ground truth. the input processing during training which considers only the 100 The baseline achieves an mAP of 0.0173, which the two participants first frames and is therefore unable to capture features from these of this subtask did not outperform. longer strokes. Finally, Fig. 2.b underlines the main weakness of Runs are also evaluated using a global IoU that considers only the model: being unable of distinguishing Forehand and Backhand the frame-wise overlap of the detected strokes with the ground strokes. The pipeline’s method could consider higher level cate- truth annotations. The number of strokes detected is no longer gories, following a cascade method, to improve the performance. taken into account in the evaluation. The baseline achieves a Global Two of the three participants have outperformed by far the baseline IoU of 0.144, which was outperformed by one participant. performance [17, 18]. The method’s performance is quite low due to the method being relatively simple. It also relies on a straightforward and non-efficient 4 CONCLUSION window proposal to segment the strokes without fusing the output This baseline intends to help the participants solving the Sports decision. Indeed, two consecutive windows, part of the same stroke Video Task. The baseline performance remains limited, but its pub- and classified as strokes, will be classified as two different strokes licly available implementation allows the participants to not start and not a single one, and will therefore have an impact on the mAP from scratch. Many aspects of the method may be improved, such as metric. The method can easily be improved by considering better the data processing: a spatial and temporal ROI may increase the per- proposals and fusing the output decisions. formance. Similarly with the architecture of the model, which was kept very simple, or the training method that could have merged 3.2 Subtask 2 - Stroke Classification the train and validation sets before inferring on the test set. The results for the stroke classification subtask on the test set are The detection subtask seems to be challenging. No participants reported in the table 1. This table is divided into different sections were able to beat the baseline performance with regard to the mAP for considering different refined classifications. After training, the metric, which is the ranking metric. This subtask is new in the model reached only 25.2% and 28.9% of accuracy, respectively, on Sports Video Task, which also explains the low results obtained. the train and validation sets. However we believe much improvement can be obtained since our • “Global” consider all the 20 classes method has tackled it as a classification task. The window proposal • “Type” consider only the type of the stroke: Defensive, is also very crude and can easily be improved. Offensive or Service The classification subtask has gathered more participants with, • “Hand-Side” consider only Forehand and Backhand super- overall, more successful performance. This may be explained by classes the task’s non-novelty in the history of the MediaEval benchmark • “Type and Hand-Sided” consider the intersection of the and the more active investigation in this field. two last clusters leading to 6 classes. Next year we plan to gather ideas from this year’s submissions The confusion matrix of the “Type” and “Hand-Side” are also de- to improve the baseline and give a more substantial base to the new picted in Fig. 2 for further analysis. participants joining the Sports Video Task. Sports Video Task MediaEval’21, December 13-15 2021, Online REFERENCES [1] Hakan Bilen, Basura Fernando, Efstratios Gavves, and Andrea Vedaldi. 2018. Action Recognition with Dynamic Image Networks. IEEE Trans. Pattern Anal. Mach. Intell. 40, 12 (2018), 2799–2813. [2] Jordan Calandre, Renaud Péteri, and Laurent Mascarilla. 2019. Optical Flow Singularities for Sports Video Annotation: Detection of Strokes in Table Tennis. In MediaEval (CEUR Workshop Proceedings), Vol. 2670. CEUR-WS.org. [3] João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In CVPR. IEEE Computer Society, 4724– 4733. [4] Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, and Cordelia Schmid. 2019. MARS: Motion-Augmented RGB Stream for Action Recognition. In CVPR. IEEE Computer Society, 7882–7891. [5] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolu- tional Two-Stream Network Fusion for Video Action Recognition. In CVPR. IEEE Computer Society, 1933–1941. [6] Stephen Jose Hanson and Lorien Y. Pratt. 1988. Comparing Biases for Minimal Network Construction with Back-Propagation. In NIPS. 177–185. [7] Ho Joon Kim, Joseph S. Lee, and Hyun Seung Yang. 2007. Human Action Recog- nition Using a Modified Convolutional Neural Network. In ISNN (2) (Lecture Notes in Computer Science), Vol. 4492. Springer, 715–723. [8] Tiago Lima, Bruno J. T. Fernandes, and Pablo V. A. Barros. 2017. Human action recognition with 3D convolutional neural network. In LA-CCI. IEEE, 1–6. [9] Diogo C. Luvizon, David Picard, and Hedi Tabia. 2018. 2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning. In CVPR. IEEE Computer Society, 5137–5146. [10] Pierre-Etienne Martin. 2020. Fine-Grained Action Detection and Classification from Videos with Spatio-Temporal Convolutional Neural Networks. Application to Table Tennis. (Détection et classification fines d’actions à partir de vidéos par réseaux de neurones à convolutions spatio-temporelles. Application au tennis de table). Ph.D. Dissertation. University of La Rochelle, France. https://tel.archives-ouvertes.fr/ tel-03128769 [11] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2019. Sports Video Annotation: Detection of Strokes in Table Tennis Task for MediaEval 2019. In MediaEval (CEUR Workshop Proceedings), Vol. 2670. CEUR-WS.org. [12] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2021. Sports Video: Fine- Grained Action Detection and Classification of Table Tennis Strokes from videos for MediaEval 2021. In MediaEval (CEUR Workshop Proceedings). CEUR-WS.org. [13] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud Péteri, and Julien Morlier. 2019. Siamese Spatio-Temporal Convolutional Neural Network for Stroke Classification in Table Tennis Games. In MediaEval (CEUR Workshop Proceedings), Vol. 2670. CEUR-WS.org. [14] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. 2018. Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis. In CBMI. IEEE, 1–6. [15] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. 2021. Three-Stream 3D/1D CNN for Fine-Grained Action Classification and Segmentation in Table Tennis. CoRR abs/2109.14306 (2021). arXiv:2109.14306 https://arxiv.org/abs/2109.14306 [16] Yurii E. Nesterov. 2004. Introductory Lectures on Convex Optimization - A Basic Course. Applied Optimization, Vol. 87. Springer. [17] Trong-Tung Nguyen, Thanh-Son Nguyen, Gia-Bao Dinh Ho, Hai-Dang Nguyen, and Minh-Triet Tran. 2021. HCMUS at MediaEval 2021: Ensembles of Action Recognition Networks with Prior Knowledge for Table Tennis Strokes Classifi- cation Task. In MediaEval (CEUR Workshop Proceedings). CEUR-WS.org. [18] Yijun Qian, Lijun Yu, Wenhe Liu, and Alexander G. Hauptmann. 2021. Learning Unbiased Transformer for Long-Tail Sports Action Classification. In MediaEval (CEUR Workshop Proceedings). CEUR-WS.org. [19] Grégory Rogez, Philippe Weinzaepfel, and Cordelia Schmid. 2020. LCR-Net++: Multi-Person 2D and 3D Pose Detection in Natural Images. IEEE Trans. Pattern Anal. Mach. Intell. 42, 5 (2020), 1146–1161. [20] Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In NIPS. 568–576. [21] Xuanhan Wang, Lianli Gao, Peng Wang, Xiaoshuai Sun, and Xianglong Liu. 2018. Two-Stream 3-D convNet Fusion for Action Recognition in Videos With Arbitrary Size and Length. IEEE Trans. Multimedia 20, 3 (2018), 634–644.