=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_36
|storemode=property
|title=MediaEval 2019: LRCNs for Stroke Detection in Table Tennis
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_36.pdf
|volume=Vol-2670
|authors=Siddharth Sriraman,Srinath Srinivasan,Vishnu K Krishnan,Bhuvana J,T.T. Mirnalinee
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SriramanSKJM19
}}
==MediaEval 2019: LRCNs for Stroke Detection in Table Tennis==
MediaEval 2019: LRCNs for Stroke Detection in Table Tennis Siddharth Sriraman, Srinath Srinivasan, Vishnu K Krishnan, Bhuvana J, T.T. Mirnalinee SSN College of Engineering, India (siddharth18150,srinath18205,vishnukrishnan18200)@cse.ssn.edu.in,(bhuvanaj,mirnalineett)@ssn.edu.in ABSTRACT 3 APPROACH Recognizing actions in videos is one of the most widely re- The approach we used had to take into account temporal searched tasks in video analytics. Sports action recognition is information in the frames efficiently due to the moves having one such work that has been extensively researched in order very subtle differences. The low inter-class variability is the to make strategic decisions in athletic training. We present main obstacle we had to face. A vanilla Convolutional Neural a model to classify strokes made by table tennis players as Network (CNN) with a rolling average prediction works well a part of the 2019 MediaEval Challenge. Our approach ex- enough for highly spatial data since each class is very distinct tracts features into a spatio-temporal model trained on the from the other. Here, due to low inter-class variability the MediaEval Sports Video Classification d ataset t o d etect the move could not be classified from just a single frame, so we move made. decided to implement a spatio-temporal model. Our basic idea was to implement a Long-term Recurrent Convolutional Network (LRCN) while trying multiple architectures for the 1 INTRODUCTION CNN used in it and try to find the best hyperparameters In this paper we have discussed our method to classify strokes to ensure the model performed well on moves of shorter in a dataset consisting of various strokes performed by table duration. tennis players during games. [5] The dataset consists of 20 dif- ferent shot techniques which the classification i s b ased upon, 3.1 Data Pre-processing and these moves are shot in natural conditions. Research Time distributed models take fixed input sizes for each mini- into this specific d omain c an i mprove a thletic performance batch. The frames were downscaled to 80x80. Since the frame by computer-aided analysis of moves. rate of the data (120 fps) is very high, the initial models The main challenge we faced was to train a model that used 25 frame long sequences for each move. The sequences would take into account the temporal features carefully. We are then scaled using mean and standard deviation before implemented an existing spatio-temporal model for this prob- sending them into the model. A smaller batch size of 64 was lem and discussed our results with the given dataset. We ap- taken to train the model as the size of the data was large. plied a Long-term Recurrent Convolutional Network (LRCN) The moves are of varying duration with some being signif- [1]. The work done mainly focusses on testing the implemen- icantly short-lived while others more drawn out. This meant tation and feasibility of using this model in such a paradigm. we had to ensure the extracted frames had information on the entire move in it, irrespective of the duration. Initially, 2 RELATED WORK We tested a model that took in 25 frames from each sequence Extensive research has been carried out in the field o f action sampled at different rates. The rate used was 5 frames as rates recognition in videos which usually tend to focus on recognis- higher than this showed significant skips in hand movement ing a large number of actions using spatio-temporal models. from one frame to another. Then we used variable sequence These videos usually last longer compared to table tennis lengths which were padded to ensure the data sent to the strokes which are relatively brief. network are of uniform lengths. We also tested a variable Our focus is on sports video classification, specifically table rate for each video (based on the duration of the move) to tennis. While 2D ConvNets like VGG16[7] have produced ensure each training example had a fixed sequence length, outstanding results for image classification, v ideo classifica- but this did not lead to any improvements. tion research has focussed on importing this to the temporal dimension using 3D ConvNets [4] and Long-term Recurrent 3.2 Model Convolutional Networks. Our approach uses an RGB frame sequence of the move sent The application of action recognition to table tennis for to a Long-term Recurrent Convolutional Network to classify stroke detection [6] has been researched, and the closest work the frames. The Convolutional Neural Network implemented uses a 3D ConvNet model along with optical flow d ata. Our is Time distributed, meaning the same CNN architecture is approach does not use optical flow d ata t o d etect t he moves applied to each frame in the sequence independently, resulting and instead directly uses the frame sequences. In spite of this in a collection of outputs whose length is the frame sequence reducing the complexity of the model, we found using larger length. The architecture is a modified version of VGG16 that batch sizes more demanding to run. implements Batch Normalization and dropout to address Copyright 2019 for this paper by its authors. Use overfitting. The parameters were initialised using Glorot permitted under Creative Commons License Attribution initialisation [2]. This sequence is then sent to a vanilla Long 4.0 International (CC BY 4.0). Short Term Memory (LSTM) layer [3]. MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France S. Siddharth et al. Another variant of this model used a second LSTM layer moves which had lesser data to work with. We also observed after that. The LSTM outputs are sent to a standard fully- that short and swift actions like table tennis moves are not connected network to map it to the final output of 20 classes efficiently learnt by sequence models like LSTMs and that namely Defensive Backhand Backspin, Backhand Block, Back- they require data of other forms, in addition to direct RGB hand Push, Forehand Backspin, Forehand Block, Forehand frame pixel data. Push, Offensive Backhand Flip, Backhand Hit, Backhand Loop, Forehand Flip, Forehand Hit, Forehand Loop, Serve Backhand Backspin, Serve Backhand Loop, Serve Backhand Table 3: Test Run Sidespin, Serve Backhand Topspin, Serve Forehand Backspin, Serve Forehand Loop, Serve Forehand Sidespin and Serve Test Run Accuracy Ratio Forehand Topspin. Hyperparameters adopted by our approach is listed in 0.1130 40/354 Table 1 and training metrics shown in Table 2. Overfitting Table 1: Training Hyperparameters The different variants of the model did not show very significant changes in results, with the best run getting a Hyperparameter Value 11.3% accuracy (Table 3). The only difference in the other run we submitted was allowing to the model to train for 5 more Learning Rate 1e-3 epochs, the training results we obtained were very similar to Weight Initialisation Glorot this run. All other hyperparameters used were kept the same. CNN Activation Function ReLU Using only the RGB sequence information (without optical Final Activation Function Softmax flow data) the model could not generalise on the specific Decay (in Adam) 1e-6 differences between classes of moves on the test set, leading Loss Function Categorical Crossentropy to a low test accuracy. We observed that the model could Regularisation Parameter 0.001 not predict certain moves at all, while performing reasonably Batch Size 64 well on other classes of moves. Dropout (retention probability) 0.5 A closer analysis shows that the model fails to distinguish Decay (in Adam) 40/354 between the moves belonging to a specific class (such as Serve, K-Fold Crossvalidation Splits 8 Defensive, Offensive) as the differences are very intricate. The model tended to prefer certain moves significantly more than was a major problem. Usage of multiple dense layers before others on the test set, which arose due to the distribution the final activations caused overfitting. Another reason was of the training set. Using uniform amounts of data to work due to the depth of the CNN used. Kernel Regularisers were with resulted in the number of examples to train on being used for all the CNN layers. Using dropout with a 0.3 to very low. The model did face significant overfitting issues 0.5 probability (to retain) for the LSTM layer showed best while training even when the complexity was reduced and results. regularisation was employed. The difference in accuracy on test and validation data might be due to the frequency of Table 2: Training Metrics the different classes on the test set being different from the training and validation set. Metric Value Number of epochs 30-40 5 DISCUSSION AND OUTLOOK Validation loss 1.66 The main insight we gained is that data with classes having Validation accuracy 0.829 very low variability require more complex models and features Train loss 1.63 to be extracted. We learnt that data of this kind cannot be Train accuracy 0.839 generalised even by models known to work well with spatio- temporal data. This was due to the model overfitting and learning the intricacies of the moves too deeply from the training data, hence it failed to reproduce the results on the 4 RESULTS AND ANALYSIS test data. The model was only able to classify 40 out of the 354 moves We also gained an understanding of why swift action moves (11.3%) showing that using LRCN was not a viable option are difficult to classify, the limitations of sequence models in for this dataset. The training set was skewed towards cer- this regard and how temporal features with low variability tain classes more than others and Stratified K-Fold Cross cannot be differentiated easily. Working on the MediaEval Validation was used to ensure the best train and test splits Sports Classification dataset helped us to grasp why problems were chosen. The per-class accuracy data shows that the under this domain are critical to solve and how we should model learnt certain moves very well, and could not learn the choose to approach data of this kind in the future. The 2019 Sport Video Classification Task: Table Tennis MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [1] Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Sub- hashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2014. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. (2014). arXiv:arXiv:1411.4389 [2] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research), Yee Whye Teh and Mike Titterington (Eds.), Vol. 9. PMLR, Chia Laguna Resort, Sardinia, Italy, 249–256. http://proceedings.mlr.press/v9/glorot10a.html [3] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short- Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 [4] S. Ji, W. Xu, M. Yang, and K. Yu. 2013. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans- actions on Pattern Analysis and Machine Intelligence 35, 1 (Jan 2013), 221–231. https://doi.org/10.1109/TPAMI.2012.59 [5] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansen- cal, Renaud Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2019. Sports Video Annotation: Detection of Strokes in Table Tennis task for MediaEval 2019. In Proc. of the MediaEval 2019 Workshop, Sophia Antipolis, France, 27-29 October 2019. [6] P. Martin, J. Benois-Pineau, R. Péteri, and J. Morlier. 2018. Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis. In 2018 International Conference on Content-Based Multimedia Indexing (CBMI). 1–6. https://doi.org/10.1109/CBMI.2018.8516488 [7] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. (2014). arXiv:arXiv:1409.1556