=Paper= {{Paper |id=Vol-2670/MediaEval_19_paper_36 |storemode=property |title=MediaEval 2019: LRCNs for Stroke Detection in Table Tennis |pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_36.pdf |volume=Vol-2670 |authors=Siddharth Sriraman,Srinath Srinivasan,Vishnu K Krishnan,Bhuvana J,T.T. Mirnalinee |dblpUrl=https://dblp.org/rec/conf/mediaeval/SriramanSKJM19 }} ==MediaEval 2019: LRCNs for Stroke Detection in Table Tennis== https://ceur-ws.org/Vol-2670/MediaEval_19_paper_36.pdf
    MediaEval 2019: LRCNs for Stroke Detection in Table Tennis
      Siddharth Sriraman, Srinath Srinivasan, Vishnu K Krishnan, Bhuvana J, T.T. Mirnalinee
                                        SSN College of Engineering, India
       (siddharth18150,srinath18205,vishnukrishnan18200)@cse.ssn.edu.in,(bhuvanaj,mirnalineett)@ssn.edu.in

ABSTRACT                                                             3     APPROACH
Recognizing actions in videos is one of the most widely re-          The approach we used had to take into account temporal
searched tasks in video analytics. Sports action recognition is      information in the frames efficiently due to the moves having
one such work that has been extensively researched in order          very subtle differences. The low inter-class variability is the
to make strategic decisions in athletic training. We present         main obstacle we had to face. A vanilla Convolutional Neural
a model to classify strokes made by table tennis players as          Network (CNN) with a rolling average prediction works well
a part of the 2019 MediaEval Challenge. Our approach ex-             enough for highly spatial data since each class is very distinct
tracts features into a spatio-temporal model trained on the          from the other. Here, due to low inter-class variability the
MediaEval Sports Video Classification d ataset t o d etect the       move could not be classified from just a single frame, so we
move made.                                                           decided to implement a spatio-temporal model. Our basic
                                                                     idea was to implement a Long-term Recurrent Convolutional
                                                                     Network (LRCN) while trying multiple architectures for the
1   INTRODUCTION                                                     CNN used in it and try to find the best hyperparameters
In this paper we have discussed our method to classify strokes       to ensure the model performed well on moves of shorter
in a dataset consisting of various strokes performed by table        duration.
tennis players during games. [5] The dataset consists of 20 dif-
ferent shot techniques which the classification i s b ased upon,     3.1    Data Pre-processing
and these moves are shot in natural conditions. Research             Time distributed models take fixed input sizes for each mini-
into this specific d omain c an i mprove a thletic performance       batch. The frames were downscaled to 80x80. Since the frame
by computer-aided analysis of moves.                                 rate of the data (120 fps) is very high, the initial models
   The main challenge we faced was to train a model that             used 25 frame long sequences for each move. The sequences
would take into account the temporal features carefully. We          are then scaled using mean and standard deviation before
implemented an existing spatio-temporal model for this prob-         sending them into the model. A smaller batch size of 64 was
lem and discussed our results with the given dataset. We ap-         taken to train the model as the size of the data was large.
plied a Long-term Recurrent Convolutional Network (LRCN)                The moves are of varying duration with some being signif-
[1]. The work done mainly focusses on testing the implemen-          icantly short-lived while others more drawn out. This meant
tation and feasibility of using this model in such a paradigm.       we had to ensure the extracted frames had information on
                                                                     the entire move in it, irrespective of the duration. Initially,
2   RELATED WORK                                                     We tested a model that took in 25 frames from each sequence
Extensive research has been carried out in the field o f action      sampled at different rates. The rate used was 5 frames as rates
recognition in videos which usually tend to focus on recognis-       higher than this showed significant skips in hand movement
ing a large number of actions using spatio-temporal models.          from one frame to another. Then we used variable sequence
These videos usually last longer compared to table tennis            lengths which were padded to ensure the data sent to the
strokes which are relatively brief.                                  network are of uniform lengths. We also tested a variable
   Our focus is on sports video classification, specifically table   rate for each video (based on the duration of the move) to
tennis. While 2D ConvNets like VGG16[7] have produced                ensure each training example had a fixed sequence length,
outstanding results for image classification, v ideo classifica-     but this did not lead to any improvements.
tion research has focussed on importing this to the temporal
dimension using 3D ConvNets [4] and Long-term Recurrent              3.2    Model
Convolutional Networks.                                              Our approach uses an RGB frame sequence of the move sent
   The application of action recognition to table tennis for         to a Long-term Recurrent Convolutional Network to classify
stroke detection [6] has been researched, and the closest work       the frames. The Convolutional Neural Network implemented
uses a 3D ConvNet model along with optical flow d ata. Our           is Time distributed, meaning the same CNN architecture is
approach does not use optical flow d ata t o d etect t he moves      applied to each frame in the sequence independently, resulting
and instead directly uses the frame sequences. In spite of this      in a collection of outputs whose length is the frame sequence
reducing the complexity of the model, we found using larger          length. The architecture is a modified version of VGG16 that
batch sizes more demanding to run.                                   implements Batch Normalization and dropout to address
Copyright 2019 for this paper by its authors. Use
                                                                     overfitting. The parameters were initialised using Glorot
permitted under Creative Commons License Attribution                 initialisation [2]. This sequence is then sent to a vanilla Long
4.0 International (CC BY 4.0).                                       Short Term Memory (LSTM) layer [3].
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                  S. Siddharth et al.


   Another variant of this model used a second LSTM layer       moves which had lesser data to work with. We also observed
after that. The LSTM outputs are sent to a standard fully-      that short and swift actions like table tennis moves are not
connected network to map it to the final output of 20 classes   efficiently learnt by sequence models like LSTMs and that
namely Defensive Backhand Backspin, Backhand Block, Back-       they require data of other forms, in addition to direct RGB
hand Push, Forehand Backspin, Forehand Block, Forehand          frame pixel data.
Push, Offensive Backhand Flip, Backhand Hit, Backhand
Loop, Forehand Flip, Forehand Hit, Forehand Loop, Serve
Backhand Backspin, Serve Backhand Loop, Serve Backhand                                Table 3: Test Run
Sidespin, Serve Backhand Topspin, Serve Forehand Backspin,
Serve Forehand Loop, Serve Forehand Sidespin and Serve                          Test Run Accuracy       Ratio
Forehand Topspin.
   Hyperparameters adopted by our approach is listed in                         0.1130                 40/354
Table 1 and training metrics shown in Table 2. Overfitting

            Table 1: Training Hyperparameters                      The different variants of the model did not show very
                                                                significant changes in results, with the best run getting a
 Hyperparameter                               Value             11.3% accuracy (Table 3). The only difference in the other run
                                                                we submitted was allowing to the model to train for 5 more
 Learning Rate                                1e-3
                                                                epochs, the training results we obtained were very similar to
 Weight Initialisation                       Glorot
                                                                this run. All other hyperparameters used were kept the same.
 CNN Activation Function                     ReLU
                                                                Using only the RGB sequence information (without optical
 Final Activation Function                  Softmax
                                                                flow data) the model could not generalise on the specific
 Decay (in Adam)                              1e-6
                                                                differences between classes of moves on the test set, leading
 Loss Function                      Categorical Crossentropy
                                                                to a low test accuracy. We observed that the model could
 Regularisation Parameter                     0.001
                                                                not predict certain moves at all, while performing reasonably
 Batch Size                                    64
                                                                well on other classes of moves.
 Dropout (retention probability)               0.5
                                                                   A closer analysis shows that the model fails to distinguish
 Decay (in Adam)                             40/354
                                                                between the moves belonging to a specific class (such as Serve,
 K-Fold Crossvalidation Splits                  8
                                                                Defensive, Offensive) as the differences are very intricate. The
                                                                model tended to prefer certain moves significantly more than
was a major problem. Usage of multiple dense layers before      others on the test set, which arose due to the distribution
the final activations caused overfitting. Another reason was    of the training set. Using uniform amounts of data to work
due to the depth of the CNN used. Kernel Regularisers were      with resulted in the number of examples to train on being
used for all the CNN layers. Using dropout with a 0.3 to        very low. The model did face significant overfitting issues
0.5 probability (to retain) for the LSTM layer showed best      while training even when the complexity was reduced and
results.                                                        regularisation was employed. The difference in accuracy on
                                                                test and validation data might be due to the frequency of
                 Table 2: Training Metrics                      the different classes on the test set being different from the
                                                                training and validation set.
                Metric                Value
                Number of epochs      30-40
                                                                5   DISCUSSION AND OUTLOOK
                Validation loss        1.66                     The main insight we gained is that data with classes having
                Validation accuracy   0.829                     very low variability require more complex models and features
                Train loss             1.63                     to be extracted. We learnt that data of this kind cannot be
                Train accuracy        0.839                     generalised even by models known to work well with spatio-
                                                                temporal data. This was due to the model overfitting and
                                                                learning the intricacies of the moves too deeply from the
                                                                training data, hence it failed to reproduce the results on the
4   RESULTS AND ANALYSIS                                        test data.
The model was only able to classify 40 out of the 354 moves        We also gained an understanding of why swift action moves
(11.3%) showing that using LRCN was not a viable option         are difficult to classify, the limitations of sequence models in
for this dataset. The training set was skewed towards cer-      this regard and how temporal features with low variability
tain classes more than others and Stratified K-Fold Cross       cannot be differentiated easily. Working on the MediaEval
Validation was used to ensure the best train and test splits    Sports Classification dataset helped us to grasp why problems
were chosen. The per-class accuracy data shows that the         under this domain are critical to solve and how we should
model learnt certain moves very well, and could not learn the   choose to approach data of this kind in the future.
The 2019 Sport Video Classification Task: Table Tennis                MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES
 [1] Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Sub-
     hashini Venugopalan, Sergio Guadarrama, Kate Saenko, and
     Trevor Darrell. 2014. Long-term Recurrent Convolutional
     Networks for Visual Recognition and Description. (2014).
     arXiv:arXiv:1411.4389
 [2] Xavier Glorot and Yoshua Bengio. 2010. Understanding the
     difficulty of training deep feedforward neural networks. In
     Proceedings of the Thirteenth International Conference on
     Artificial Intelligence and Statistics (Proceedings of Machine
     Learning Research), Yee Whye Teh and Mike Titterington
     (Eds.), Vol. 9. PMLR, Chia Laguna Resort, Sardinia, Italy,
     249–256. http://proceedings.mlr.press/v9/glorot10a.html
 [3] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-
     Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780.
     https://doi.org/10.1162/neco.1997.9.8.1735
 [4] S. Ji, W. Xu, M. Yang, and K. Yu. 2013. 3D Convolutional
     Neural Networks for Human Action Recognition. IEEE Trans-
     actions on Pattern Analysis and Machine Intelligence 35, 1
     (Jan 2013), 221–231. https://doi.org/10.1109/TPAMI.2012.59
 [5] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansen-
     cal, Renaud Péteri, Laurent Mascarilla, Jordan Calandre, and
     Julien Morlier. 2019. Sports Video Annotation: Detection of
     Strokes in Table Tennis task for MediaEval 2019. In Proc.
     of the MediaEval 2019 Workshop, Sophia Antipolis, France,
     27-29 October 2019.
 [6] P. Martin, J. Benois-Pineau, R. Péteri, and J. Morlier. 2018.
     Sport Action Recognition with Siamese Spatio-Temporal
     CNNs: Application to Table Tennis. In 2018 International
     Conference on Content-Based Multimedia Indexing (CBMI).
     1–6. https://doi.org/10.1109/CBMI.2018.8516488
 [7] Karen Simonyan and Andrew Zisserman. 2014. Very Deep
     Convolutional Networks for Large-Scale Image Recognition.
     (2014). arXiv:arXiv:1409.1556