Spatio-Temporal Based Table Tennis Hit Assessment Using LSTM
                         Algorithm
                                              Kadir Aktas1 , Mehmet Demirel2 , Marilin Moor1 ,
                                                Johanna Olesk1 , Gholamreza Anbarjafari1,3
                                                                  1 University of Tartu, Estonia
                                                       2 University of Manchester, United Kingdom
                                          3 PwC Advisory Finland, Itämerentori 2, 00180 Helsinki, Finland

                                                     (kadir.aktas,marilinm,johanna.olesk,shb)@ut.ee
                                                       mehmet.demirel@student.manchester.ac.uk
ABSTRACT                                                                             extract and resize the frames to (320 × 180) for each stroke in or-
In these working notes, we present our approach and results for                      der to use them as input data [8]. Instead, we resize the frames to
Mediaeval 2020 Sports Video Classification Task [6]. We imple-                       (120 × 80) to increase the processing speed.
mented a multi-stage pipeline with LSTM-based network. In the                           Sriraman et al. present another approach which extracts features
developed approach, firstly, the frames are extracted, sampled and                   using Convolutional Neural Network (CNN) and applies them to
resized. Then, considering that the stroke type has three different                  a spatio-temporal model [10]. They use VGG16 network [11] as
parts, each part is labelled and predicted separately. In order to                   the feature extractor and apply Long Short Term Memory (LSTM)
obtain the predicted stroke type, the predictions for each part are                  [4] layer on the extracted features. They use 25 frames, which are
fused together.                                                                      sampled by a varying rate, per each move. In our work, we use
                                                                                     21 frames based on centroids of the k-nearest neighbour method.
                                                                                     Also, we extract spatio-temporal features only and do not use a
                                                                                     CNN-based feature extractor.
1    INTRODUCTION
Sports action recognition is a well-studied research topic due to                    3     APPROACH
the wide application area and commercial value. Although many                        To face the challenge of a high number of classes and low variance
methods are developed for different sports tasks [2, 9], the challenge               between them, we designed a multi-stage approach. We divided the
of performing more precise analysis still remains open, especially                   initial 20 labels into 5 groups (see Figure 1). In stage 1 and 2, the
for low variance classification tasks such as table tennis stroke                    first and second parts of the final label are predicted. In stage 3, the
type classification. To address this claiming, Martin et al. collected               third part of the final label is predicted, however, the prediction is
TTSTROKE-21 dataset [8] and the Mediaeval 2019 and 2020 Sports                       done based on stage 1 results. For example, if the stage 1 predicts
Video Classification Task were created [6, 7].                                       ‘Serve’ then in stage 3 the model which is trained for predicting
    In this paper, we present a multi-stage spatio-temporal recog-                   one of ‘Topspin’, ‘Sidespin’, ‘Backspin’, ‘Loop’ is used. We used the
nition method using long short-term memory (LSTM) [4, 13, 15]                        same input and model structure for each stage, meaning, that we
based network. Our architecture predicts the final label in three                    trained the same model for each label subset, for 5 times in total.
stages. In the first stage, the position (serve, offensive, defensive)
is classified. In the second stage, the hand orientation (forehand,
backhand) is classified. Finally, in the third stage, the hit technique
(flip, hit, push, block, loop, topspin, backspin, sidespin) is predicted
using one of the 3 different models. The first model classifies serve
techniques. Second and third models classify offensive and defen-
sive techniques, respectively. Lastly, in order to obtain the final
stroke type, a fusion of labels                                                                             Figure 1: Labels splits

2    RELATED WORK
Recently there has been an increase in the number of studies in table                3.1    Data pre-processing
tennis stroke type recognition from videos. Martin et al. have col-                  The dataset contains videos with 120 fps and resolution of (1920 ×
lected TTSTROKE-21 dataset and proposed a Twin Spatio-Temporal                       1080). Considering that a single stroke has minimum of 100 frames
Convolutional Neural Network (TSTCNN). Their network uses an                         [8], processing the data without resizing causes memory and time-
RGB image sequence and optical flow calculations as an input. They                   related issues. So, to speed up the process and decrease memory
                                                                                     restrictions we resize each frame to (120 × 80).
                                                                                        Each move in the dataset has a varying frame range. This means
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                   that we need to sample them in a fixed size as our model require
MediaEval’20, December 14-15 2020, Online                                            a fixed input size. We sample 21 frames per move. This number is
                                                                                     picked heuristically, however, we have tested that if the sample size
MediaEval’20, December 14-15 2020, Online                                                                                          K. Aktas, et al.


is too low, i.e. 7 frames per move, then the accuracy is decreased          predicted labels in the final label, we obtained 78.1% stroke type
significantly. This way we boost the processing performance and             prediction accuracy.
provide a fixed input size to our model. Such approaches are well-             These results can be explained by a couple of factors. Firstly, the
known in the video indexing [1].                                            volume and distribution of the data affect the results. In stage 3 each
    To decide on which frames to sample we use centroids of k-              label has considerably fewer data compared to the data numbers
nearest neighbor (KNN) method. This method reflects the data                of labels in other stages. Also, especially in stage 3 labels, the data
distribution as the centroids are calculated using nearest neigh-           distribution is highly biased towards some classes, causing biased
bours [16]. We use RGB values of the resized images to compute              learning. Additionally, due to the nature of the task, stage 3 labels
KNN. And, We calculate 21 centroids and then sample 21 frames               have less variance between each other compared to other stages.
closest to each of the calculated centroids, respectively. Additionally,    Lastly, since stage 3 is conditioned on the outcome of stage 1, some
we flatten each frame in order to ensure they fit into our model.           of the errors are caused by this outcome.

3.2    Model                                                                                    Table 1: Training Results
Our model uses RGB images as the input data without any prior
feature extraction. In our model, firstly, the batch normalization                     Stage   Validation Accuracy       Test Accuracy
layer is used to regularize the input data. This step processes the
input batch by batch, subtracts mean and divides by standard devia-                      1             91.4%             94.7%
tion [5]. Then, two LSTM [4] layers are included in order to capture                     2             98.7%             98%
spatio-temporal features. These layers are constructed with unit                         3             82.8%             80.1%
numbers 128 and 32 respectively. Afterwards, a fully connected                         Final           79.8%             78.1%
layer with 64 units is included to model the relation between fea-
tures and the output. Each of these 3 layers is followed by a dropout          Our method got 9.32% accuracy on the run processed by Media-
layer at the rate of 0.2 in order to prevent overfitting [12]. Finally,     Eval on a different test set. It was able to correctly predict classes
an output layer with softmax activation is added to do the classifi-        of 33 samples out of 354. Our run results show that the method was
cation.                                                                     able to achieve 50.85% for stage 1 prediction and 66.67% for stage 2
                                                                            prediction. Although the accuracy for stage 3 is not published, it is
                                                                            obvious that the model had the lowest accuracy on stage 3 with a
                                                                            big gap (See Table 2).

                                                                                                   Table 2: Run Result

                                                                                                     Stage     Accuracy
                                                                                                        1       50.85%
                   Figure 2: Model architecture                                                         2       66.67%
                                                                                                        3        N/A
                                                                                                      Final     9.32%
3.3    Training
Fully connected layers are initialized by using Glorot initializa-
                                                                               Results of the MediaEval test run show that the model did not
tion [3]. We use categorical cross-entropy as the loss function and
                                                                            learn properly. Also, when we precisely analyze the results we see
RMSprop as optimizer [14] with a learning rate of 0.0001. The train-
                                                                            that the method is biased towards some classes.
ing is done with batch size 8 in 30 epochs. 10-fold cross-validation
is applied to prevent a biased data split.
                                                                            5   DISCUSSION AND OUTLOOK
   We use the same model architecture and hyperparameters to
train 5 different models. Each model has its own purpose, so they           We obtained promising results during the training and validation,
get trained with different subsets of the training data for different       which assures there is no occurrence of overfitting. However, as the
sets of labels (see Table ??).                                              test results show, the model has failed to properly learn, i.e. was not
   We split our data into train, validation and test splits by 0.6, 0.2,    able to generalise the learning. It is expected this can be addressed
0.2 proportions. Train and validation splits are used during model          by having more labelled data in the training.
training. Test split is only used to test the trained model.                    We also argue that the low variance between the classes and
                                                                            nature of the task causes the aforementioned challenge. Considering
4     RESULTS AND ANALYSIS                                                  that a single class can be sampled in many ways for different players,
                                                                            i.e. right/left-handed or high/low experienced, we discuss that the
Training results are shown in Table 1. We obtained 94.7% test ac-
                                                                            dataset can be improved to increase the coverage of the classes as
curacy for stage 1, i.e. classifier for ‘Serve’, ‘Defensive’, ‘Offensive’
                                                                            well as reducing the bias among the classes.
labels. An accuracy of 98% was achieved for stage 2, i.e. ‘Forehand’,
‘Backhand’ classification. However, for stage 3 we got 80.1% ac-
curacy, which is much lower than others. When combining the
Sports Video Classification: Classification of Strokes in Table Tennis                MediaEval’20, December 14-15 2020, Online


REFERENCES
 [1] Maylis Delest, Anthony Don, and Jenny Benois-Pineau. 2006. DAG-
     based visual interfaces for navigation in indexed video content. Mul-
     timedia Tools and Applications 31 (10 2006), 51–72. https://doi.org/10.
     1007/s11042-006-0032-4
 [2] Mehrnaz Fani, Kanav Vats, Christopher Dulhanty, David A. Clausi,
     and John S. Zelek. 2019. Pose-Projected Action Recognition Hourglass
     Network (PARHN) in Soccer. In 16th Conference on Computer and Robot
     Vision, CRV 2019, Kingston, ON, Canada, May 29-31, 2019. 201–208.
     https://doi.org/10.1109/CRV.2019.00035
 [3] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty
     of training deep feedforward neural networks. In Proceedings of the
     thirteenth international conference on artificial intelligence and statistics.
     249–256.
 [4] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term
     memory. Neural computation 9, 8 (1997), 1735–1780.
 [5] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Ac-
     celerating deep network training by reducing internal covariate shift.
     arXiv preprint arXiv:1502.03167 (2015).
 [6] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud
     Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2020.
     Sports Video Classification: Classification of Strokes in Table Tennis
     for MediaEval 2020. In Proc. of the MediaEval 2020 Workshop, Online,
     14-15 December 2020.
 [7] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud
     Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2019.
     Sports Video Annotation: Detection of Strokes in Table Tennis task
     for MediaEval 2019. In MediaEval 2019 Workshop.
 [8] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
     Morlier. 2020. Fine grained sport action recognition with Twin spatio-
     temporal convolutional neural networks: Application to table tennis.
     Multimedia Tools and Applications (2020), 1–19.
 [9] Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. FineGym: A
     Hierarchical Video Dataset for Fine-Grained Action Understanding.
     (2020), 2613–2622. https://doi.org/10.1109/CVPR42600.2020.00269
[10] Vishnu K. Krishnan Bhuvana J Siddharth Sriraman, Srinath Srinivasan
     and T. T. Mirnalinee. 2019. MediaEval 2019: LRCNs for Stroke Detec-
     tion in Table Tennis. (2019).
[11] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
     lutional networks for large-scale image recognition. arXiv preprint
     arXiv:1409.1556 (2014).
[12] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,
     and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent
     neural networks from overfitting. The journal of machine learning
     research 15, 1 (2014), 1929–1958.
[13] Martin Tammvee and Gholamreza Anbarjafari. 2020. Human activity
     recognition-based path planning for autonomous vehicles. Signal,
     Image and Video Processing (2020), 1–8.
[14] Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop:
     Divide the gradient by a running average of its recent magnitude.
     COURSERA: Neural networks for machine learning 4, 2 (2012), 26–31.
[15] Jun Wan, Chi Lin, Longyin Wen, Yunan Li, Qiguang Miao, Sergio
     Escalera, Gholamreza Anbarjafari, Isabelle Guyon, Guodong Guo, and
     Stan Z Li. 2020. ChaLearn Looking at People: IsoGD and ConGD Large-
     Scale RGB-D Gesture Recognition. IEEE Transactions on Cybernetics
     (2020).
[16] Qingjiu Zhang and Shiliang Sun. 2010. A centroid k-nearest neighbor
     method. In International Conference on Advanced Data Mining and
     Applications. Springer, 278–285.