YOLOv5 for Stroke Detection and Classification in Table Tennis
                                                     Bhuvana J, T.T. Mirnalinee, B. Bharathi,
                                                          Jayasooryan S, Lokesh N N
                                     Sri Sivasubramaniya Nadar College of Engineering, Tamil Nadu, India
                                                 (bhuvanaj,mirnalineett,bharathib)@ssn.edu.in
                                                (jayasooryan19042,lokesh19055)@cse.ssn.edu.in

ABSTRACT                                                                             the table tennis strokes [8] that extracts the features using VGG16,
Sports action detection and classification is one of the most re-                    a pretrained model. Our approach does not use optical flow data to
searched topics in video analytics. It is very useful in order to make               detect the moves and instead directly uses the frame sequences.
fine tuned athletic training and get a better analysis of the athlete’s
performance. We present a model to detect and classify table tennis                  3     APPROACH
strokes made by players as a part of the MediaEval 2021 benchmark.                   The dataset had many videos which consisted of actions and moves
Our approach extracts features using a YOLOv5 model trained on                       made by players which had very subtle differences amongst them.
the MediaEval Fine Grained Action Detection and Classification of                    So we took into account temporal information in the frames in
Table Tennis Strokes dataset provided to us, to detect and classify                  an effective manner. Since the actions had very subtle differences
the moves/actions made.                                                              with the low inter-class variability, it was a difficult task to handle.
                                                                                     CNN models found to be optimal to classify if the data is highly
1    INTRODUCTION                                                                    spatial with proper discrimination among the classes. We decided
Action recognition is the task where predefined set of actions will                  to study this with object detection and recognition deep learning
be associated with the video. An automatic analysis of actions in                    framework, YOLOv5 architecture [2].
the videos is the need of the day. In this paper we have proposed
a method to detect and classify strokes in a dataset consisting of                   3.1    Data Pre-processing
various strokes in table tennis performed during a match or during                   The YOLO model takes fixed input sizes for each mini batch. The
practice. Localization of the objects and identifying them followed                  frames were downscaled to 512 × 512 in order to keep the size of the
the classification is the sequence of tasks involved in the action                   files manageable. CVAT (Computer Vision Annotation Tool) was
recognition. Strategic decisions can be taken once the actions are de-               used to annotate the actions of the players, by drawing bounding
tected and classified. The dataset consists of 20 different classes [5]              boxes over the body, focused on the hand holding the bat, in the
of strokes which the detection and classification is based upon, and                 videos as per the given frame number in the dataset. The strokes
these moves are shot in natural conditions. Application of machine                   annotated are of varying duration with some being very short while
learning in this specific domain can improve athletic performance                    others more lengthy. This meant we had to ensure the extracted
by computer-aided analysis of moves. We implemented a YOLOv5                         frames had information on the entire move in it, irrespective of the
model which is based on CNNs for this problem and discussed our                      duration. Since there were two different annotation data sets i.e
results with the given dataset.                                                      training and classification, we observed that the detection frame
                                                                                     sequences were overlapping with the classification ones, we an-
2    RELATED WORK                                                                    notated only using the detection data set using the stroke-classes
Sports action classification is a topic in which there has been a lot                when present, or marked it as just "stroke" if no class was present.
of research been carried out which tend to focus on recognising                      We then split the annotated files in the required two file types. This
a large number of actions using spatio-temporal models, using                        saved us a lot of work, as we did not have to annotate the same
videos. Feature extraction, dictionary learning , and classification                 video twice or draw two bounding boxes.
[7] are the steps involved in Action localization and recognition
of sports videos. Sliding window approach is used to choose the                      3.2    Proposed Model
maximum score of the classifier in the spatio-temporal volume.                       Our approach uses the complete RGB frame sequence of the whole
Siamese Spatio-Temporal Convolutional Neural Network (SSTCNN)                        video that consisted of some subset. The YOLOv5 architecture is
has been used to detect the table tennis strokes [6]. It uses the                    a modified version of the YOLOV4 implemented by Ultralytics.
RGB video frames and Optical Flow normalization to enhance their                     YOLOv5 has three functional components namely the Backbone,
performance. Similar action recognition research has been found                      CSPDarknet, its Neck, PANet and the Head, Yolo Layer. CSPDarknet
in literature [3], [1] using 3D ConvNets and extracting HOG of                       helps to extract the features from the frames of the table tennis
the Temporal Difference Map (TDMap) respectively. Long-term                          videos. Feature pyramids are constructed using the PANet stage
Recurrent Convolutional Network (LRCN) has been used to classify                     that helps in generalizing with different sized objects. By applying
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   the bounding boxes on the features the head layer performs the
License Attribution 4.0 International (CC BY 4.0).                                   object detection task.
MediaEval’21, December 13-15 2021, Online
                                                                                        The YOLOv5 model has been trained for 15 epochs in order to
                                                                                     detect the strokes, classify them and find their respective bounding
MediaEval’21, December 13-15 2021, Online                                                                                      Bhuvana J et al.


                                                                        current training was not a probable model for this dataset. The G-
                                                                        IoU is better than mAP shows that the detection is moving towards
                                                                        the ground truth, that if we have trained the network with different
                                                                        hyper-parameters for more epochs the detection would have been
                                                                        better. The baseline results for this dataset can be seen in [4] Hyper-
                                                                        parameters adopted by our approach is listed in Table 1. The model
                                                                        was not able to completely detect the moves such as serve backhand
                                                                        topspin, serve backhand backspin, forehand loop and forehand side
                                                                        spin leading to very poor test accuracy. The model did not perform
                                                                        well on the detection part as the frame sequences not only moves
                                                                        but other actions as well such as standing still, walking, etc. This
                                                                        resulted in incorrect detection of stroke from the frame sequences.
                                                                        A closer analysis shows that the model fails to distinguish between
                         (a) Detected as stroke                         the moves belonging to a specific class (such as Serve, Defensive,
                                                                        Offensive) as the differences are very intricate. The model tended
                                                                        to prefer certain moves significantly more than others on the test
                                                                        set, which arose due to the distribution of the training set. Using
                                                                        uniform amounts of data to work with resulted in the number of
                                                                        examples to train on being very low. The difference in accuracy
                                                                        on test and validation data might be due to the frequency of the
                                                                        different classes on the test set being different from the training
                                                                        and validation set.

                                                                                        Table 1: Hyper-parameters used


                                                                                  Hyper-parameter                          Value
                                                                                  Learning Rate                             0.01
                (b) Detected as offensive forehand flip                           optimizer                               adam
                                                                                  Loss                    Binary Cross-Entropy
                   Figure 1: Stroke detections                                                                 with Logits Loss
                                                                                  Momentum                                0.937
                                                                                  Weight Decay                           0.0005
                                                                                  IoU Threshold                              0.2
boxes from the frames. The model is trained to detect 20 different
                                                                                  Anchor Threshold                           4.0
classes of strokes. It has obtained a training and validation loss of
about 0.0039% and 0.0021% with loss function mentioned in Table 1.
Hyper-parameters adopted by our approach is listed in Table 1. As
we considered the whole action sequence during detection, over-
                                                                        5   DISCUSSION AND OUTLOOK
fitting was a major problem as even a still position of the player
was fed to the model with a positive label which caused over-fitting    As we processed the data where each move/action was considered
of the model. This could have been prevented by taking only the         for a very large frame sequence, it resulted in over-fitting. Thus
frame sequences where a move was performed. In the classification       over-fitting could have been avoided if the moves/actions were
part, this model seemed to perform better than the detection part       precisely annotated in the video dataset and considered correctly
as the classes were in lesser amount in the dataset compared to the     when fed into the model. We learnt that data of this kind needs to
detection part.                                                         have precise annotations after pre-processing which could result in
                                                                        better results. Thus our model could not show comparable accuracy
                                                                        when compared with the baseline model which was provided for
4   RESULTS                                                             reference. The performance could have been enhanced further by
The model was able to classify 13 out of the 20 classes as we could     annotating all the videos and by training for more number of epochs.
not annotate the videos which had the other 7 classes. We achieved      Conv3d with different hyper-parameters other than the baseline
an accuracy of 9.95% on the 13 classes where some of the classes        model can be attempted to study the performance.
were predicted with a good accuracy and some of them had poor
accuracy. It has been observed with respect to per class accuracy       REFERENCES
that the model learnt some of the moves well than the others. A          [1] Omar Elharrouss, Noor Almaadeed, Somaya Al-Maadeed, Ahmed
sample images after detection are shown in Figure 1a and 1b.                 Bouridane, and Azeddine Beghdadi. 2021. A combined multiple ac-
   But a very poor performance in the test set of the detection              tion recognition and summarization for surveillance video sequences.
(mAP=0.000525 G-IoU=0.247) showing that using YOLOv5 with                    Applied Intelligence 51, 2 (2021), 690–712.
Sports Video Task                                                               MediaEval’21, December 13-15 2021, Online


[2] Glenn Jocher. 2020. YOLOV5. https://github.com/ultralytics/yolov5.
    (2020). Online; accessed 29 October 2021.
[3] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional
    neural networks for human action recognition. IEEE transactions on
    pattern analysis and machine intelligence 35, 1 (2012), 221–231.
[4] Pierre-Etienne Martin. 2021. Spatio-Temporal CNN baseline method
    for the Sports Video Task of MediaEval 2021 benchmark. In MediaEval
    (CEUR Workshop Proceedings). CEUR-WS.org.
[5] Pierre-Etienne Martin, Jordan Calandre, Boris Mansencal, Jenny
    Benois-Pineau, Renaud Péteri, Laurent Mascarilla, and Julien Morlier.
    2021. Sports Video: Fine-Grained Action Detection and Classification
    of Table Tennis Strokes from videos for MediaEval 2021. In MediaEval
    (CEUR Workshop Proceedings). CEUR-WS.org.
[6] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
    Morlier. 2020. Fine grained sport action recognition with twin spatio-
    temporal convolutional neural networks. Multimedia Tools and Appli-
    cations 79, 27 (2020), 20429–20447.
[7] Khurram Soomro and Amir R Zamir. 2014. Action recognition in
    realistic sports videos. In Computer vision in sports. Springer, 181–208.
[8] Siddharth Sriraman, Srinath Srinivasan, Vishnu K Krishnan, J Bhuvana,
    and TT Mirnalinee. 2019. MediaEval 2019: LRCNs for Stroke Detection
    in Table Tennis.. In MediaEval.