ConvLSTM for Table Tennis Stroke Classification Jansi Rani Sella Veluswami1 , Ananth Narayanan P1 , Bhuvan S1 and Shobith Kumar R1 1 Sri Sivasubramaniya Nadar College of Engineering, Tamil Nadu, India Abstract Our study concentrates on sports video analytics, particularly stroke classification. We utilize a model that combines Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) trained on the MediaEval Fine-Grained Action Classification of the Table Tennis Strokes dataset. With an accuracy of 81.4%, our model effectively classifies table tennis moves, providing insights for post-match commentary and playstyle analysis. This effectiveness is demonstrated in the context of the MediaEval 2023 benchmark. 1. Introduction The field of action recognition involves associating a predefined set of actions with video content to meet the increasing demand for automated action analysis in videos. This paper presents a method that specifically targets the classification of strokes within a dataset of various table tennis strokes performed in match and practice settings. The action recognition process involves localizing objects, identifying them, and then classifying the detected actions. The ability to detect and classify actions is crucial for making strategic decisions, particularly in the context of athletic performance analysis. The Overview paper [1] describes the dataset TTStroke-21 used in this study which includes 21 different classes of strokes, where two annotated sets are provided: a training and a validation set. Utilizing machine learning in this domain has the potential to enhance athletic performance through computer-aided analysis of moves. In this study, we developed a model implemented using TensorFlow, using a combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) architecture. Our approach aims to contribute to the improvement of athletic performance by automating the analysis of various strokes. We discuss the results obtained using our model on the given dataset, highlighting the significance of effective action recognition in sports analytics. 2. Related Work The provided baseline methodology [2] proposes two types of 3D-CNN architectures to solve the subtask. Both the methods are 3D-CNN architectures using Spatio-temporal convolutions and attention mechanisms. The predominant strategies have centered around the utilization of CNN and LSTM-based methodologies. For example, in the paper by Kaustubh Milind Kulkarni et al. [3], an LSTM model, a TCN model, and a combined TCN + LSTM model were presented. They used Pose Estimation and a Savitzky-Golay filter for feature extraction. Kadir Aktas et al. [4], present another approach where RGB images were used as the input data without any prior feature extraction. They used an LSTM model to achieve about 79.8% accuracy in validation data. MediaEval’23: Multimedia Evaluation Workshop, February 1–2, 2024, Amsterdam, The Netherlands and Online $ svjansi@ssn.edu.in (J. R. S. Veluswami); ananthnarayanan2210384@ssn.edu.in (A. N. P); bhuvan2210511@ssn.edu.in (B. S); shobithkumar2210399@ssn.edu.in (S. K. R) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings We were inspired by the idea of using the RGB images directly without any feature extraction and executed the same in our work. 3. Approach A Convolutional Neural Network (CNN or ConvNet) is a type of deep neural network specifically designed for processing image data. This network excels in analyzing images and making predictions based on them. It utilizes kernels, known as filters, to examine the image and generate feature maps, which represent the presence of specific features at various locations within the image. Initially, the network produces a limited number of feature maps, which are augmented and refined through subsequent layers using pooling operations, while retaining critical information without loss. On the other hand, a Long Short-Term Memory (LSTM) network is specifically designed to handle sequential data, taking into account all previous inputs to generate an output. LSTMs are a type of Recurrent Neural Network (RNN) that addresses the vanishing gradient problem, a limitation of traditional RNNs in handling long-term dependencies in input sequences. This enables LSTM cells to maintain context for extended periods, making them better suited for tasks such as time series prediction, speech recognition, language translation, and music composition. In the context of action recognition, we will employ a CNN + LSTM network to leverage the spatial-temporal aspects of videos. This combination will enable the network to effectively analyze and recognize actions within video sequences. 3.1. Data Preprocessing The data preparation process involves class identification and two pivotal functions: one for frame extraction, ensuring resizing and normalization, and another for dataset construction, incorporating features, labels, and video paths. Notably, the dataset creation rigorously filters videos to align with the specified sequence length. The execution of the dataset creation function on a specified directory results in the generation of dataset objects, including features, labels, and video file paths. These components include features, representing extracted frames from videos, and labels, serving as identifiers for subsequent machine learning model training. The third component consists of paths associated with videos in the dataset, functioning as references to the physical location of each video. 3.2. Proposed Model The process of creating a dataset for TensorFlow is seamless and incorporates both Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) architecture. This choice is informed by the well-established effectiveness of these architectures in video content analysis tasks. In the construction phase of our model, the Keras ConvLSTM recurrent layers, a critical architectural decision for video classification tasks is utilized. These layers excel at processing spatiotemporal information within video sequences. We configure the layer with parameters such as the number of filters, kernel size, and activation function to facilitate convolutional operations. The resulting sequences are subsequently processed through various other function layers, reducing frame dimensions to alleviate computational load, and Dropout layers, miti- gating overfitting risks. The architecture is intentionally kept simple with a limited number of trainable parameters, commensurate with the scale of the dataset. A vital element is the incorporation of a final Dense layer with softmax activation, yielding probability distributions across action categories. The constructed model is then compiled using categorical cross-entropy as the loss function, the Adam optimizer, and accuracy as the metric for evaluation. Training is initiated, incorpo- rating an early stopping callback to prevent overfitting. This structure forms a cohesive and efficient framework for the in-depth analysis of spatiotemporal patterns within table tennis stroke videos. The model’s adherence to best practices in architectural design and training strategies enhances its adaptability and potential for robust performance in action recognition tasks. 4. Results and Analysis The accuracy of the model was updated after every layer of training, and the results demonstrate a high level of accuracy. The training data accuracy reached a peak of 97%, while the validation accuracy reached 98.8%. Several factors contributed to these results. Firstly, the volume and distribution of the data had a significant impact on the accuracy. Additionally, the image height, width, and sequence length all had a significant effect on the results, with accuracy ranging from 0.7408 for an image of dimensions 90*80 and a sequence length of 60, to 0.9876 for an image of dimensions 64*64 and a sequence length of 60. It is worth noting that the data distribution of some labels is highly biased towards certain classes, leading to biased learning. Over the course of five runs, the highest global accuracy achieved by the model was 81.4 Figure 1: Accuracy across various runs. 5. Discussion and Outlook Throughout the training and validation phase, we have attained encouraging outcomes that lead us to conclude that overfitting is not present. Nonetheless, the model’s performance on the test data reveals that it has not effectively learned and is unable to generalize. In our opinion, this challenge can be remedied by augmenting the quantity of labeled data utilized in training. Moreover, we posit that the low variability between the classes and the nature of the task contribute to this issue. Considering that a single class can be sampled in various ways for different players, such as right/left-handed or high/low experienced, we suggest that the dataset could be enhanced by increasing the coverage of the classes and reducing bias among them. Figure 2: Heatmap of the model with highest global accuracy. References [1] P.-E. Martin, Baseline method for the sport task of mediaeval 2023 3d cnns using attention mecha- nisms for table tennis stoke detection and classification., MediaEval Workshop 2023 (2023). [2] A. Erades, P.-E. Martin, R. Vuillemot, B. Mansencal, R. Peteri, J. Morlier, S. Duffner, J. Benois-Pineau, SportsVideo: A Multimedia Dataset for Event and Position Detection in Table Tennis and Swimming, MediaEval Workshop 2023 (2023). [3] K. M. Kulkarni, S. Shenoy, Table tennis stroke recognition using two-dimensional human pose estimation, CVPR Sports Workshop (2021). [4] K. Aktas, M. Demirel, M. Moor, J. Olesk, G. Anbarjafari, Spatio-temporal based table tennis hit assessment using lstm algorithm, MediaEval (2020). [5] A. Zahra, P.-E. Martin, Two stream network for stroke detection in table tennis, MediaEval (2021). [6] HCMUS at MediaEval’20: Ensembles of Temporal Deep Neural Networks for Table Tennis Strokes Classification Task, 2020. [7] P.-E. Martin, J. B. Pineau, B. Mansencal, R. Péteri, J. Morlier, Siamese spatio-temporal convolutional neural network for stroke classification in table tennis games (2020).