A Hybrid Approach To Stroke Detection In Swimming A Ankitha Reddy1,*,† , Pranav Moorthi2,† , Samyuktaa Sivakumar3,† , Shwetha S4,† , Prabavathy Balasundaram5,† and Pravinkrishnan K6,† 1 Sri Sivasubramaniya Nadar College Of Engineering, India Abstract This research explores technology integration in sports, focusing on video-assisted performance diag- nostics for swimming. Motion capture techniques, particularly using machine learning models, aim to automate stroke classification, offering efficient analysis of swimmers’ techniques. The study proposes a hybrid approach with VGG16 for feature extraction and a Random Forest classifier for stroke classification. Despite challenges stemming from limited data, resulting in an accuracy of 0.28125, the study emphasises the potential of deep learning neural networks for both feature extraction and classification with the aid of larger datasets in the context of swimming performance analysis. 1. Introduction In recent times, technology has become a crucial part of the world of sports. From being used to scrutinise and analyse athletes’ performance during their training period to identifying minor edges that can lead an athlete to victory, cameras and performance monitoring systems are used at every stage of the journey. Applying motion capture techniques through video cameras can go a long way in enhancing a person’s capabilities and preventing risks of injuries and player fatigue. In today’s day and age, video-assisted performance diagnostics in the context of swimming have become an indispensable part of improving and enhancing a swimmer’s technique. This sort of evaluation of stroke rates, the body postures, and the different phases of a stroke cycle are of great importance to athletes to help them understand how to make movements with minimum movement economy and maximum speed. But it is still largely done manually by replaying and physically noting important features from the video playback. This is not only labour-intensive but also exhausting and time-consuming. Automating this process using state-of-the-art models will largely help in catering to swimmers and athletes who do not have the skill and knowledge to evaluate their technique and are largely dependent on skilled experts for the same. Using machine learning models to detect the type of swimming stroke is an efficient way to analyse sports performance and derive conclusions about the improvements required in terms of technique and posture. The task [1] given to us is to classify an image into different swimming styles: Freestyle, Backstroke, Breaststroke, Butterfly. All four strokes have their specific technique, and movement economy and have different requirements. MediaEval’23: Multimedia Evaluation Workshop, February 1–2, 2024, Amsterdam, The Netherlands and Online * Corresponding author. † These authors contributed equally. $ ankithareddy2210178@ssn.edu.in (A. A. Reddy); pranav221076@ssn.edu.in (P. Moorthi); samyuktaa2210189@ssn.edu.in (S. Sivakumar); shwetha2210210@ssn.edu.in (S. S); prabavathyb@ssn.edu.in (P. Balasundaram) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Related Work In a study by Hosseini Fani [2], swimming strokes were analysed to predict arm stroke efficiency in videos utilising the OpenPose python library to extract joint features and angles. The classification task utilized the Random Forest technique, achieving an accuracy of 67%. In another study on recognising basketball turning and dribbling, Zhang et al. [3] introduced flow images to capture the relationship between basketball motions. A convolutional neural network model was utilised with multi-feature learning to extract spatiotemporal features for basketball turning and dribbling recognition effectively. Furthermore, a study [4] on basketball shooting action efficiency employed a Sparse Gaussian Process Latent Variable Model for motion tracking. Classification methods included Random Forest, Support Vector Machine, SOM neural network, and Bayesian network. A study [5] addressing stroke detection in tennis videos employed particle filters, motion descriptors and event detectors. The process involved player tracking, extraction of player- centred images, and the use of the Lucas Kanade algorithm for optical flow analysis. Motion descriptors are generated, and feature detection is performed using a 3-D extension of the Viola Jones algorithm. Training utilized the Adaptive Boosting algorithm in machine learning. Another study [6] on detecting football player activities used CNN and GCN to decipher spatial and temporal patterns in player poses and motions and classify them based on both visual appearances and pose configurations. The model is enhanced with data augmentation and regularisation. The model achieved an F1 score of 0.90. 3. Dataset The dataset comprised 96 labeled images which were partitioned into an 80 per cent training set and a 20 per cent validation set. The images belonged to 4 classes - backstroke, freestyle, breaststroke, and butterfly, as shown in figures 1-4, with the class-wise distribution being 24. Figure 1: Freestyle Figure 2: Breaststroke Figure 3: Backstroke Figure 4: Butterfly 4. Methodology 4.1. Convolutional Neural Networks Convolutional Neural Network is a widely used machine learning model and is a fundamental element of deep learning algorithms. A CNN consists of four components: A convolutional layer, an Activation operation, a Pooling layer and a Fully Connected layer. The convolutional layer is the heart of the neural network and is responsible for feature extraction. It involves applying a filter to the image to identify and locate significant features. An element-wise multiplication is performed between the two-dimensional array of weights called the filter and the two-dimensional input array and the result is summed up. This computation is performed using a sliding window [7]. The second component of the CNN is the activation operation. This component is responsible for recognizing specific features in the images it is being trained on and allows the network to learn the non-linear relationships between the weighted sum of the inputs and the output. The most widely used activation functions are Rectified linear unit (ReLU), Sigmoid and Hyperbolic tangent functions. The third part comprises the pooling layer that is used to reduce the size of the feature map which leads to dimensionality reduction. This sort of down sampling helps in maintaining a lower-resolution feature map while retaining the important features that are important to the classification task. There are largely two types of pooling: Average pooling and Maximum Pooling. The final component is the Fully Connected(FC) layer. The FC is a densely connected layer whose weights and biases are learned during the training process. This layer’s functionality is to flatten the 2D feature map obtained from the previous layers into a 1D array and output a score for each one of the classification classes. 4.2. VGG16 VGG 16 represents a convolutional neural network designed for image classification tasks. Pre-trained on the extensive ImageNet database, this model serves as a feature extractor, demon- strating its proficiency in learning hierarchical representations of visual features. Comprising a total of 16 layers, VGG 16 consists of 13 convolutional layers followed by 3 fully connected layers. The model operates by constructing a 3D tensor, which is subsequently flattened into a 2D vector before being fed into the Random Forest Classifier. The utilisation of a pre-trained model facilitates the extraction of intricate and abstract features. 4.2.1. Random Forest The Random Forest algorithm serves as an ensemble classification method that leverages the collective decision-making capabilities of multiple decision trees. This method involves the aggregation of individual decision trees to collectively determine the class output. The fundamental unit of the Random Forest is the decision tree, and each tree is trained independently on a randomly selected subset of the training data. An advanced form of the bagging algorithm is employed, introducing an element of randomness to enhance the diversity of the individual trees within the ensemble. 5. Implementation 5.1. Convolutional Neural Networks The network architecture consists of subsequent Conv2D and MaxPooling2D layers, with the Conv2D layers employing 32 filters of size 3x3 to learn the hierarchical representations. This layer is followed by a MaxPooling2D layer with a 2x2 pooling size, providing a degree of translational invariance. The third Conv2D layer introduces 64 filters, succeeded by a MaxPooling2D layer with a 2x2 pooling size, resulting in a down-sampled output that is reshaped into a one-dimensional array. The architecture includes two dense layers, helping the model understand complex combinations of lower-level features related to different strokes. To enhance regularisation, a Dropout layer with a dropout rate of 0.24 is applied. The architecture employs Rectified Linear Unit (ReLU) activations after each convolutional layer ensuring non- linearity and spatial down-sampling. Additionally, Softmax activation in the final layer facilitates the generation of output class scores through probabilistic distributions. 5.1.1. VGG16 Figure 5: Feature Map pre-VGG16 layer Figure 6: Feature Map post-VGG16 layer The implementation leverages the VGG16 architecture, consisting of a total of 19 layers with 16 convolutional and 3 fully connected layers. However, only the 16 layers comprising the convolutional base of VGG16 are utilised for the specific purpose of feature extraction. The convolutional layers employ 3x3 filters with a stride of 1, facilitating precise pixel movement. The number of filters increases with the depth of the model, and the architecture is characterised by multiple Conv2D and MaxPooling blocks, each contributing to the extraction of intricate features. Max pooling is executed with 2x2 windows and a stride of 2, efficiently down-sampling the input. Rectified Linear Unit (ReLU) functions are applied to activate the convolutional blocks, providing non-linearity to the model. Addressing the advantages of using smaller filter sizes, hierarchical feature learning and parameter sharing contributes extensively to capturing fine-grained details and reducing the risk of overfitting. The dataset was used to further train VGG16, which was pre-trained incipiently on the extensive ImageNet dataset. Figures 5 and 6 represent the feature maps extracted in each filter before and after the application VGG16 layer. Figure 7: CNN Architecture Figure 8: VGG16 Architecture 6. Results and Analysis The final accuracy of the random forest model on the test data was 0.28125. This low accuracy score is a result of training a neural network like VGG16 on a smaller dataset consisting of 96 images. To accommodate the limited nature of the dataset, we performed data augmentation and opted to use hybrid models that combined deep learning neural networks like VGG16 for feature extraction and traditional models like Random Forest for classification. The CNN model was also applied on the training data which yielded a low validation accuracy of 0.30 due to the extensive dataset requirement of the model. We aimed to extract the most significant features by leveraging state-of-the-art technology and tuning it to fit our dataset by making changes in the hyperparameters, but it underperformed and overfit due to the smaller size of the training dataset. 7. Conclusion Through the scope of this research, we implemented a hybrid approach. The capability of the CNN and VGG16 models to extract and capture fine-grained details was utilised in identifying the significant features in the images consisting of swimming strokes to correctly classify them. References [1] A. Erades, P. Martin, R. V. B. Mansencal, R. Péteri, J. Morlier, S. Duffner, J. Benois-Pineau, Sportsvideo: A multimedia dataset for event and position detection in table tennis and swimming, in: Working Notes Proceedings of the MediaEval 2023 Workshop, Amsterdam, The Netherlands and Online and Online, 1-2 February 2024, CEUR Workshop Proceedings, CEUR-WS.org, 2023. [2] H. Fani, A. Mirlohi, H. Hosseini, R. Herperst, Swim stroke analytic: Front crawl pulling pose classification, in: 2018 25th IEEE International Conference on Image Processing (ICIP), 2018, pp. 4068–4072. doi:10.1109/ICIP.2018.8451756. [3] B. Zhang, T. Wang, Visual image recognition of basketball turning and dribbling based on feature extraction., Traitement du Signal 39 (2022). [4] R. Ji, Research on basketball shooting action based on image feature extraction and machine learning, IEEE Access 8 (2020) 138743–138751. doi:10.1109/ACCESS.2020.3012456. [5] K. Dokic, T. Mesic, M. Martinovic, Table tennis forehand and backhand stroke recognition based on neural network, in: International Conference on Advances in Computing and Data Sciences, Springer, 2020, pp. 24–35. [6] L. Lamport, LaTeX User’s Guide and Document Reference Manual, Addison-Wesley Publishing Company, Reading, Massachusetts, 1986. [7] P. Martin, Baseline method for the sport task of mediaeval 2023 3d cnns using attention mechanisms for table tennis stoke detection and classification., in: Working Notes Proceedings of the MediaEval 2023 Workshop, Amsterdam, The Netherlands and Online and Online, 1-2 February 2024, CEUR Workshop Proceedings, CEUR-WS.org, 2023.