1. Introduction

Automatic Sports Video Classification Using CNN-LSTM Approach⋆

Benoughidene Abdelhalim

0 1

Titouna Faiza

ftitouna@yahoo.fr 0 1 0 6th International Hybrid Conference On Informatics And Applied Mathematics 1 Department of computer science, LaSTIC Laboratory, University of Batna2 , Algeria

With the exponential growth and availability of sports video data, the need for video analysis has become crucial. Sports video classification is one of the most challenging problems among computer vision researchers. This paper proposes a novel method for sports video classification based on a deep learning approach using the pre-trained model Inception V3 combined with long short-term memory (LSTM). The pre-trained model Inception V3 extracts the low, middle, and high spatial features. Additionally, we will get the temporal features by feeding LSTM with spatial features. Then, we trained our InceptionV3-LSTM model based on spatial-temporal features by feeding it to the LSTM to classify the video sports into specific categories. The experiments are conducted on the UCF sports dataset. The experiments performed showed that our proposed model has obtained much more encouraging experimental results than the others.

eol>Transfer learning Inception V3 Long short-term memory (LSTM) Deep learning Sport video classification

1. Introduction

combine them [5].

To take account of the two spatio-temporal aspects As the volume of sports video data increases, video classi- of video, we use a hybrid architecture consisting of conifcation becomes essential for understanding and organiz- volutional layers to handle spatial information and reing this data. Video classification is an important visual current layers to handle temporal information. The twotask in computer vision and has been used in a variety of stream convolutional neural network (CNN) and the reapplications, including motion recognition [ 1 ] and scene current neural network (RNN), which handle both spatioclassification [ 2 ]. Video classification enables users to temporal information, proved to be better than their easily access and understand sports videos and provides single-stream counterparts [6]. insight into the game or sport [3].The aim of sports video In this paper, we focus on the classification of sports classification is to automatically classify sports videos ac- videos from the UCF dataset. We will use a transfer learncording to the sporting events they contain, which have ing method to pre-train a CNN model (Inception V3) spatial and motion features [4]. to extract low, middle, and high spatial features. Then,

Deep learning algorithms have many applications, we will use Long Short-Term Memory (LSTM) layers to such as object recognition from images, search engines, extract temporal features to classify sports videos into and speech recognition. However, sports video classifi- specific categories. The proposed model has achieved cation is a relatively new field. With the development impressive performance on this dataset compared with of deep learning techniques, this field has attracted the some existing methods. Contributions to this work inattention of researchers looking for new challenges. A clude: video is composed of a set of sequential images. Each image contains information about the spatial content, while the temporal sequence of images contains information about the motion. In order to represent a video in a comprehensive and informative way, it is necessary to capture both spatial and temporal information and

1. A new deep neural network model based on LSTM

is proposed to classify video frames into specific classes; 2. In this model, three spatial feature classes are extracted by a pre-trained CNN model (Inception V3) instead of hand-created features. This decision was taken to avoid system complexity and improve performance. The aim of this work is to prove that CNN features can outperform handcrafted ones; 3. In addition, we combine spatial information from the CNN model (Inception V3) with temporal information from the LSTM model to improve model performance and decision-making.

This method is efective because it provides known as gates. These gates are responsible for managbetter-quality information by combining diferent ing the cell state, thereby controlling the long-term and sources rather than using them separately. short-term memory of the network. They play a crucial Trdoeeuhlsraectrrpeeidabspuewseltrosth.crekoSenpocsrntioisoptssnopsoo5efrddtfsoimsucveruitdssheseoeocdstoiccolloaonnsgssc.yil.SficuaSedteciitcnoitgonion.rne2Sm4edcapitrsrikceoussn.essn3etss 1lroao0rylAedneiredis,nudcaraidlotsriendootseenr.dreamT.fllehiyrne,riewondugettwpoeumhataspftlraoiondymfeeondtrhsmoeisnaltaeliayoyfenuerlr,slywhisohcutoihclndehnnbceepocrnrtoeesctdiasetis(nsFseeCoddf) through a softmax activation function for the final pre2. Related Work diction. This entire approach was implemented using the Keras API. Indeed, deep learning methods have proven Researchers are currently looking to develop more accu- to be efective and perform well on tasks involving the rate and reliable algorithms to improve the classification comparison of video classifications. of sports videos. The classification of sports videos is a new field that is developing rapidly with the use of deep 3.1. Features Extraction Process learning techniques. Features can be extracted from text, audio, and visual information [7]. In this work, we fo- Classifying actions from videos requires extracting a set cus solely on visual information, as vision is the primary of features that are anticipated to contain the data necesmeans by which humans perceive information [8]. sary to distinguish between various actions. There are

Deep learning methods have recently been shown to three types of features: the first is low-level features like be able to automatically extract complex features from corners, edges, and simple textures; the second is middleimages. Building on this, many researchers have worked level features like complex textures and shapes; and the on the two latest deep learning techniques, convolutional third is high-level features like objects or parts of objects. neural networks (CNNs) and recurrent neural networks The feature extraction technique is explained below. (RNNs), to extract spatial and temporal features from sports videos [4]. In [9], the authors presented a transfer 3.1.1. Transfer learning via feature extraction learning algorithm for DNN-based video classification. [Low, Middle, high]: The algorithm is designed as a video classification system based on a convolutional neural network. Podgorelec et Conventional approaches to video classification have al, in [10] used a transfer learning approach to fine-tune used frame-based features to generate a representation a pre-trained CNN model for a sports image classifica- for the videos. In this work, various types of features, tion task. They developed a classification method using such as low, middle, and high, have been used to capture CNN transfer learning with Hyper Parameter Optimiza- various types of information. These features are obtained tion (HPO). Russo et al, in [11] proposed a model that with the help of a transfer learning strategy. Transfer combines deep learning and transfer learning. They com- learning is an efective strategy for training networks on bined VGGNET16, RNN, and GRU functions with transfer a small data set. The network is pre-trained on a large learning for the final classification. Chen et al, in [ 12] dataset, such as ImageNet, and then reused to train a new investigated simple sports classification in sports videos task. This ofers a significant advantage over training the by detecting motion poses in video frames. For example, network from scratch, as it requires less time and less they can detect running, jumping, translation, and zoom- data. Transfer learning can save both time and computing. Qiu et al, in [13] showed that principal components ing costs [14]. There are many pre-trained models on can be used to reduce the dimensions of visual and au- the ImageNet dataset, such as AlexNet, VGG16, ResNet dio video features. A time series of motion features can and InceptionV3. These models can be used to extract be used to characterize motion event classifications in features from the data or fine-tune them to perform a soccer sports videos. new task.

In this work, we used transfer learning to extract features. Convolutional Neural Networks (CNN) automatically learn features from images at a hierarchical level.

3. Proposed Methodology

In this work, we aim to develop a model for sport video classification tasks that starts with two basic steps: the feature extraction process generated by the pre-trained model InceptionV3 and the training process by LSTM to predict action as presented in Figure 1. We utilized an LSTM network, which is equipped with mechanisms

3.1.2. Extracting features generated by InceptionV3: Inception V3, a widely recognized convolutional neural network, is frequently utilized in tasks involving image classification and feature extraction. It demonstrates

Pre-Trained InceptionV3 model as Feature

Extraction

A sequence of 20 frames Input(20, 224, 2224, 3) Block-00 high accuracy, achieving over 75% - 78% on the ImageNet responsible for identifying low-level features; the later dataset, and is known for its speed and precision. The convolutional layers identify middle-level features; and architecture of InceptionV3, equipped with multiple ‘In- the last convolutional layers identify high-level features, ception’ modules, is designed to capture intricate patterns which are usually very specific to the task they are trained and hierarchies in images, making it particularly suitable for. In the InceptionV3 architecture, lower layers (such for sports video classification, where data often exhibits as block-02) are usually responsible for capturing lowcomplex spatial hierarchies and patterns. However, the level features such as edges and textures. Mid-level layers selection of a model can be influenced by the specific (such as block-05) can capture more complex features like requirements of the task and the characteristics of the shapes, and higher layers (such as block-10) can capture data. Therefore, it could be advantageous to compare high-level semantic information, such as the presence of the performance of InceptionV3 with other pre-trained specific objects in the image. (see Figure 2) models in our future work. 1. The low-level features: Were extracted from

In this work, we propose an approach based on transfer the block02, with 288 dimensions. learning that uses InceptionV3 to extract features from 2. The middle-level features: Were extracted frames. The selection of specific blocks from the Incep- from the block05, with 768 dimensions. tionV3 model for feature extraction is typically based 3. The high-level features: Were extracted from on the type of information these blocks can capture and the last block10, with 2048 dimensions. their impact on the overall performance of the model.

The first layers of any neural network are basically

The rationale behind selecting these specific blocks could be that the combination of low, mid, and high-level fea

(20, 8, 8, 2048) Global Averrage

Pooling2 Output (20, 2048) High-level features

LSTM Output (128) s s e c o r P n o it c a r t x E s e r u t a e F s s e c o r P g n i n n i a r T tures provides a comprehensive representation of the video content, which is beneficial for the task of sports video classification. • CE : Loss function (categorical Cross-Entropy); • M : The number of classes; • Y : The ground truth; • P : The expected probabilistic observation of class c.

Model parameters are updated using gradient descent

and backpropagation error. During the learning process, the joint weights of the pre-trained model are frozen, in particular from block 0 to block 10. Finally, the LSTM output passes through a fully connected layer with softmax activation functions.

Experimental results show that our method is efective in classifying sports videos, with the model achieving 89.5% accuracy, 90% precision, 88% recall, and 88% f1 score.

4. Experimental results and discussion

4.1. UCF Sports Action Data Set

We evaluated the performance of our proposed model

by testing it on the UCF sports action dataset, which contains videos of various actions from diferent sports Figure 3. The dataset includes human-annotated action boundaries, which are shown in yellow. UCF is considered one of the best datasets for applications that require action localization and recognition. 3.2. Training process The dataset comprises 150 videos divided into 10 categories, filmed in diferent environments. Since each We train the LSTM and Fully Connected (FC) layer to category has a diferent number of videos. Figure 4 perform the classification process. Whereas CNN blocks shows the distribution of the number of videos for each cannot be trained to extract features. This saves us less category. The video resolution is 720 x 480 and the frame consumption of resources, parameters and computation rate is 10 fps. The total duration of the dataset is 958 time. After passing the data through the model, we grad- seconds, and the average sequence length is 6.39 seconds. ually decrease the computed error using the categorical Table 1 shows the characteristics of the data set [15] [16]. cross-entropy formula provided by the following loss The 20 frames of each video were used for video clasfunction. sification, as shown in Table 2.

(1) 4.2. Evaluation Metrics

The quality of video classification models is measured using various performance metrics. Precision, accuracy,

recall and F1 score are among the most common measures. We uses these metrics to evaluate ours model. These metrics are defined as follows: 1. Accuracy: It is a measure of how well a model performs across all categories. Accuracy is useful when all categories are equally important. It is calculated by dividing the number of correct predictions by the total number of predictions. 4. F1 score: It is a measure that combines accuracy and recall into a single score. F1 score ranges from 0 to 1, with a score of 1 indicating the best model performance.

The goal of a learning algorithm is to find a model that

has a good fit, meaning that the model is not overfitting or underfitting. Our model has a good fit, as evidenced by the decrease in the training and validation losses to a point of stability with a minimum gap between the final loss values.

In Figure 5, it can be observed that the maximum training accuracy for the model was reached at epoch 100, where the validation accuracy also reached its maximum, which is 89.5%. The learning curve plot shows good agreement, which confirms the good fit of our model. + + + + + = (2) Figure 6 shows a confusion matrix for ten actions in the UCF Sport action recognition experiment using a 2. Precision: It is the ratio of the number of positive batch normalization layer. The diagonal elements indisamples correctly classified to the total number cate the accuracy of recognizing each action type. Each of samples classified as positive, including incor- row of the confusion matrix represents the true action, rectly classified positive samples. and each column represents the predicted action. Our model performs well on some actions, achieving a vali = (3) dation accuracy of 100% for seven actions.

+ The classification report is shown in Figure 7 includ

ing three criteria: precision, recall, and F1-score. Our 3. tsnRhauememcmpablleoelrrs:eoIttfhppiaostostsiwthiiteevivrereeascstaaiomomrrpopelfcleettsslhy.aeTrdenheudetemehctitbegecehdtreetodrof.thtpheoersetitociatvalell, tpmerneeTtcahaicbsotildioeonsn3u,sc8c,oc8aem%ssptsohfafuerrlelmesycooapduleler,rlfraoyenrsidemull8dst8sst%oh8ne9o.t5cfhl%fae1soUssficCficaoaFcrtceisou.pnroarbctyes,tdw9a0et%aesnoetf with some state-of-the-art methods. Our model signifi = (4) cantly outperformed all studies that used the handcrafted features method, but for studies that used our same approach to extracting features, that is, the learned features, our model achieves competitive results compared to [21], [22] and [23] in terms of accuracy.

5. Conclusion and future work

This study proposes a new approach to identifying human action recognition in sports. The approach is based on feature extraction and uses two models: Inception V3 to extract spatial features (low, middle, and high-level) and LSTM to extract temporal features. The FC layer receives the output of the final LSTM layer, and the softmax layer predicts the type of action. One of the key benefits of the proposed method is its ability to aggregate spatial features from Inception V3 and temporal features from LSTM at each time step and in each video frame. This approach improves the model’s performance and decision-making by combining information from two sources rather than using them independently. Experimental results on the UCF sports dataset show that our proposed method achieves higher classification accuracy 89.5% than state-of-the-art sports video classification methods. Our proposed method has shown promising results, but it has its limitations. One potential limitation could be the reliance on the InceptionV3 model for feature extraction, which may not be ideal for all sports videos. Also, the LSTM model, used for capturing timerelated changes, might struggle with long sequences.

For future work, we suggest looking into other pretrained models for feature extraction, exploring more advanced versions of LSTM like Gated Recurrent Units (GRU) or Transformer models, and combine more data like player stats or game context. We believe these improvements could make our method even better at classifying sports videos. Dynamic scene understanding: The role of orien- engineering and applications 56 (2020) 43–48. tation features in space and time in scene clas- [14] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, sification, in: 2012 IEEE Conference on Com- H. Xiong, Q. He, A comprehensive survey on transputer Vision and Pattern Recognition, IEEE, 2012. fer learning, Proceedings of the IEEE 109 (2020) doi:10.1109/cvpr.2012.6247815. 43–76. [3] J. Zheng, X. Cao, B. Zhang, X. Zhen, X. Su, Deep [15] M. D. Rodriguez, J. Ahmed, M. Shah, Action ensemble machine for video classification, IEEE MACH a spatio-temporal maximum average corTransactions on Neural Networks and Learning relation height filter for action recognition, in: Systems 30 (2019) 553–565. doi:10.1109/tnnls. 2008 IEEE Conference on Computer Vision and Pat2018.2844464. tern Recognition, IEEE, 2008. doi:10.1109/cvpr. [4] M. S. Sarma, K. Deb, P. K. Dhar, T. Koshiba, Tradi- 2008.4587727.

tional bangladeshi sports video classification using [16] K. Soomro, A. R. Zamir, Action recognition in realdeep learning method, Applied Sciences 11 (2021) istic sports videos, in: Computer vision in sports, 2149. doi:10.3390/app11052149. Springer, 2014, pp. 181–208. [5] M. A. Russo, A. Filonenko, K.-H. Jo, Sports clas- [17] L. Yefet, L. Wolf, Local trinary patterns for human sification in sequential frames using cnn and rnn, action recognition, in: 2009 IEEE 12th International in: 2018 International Conference on Information Conference on Computer Vision, IEEE, 2009. doi:10. and Communication Technology Robotics (ICT- 1109/iccv.2009.5459201.

ROBOT), 2018, pp. 1–3. doi:10.1109/ICT-ROBOT. [18] A. Klaser, M. Marszalek, I. Laptev, C. Schmid, Will 2018.8549884. person detection help bag-of-features action recog[6] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Suk- nition?, Research Report RR-7373, INRIA, 2010. thankar, L. Fei-Fei, Large-scale video classification URL: https://hal.inria.fr/inria-00514828. with convolutional neural networks, in: Proceed- [19] H. Wang, A. Klaser, C. Schmid, C.-L. Liu, Action ings of the IEEE Conference on Computer Vision recognition by dense trajectories, in: CVPR 2011, and Pattern Recognition (CVPR), 2014. IEEE, 2011. doi:10.1109/cvpr.2011.5995407. [7] D. Brezeale, D. Cook, Automatic video classifi- [20] J. Yu, M. Jeon, W. Pedrycz, Weighted feature trajeccation: A survey of the literature, IEEE Trans- tories and concatenated bag-of-features for action actions on Systems, Man, and Cybernetics, Part recognition, Neurocomputing 131 (2014) 200–207. C (Applications and Reviews) 38 (2008) 416–430. doi:10.1016/j.neucom.2013.10.024. doi:10.1109/tsmcc.2008.919173. [21] V. de Oliveira Silva, F. de Barros Vidal, A. R. S. Ro[8] M. C. Darji, D. Mathpal, A review of video classifi- mariz, Human action recognition based on a twocation techniques, IRJET Journal 4 (2017). stream convolutional network classifier, in: 2017 [9] H. Guangyu, Analysis of sports video intelligent 16th IEEE International Conference on Machine classification technology based on neural network Learning and Applications (ICMLA), IEEE, 2017. algorithm and transfer learning, Computational doi:10.1109/icmla.2017.00-64. Intelligence and Neuroscience 2022 (2022) 1–10. [22] A. Zare, H. A. Moghaddam, A. Sharifi, Video spadoi:10.1155/2022/7474581. tiotemporal mapping for human action recognition [10] V. Podgorelec, Š. Pečnik, G. Vrbančič, Classifica- by convolutional neural network, Pattern Analysis tion of similar sports images using convolutional and Applications 23 (2019) 265–279. doi:10.1007/ neural network with hyper-parameter optimiza- s10044-019-00788-1. tion, Applied Sciences 10 (2020) 8494. doi:10. [23] N. Jaouedi, N. Boujnah, M. S. Bouhlel, A new hy3390/app10238494. brid deep learning model for human action recog[11] M. A. Russo, L. Kurnianggoro, K.-H. Jo, Classifi- nition, Journal of King Saud University - Comcation of sports videos with combination of deep puter and Information Sciences 32 (2020) 447–453. learning models and transfer learning, in: 2019 In- doi:10.1016/j.jksuci.2019.09.004. ternational Conference on Electrical, Computer and Communication Engineering (ECCE), IEEE, 2019.

doi:10.1109/ecace.2019.8679371. [12] C. Lifu, W. Hong, C. Xianliang, G. Zhenghua, J. Zhiwei, Convolutional neural network sar image target recognition based on transfer learning, Chinese

Space Science Technology 38 (2018) 45–51. [13] N. Qiu, X. Wang, P. Wang, S. Zhou, Y. Wang, Research on convolutional neural network algorithm combined with transfer learning model, Computer

[1]

Liu ,

Shu ,

Tang , W. Zhang, Computational model based on neural network of visual cortex for human action recognition , IEEE Transactions on Neural Networks and Learning Systems 29 ( 2018 ) 1427 - 1440 . doi: 10 .1109/tnnls. 2017 . 2669522 .

[2]

K. G.

Derpanis ,

Lecce ,

Daniilidis ,

R. P.

Wildes ,