<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Sports Video Classification Using CNN-LSTM Approach⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benoughidene Abdelhalim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Titouna Faiza</string-name>
          <email>ftitouna@yahoo.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>6th International Hybrid Conference On Informatics And Applied Mathematics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of computer science, LaSTIC Laboratory, University of Batna2</institution>
          ,
          <country country="DZ">Algeria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the exponential growth and availability of sports video data, the need for video analysis has become crucial. Sports video classification is one of the most challenging problems among computer vision researchers. This paper proposes a novel method for sports video classification based on a deep learning approach using the pre-trained model Inception V3 combined with long short-term memory (LSTM). The pre-trained model Inception V3 extracts the low, middle, and high spatial features. Additionally, we will get the temporal features by feeding LSTM with spatial features. Then, we trained our InceptionV3-LSTM model based on spatial-temporal features by feeding it to the LSTM to classify the video sports into specific categories. The experiments are conducted on the UCF sports dataset. The experiments performed showed that our proposed model has obtained much more encouraging experimental results than the others.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Transfer learning</kwd>
        <kwd>Inception V3</kwd>
        <kwd>Long short-term memory (LSTM)</kwd>
        <kwd>Deep learning</kwd>
        <kwd>Sport video classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>combine them [5].</p>
      <p>
        To take account of the two spatio-temporal aspects
As the volume of sports video data increases, video classi- of video, we use a hybrid architecture consisting of
conifcation becomes essential for understanding and organiz- volutional layers to handle spatial information and
reing this data. Video classification is an important visual current layers to handle temporal information. The
twotask in computer vision and has been used in a variety of stream convolutional neural network (CNN) and the
reapplications, including motion recognition [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and scene current neural network (RNN), which handle both
spatioclassification [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Video classification enables users to temporal information, proved to be better than their
easily access and understand sports videos and provides single-stream counterparts [6].
insight into the game or sport [3].The aim of sports video In this paper, we focus on the classification of sports
classification is to automatically classify sports videos ac- videos from the UCF dataset. We will use a transfer
learncording to the sporting events they contain, which have ing method to pre-train a CNN model (Inception V3)
spatial and motion features [4]. to extract low, middle, and high spatial features. Then,
      </p>
      <p>Deep learning algorithms have many applications, we will use Long Short-Term Memory (LSTM) layers to
such as object recognition from images, search engines, extract temporal features to classify sports videos into
and speech recognition. However, sports video classifi- specific categories. The proposed model has achieved
cation is a relatively new field. With the development impressive performance on this dataset compared with
of deep learning techniques, this field has attracted the some existing methods. Contributions to this work
inattention of researchers looking for new challenges. A clude:
video is composed of a set of sequential images. Each
image contains information about the spatial content,
while the temporal sequence of images contains
information about the motion. In order to represent a video
in a comprehensive and informative way, it is necessary
to capture both spatial and temporal information and</p>
      <sec id="sec-1-1">
        <title>1. A new deep neural network model based on LSTM</title>
        <p>is proposed to classify video frames into specific
classes;
2. In this model, three spatial feature classes are
extracted by a pre-trained CNN model (Inception
V3) instead of hand-created features. This
decision was taken to avoid system complexity and
improve performance. The aim of this work is to
prove that CNN features can outperform
handcrafted ones;
3. In addition, we combine spatial information
from the CNN model (Inception V3) with
temporal information from the LSTM model to
improve model performance and decision-making.</p>
        <p>This method is efective because it provides known as gates. These gates are responsible for
managbetter-quality information by combining diferent ing the cell state, thereby controlling the long-term and
sources rather than using them separately. short-term memory of the network. They play a crucial
Trdoeeuhlsraectrrpeeidabspuewseltrosth.crekoSenpocsrntioisoptssnopsoo5efrddtfsoimsucveruitdssheseoeocdstoiccolloaonnsgssc.yil.SficuaSedteciitcnoitgonion.rne2Sm4edcapitrsrikceoussn.essn3etss 1lroao0rylAedneiredis,nudcaraidlotsriendootseenr.dreamT.fllehiyrne,riewondugettwpoeumhataspftlraoiondymfeeondtrhsmoeisnaltaeliayoyfenuerlr,slywhisohcutoihclndehnnbceepocrnrtoeesctdiasetis(nsFseeCoddf)
through a softmax activation function for the final
pre2. Related Work diction. This entire approach was implemented using the
Keras API. Indeed, deep learning methods have proven
Researchers are currently looking to develop more accu- to be efective and perform well on tasks involving the
rate and reliable algorithms to improve the classification comparison of video classifications.
of sports videos. The classification of sports videos is a
new field that is developing rapidly with the use of deep 3.1. Features Extraction Process
learning techniques. Features can be extracted from text,
audio, and visual information [7]. In this work, we fo- Classifying actions from videos requires extracting a set
cus solely on visual information, as vision is the primary of features that are anticipated to contain the data
necesmeans by which humans perceive information [8]. sary to distinguish between various actions. There are</p>
        <p>Deep learning methods have recently been shown to three types of features: the first is low-level features like
be able to automatically extract complex features from corners, edges, and simple textures; the second is
middleimages. Building on this, many researchers have worked level features like complex textures and shapes; and the
on the two latest deep learning techniques, convolutional third is high-level features like objects or parts of objects.
neural networks (CNNs) and recurrent neural networks The feature extraction technique is explained below.
(RNNs), to extract spatial and temporal features from
sports videos [4]. In [9], the authors presented a transfer 3.1.1. Transfer learning via feature extraction
learning algorithm for DNN-based video classification. [Low, Middle, high]:
The algorithm is designed as a video classification system
based on a convolutional neural network. Podgorelec et Conventional approaches to video classification have
al, in [10] used a transfer learning approach to fine-tune used frame-based features to generate a representation
a pre-trained CNN model for a sports image classifica- for the videos. In this work, various types of features,
tion task. They developed a classification method using such as low, middle, and high, have been used to capture
CNN transfer learning with Hyper Parameter Optimiza- various types of information. These features are obtained
tion (HPO). Russo et al, in [11] proposed a model that with the help of a transfer learning strategy. Transfer
combines deep learning and transfer learning. They com- learning is an efective strategy for training networks on
bined VGGNET16, RNN, and GRU functions with transfer a small data set. The network is pre-trained on a large
learning for the final classification. Chen et al, in [ 12] dataset, such as ImageNet, and then reused to train a new
investigated simple sports classification in sports videos task. This ofers a significant advantage over training the
by detecting motion poses in video frames. For example, network from scratch, as it requires less time and less
they can detect running, jumping, translation, and zoom- data. Transfer learning can save both time and
computing. Qiu et al, in [13] showed that principal components ing costs [14]. There are many pre-trained models on
can be used to reduce the dimensions of visual and au- the ImageNet dataset, such as AlexNet, VGG16, ResNet
dio video features. A time series of motion features can and InceptionV3. These models can be used to extract
be used to characterize motion event classifications in features from the data or fine-tune them to perform a
soccer sports videos. new task.</p>
        <p>In this work, we used transfer learning to extract
features. Convolutional Neural Networks (CNN)
automatically learn features from images at a hierarchical level.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Proposed Methodology</title>
      <p>In this work, we aim to develop a model for sport video
classification tasks that starts with two basic steps: the
feature extraction process generated by the pre-trained
model InceptionV3 and the training process by LSTM
to predict action as presented in Figure 1. We utilized
an LSTM network, which is equipped with mechanisms</p>
      <sec id="sec-2-1">
        <title>3.1.2. Extracting features generated by</title>
      </sec>
      <sec id="sec-2-2">
        <title>InceptionV3:</title>
        <sec id="sec-2-2-1">
          <title>Inception V3, a widely recognized convolutional neural network, is frequently utilized in tasks involving image classification and feature extraction. It demonstrates</title>
          <p>Pre-Trained InceptionV3 model as Feature</p>
          <p>Extraction</p>
          <p>A sequence of 20 frames
Input(20, 224, 2224, 3)
Block-00
high accuracy, achieving over 75% - 78% on the ImageNet responsible for identifying low-level features; the later
dataset, and is known for its speed and precision. The convolutional layers identify middle-level features; and
architecture of InceptionV3, equipped with multiple ‘In- the last convolutional layers identify high-level features,
ception’ modules, is designed to capture intricate patterns which are usually very specific to the task they are trained
and hierarchies in images, making it particularly suitable for. In the InceptionV3 architecture, lower layers (such
for sports video classification, where data often exhibits as block-02) are usually responsible for capturing
lowcomplex spatial hierarchies and patterns. However, the level features such as edges and textures. Mid-level layers
selection of a model can be influenced by the specific (such as block-05) can capture more complex features like
requirements of the task and the characteristics of the shapes, and higher layers (such as block-10) can capture
data. Therefore, it could be advantageous to compare high-level semantic information, such as the presence of
the performance of InceptionV3 with other pre-trained specific objects in the image. (see Figure 2)
models in our future work. 1. The low-level features: Were extracted from</p>
          <p>In this work, we propose an approach based on transfer the block02, with 288 dimensions.
learning that uses InceptionV3 to extract features from 2. The middle-level features: Were extracted
frames. The selection of specific blocks from the Incep- from the block05, with 768 dimensions.
tionV3 model for feature extraction is typically based 3. The high-level features: Were extracted from
on the type of information these blocks can capture and the last block10, with 2048 dimensions.
their impact on the overall performance of the model.</p>
          <p>The first layers of any neural network are basically</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>The rationale behind selecting these specific blocks could be that the combination of low, mid, and high-level fea</title>
          <p>(20, 8, 8, 2048)
Global Averrage</p>
          <p>Pooling2
Output (20, 2048)
High-level features</p>
          <p>LSTM
Output (128)
s
s
e
c
o
r
P
n
o
it
c
a
r
t
x
E
s
e
r
u
t
a
e
F
s
s
e
c
o
r
P
g
n
i
n
n
i
a
r
T
tures provides a comprehensive representation of the
video content, which is beneficial for the task of sports
video classification.
• CE : Loss function (categorical Cross-Entropy);
• M : The number of classes;
• Y : The ground truth;
• P : The expected probabilistic observation of class
c.</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>Model parameters are updated using gradient descent</title>
          <p>and backpropagation error. During the learning process,
the joint weights of the pre-trained model are frozen, in
particular from block 0 to block 10. Finally, the LSTM
output passes through a fully connected layer with softmax
activation functions.</p>
          <p>Experimental results show that our method is efective
in classifying sports videos, with the model achieving
89.5% accuracy, 90% precision, 88% recall, and 88% f1
score.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experimental results and discussion</title>
      <p>4.1. UCF Sports Action Data Set</p>
      <sec id="sec-3-1">
        <title>We evaluated the performance of our proposed model</title>
        <p>by testing it on the UCF sports action dataset, which
contains videos of various actions from diferent sports
Figure 3. The dataset includes human-annotated action
boundaries, which are shown in yellow. UCF is
considered one of the best datasets for applications that require
action localization and recognition.
3.2. Training process The dataset comprises 150 videos divided into 10
categories, filmed in diferent environments. Since each
We train the LSTM and Fully Connected (FC) layer to category has a diferent number of videos. Figure 4
perform the classification process. Whereas CNN blocks shows the distribution of the number of videos for each
cannot be trained to extract features. This saves us less category. The video resolution is 720 x 480 and the frame
consumption of resources, parameters and computation rate is 10 fps. The total duration of the dataset is 958
time. After passing the data through the model, we grad- seconds, and the average sequence length is 6.39 seconds.
ually decrease the computed error using the categorical Table 1 shows the characteristics of the data set [15] [16].
cross-entropy formula provided by the following loss The 20 frames of each video were used for video
clasfunction. sification, as shown in Table 2.</p>
        <p>(1)
4.2. Evaluation Metrics</p>
      </sec>
      <sec id="sec-3-2">
        <title>The quality of video classification models is measured using various performance metrics. Precision, accuracy,</title>
        <p>recall and F1 score are among the most common
measures. We uses these metrics to evaluate ours model.
These metrics are defined as follows:
1. Accuracy: It is a measure of how well a model
performs across all categories. Accuracy is
useful when all categories are equally important. It
is calculated by dividing the number of correct
predictions by the total number of predictions.
4. F1 score: It is a measure that combines accuracy
and recall into a single score. F1 score ranges
from 0 to 1, with a score of 1 indicating the best
model performance.</p>
      </sec>
      <sec id="sec-3-3">
        <title>The goal of a learning algorithm is to find a model that</title>
        <p>has a good fit, meaning that the model is not overfitting
or underfitting. Our model has a good fit, as evidenced
by the decrease in the training and validation losses to a
point of stability with a minimum gap between the final
loss values.</p>
        <p>In Figure 5, it can be observed that the maximum
training accuracy for the model was reached at epoch 100,
where the validation accuracy also reached its maximum,
which is 89.5%. The learning curve plot shows good
agreement, which confirms the good fit of our model.
 
  +  
  +  
  +   +   +  
 = (2) Figure 6 shows a confusion matrix for ten actions
in the UCF Sport action recognition experiment using a
2. Precision: It is the ratio of the number of positive batch normalization layer. The diagonal elements
indisamples correctly classified to the total number cate the accuracy of recognizing each action type. Each
of samples classified as positive, including incor- row of the confusion matrix represents the true action,
rectly classified positive samples. and each column represents the predicted action. Our
model performs well on some actions, achieving a
vali  =   (3) dation accuracy of 100% for seven actions.</p>
        <p>+   The classification report is shown in Figure 7
includ</p>
        <p>ing three criteria: precision, recall, and F1-score. Our
3. tsnRhauememcmpablleoelrrs:eoIttfhppiaostostsiwthiiteevivrereeascstaaiomomrrpopelfcleettsslhy.aeTrdenheudetemehctitbegecehdtreetodrof.thtpheoersetitociatvalell, tpmerneeTtcahaicbsotildioeonsn3u,sc8c,oc8aem%ssptsohfafuerrlelmesycooapduleler,rlfraoyenrsidemull8dst8sst%oh8ne9o.t5cfhl%fae1soUssficCficaoaFcrtceisou.pnroarbctyes,tdw9a0et%aesnoetf
with some state-of-the-art methods. Our model
signifi = (4) cantly outperformed all studies that used the handcrafted
features method, but for studies that used our same
approach to extracting features, that is, the learned features,
our model achieves competitive results compared to [21],
[22] and [23] in terms of accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion and future work</title>
      <p>This study proposes a new approach to identifying
human action recognition in sports. The approach is based
on feature extraction and uses two models: Inception V3
to extract spatial features (low, middle, and high-level)
and LSTM to extract temporal features. The FC layer
receives the output of the final LSTM layer, and the
softmax layer predicts the type of action. One of the key
benefits of the proposed method is its ability to
aggregate spatial features from Inception V3 and temporal
features from LSTM at each time step and in each video
frame. This approach improves the model’s performance
and decision-making by combining information from
two sources rather than using them independently.
Experimental results on the UCF sports dataset show that
our proposed method achieves higher classification
accuracy 89.5% than state-of-the-art sports video classification
methods. Our proposed method has shown promising
results, but it has its limitations. One potential
limitation could be the reliance on the InceptionV3 model for
feature extraction, which may not be ideal for all sports
videos. Also, the LSTM model, used for capturing
timerelated changes, might struggle with long sequences.</p>
      <p>For future work, we suggest looking into other
pretrained models for feature extraction, exploring more
advanced versions of LSTM like Gated Recurrent Units
(GRU) or Transformer models, and combine more data
like player stats or game context. We believe these
improvements could make our method even better at
classifying sports videos.
Dynamic scene understanding: The role of orien- engineering and applications 56 (2020) 43–48.
tation features in space and time in scene clas- [14] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu,
sification, in: 2012 IEEE Conference on Com- H. Xiong, Q. He, A comprehensive survey on
transputer Vision and Pattern Recognition, IEEE, 2012. fer learning, Proceedings of the IEEE 109 (2020)
doi:10.1109/cvpr.2012.6247815. 43–76.
[3] J. Zheng, X. Cao, B. Zhang, X. Zhen, X. Su, Deep [15] M. D. Rodriguez, J. Ahmed, M. Shah, Action
ensemble machine for video classification, IEEE MACH a spatio-temporal maximum average
corTransactions on Neural Networks and Learning relation height filter for action recognition, in:
Systems 30 (2019) 553–565. doi:10.1109/tnnls. 2008 IEEE Conference on Computer Vision and
Pat2018.2844464. tern Recognition, IEEE, 2008. doi:10.1109/cvpr.
[4] M. S. Sarma, K. Deb, P. K. Dhar, T. Koshiba, Tradi- 2008.4587727.</p>
      <p>tional bangladeshi sports video classification using [16] K. Soomro, A. R. Zamir, Action recognition in
realdeep learning method, Applied Sciences 11 (2021) istic sports videos, in: Computer vision in sports,
2149. doi:10.3390/app11052149. Springer, 2014, pp. 181–208.
[5] M. A. Russo, A. Filonenko, K.-H. Jo, Sports clas- [17] L. Yefet, L. Wolf, Local trinary patterns for human
sification in sequential frames using cnn and rnn, action recognition, in: 2009 IEEE 12th International
in: 2018 International Conference on Information Conference on Computer Vision, IEEE, 2009. doi:10.
and Communication Technology Robotics (ICT- 1109/iccv.2009.5459201.</p>
      <p>ROBOT), 2018, pp. 1–3. doi:10.1109/ICT-ROBOT. [18] A. Klaser, M. Marszalek, I. Laptev, C. Schmid, Will
2018.8549884. person detection help bag-of-features action
recog[6] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Suk- nition?, Research Report RR-7373, INRIA, 2010.
thankar, L. Fei-Fei, Large-scale video classification URL: https://hal.inria.fr/inria-00514828.
with convolutional neural networks, in: Proceed- [19] H. Wang, A. Klaser, C. Schmid, C.-L. Liu, Action
ings of the IEEE Conference on Computer Vision recognition by dense trajectories, in: CVPR 2011,
and Pattern Recognition (CVPR), 2014. IEEE, 2011. doi:10.1109/cvpr.2011.5995407.
[7] D. Brezeale, D. Cook, Automatic video classifi- [20] J. Yu, M. Jeon, W. Pedrycz, Weighted feature
trajeccation: A survey of the literature, IEEE Trans- tories and concatenated bag-of-features for action
actions on Systems, Man, and Cybernetics, Part recognition, Neurocomputing 131 (2014) 200–207.
C (Applications and Reviews) 38 (2008) 416–430. doi:10.1016/j.neucom.2013.10.024.
doi:10.1109/tsmcc.2008.919173. [21] V. de Oliveira Silva, F. de Barros Vidal, A. R. S.
Ro[8] M. C. Darji, D. Mathpal, A review of video classifi- mariz, Human action recognition based on a
twocation techniques, IRJET Journal 4 (2017). stream convolutional network classifier, in: 2017
[9] H. Guangyu, Analysis of sports video intelligent 16th IEEE International Conference on Machine
classification technology based on neural network Learning and Applications (ICMLA), IEEE, 2017.
algorithm and transfer learning, Computational doi:10.1109/icmla.2017.00-64.
Intelligence and Neuroscience 2022 (2022) 1–10. [22] A. Zare, H. A. Moghaddam, A. Sharifi, Video
spadoi:10.1155/2022/7474581. tiotemporal mapping for human action recognition
[10] V. Podgorelec, Š. Pečnik, G. Vrbančič, Classifica- by convolutional neural network, Pattern Analysis
tion of similar sports images using convolutional and Applications 23 (2019) 265–279. doi:10.1007/
neural network with hyper-parameter optimiza- s10044-019-00788-1.
tion, Applied Sciences 10 (2020) 8494. doi:10. [23] N. Jaouedi, N. Boujnah, M. S. Bouhlel, A new
hy3390/app10238494. brid deep learning model for human action
recog[11] M. A. Russo, L. Kurnianggoro, K.-H. Jo, Classifi- nition, Journal of King Saud University -
Comcation of sports videos with combination of deep puter and Information Sciences 32 (2020) 447–453.
learning models and transfer learning, in: 2019 In- doi:10.1016/j.jksuci.2019.09.004.
ternational Conference on Electrical, Computer and
Communication Engineering (ECCE), IEEE, 2019.</p>
      <p>doi:10.1109/ecace.2019.8679371.
[12] C. Lifu, W. Hong, C. Xianliang, G. Zhenghua, J.
Zhiwei, Convolutional neural network sar image
target recognition based on transfer learning, Chinese</p>
      <p>Space Science Technology 38 (2018) 45–51.
[13] N. Qiu, X. Wang, P. Wang, S. Zhou, Y. Wang,
Research on convolutional neural network algorithm
combined with transfer learning model, Computer</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tang</surname>
          </string-name>
          , W. Zhang,
          <article-title>Computational model based on neural network of visual cortex for human action recognition</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>29</volume>
          (
          <year>2018</year>
          )
          <fpage>1427</fpage>
          -
          <lpage>1440</lpage>
          . doi:
          <volume>10</volume>
          .1109/tnnls.
          <year>2017</year>
          .
          <volume>2669522</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K. G.</given-names>
            <surname>Derpanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lecce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Daniilidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Wildes</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>