<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ConvLSTM for Table Tennis Stroke Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jansi Rani Sella Veluswami</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ananth Narayanan P</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bhuvan S</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shobith Kumar R</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sri Sivasubramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Our study concentrates on sports video analytics, particularly stroke classification. We utilize a model that combines Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) trained on the MediaEval Fine-Grained Action Classification of the Table Tennis Strokes dataset. With an accuracy of 81.4%, our model efectively classifies table tennis moves, providing insights for post-match commentary and playstyle analysis. This efectiveness is demonstrated in the context of the MediaEval 2023 benchmark.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The provided baseline methodology [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposes two types of 3D-CNN architectures to solve
the subtask. Both the methods are 3D-CNN architectures using Spatio-temporal convolutions
and attention mechanisms. The predominant strategies have centered around the utilization of
CNN and LSTM-based methodologies. For example, in the paper by Kaustubh Milind Kulkarni
et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], an LSTM model, a TCN model, and a combined TCN + LSTM model were presented.
They used Pose Estimation and a Savitzky-Golay filter for feature extraction. Kadir Aktas et al.
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], present another approach where RGB images were used as the input data without any prior
feature extraction. They used an LSTM model to achieve about 79.8% accuracy in validation data.
We were inspired by the idea of using the RGB images directly without any feature extraction
and executed the same in our work.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>A Convolutional Neural Network (CNN or ConvNet) is a type of deep neural network specifically
designed for processing image data. This network excels in analyzing images and making
predictions based on them. It utilizes kernels, known as filters, to examine the image and
generate feature maps, which represent the presence of specific features at various locations
within the image. Initially, the network produces a limited number of feature maps, which are
augmented and refined through subsequent layers using pooling operations, while retaining
critical information without loss.</p>
      <p>On the other hand, a Long Short-Term Memory (LSTM) network is specifically designed to
handle sequential data, taking into account all previous inputs to generate an output. LSTMs
are a type of Recurrent Neural Network (RNN) that addresses the vanishing gradient problem,
a limitation of traditional RNNs in handling long-term dependencies in input sequences. This
enables LSTM cells to maintain context for extended periods, making them better suited for tasks
such as time series prediction, speech recognition, language translation, and music composition.</p>
      <p>In the context of action recognition, we will employ a CNN + LSTM network to leverage
the spatial-temporal aspects of videos. This combination will enable the network to efectively
analyze and recognize actions within video sequences.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Preprocessing</title>
        <p>The data preparation process involves class identification and two pivotal functions: one for
frame extraction, ensuring resizing and normalization, and another for dataset construction,
incorporating features, labels, and video paths. Notably, the dataset creation rigorously filters
videos to align with the specified sequence length. The execution of the dataset creation function
on a specified directory results in the generation of dataset objects, including features, labels, and
video file paths. These components include features, representing extracted frames from videos,
and labels, serving as identifiers for subsequent machine learning model training. The third
component consists of paths associated with videos in the dataset, functioning as references to
the physical location of each video.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Proposed Model</title>
        <p>The process of creating a dataset for TensorFlow is seamless and incorporates both Convolutional
Neural Network (CNN) and Long Short-Term Memory (LSTM) architecture. This choice is
informed by the well-established efectiveness of these architectures in video content analysis
tasks.</p>
        <p>In the construction phase of our model, the Keras ConvLSTM recurrent layers, a critical
architectural decision for video classification tasks is utilized. These layers excel at processing
spatiotemporal information within video sequences. We configure the layer with parameters
such as the number of filters, kernel size, and activation function to facilitate convolutional
operations. The resulting sequences are subsequently processed through various other function
layers, reducing frame dimensions to alleviate computational load, and Dropout layers,
mitigating overfitting risks. The architecture is intentionally kept simple with a limited number
of trainable parameters, commensurate with the scale of the dataset. A vital element is the
incorporation of a final Dense layer with softmax activation, yielding probability distributions
across action categories.</p>
        <p>The constructed model is then compiled using categorical cross-entropy as the loss function,
the Adam optimizer, and accuracy as the metric for evaluation. Training is initiated,
incorporating an early stopping callback to prevent overfitting. This structure forms a cohesive and
eficient framework for the in-depth analysis of spatiotemporal patterns within table tennis
stroke videos. The model’s adherence to best practices in architectural design and training
strategies enhances its adaptability and potential for robust performance in action recognition
tasks.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>The accuracy of the model was updated after every layer of training, and the results demonstrate
a high level of accuracy. The training data accuracy reached a peak of 97%, while the validation
accuracy reached 98.8%. Several factors contributed to these results. Firstly, the volume and
distribution of the data had a significant impact on the accuracy. Additionally, the image height,
width, and sequence length all had a significant efect on the results, with accuracy ranging from
0.7408 for an image of dimensions 90*80 and a sequence length of 60, to 0.9876 for an image of
dimensions 64*64 and a sequence length of 60. It is worth noting that the data distribution of
some labels is highly biased towards certain classes, leading to biased learning. Over the course
of five runs, the highest global accuracy achieved by the model was 81.4</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Outlook</title>
      <p>Throughout the training and validation phase, we have attained encouraging outcomes that
lead us to conclude that overfitting is not present.</p>
      <p>Nonetheless, the model’s performance on the test data reveals that it has not efectively learned
and is unable to generalize. In our opinion, this challenge can be remedied by augmenting the
quantity of labeled data utilized in training.</p>
      <p>Moreover, we posit that the low variability between the classes and the nature of the task
contribute to this issue. Considering that a single class can be sampled in various ways for
diferent players, such as right/left-handed or high/low experienced, we suggest that the dataset
could be enhanced by increasing the coverage of the classes and reducing bias among them.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.-E.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <article-title>Baseline method for the sport task of mediaeval 2023 3d cnns using attention mechanisms for table tennis stoke detection and classification</article-title>
          .,
          <source>MediaEval Workshop</source>
          <year>2023</year>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Erades</surname>
          </string-name>
          , P.-E. Martin,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vuillemot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansencal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Peteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dufner</surname>
          </string-name>
          , J. Benois-Pineau,
          <article-title>SportsVideo: A Multimedia Dataset for Event and Position Detection in Table Tennis and Swimming</article-title>
          ,
          <source>MediaEval Workshop</source>
          <year>2023</year>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>K. M. Kulkarni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Shenoy</surname>
          </string-name>
          ,
          <article-title>Table tennis stroke recognition using two-dimensional human pose estimation</article-title>
          ,
          <source>CVPR Sports Workshop</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Aktas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Demirel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Moor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Olesk</surname>
          </string-name>
          , G. Anbarjafari,
          <article-title>Spatio-temporal based table tennis hit assessment using lstm algorithm</article-title>
          ,
          <source>MediaEval</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zahra</surname>
          </string-name>
          , P.-E. Martin,
          <article-title>Two stream network for stroke detection in table tennis</article-title>
          ,
          <source>MediaEval</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] HCMUS at MediaEval'20:
          <article-title>Ensembles of Temporal Deep Neural Networks for Table Tennis Strokes Classification Task</article-title>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.-E.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansencal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          ,
          <article-title>Siamese spatio-temporal convolutional neural network for stroke classification in table tennis games (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>