<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Spatio-Temporal Based Table Tennis Hit Assessment Using LSTM Algorithm</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kadir Aktas</string-name>
          <email>kadir.aktas@ut.ee</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mehmet Demirel</string-name>
          <email>mehmet.demirel@student.manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marilin Moor</string-name>
          <email>marilinm@ut.ee</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johanna Olesk</string-name>
          <email>johanna.olesk@ut.ee</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gholamreza Anbarjafari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>PwC Advisory Finland</institution>
          ,
          <addr-line>Itämerentori 2, 00180 Helsinki</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Manchester</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Tartu</institution>
          ,
          <country country="EE">Estonia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In these working notes, we present our approach and results for Mediaeval 2020 Sports Video Classification Task [ 6]. We implemented a multi-stage pipeline with LSTM-based network. In the developed approach, firstly, the frames are extracted, sampled and resized. Then, considering that the stroke type has three diferent parts, each part is labelled and predicted separately. In order to obtain the predicted stroke type, the predictions for each part are fused together.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Sports action recognition is a well-studied research topic due to
the wide application area and commercial value. Although many
methods are developed for diferent sports tasks [
        <xref ref-type="bibr" rid="ref2 ref9">2, 9</xref>
        ], the challenge
of performing more precise analysis still remains open, especially
for low variance classification tasks such as table tennis stroke
type classification. To address this claiming, Martin et al. collected
TTSTROKE-21 dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and the Mediaeval 2019 and 2020 Sports
Video Classification Task were created [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ].
      </p>
      <p>
        In this paper, we present a multi-stage spatio-temporal
recognition method using long short-term memory (LSTM) [
        <xref ref-type="bibr" rid="ref13 ref15 ref4">4, 13, 15</xref>
        ]
based network. Our architecture predicts the final label in three
stages. In the first stage, the position (serve, ofensive, defensive)
is classified. In the second stage, the hand orientation (forehand,
backhand) is classified. Finally, in the third stage, the hit technique
(flip, hit, push, block, loop, topspin, backspin, sidespin) is predicted
using one of the 3 diferent models. The first model classifies serve
techniques. Second and third models classify ofensive and
defensive techniques, respectively. Lastly, in order to obtain the final
stroke type, a fusion of labels
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Recently there has been an increase in the number of studies in table
tennis stroke type recognition from videos. Martin et al. have
collected TTSTROKE-21 dataset and proposed a Twin Spatio-Temporal
Convolutional Neural Network (TSTCNN). Their network uses an
RGB image sequence and optical flow calculations as an input. They
extract and resize the frames to (320 × 180) for each stroke in
order to use them as input data [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Instead, we resize the frames to
(120 × 80) to increase the processing speed.
      </p>
      <p>
        Sriraman et al. present another approach which extracts features
using Convolutional Neural Network (CNN) and applies them to
a spatio-temporal model [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. They use VGG16 network [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] as
the feature extractor and apply Long Short Term Memory (LSTM)
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] layer on the extracted features. They use 25 frames, which are
sampled by a varying rate, per each move. In our work, we use
21 frames based on centroids of the k-nearest neighbour method.
Also, we extract spatio-temporal features only and do not use a
CNN-based feature extractor.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>To face the challenge of a high number of classes and low variance
between them, we designed a multi-stage approach. We divided the
initial 20 labels into 5 groups (see Figure 1). In stage 1 and 2, the
ifrst and second parts of the final label are predicted. In stage 3, the
third part of the final label is predicted, however, the prediction is
done based on stage 1 results. For example, if the stage 1 predicts
‘Serve’ then in stage 3 the model which is trained for predicting
one of ‘Topspin’, ‘Sidespin’, ‘Backspin’, ‘Loop’ is used. We used the
same input and model structure for each stage, meaning, that we
trained the same model for each label subset, for 5 times in total.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Data pre-processing</title>
      <p>
        The dataset contains videos with 120 fps and resolution of (1920 ×
1080). Considering that a single stroke has minimum of 100 frames
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], processing the data without resizing causes memory and
timerelated issues. So, to speed up the process and decrease memory
restrictions we resize each frame to (120 × 80).
      </p>
      <p>
        Each move in the dataset has a varying frame range. This means
that we need to sample them in a fixed size as our model require
a fixed input size. We sample 21 frames per move. This number is
picked heuristically, however, we have tested that if the sample size
is too low, i.e. 7 frames per move, then the accuracy is decreased
significantly. This way we boost the processing performance and
provide a fixed input size to our model. Such approaches are
wellknown in the video indexing [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        To decide on which frames to sample we use centroids of
knearest neighbor (KNN) method. This method reflects the data
distribution as the centroids are calculated using nearest
neighbours [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. We use RGB values of the resized images to compute
KNN. And, We calculate 21 centroids and then sample 21 frames
closest to each of the calculated centroids, respectively. Additionally,
we flatten each frame in order to ensure they fit into our model.
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Model</title>
      <p>
        Our model uses RGB images as the input data without any prior
feature extraction. In our model, firstly, the batch normalization
layer is used to regularize the input data. This step processes the
input batch by batch, subtracts mean and divides by standard
deviation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Then, two LSTM [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] layers are included in order to capture
spatio-temporal features. These layers are constructed with unit
numbers 128 and 32 respectively. Afterwards, a fully connected
layer with 64 units is included to model the relation between
features and the output. Each of these 3 layers is followed by a dropout
layer at the rate of 0.2 in order to prevent overfitting [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Finally,
an output layer with softmax activation is added to do the
classification.
Fully connected layers are initialized by using Glorot
initialization [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We use categorical cross-entropy as the loss function and
RMSprop as optimizer [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] with a learning rate of 0.0001. The
training is done with batch size 8 in 30 epochs. 10-fold cross-validation
is applied to prevent a biased data split.
      </p>
      <p>We use the same model architecture and hyperparameters to
train 5 diferent models. Each model has its own purpose, so they
get trained with diferent subsets of the training data for diferent
sets of labels (see Table ??).</p>
      <p>We split our data into train, validation and test splits by 0.6, 0.2,
0.2 proportions. Train and validation splits are used during model
training. Test split is only used to test the trained model.
4</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND ANALYSIS</title>
      <p>Training results are shown in Table 1. We obtained 94.7% test
accuracy for stage 1, i.e. classifier for ‘Serve’, ‘Defensive’, ‘Ofensive’
labels. An accuracy of 98% was achieved for stage 2, i.e. ‘Forehand’,
‘Backhand’ classification. However, for stage 3 we got 80.1%
accuracy, which is much lower than others. When combining the
predicted labels in the final label, we obtained 78.1% stroke type
prediction accuracy.</p>
      <p>These results can be explained by a couple of factors. Firstly, the
volume and distribution of the data afect the results. In stage 3 each
label has considerably fewer data compared to the data numbers
of labels in other stages. Also, especially in stage 3 labels, the data
distribution is highly biased towards some classes, causing biased
learning. Additionally, due to the nature of the task, stage 3 labels
have less variance between each other compared to other stages.
Lastly, since stage 3 is conditioned on the outcome of stage 1, some
of the errors are caused by this outcome.</p>
      <p>Stage
1
2
3
Final</p>
      <p>Our method got 9.32% accuracy on the run processed by
MediaEval on a diferent test set. It was able to correctly predict classes
of 33 samples out of 354. Our run results show that the method was
able to achieve 50.85% for stage 1 prediction and 66.67% for stage 2
prediction. Although the accuracy for stage 3 is not published, it is
obvious that the model had the lowest accuracy on stage 3 with a
big gap (See Table 2).</p>
    </sec>
    <sec id="sec-7">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>We obtained promising results during the training and validation,
which assures there is no occurrence of overfitting. However, as the
test results show, the model has failed to properly learn, i.e. was not
able to generalise the learning. It is expected this can be addressed
by having more labelled data in the training.</p>
      <p>We also argue that the low variance between the classes and
nature of the task causes the aforementioned challenge. Considering
that a single class can be sampled in many ways for diferent players,
i.e. right/left-handed or high/low experienced, we discuss that the
dataset can be improved to increase the coverage of the classes as
well as reducing the bias among the classes.</p>
      <p>Sports Video Classification: Classification of Strokes in Table Tennis</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Maylis</given-names>
            <surname>Delest</surname>
          </string-name>
          , Anthony Don, and
          <string-name>
            <surname>Jenny</surname>
          </string-name>
          Benois-Pineau.
          <year>2006</year>
          .
          <article-title>DAGbased visual interfaces for navigation in indexed video content</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          <volume>31</volume>
          (10
          <year>2006</year>
          ),
          <fpage>51</fpage>
          -
          <lpage>72</lpage>
          . https://doi.org/10. 1007/s11042-006-0032-4
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Mehrnaz</given-names>
            <surname>Fani</surname>
          </string-name>
          , Kanav Vats, Christopher Dulhanty,
          <string-name>
            <given-names>David A.</given-names>
            <surname>Clausi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and John S.</given-names>
            <surname>Zelek</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Pose-Projected Action Recognition Hourglass Network (PARHN) in Soccer</article-title>
          .
          <source>In 16th Conference on Computer and Robot Vision</source>
          , CRV 2019,
          <article-title>Kingston</article-title>
          ,
          <string-name>
            <surname>ON</surname>
          </string-name>
          , Canada, May
          <volume>29</volume>
          -31,
          <year>2019</year>
          .
          <fpage>201</fpage>
          -
          <lpage>208</lpage>
          . https://doi.org/10.1109/CRV.
          <year>2019</year>
          .00035
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Glorot</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Understanding the dificulty of training deep feedforward neural networks</article-title>
          .
          <source>In Proceedings of the thirteenth international conference on artificial intelligence and statistics</source>
          . 249-
          <fpage>256</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9</source>
          ,
          <issue>8</issue>
          (
          <year>1997</year>
          ),
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Iofe</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>
          .
          <source>arXiv preprint arXiv:1502.03167</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jenny</surname>
          </string-name>
          Benois-Pineau, Boris Mansencal, Renaud Péteri, Laurent Mascarilla,
          <string-name>
            <surname>Jordan Calandre</surname>
            , and
            <given-names>Julien</given-names>
          </string-name>
          <string-name>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Sports Video Classification: Classification of Strokes in Table Tennis for MediaEval 2020</article-title>
          .
          <source>In Proc. of the MediaEval 2020 Workshop</source>
          , Online,
          <fpage>14</fpage>
          -
          <lpage>15</lpage>
          December
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jenny</surname>
          </string-name>
          Benois-Pineau, Boris Mansencal, Renaud Péteri, Laurent Mascarilla,
          <string-name>
            <surname>Jordan Calandre</surname>
            , and
            <given-names>Julien</given-names>
          </string-name>
          <string-name>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Sports Video Annotation: Detection of Strokes in Table Tennis task for MediaEval 2019</article-title>
          . In MediaEval 2019 Workshop.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jenny</surname>
            Benois-Pineau,
            <given-names>Renaud</given-names>
          </string-name>
          <string-name>
            <surname>Péteri</surname>
            , and
            <given-names>Julien</given-names>
          </string-name>
          <string-name>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Fine grained sport action recognition with Twin spatiotemporal convolutional neural networks: Application to table tennis</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          (
          <year>2020</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Dian</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yue</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Bo</given-names>
            <surname>Dai</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Dahua</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding</article-title>
          . (
          <year>2020</year>
          ),
          <fpage>2613</fpage>
          -
          <lpage>2622</lpage>
          . https://doi.org/10.1109/CVPR42600.
          <year>2020</year>
          .00269
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Vishnu</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Krishnan Bhuvana J Siddharth Sriraman</surname>
            , Srinath Srinivasan and
            <given-names>T. T.</given-names>
          </string-name>
          <string-name>
            <surname>Mirnalinee</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>MediaEval 2019: LRCNs for Stroke Detection in Table Tennis</article-title>
          . (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Nitish</surname>
            <given-names>Srivastava</given-names>
          </string-name>
          , Geofrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Dropout: a simple way to prevent neural networks from overfitting</article-title>
          .
          <source>The journal of machine learning research 15</source>
          ,
          <issue>1</issue>
          (
          <year>2014</year>
          ),
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Tammvee</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gholamreza</given-names>
            <surname>Anbarjafari</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Human activity recognition-based path planning for autonomous vehicles</article-title>
          . Signal, Image and
          <string-name>
            <given-names>Video</given-names>
            <surname>Processing</surname>
          </string-name>
          (
          <year>2020</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Tijmen</given-names>
            <surname>Tieleman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Geofrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <source>2012. Lecture 6</source>
          .5
          <article-title>-rmsprop: Divide the gradient by a running average of its recent magnitude</article-title>
          .
          <source>COURSERA: Neural networks for machine learning 4</source>
          ,
          <issue>2</issue>
          (
          <year>2012</year>
          ),
          <fpage>26</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Jun</surname>
            <given-names>Wan</given-names>
          </string-name>
          , Chi Lin, Longyin Wen,
          <string-name>
            <given-names>Yunan</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qiguang</given-names>
            <surname>Miao</surname>
          </string-name>
          , Sergio Escalera, Gholamreza Anbarjafari, Isabelle Guyon,
          <source>Guodong Guo, and Stan Z Li</source>
          .
          <year>2020</year>
          . ChaLearn Looking at People: IsoGD and
          <string-name>
            <surname>ConGD LargeScale RGB-D Gesture</surname>
          </string-name>
          <article-title>Recognition</article-title>
          .
          <source>IEEE Transactions on Cybernetics</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Qingjiu</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Shiliang</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>A centroid k-nearest neighbor method</article-title>
          .
          <source>In International Conference on Advanced Data Mining and Applications</source>
          . Springer,
          <fpage>278</fpage>
          -
          <lpage>285</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>