<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Human Pose Estimation Model for Stroke Classification in Table Tennis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Soichiro Sato</string-name>
          <email>s-sato@kde.cs.tut.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Masaki Aono</string-name>
          <email>masaki.aono.ss@tut.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Toyohashi University of Technology</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we propose a stroke classification method for table tennis, submitted to MediaEval 2020 Sports Video Classification: Classification of Strokes in Table Tennis. The main focus of this paper is on the exploitation of features extracted from a pose estimation model in stroke classification. Specifically, we first introduce an original method that incorporates PoseNet. Then, we construct a DNN model based on our proposed method. Subsequently we evaluate our stroke classification using unannotated unknown data. Finally, we analyze the proposed method from the classification results. The results demonstrate that the classification accuracy of the proposed method outperforms the baseline by 4.8%.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In recent years, research on video action recognition using a DNN
model has gained the popularity. Considering this popularity, it is
natural to think of applying video action recognition to a variety
of sports fields such as athletes action analysis and creation of
educational videos for the sports. The datasets used in video action
recognition include UCF-101 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and Kinetics [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These datasets
are classified typically by types of sports and human action. On
the other hand, the stroke classification of table tennis in
MediaEval2020 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] requires the stroke classification within a single sport.
Therefore, it is a dificult task due to the higher degree of
similarity between the classes than usual general datasets. RGB and
Optical Flow have been often used for the input to the DNN model for
video action recognition [
        <xref ref-type="bibr" rid="ref2 ref7 ref9">2, 7, 9</xref>
        ]. We speculate that the features
extracted from a DNN model, which enables posture estimation from
images and movies, could be used for stroke classification. In this
paper, from the above observations and speculation, we propose
a stroke classification method for table tennis based on features
extracted from the posture estimation model.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>
        In this paper, we leverage PoseNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which one of the posture
estimation models. PoseNet can estimate a total of seventeen
diferent skeletal coordinates including a person’s wrist, elbow,
shoulder, and knee by inputting RGB images. By applying this method,
it is possible to create a time series data of human skeletal
coordinates in the video. It is also possible to determine the position
of a human in the video frame based on the estimated coordinates
of the skeleton, which can be utilized to determine the crop
position. In this paper, we define Pose Time Series Data as time series
data representing the transition of human skeletal coordinates in
a video.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Steps to create Pose Time Series Data</title>
      <p>First, we input T video frames into the pre-trained PoseNet and
extract features with the number of dimensions (T,17,2) that
represent the estimated coordinates (x,y) of the seventeen diferent
skeletons. Next, we pre-process the extracted features to input them
into the DNN model. The preprocessing of the extracted features
uses four methods: transformation from absolute coordinates to
relative coordinates, computation of moving average,
normalization, and zero padding. Here, the transformation from absolute
coordinates to relative coordinates is based on the estimated
coordinates of the skeleton in the first frame of the video. If the player
is not visible on the first frame of the video, the transformation is
based on the center coordinates in the first frame of the video. The
Pose Time Series Data created by the above procedure is used as
input of the DNN model described in section 2.3.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Crop based on skeletal coordinates</title>
      <p>When a video frame is fed into the DNN model, it is pre-processed
by cropping video frame at the size of 120 × 120. For video
action recognition using a DNN model, we could employ cropping
methods such as Center Crop and Random Crop. However, if these
methods are used to crop the video frame, the person who is
actually performing the table tennis action will not be included in
the cropped area, which will increase the risk of not being able to
classify strokes correctly. Therefore, we take advantage of PoseNet
ability to estimate seventeen diferent skeletal coordinates, and
compute crops based on the estimated skeletal coordinates.
Specifically, after inputting a video frame into PoseNet and obtaining
seventeen diferent skeletal coordinates, the average value of their
x-coordinates and y-coordinates is calculated. Then we crop the
frame at the size of 120 × 120, with the coordinates calculated as
the center position of the crop. The video frame cropped by the
above procedure is used as input of the DNN model described in
the section 2.3.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Model</title>
      <p>
        In this task, we have implemented five diferent DNN models
because it allows us to submit up to five runs. First, we have
reproduced the SSTCNN [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This model served as the baseline model
used for performance comparison with the proposed method. Next,
we have developed a DNN model in which Pose Time Series Data
is inputted and the part that performs 1D convolution is added to
SSTCNN. The inputs to this model are three types of data: RGB,
Optical Flow, and Pose Time Series Data. RGB used as input for
the DNN model is cropped according to the method described in
S. Sato. M. Aono
section 2.2. Optical Flow used as input for the DNN model is a
combination of two time-consecutive video frames created by
DeepFlow [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and a background diference proposed by Zivkovic et
al [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. This allows us to filter out only the locations where the
change is presumed to have occurred between two consecutive
frames in time. In addition, we have enhanced the DNN model so
that it incorporates a Depthwise Separable Convolution [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in the
convolution layer of Pose Time Series Data, free of Optical Flow for
input. These models aim to reduce the number of parameters in the
model, as the number of parameters in the training model increases
with the type of data for inputs. The diferences between the five
DNN models for our submitted runs are denoted by ⃝1 , ⃝2 , ⃝3
which is also delineated in Table 2. Here, RGB is used for all of
DNN models.
      </p>
      <p>Include Optical Flow in the input
Include Pose Time Series Data in the input</p>
      <p>Introduce Depthwise Separable Convolution
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Training and Submission Runs</title>
      <p>
        The models are trained by the hyperparameters shown in Table 1
for five diferent runs with our DNN models. The dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
consists of short movie clips of table tennis strokes practice. The
training dataset includes 755 actions and the test dataset 354 actions.
During the training, we have not set up validation data. Instead,
we have used all 755 training data solely for training the model.
After training, we have fed the test data into the trained model
and performed stroke classification.
3
      </p>
    </sec>
    <sec id="sec-7">
      <title>RESULTS AND ANALYSIS</title>
      <p>The classification results of the submitted runs are shown in
Table 2. The left column shows the name of the submitted runs. The
middle column shows the diferences in the models
corresponding to the submitted runs by means of a checklist. The right
column shows the classification results corresponding to the
submitted runs. In addition to the results of the 20 classes of table tennis
strokes, the results of a rough classification of strokes are shown in
the columns ‘Hand’, ‘Serve’ and ‘H&amp;S’ (Hand and Serve). Table 2
demonstrates that Run 5 turned out to be the model with the
highest Global Accuracy, but Run 2 turned out to be the model with the
highest classification accuracy in the case of a rough stroke
classiifcation. The Confusion Matrix of Run 5 is shown in Figure 1. From
Figure 1, it is observed that there are several classes that could be
accurately categorized in the test data, such as ‘Ofensive Forehand
Flip’. On the other hand, it is also observed that the detailed stroke
Run 1
Run 2
Run 3
Run 4
Run 5
classification is not exactly accurate. In particular, the
classification of details in table tennis strokes has been often misclassified,
such as the diference in spin on the table tennis ball when a player
performs a stroke. It is also possible that the lack of data
augmentation when training the model may lead to an inaccurate stroke
classification due to overfitting.
4</p>
    </sec>
    <sec id="sec-8">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>In this paper, we proposed a stroke classification method based
on PoseNet. We have implemented five diferent DNN models and
trained them to classify table tennis strokes with the test data.
The results exhibited that the classification accuracy of the
proposed method was up to 4.8% higher than the baseline. However,
we have not been able to classify them accurately and found that
there is still a room for improvement. In the future, we would like
to verify the accuracy of data augmentation and explore methods
to improve the accuracy of table tennis stroke classification from
various perspectives, such as data preprocessing method and DNN
model architecture.</p>
      <p>Sports Video Classification: Classification of Strokes in Table Tennis</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>João</given-names>
            <surname>Carreira</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset</article-title>
          .
          <source>In 2017 IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2017</year>
          ,
          <article-title>Honolulu</article-title>
          ,
          <string-name>
            <surname>HI</surname>
          </string-name>
          , USA, July
          <volume>21</volume>
          -
          <issue>26</issue>
          ,
          <year>2017</year>
          . IEEE Computer Society,
          <fpage>4724</fpage>
          -
          <lpage>4733</lpage>
          . https://doi.org/10.1109/CVPR.
          <year>2017</year>
          .502
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Feichtenhofer</surname>
          </string-name>
          , Axel Pinz, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Convolutional Two-Stream Network Fusion for Video Action Recognition</article-title>
          .
          <source>In 2016 IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2016</year>
          ,
          <string-name>
            <surname>Las</surname>
            <given-names>Vegas</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NV</surname>
          </string-name>
          , USA, June 27-30,
          <year>2016</year>
          . IEEE Computer Society, 1933-
          <fpage>1941</fpage>
          . https://doi.org/10.1109/CVPR.
          <year>2016</year>
          .213
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jenny</surname>
          </string-name>
          Benois-Pineau, Boris Mansencal, Renaud Péteri, Laurent Mascarilla,
          <string-name>
            <surname>Jordan Calandre</surname>
            , and
            <given-names>Julien</given-names>
          </string-name>
          <string-name>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Sports Video Classification: Classification of Strokes in Table Tennis for MediaEval 2020</article-title>
          .
          <source>In Proc. of the MediaEval 2020 Workshop</source>
          , Online,
          <fpage>14</fpage>
          -
          <lpage>15</lpage>
          December
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jenny</surname>
            Benois-Pineau,
            <given-names>Renaud</given-names>
          </string-name>
          <string-name>
            <surname>Péteri</surname>
            , and
            <given-names>Julien</given-names>
          </string-name>
          <string-name>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Fine grained sport action recognition with Twin spatio-temporal convolutional neural networks</article-title>
          .
          <source>Multim. Tools Appl</source>
          .
          <volume>79</volume>
          ,
          <fpage>27</fpage>
          -
          <lpage>28</lpage>
          (
          <year>2020</year>
          ),
          <fpage>20429</fpage>
          -
          <lpage>20447</lpage>
          . https://doi.org/10.1007/ s11042-020-08917-3
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Dan</given-names>
            <surname>Oved</surname>
          </string-name>
          , Irene Alvarado, and
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Gallo</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Realtime Human Pose Estimation in the Browser with TensorFlow.js</article-title>
          . (
          <year>2018</year>
          ). https://blog.tensorflow.org/
          <year>2018</year>
          /05/real
          <article-title>-time-human-poseestimation-in</article-title>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Laurent</given-names>
            <surname>Sifre</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stéphane</given-names>
            <surname>Mallat</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Rigid-motion scattering for image classification</article-title>
          .
          <source>Ph. D. thesis</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Two-Stream Convolutional Networks for Action Recognition in Videos</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems</source>
          <year>2014</year>
          , December 8-
          <issue>13</issue>
          <year>2014</year>
          , Montreal, Quebec, Canada, Zoubin Ghahramani, Max Welling, Corinna Cortes,
          <string-name>
            <given-names>Neil D.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          , and
          <string-name>
            <surname>Kilian</surname>
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          (Eds.).
          <fpage>568</fpage>
          -
          <lpage>576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Khurram</given-names>
            <surname>Soomro</surname>
          </string-name>
          , Amir Roshan Zamir, and
          <string-name>
            <given-names>Mubarak</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild</article-title>
          .
          <source>CoRR abs/1212</source>
          .0402 (
          <year>2012</year>
          ). arXiv:
          <volume>1212</volume>
          .0402 http://arxiv. org/abs/1212.0402
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Limin</given-names>
            <surname>Wang</surname>
          </string-name>
          , Yuanjun Xiong,
          <string-name>
            <surname>Zhe</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            <given-names>Qiao</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Dahua</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          , and Luc Van Gool.
          <year>2016</year>
          .
          <article-title>Temporal Segment Networks: Towards Good Practices for Deep Action Recognition</article-title>
          .
          <source>In Computer Vision - ECCV 2016 - 14th European Conference</source>
          , Amsterdam, The Netherlands,
          <source>October 11-14</source>
          ,
          <year>2016</year>
          , Proceedings,
          <string-name>
            <surname>Part VIII</surname>
          </string-name>
          (Lecture Notes in Computer Science), Bastian Leibe, Jiri Matas,
          <source>Nicu Sebe, and Max Welling (Eds.)</source>
          , Vol.
          <volume>9912</volume>
          . Springer,
          <fpage>20</fpage>
          -
          <lpage>36</lpage>
          . https://doi.org/10.1007/ 978-3-
          <fpage>319</fpage>
          -46484-
          <issue>8</issue>
          _
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Philippe</surname>
            <given-names>Weinzaepfel</given-names>
          </string-name>
          , Jérôme Revaud, Zaïd Harchaoui, and
          <string-name>
            <given-names>Cordelia</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>DeepFlow: Large Displacement Optical Flow with Deep Matching</article-title>
          .
          <source>In IEEE International Conference on Computer Vision</source>
          , ICCV 2013, Sydney, Australia, December 1-
          <issue>8</issue>
          ,
          <year>2013</year>
          . IEEE Computer Society,
          <fpage>1385</fpage>
          -
          <lpage>1392</lpage>
          . https://doi.org/10.1109/ICCV.
          <year>2013</year>
          .175
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Zoran</given-names>
            <surname>Zivkovic</surname>
          </string-name>
          and Ferdinand van der Heijden.
          <year>2006</year>
          .
          <article-title>Eficient adaptive density estimation per image pixel for the task of background subtraction</article-title>
          .
          <source>Pattern Recognit. Lett. 27</source>
          ,
          <issue>7</issue>
          (
          <year>2006</year>
          ),
          <fpage>773</fpage>
          -
          <lpage>780</lpage>
          . https: //doi.org/10.1016/j.patrec.
          <year>2005</year>
          .
          <volume>11</volume>
          .005
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>