<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Bergen, Norway and Online
* These authors contributed equally.
† Corresponding author.
$ pierre_etienne_martin@eva.mpg.de (P. Martin)
 www.eva.mpg.de/comparative-cultural-psychology/staf/pierre-etienne-martin (P. Martin)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Fine-Grained Action Detection with RGB and Pose Information using Two Stream Convolutional Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonard Hacker</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Finn Bartels</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierre-Etienne Martin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CCP Department, Max Planck Institute for Evolutionary Anthropology</institution>
          ,
          <addr-line>D-04103 Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer Science Institute, University of Leipzig</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>As participants of the MediaEval 2022 Sport Task, we propose a two-stream network approach for the classification and detection of table tennis strokes. Each stream is a succession of 3D Convolutional Neural Network (CNN) blocks using attention mechanisms. Each stream processes diferent 4D inputs. Our method utilizes raw RGB data and pose information computed from MMPose toolbox. The pose information is treated as an image by applying the pose either on a black background or on the original RGB frame it has been computed from. Best performance is obtained by feeding raw RGB data to one stream, Pose + RGB (PRGB) information to the other stream and applying late fusion on the features. The approaches were evaluated on the provided TTStroke-21 data sets. We can report an improvement in stroke classification, reaching 87.3% of accuracy, while the detection does not outperform the baseline but still reaches an IoU of 0.349 and mAP of 0.110.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        While there have been great advances in detection of coarse-grained action in videos (e.g. the
type of sport being performed), fine-grained action detection is inherently more dificult due to
its low inter-class variability [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The goal of this benchmark task is to provide viable tools to
enable analyzing athletes’ performance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Table Tennis entails many interesting challenges
for fine-grained video detection, e.g. ball trajectory prediction [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or real-time score and game
analysis [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In the field of image recognition, a CNN consisting of Convolutional layers, ReLU layers
and Max Pooling layers is a conventional practice [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The provided baseline model in the
competition originates from this [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. While video data includes an additional dimension (time
or frame number), several approaches adapt a plain CNN to determine movement or changes in
a range of frames [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A popular approach introduced first by Simonyan and Zisserman [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is the
Two-Stream Neural Network. They implemented an architecture consisting of a spatial and a
temporal CNN. The aforementioned spatial stream represents a single frame at each time while
the temporal stream is built as a multi-frame Optical Flow (OF) recognizing the momentum
of movement in multiple RGB frames. Hence, the model obtains the additional benefit of
complementary information provided by a second stream. Thus, areas within the video covering
no movement can be removed easily to reduce noise [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A successful implementation built upon
this is the Inflated 3D Convolutional Neural Network ( I3D) model [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where multiple images
are pushed into a 3D CNN instead of a single image at a time. Feichtenhofer et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed
(a) RGB
(b) Pose
(c) PRGB
networks called SlowFast based on a two-stream architecture. The first stream processes the
frames with a low frame rate and the second stream with a high frame rate so that the semantic
information and motion information are considered.
      </p>
      <p>
        The fusion step is an essential part of multiple stream networks. Within the models, there
are diferent approaches difering from where the fusion is performed. Late fusion is the easiest
and one of the most eficient ways to proceed [
        <xref ref-type="bibr" rid="ref1 ref11">11, 1</xref>
        ]. Fusion can be performed after a fully
connected ReLU layer before a final Softmax function [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. Early fusion is usually performed
at an earlier stage of the network, meaning that the information of the second stream is pushed
into the first stream [
        <xref ref-type="bibr" rid="ref10 ref14">10, 14</xref>
        ]. The following sections describe the architecture of two streams,
the results of stroke classification and stroke detection and a discussion on preliminary factors
and approaches.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>
        As mentioned, the two-stream architecture is one of the state-of-the-art methods. For this
contribution, the baseline provided by MediaEval is extended into a two-stream network utilising
raw RGB images and pose information. The baseline itself is a single stream 3D CNN with
an attention mechanism. Recent papers suggest that the utilization of pose information can
achieve better results for fine-grained action detection than OF [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17</xref>
        ]. When considering
actions that are performed by people, the pose holds significant information. We superimpose
this information by drawing the pose information on top of RGB images and feeding that into a
second stream. The implementation of our method is available online1.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Pose Estimation</title>
        <p>
          The pose information is the traced human pose in each frame as depicted in Figure 1. Since OF
is already well researched and the use of pose information yields promising results, this work
focuses on using pose information to the two-stream CNN framework. The pose information
is extracted using the MMPose [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] toolbox from OpenMMLab: an open-source package for
detailed video understanding. Each frame of the input video is analyzed. First, a person detector
is used to draw bounding boxes around every person in the frame, then a pose estimator is
deployed to extract the poses out of the bounding boxes. For person detection we utilized
a faster region-based CNN [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and for pose estimation deep high-resolution representation
learning [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. A top-down classifier is utilized for keypoint extraction, as it performs better
than bottom-up classifiers [
          <xref ref-type="bibr" rid="ref15 ref21">15, 21</xref>
          ]. Both models were pre-trained on the COCO data set [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]
by OpenMMLab. The individual keypoints are then connected by diferently colored segments
representing body parts superimposed over an image. Since diferent body parts contribute
1https://github.com/fidsinn/SportTaskME22
to the strokes in diferent ways, we assume that the model can distinguish them based on the
coloring of each body part. In this work, two methods utilizing pose information are investigated:
Pose and PRGB. The Pose variant contains the computed pose over a black background, while
the PRGB uses the original RGB frame as a background. By comparing pose, PRGB and RGB
performance, it is possible to determine how well the network utilizes the pose information.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Architecture</title>
        <p>
          Our model is a variant of the two-stream architecture [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] and an extension of the provided
baseline [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which also uses 3D convolutional blocks and an attention mechanism. As depicted
in Figure 2, the Two Stream Pose Convolutional Neural Network (TSPCNN) consists of two
identical streams, each with five convolutional layers and pooling layers with an increasing
number of filters leading to a linear layer with ReLU activation. The latest feeds a second
linear layer followed by a Softmax function to convert the output into a 21-dimensional
probabilistic vector for classification (21 diferent stroke types including non-stroke class) or a
two-dimensional vector for detection (stroke and non-stroke class). The output of the two
branches is then summed and processed by the last Softmax function to have a probabilistic
output for classification and detection. The first Softmax function normalizes the output of each
individual stream before fusion to minimize vanishing gradients.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Fusion</title>
        <p>
          Literature suggests that employing early fusion combined with late fusion boosts the
performance of a two-stream model considerably [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Adding multiple fusion methods to the
TSPCNN showed limited gain in performance. The best performance was achieved using a late
fusion approach, i.e. fusing before the last layer. Diferent fusion styles were i) weighted fusion,
where the resulting feature is equal to the weighted sum of the two fused features, ii) summed
fusion where the resulting feature is the sum of both features and iii) concatenated fusion,
where the resulting feature is a concatenation of both features. Summed fusion performed best,
and therefore, only its results are reported in the remaining of this paper.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Training</title>
        <p>All experiments were performed on Tesla V100 GPUs provided by the University of Leipzig.
The training took 7 to 8 minutes per epoch for the detection task and 1 to 2 minutes for the
classification task over 2000 epochs, with a learning rate of 0.0001 and a momentum of 0.5.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Discussion</title>
      <p>
        We evaluated our approach using the TTStroke-21 data set [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] provided by the Sport task
organizers of MediaEval [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The results are compared with the provided baseline [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. To compare
our approach with the baseline, we evaluated our runs using accuracy for the classification
task and IoU and mAP for the detection task. In addition, we added the results for training
accuracy and validation accuracy for each model, reported in Table 1. The test results were
selected depending on which of several decision methods produced the best results [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The
baseline and the other single-stream approaches already achieve quite promising results for
the classification task. Basic RGB combined with PRGB in a two-stream approach shows the
best accuracy in testing. The two-stream approaches slightly improve the classification results
compared to the single-stream method by up to .009. In contrast to the improvement regarding
the classification task, for detection the two-stream methods using the pose stream lead to a
decreasing quality compared to the single-stream baseline. The poor detection performance
is likely due to the missing ball and racket information in the pose data. An arm movement
without the racket may also be similar to a stroke which can confuse the model. Therefore, we
can say that pose information did not improve stroke detection performance.
      </p>
      <p>We have shown that the pose information can be suited for fine-grained action classification
but seems to fail to capture discriminant features for detection. While the TSPCNN
outperformed the baseline in the classification task, we could not improve detection performance. A
contributing factor to the slight classification improvement might be the limited training data,
especially since some classes only have a few labelled videos in the training set.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Summary and Outlook</title>
      <p>
        As wearable systems can be intrusive and cumbersome to set up while also not being widely
available, fine-grained action detection from video is of high interest for athletes and coaches to
be able to classify diferent actions in their game and to improve training eficiency [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We
have built a TSPCNN on state-of-the-art research. Our main contribution is to use RGB data
overlaid with human poses in a two-stream network. Our approach slightly outperforms the
baseline in terms of classification accuracy but produces poor performances for stroke detection.
To improve the TSPCNN further, some more experiments with diferent qualities of pose data
are needed. Moreover, diferent representations of pose data can be evaluated such as thicker
lines to emphasise poses. The approach should also be validated with diferent data sets, such
as the Finegym data set [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] since it also has low variability between classes but a more even
class distribution in the training data.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zolfaghari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Tighe,
          <string-name>
            <given-names>R.</given-names>
            <surname>Manmatha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A comprehensive study of deep video action recognition</article-title>
          , arXiv preprint arXiv:
          <year>2012</year>
          .
          <volume>06567</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Calandre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansencal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mascarilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          ,
          <article-title>Sport task: Fine grained action detection and classification of table tennis strokes from videos for mediaeval 2022</article-title>
          , in: MediaEval, CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Ball tracking and trajectory prediction for table-tennis robots</article-title>
          ,
          <source>Sensors</source>
          <volume>20</volume>
          (
          <year>2020</year>
          )
          <fpage>333</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Voeikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Falaleev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baikulov</surname>
          </string-name>
          , Ttnet:
          <article-title>Real-time temporal and spatial video analysis of table tennis</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF CVPR Workshops</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>884</fpage>
          -
          <lpage>885</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Albawi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Mohammed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Al-Zawi</surname>
          </string-name>
          ,
          <article-title>Understanding of a convolutional neural network</article-title>
          ,
          <source>in: 2017 International Conference on Engineering and Technology (ICET)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <article-title>Spatio-temporal cnn baseline method for the sports video task of mediaeval 2021 benchmark</article-title>
          , in: MediaEval, CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <article-title>Baseline method for the sport task of mediaeval 2022 benchmark with 3d cnn using attention mechanism</article-title>
          , in: MediaEval, CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Two-stream convolutional networks for action recognition in videos</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>27</volume>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Quo vadis, action recognition? a new model and the kinetics dataset</article-title>
          ,
          <source>in: IEEE CVPR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>6299</fpage>
          -
          <lpage>6308</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Feichtenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Slowfast networks for video recognition</article-title>
          , in: IEEE/CVF international conference on
          <source>computer vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6202</fpage>
          -
          <lpage>6211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansencal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          ,
          <article-title>Classification of strokes in table tennis with a three stream spatio-temporal cnn for mediaeval 2020</article-title>
          ,
          <source>in: Proc. of the MediaEval 2020 Workshop</source>
          , Online,
          <fpage>14</fpage>
          -15
          <source>December</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>P.-E. Martin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Fine-Grained Action</surname>
          </string-name>
          Detection and
          <article-title>Classification from Videos with Spatio-Temporal Convolutional Neural Networks</article-title>
          . Application to Table Tennis.,
          <string-name>
            <surname>Theses</surname>
          </string-name>
          , Université de Bordeaux ; Université de la Rochelle,
          <year>2020</year>
          . URL: https://hal.archives-ouvertes.fr/tel-03099907.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Morlier,</surname>
          </string-name>
          <article-title>Sport action recognition with siamese spatiotemporal cnns: Application to table tennis</article-title>
          , in: CBMI, IEEE,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Feichtenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pinz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Convolutional two-stream network fusion for video action recognition</article-title>
          ,
          <source>in: IEEE CVPR</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1933</fpage>
          -
          <lpage>1941</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <article-title>Revisiting skeleton-based action recognition</article-title>
          ,
          <source>arXiv preprint arXiv:2104.13586</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          ,
          <article-title>Optimal choice of motion estimation methods for ifne-grained action classification with 3d convolutional networks</article-title>
          ,
          <source>in: ICIP</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>554</fpage>
          -
          <lpage>558</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aono</surname>
          </string-name>
          ,
          <article-title>Leveraging human pose estimation model for stroke classification in table tennis</article-title>
          ,
          <source>in: MediaEval</source>
          , volume
          <volume>2882</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Contributors</surname>
          </string-name>
          ,
          <article-title>Openmmlab pose estimation toolbox and benchmark</article-title>
          , https://github.com/ open-mmlab/mmpose,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Faster</surname>
          </string-name>
          r-cnn:
          <article-title>Towards real-time object detection with region proposal networks</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>28</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Deep high-resolution representation learning for human pose estimation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5693</fpage>
          -
          <lpage>5703</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <article-title>Microsoft coco: Common objects in context</article-title>
          , in: ECCV, Springer,
          <year>2014</year>
          , pp.
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Morlier,</surname>
          </string-name>
          <article-title>3d attention mechanisms in twin spatio-temporal convolutional neural networks. application to action classification in videos of table tennis games</article-title>
          ., in: ICPR, IEEE Computer Society,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          ,
          <article-title>Fine grained sport action recognition with twin spatio-temporal convolutional neural networks</article-title>
          ,
          <source>Multim. Tools Appl</source>
          .
          <volume>79</volume>
          (
          <year>2020</year>
          )
          <fpage>20429</fpage>
          -
          <lpage>20447</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>D.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Finegym: A hierarchical video dataset for fine-grained action understanding</article-title>
          , in: CVPR, IEEE,
          <year>2020</year>
          , pp.
          <fpage>2613</fpage>
          -
          <lpage>2622</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>