<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Bergen, Norway and Online
$ pierre_etienne_martin@eva.mpg.de (P. Martin)
€ www.eva.mpg.de/comparative-cultural-psychology/sta /pierre-etienne-martin (P. Martin)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Baseline Method for the Sport Task of MediaEval 2022 with 3D CNNs using Attention Mechanisms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pierre-Etienne Martin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CCP Department, Max Planck Institute for Evolutionary Anthropology</institution>
          ,
          <addr-line>D-04103 Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper presents the baseline method proposed for the Sports Video task part of the MediaEval 2022 benchmark. This task proposes two subtasks: stroke classi cation from trimmed videos, and stroke detection from untrimmed videos. This baseline addresses both subtasks. We propose two types of 3D-CNN architectures to solve the two subtasks. Both 3D-CNNs use Spatio-temporal convolutions and attention mechanisms. The architectures and the training process are tailored to solve the addressed subtask. This baseline method is shared publicly online to help the participants in their investigation and alleviate eventually some aspects of the task such as video processing, training method, evaluation and submission routine. The baseline method reaches 86.4% of accuracy with our v2 model for the classi cation subtask. For the detection subtask, the baseline reaches a mAP of 0.131 and IoU of 0.515 with our v1 model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Action classi cation from videos is a popular topic in the computer vision eld [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ].
In order to solve such task, 2D CNNs were rst introduced [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Then, to better capture
the temporal information from videos, 3D convolution methods emerged [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Optical ow
computed from the RGB stream was also investigated in order to boost performance and
translate RGB changes into movement information [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. Recently, multi-model methods
are re-investigated but this time combining the RGB and the audio streams [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] leading to
the state-of-the-art on common benchmark datasets such as Kintetics600 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Multi-view
methods combined with Transformers [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] are also the current state-of-the-art in many action
classi cation dataset [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ].
      </p>
      <p>
        In the Sport Task of MediaEval 2022, the focus is on the classi cation and detection of
table tennis strokes from videos. As described in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], the task focuses on low visual
interclass variability actions: classify them from trimmed videos (subtask 1) and detect them from
untrimmed videos (subtask 2). The task is based on TTStroke-21 dataset [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and is similar to
other datasets with low inter-class variability [
        <xref ref-type="bibr" rid="ref14 ref18 ref19 ref20">18, 19, 14, 20</xref>
        ].
      </p>
      <p>This baseline, publicly available on GitHub1, tackles the two subtasks and aims to help
participants in their submission such as the processing of the videos, the annotation les and
the deep learning methods.</p>
      <p>320
Video stream
3 3 3
180
2. Method</p>
      <p>1 1 1 32
96
Conv45 P8o0ol 48
(3x3x3) (2x2x2) Att
ReLU
20
64
2 .5.. 3 1 FC</p>
      <p>ReLU</p>
      <p>Probabilistic</p>
      <p>Output
21 for classification</p>
      <p>FC 2 for detection</p>
      <p>SoftMax
1 FC
ReLU
1280</p>
      <p>Probabilistic</p>
      <p>Output
21 for classification</p>
      <p>FC 2 for detection
SoftMax</p>
      <p>
        The method has been kept simple and uses only the RGB information from the provided videos.
The implementation is inspired from [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. The main divergence is the absence of Region Of
Interest (ROI) which was computed from Optical Flow values. The data processing is trivial:
The RGB frames are resized to a width of 320 and stacked together to form tensors of length
96 either from the trimmed videos or following the annotation boundaries available in the
XML les. Data are augmented to increase variability: start at di erent time points and spatial
transformations ( ip and rotation).
      </p>
      <p>Two versions, V1 and V2 are introduced and depicted in Figure 1. V1 is a sequence of four
conv+pool+attention layers and two conv+pool layers. All convolutional layers use 3x3x3 lters.
The rst layers use 2x2x1 pooling lters (no pooling on the temporal domain) and 2x2x2 pooling
lters for the other layers. V2 is a sequence of ve conv+pool+attention layers. Conv. lters are
of size 7x5x3 and pooling lters of size 4x3x2 for the rst two blocks. The remaining blocks use
3x3x3 and 2x2x2 for conv. and pooling lters respectively. V2 leads to almost squared feature
maps after the second block so that horizontality, verticality and temporality can be better
combined before the fully connected layers.</p>
      <p>
        The training method uses Nesterov momentum over a xed amount of epochs. The learning
rate is modi ed according to the loss evolution [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. The model with the best performance on
the validation loss is saved. The training methods are the same for both subtasks. The objective
function is the cross-entropy loss of the output processed by the softmax function summing
over the batch:
′
L(y, class) = −log( exNp(yclass) )
      </p>
      <p>Pi exp(yi)
(1)</p>
      <p>
        We consider 21 classes for the classi cation task and two classes for the detection task as
previously done in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Negative samples are extracted for the detection task and negative
proposals are built on its test set. Testing is performed with the trimmed proposal (with one
window centered or with a sliding window and several post-processing approaches) or by
running a sliding window on the whole video for the detection task. The latest output is
processed in order to segment framewisely the strokes. Too short strokes, less than 30 frames,
are not considered. The model trained on the classi cation task is also tested on the detection
task without further training on the detection data. Two approaches are considered: 1) Negative
class score VS all others for decision and 2) Negative class score VS sum of all the others. Several
decision methods are also tested: No Window, Vote, Mean, and Gaussian according to a temporal
window. See [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] for further details.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. Results</title>
      <p>
        This section presents the results per subtask according to the metrics presented in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. For the
two subtasks, we trained the models for 2000 epochs using a learning rate of .0001, a momentum
of .5 and a weight decay of .005.
      </p>
      <sec id="sec-2-1">
        <title>3.1. Subtask 1 - Stroke Classification</title>
        <p>As reported in Table 1, V1 and V2 perform similarly on the stroke classi cation subtask, but V2
using the Gaussian window decision performs the best with 86.4% of accuracy on the test set.
This model nished convergence at epoch 815 with train and validation accuracies of .989 and
.813 respectively. The confusion matrix of this run is depicted in Figure 2.</p>
        <p>As we can notice on the confusion matrix, the model has the tendency to classify some
strokes as non-strokes (negative class). This is certainly due to the variation in the negative
class, increasing its dedicated latent space and giving more probability to the unseen samples to
fall in it. This could be solved by increasing the variability of these samples via data augmentation
or more recording of these strokes.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Subtask 2 - Stroke Detection</title>
        <p>Table 2 reports the results using video candidates from the test set. Video candidates are simply
non-overlapped successive samples of length 150 frames from the test videos. The main metric
for evaluation is the mAP, and therefore the model V2 using a Vote decision performs the best.
However, extracting video candidates in such way is not e cient to detect the strokes. That is
why in Table 3 results using another segmentation methods are reported.</p>
        <p>To perform a better segmentation, a sliding window with step one is used on the test videos.
The outputs are combined in order to make a decision following the same previously presented
window methods. The models from subtask 1 are also tested.</p>
        <p>As we can see, the segmentation method allows the model V1 to reach the best performance
in terms of mAP and IoU. However it is not the case for the V2 models.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Conclusion</title>
      <p>
        This baseline intends to help the participants solving the Sports Video Task. This work is in the
continuity of last year’s baseline [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and more tools were implemented to help the participants
as suggested at the last edition. Improvements can be made by combining knowledge from
subtask 1 to solve subtask 2. Also, the data augmentation and the loss can be improved to
balance the unbalanced distribution of the samples. Finally, the segmentation method for stroke
detection can still be improved to boost the performance in this subtask. These possibilities of
improvements may be implemented in next year’s baseline.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Soomro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Zamir</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Shah, UCF101: A dataset of 101 human actions classes from videos in the wild</article-title>
          ,
          <source>CoRR abs/1212</source>
          .0402 (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Vondrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pantofaru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vijayanarasimhan</surname>
          </string-name>
          , G. Toderici,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ricco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sukthankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Malik,</surname>
          </string-name>
          <article-title>AVA: A video dataset of spatio-temporally localized atomic visual actions (</article-title>
          <year>2018</year>
          )
          <fpage>6047</fpage>
          -
          <lpage>6056</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Thotakuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vostrikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>The ava-kinetics localized human actions video dataset</article-title>
          , CoRR abs/
          <year>2005</year>
          .00214 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Piergiovanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Ryoo</surname>
          </string-name>
          ,
          <article-title>Avid dataset: Anonymized videos from diverse countries</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems</source>
          <year>2020</year>
          ,
          <article-title>NeurIPS 2020</article-title>
          , December 6-
          <issue>12</issue>
          ,
          <year>2020</year>
          , virtual,
          <year>2020</year>
          . URL: https://proceedings.neurips.cc/paper/2020/hash/c28e5b0c9841b5ef396f9f519bf6c217-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bilen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fernando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gavves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <article-title>Action recognition with dynamic image networks</article-title>
          ,
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>40</volume>
          (
          <year>2018</year>
          )
          <fpage>2799</fpage>
          -
          <lpage>2813</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Two-stream convolutional networks for action recognition in videos</article-title>
          , in: NIPS,
          <year>2014</year>
          , pp.
          <fpage>568</fpage>
          -
          <lpage>576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Human action recognition using a modi ed convolutional neural network</article-title>
          ,
          <source>in: ISNN (2)</source>
          , volume
          <volume>4492</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2007</year>
          , pp.
          <fpage>715</fpage>
          -
          <lpage>723</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. J. T.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. V. A.</given-names>
            <surname>Barros</surname>
          </string-name>
          ,
          <article-title>Human action recognition with 3d convolutional neural network, in: LA-CCI</article-title>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Quo vadis, action recognition? A new model and the kinetics dataset</article-title>
          , in: CVPR, IEEE Computer Society,
          <year>2017</year>
          , pp.
          <fpage>4724</fpage>
          -
          <lpage>4733</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Morlier,</surname>
          </string-name>
          <article-title>Sport action recognition with siamese spatio-temporal cnns: Application to table tennis</article-title>
          , in: CBMI, IEEE,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zellers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salehi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kusupati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hessel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          , Merlot reserve:
          <article-title>Multimodal neural script knowledge through vision and language and sound</article-title>
          , in: CVPR,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Noland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Banki-Horvath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hillier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <source>A short note about kinetics-600</source>
          , CoRR abs/
          <year>1808</year>
          .01340 (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gritsenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Arnab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <article-title>Scenic: A jax library for computer vision research and beyond</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>21393</fpage>
          -
          <lpage>21398</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Damen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Doughty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Farinella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fidler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Furnari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kazakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moltisanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Munro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Perrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Price</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wray</surname>
          </string-name>
          ,
          <article-title>Scaling egocentric vision: The EPIC-KITCHENS dataset</article-title>
          , CoRR abs/
          <year>1804</year>
          .02748 (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Monfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Vondrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Andonian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Bargal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gutfreund</surname>
          </string-name>
          ,
          <article-title>Moments in time dataset: One million videos for event understanding</article-title>
          ,
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>42</volume>
          (
          <year>2020</year>
          )
          <fpage>502</fpage>
          -
          <lpage>508</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Calandre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansencal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mascarilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          ,
          <article-title>Sport task: Fine grained action detection and classi cation of table tennis strokes from videos for mediaeval 2022</article-title>
          , in: MediaEval, CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          ,
          <article-title>Fine grained sport action recognition with twin spatio-temporal convolutional neural networks</article-title>
          ,
          <source>Multim. Tools Appl</source>
          .
          <volume>79</volume>
          (
          <year>2020</year>
          )
          <fpage>20429</fpage>
          -
          <lpage>20447</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Finegym: A hierarchical video dataset for ne-grained action understanding</article-title>
          , in: CVPR, IEEE,
          <year>2020</year>
          , pp.
          <fpage>2613</fpage>
          -
          <lpage>2622</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vasconcelos</surname>
          </string-name>
          ,
          <string-name>
            <surname>RESOUND:</surname>
          </string-name>
          <article-title>towards action recognition without representation bias</article-title>
          ,
          <source>in: ECCV (6)</source>
          , volume
          <volume>11210</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2018</year>
          , pp.
          <fpage>520</fpage>
          -
          <lpage>535</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Noiumkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tirakoat</surname>
          </string-name>
          ,
          <article-title>Use of optical motion capture in sports science: A case study of golf swing</article-title>
          , in: ICICM,
          <year>2013</year>
          , pp.
          <fpage>310</fpage>
          -
          <lpage>313</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Morlier,</surname>
          </string-name>
          <article-title>3d attention mechanisms in twin spatio-temporal convolutional neural networks. application to action classi cation in videos of table tennis games</article-title>
          ., in: ICPR, IEEE Computer Society,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <article-title>Spatio-temporal cnn baseline method for the sports video task of mediaeval 2021 benchmark</article-title>
          , in: MediaEval, CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Fine-Grained Action</surname>
          </string-name>
          Detection and
          <article-title>Classi cation from Videos with Spatio-Temporal Convolutional Neural Networks</article-title>
          .
          <article-title>Application to Table Tennis. (Détection et classi cation nes d'actions à partir de vidéos par réseaux de neurones à convolutions spatio-temporelles</article-title>
          .
          <source>Application au tennis de table)</source>
          ,
          <source>Ph.D. thesis</source>
          , University of La Rochelle, France,
          <year>2020</year>
          . URL: https://tel.archives-ouvertes.fr/tel-03128769.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>