<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Amsterdam, The Netherlands and Online
$ pierre_etienne_martin@eva.mpg.de (P. Martin)
 www.eva.mpg.de/ccp/staf/pierre-etienne-martin (P. Martin)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Baseline Method for the Sport Task of MediaEval 2023 using 3D CNNs with Attention Mechanisms for Table Tennis Stoke Detection and Classification.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pierre-Etienne Martin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CCP Department, Max Planck Institute for Evolutionary Anthropology</institution>
          ,
          <addr-line>D-04103 Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper presents the baseline method proposed for the Sports Video task part of the MediaEval 2023 benchmark. This task proposes six sports-related multimedia tasks, each divided into sub-tasks for table tennis and swimming. In this baseline, we focus only on table tennis stroke detection from untrimmed videos (subtask 2.1), and stroke classification from trimmed videos (subtask 3.1). We propose two types of 3D-CNN architectures to solve those two subtasks. Both 3D-CNNs use Spatio-temporal convolutions and attention mechanisms. The architectures and the training process are tailored to solve the addressed subtask. This baseline method is shared publicly online to help the participants in their investigation and alleviate eventually some aspects of the task such as video processing, training method, evaluation, and submission routine. The baseline reaches a mAP of 0.131 and IoU of 0.515 with our v1 model for the detection subtask. For the classification subtask, the baseline method reaches 86.4% of accuracy with our v2 model. The same baseline was used in the 2022 edition. Additional results are incorporated in this paper to encourage comparison and discussion with the participants.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The field of computer vision has shown considerable interest in the classification of actions
from videos [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. Initially, 2D CNNs were utilized for this task [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], which later evolved
into 3D convolution methods to better encapsulate temporal information from videos [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The
use of optical flow, computed from the RGB stream, was explored to enhance performance and
convert RGB variations into movement data [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. More recently, multi-model methods have
been revisited, this time integrating the RGB and audio streams [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], leading to breakthroughs
on standard benchmark datasets like Kintetics600 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        The MediaEval 2023 Sport Task focuses on the classification and detection of table tennis
strokes from videos, as detailed in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The task emphasizes actions with low visual inter-class
variability and involves detecting them from untrimmed videos (subtask 2.1) and classifying
them from trimmed videos (subtask 3.1). These subtasks are built on the TTStroke-21 dataset [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
and bears similarities to other datasets with low inter-class variability [
        <xref ref-type="bibr" rid="ref14 ref15 ref16 ref17">14, 15, 16, 17</xref>
        ].
      </p>
      <p>
        This baseline is the same as the one presented in the 2022 Mediaeval edition [18, 19? ] and
used in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Its implementation is publicly available on GitHub1.
      </p>
      <p>320
Video stream
3 3 3
180
Conv45 P8o0ol 48
(3x3x3) (2x2x2) Att</p>
      <p>ReLU
320
Video stream
2688
21 for classification</p>
      <p>FC 2 for detection</p>
      <p>SoftMax
Probabilistic</p>
      <p>Output
21 for classification</p>
      <p>FC 2 for detection
SoftMax</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>
        The proposed method leverages solely the RGB data from the given videos, drawing inspiration
from [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. A key diference is the lack of Region Of Interest (ROI) computation from Optical
Flow values. The RGB frames are resized to a width of 320 and stacked to form 96-length
tensors, either from the trimmed videos or according to the annotation boundaries. Data
augmentation techniques, such as starting at diferent time points and spatial transformations
(flip and rotation), are employed to enhance variability.
      </p>
      <p>Two versions of the method, V1 and V2 as illustrated in Figure 1, are utilized. V1 consists
of a sequence of four conv+pool+attention layers followed by two conv+pool layers. All
convolutional layers employ 3x3x3 filters. The initial layers use 2x2x1 pooling filters, while the
subsequent layers use 2x2x2 pooling filters. V2, on the other hand, comprises a sequence of five
conv+pool+attention layers. The convolution filters for the first two blocks are 7x5x3 in size,
with 4x3x2 pooling filters. The remaining blocks use 3x3x3 and 2x2x2 filters for convolution
and pooling, respectively.</p>
      <p>
        The training process employs Nesterov momentum over a set number of epochs, with the
learning rate adjusted based on loss evolution [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. The model that performs best on the
validation loss is retained. The same training methods are used for both subtasks. The objective
function is the cross-entropy loss of the softmax-processed output, summed over the batch:
ℒ(, ) = − (
(′
      </p>
      <p>) )
∑︀ ()</p>
      <p>
        For classification, we consider 21 classes, and for detection, two classes, as in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Negative
samples are used for detection, and testing involves trimmed proposals or a sliding window
across the entire video. Strokes shorter than 30 frames are ignored. The classification-trained
model is also tested on detection without additional training. Two approaches are considered:
comparing the negative class score against all others, and comparing the negative class score
against the sum of all others. Various decision methods are tested. For more details, see [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
(1)
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>
        This section outlines the results for each subtask based on the metrics detailed in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Both
subtasks involved training the models for 2000 epochs with a learning rate of .0001, a momentum
of .5, and a weight decay of .005.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Subtask 2.1 - Table Tennis Stroke Detection</title>
        <p>Table 1’s first section presents results using video candidates from the test set, which are
nonoverlapping, successive 150-frame samples from the test videos. The primary evaluation metric
is mAP, with the V2 model using a Vote decision performing best. However, this method of
extracting video candidates is not eficient for stroke detection. For better segmentation, a
sliding window with a step of one is used on the test videos, and outputs are combined using
the previously mentioned window methods. Models from subtask 1 (marked with †) are also
tested. The second part of Table 1 shows some improvement, with the V1 model achieving the
best mAP and IoU scores using the segmentation method, but the V2 models do not perform as
well.
†† Negative class VS sum of all</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Subtask 3.1 - Table Tennis Stroke Classification</title>
        <p>As reported in Table 2, V1 and V2 perform similarly on the stroke classification subtask, but V2
using the Gaussian window decision performs the best with 86.4% of accuracy on the test set.
This model finished convergence at epoch 815 with train and validation accuracies of .989 and
.813 respectively. The confusion matrices of this run are depicted in Figure 2.</p>
        <p>As we can notice on the confusion matrix, the model has the tendency to classify some
strokes as non-strokes (negative class). This is certainly due to the variation in the negative
class, increasing its dedicated latent space and giving more probability to the unseen samples to
fall in it. This could be solved by increasing the variability of these samples via data augmentation
or more recording of these strokes.</p>
        <p>a. Type
b. Hand-side
c. Hand-side and type</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>This baseline aims to assist participants in the Sports Video Task, building on last year’s
baseline [? ]. This paper provides more results than the previous year to foster discussion and
comparison. Enhancements can be made by integrating insights from subtasks 2.1 and 3.1,
refining the training process with more complex data augmentation or weighted loss. For the
next edition, we plan to provide a baseline for the entire sport task, including table tennis and
swimming.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Soomro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Zamir</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Shah, UCF101: A dataset of 101 human actions classes from videos in the wild</article-title>
          ,
          <source>CoRR abs/1212</source>
          .0402 (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Vondrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pantofaru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vijayanarasimhan</surname>
          </string-name>
          , G. Toderici,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ricco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sukthankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Malik,</surname>
          </string-name>
          <article-title>AVA: A video dataset of spatio-temporally localized atomic visual actions (</article-title>
          <year>2018</year>
          )
          <fpage>6047</fpage>
          -
          <lpage>6056</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Thotakuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vostrikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>The ava-kinetics localized human actions video dataset</article-title>
          , CoRR abs/
          <year>2005</year>
          .00214 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Piergiovanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Ryoo</surname>
          </string-name>
          ,
          <article-title>Avid dataset: Anonymized videos from diverse countries</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems</source>
          <year>2020</year>
          ,
          <article-title>NeurIPS 2020</article-title>
          , December 6-
          <issue>12</issue>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bilen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fernando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gavves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <article-title>Action recognition with dynamic image networks</article-title>
          ,
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>40</volume>
          (
          <year>2018</year>
          )
          <fpage>2799</fpage>
          -
          <lpage>2813</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Two-stream convolutional networks for action recognition in videos</article-title>
          , in: NIPS,
          <year>2014</year>
          , pp.
          <fpage>568</fpage>
          -
          <lpage>576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. J. T.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. V. A.</given-names>
            <surname>Barros</surname>
          </string-name>
          ,
          <article-title>Human action recognition with 3d convolutional neural network, in: LA-CCI</article-title>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Quo vadis, action recognition? A new model and the kinetics dataset</article-title>
          , in: CVPR, IEEE Computer Society,
          <year>2017</year>
          , pp.
          <fpage>4724</fpage>
          -
          <lpage>4733</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Morlier,</surname>
          </string-name>
          <article-title>Sport action recognition with siamese spatio-temporal cnns: Application to table tennis</article-title>
          , in: CBMI, IEEE,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zellers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salehi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kusupati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hessel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          , Merlot reserve:
          <article-title>Multimodal neural script knowledge through vision and language and sound</article-title>
          , in: CVPR,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Noland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Banki-Horvath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hillier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <source>A short note about kinetics-600</source>
          , CoRR abs/
          <year>1808</year>
          .01340 (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Erades</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. V. B.</given-names>
            <surname>Mansencal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dufner</surname>
          </string-name>
          , J. Benois-Pineau,
          <article-title>SportsVideo: A multimedia dataset for event and position detection in table tennis and swimming</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2023 Workshop</source>
          , Amsterdam,
          <source>The Netherlands and Online, 1-2 February</source>
          <year>2024</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          ,
          <article-title>Fine grained sport action recognition with twin spatio-temporal convolutional neural networks</article-title>
          ,
          <source>Multim. Tools Appl</source>
          .
          <volume>79</volume>
          (
          <year>2020</year>
          )
          <fpage>20429</fpage>
          -
          <lpage>20447</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Finegym: A hierarchical video dataset for fine-grained action understanding</article-title>
          , in: CVPR, IEEE,
          <year>2020</year>
          , pp.
          <fpage>2613</fpage>
          -
          <lpage>2622</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vasconcelos</surname>
          </string-name>
          ,
          <string-name>
            <surname>RESOUND:</surname>
          </string-name>
          <article-title>towards action recognition without representation bias</article-title>
          ,
          <source>in: ECCV (6)</source>
          , volume
          <volume>11210</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2018</year>
          , pp.
          <fpage>520</fpage>
          -
          <lpage>535</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Damen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Doughty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Farinella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fidler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Furnari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kazakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moltisanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Munro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Perrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Price</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wray</surname>
          </string-name>
          ,
          <article-title>Scaling egocentric vision: The EPIC-KITCHENS dataset</article-title>
          , CoRR abs/
          <year>1804</year>
          .02748 (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Noiumkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tirakoat</surname>
          </string-name>
          ,
          <article-title>Use of optical motion capture in sports science: A case study of golf swing</article-title>
          , in: ICICM,
          <year>2013</year>
          , pp.
          <fpage>310</fpage>
          -
          <lpage>313</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Langguth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Andreadis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hürriyetoglu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordmo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vuillemot</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          (Eds.),
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          , volume
          <volume>3583</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Calandre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mansencal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mascarilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          ,
          <article-title>Sport task: Fine grained action detection and classification of table tennis strokes from videos for mediaeval 2022</article-title>
          , in: [ 18],
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper26.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hacker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bartels</surname>
          </string-name>
          , P. Martin,
          <article-title>Fine-grained action detection with RGB and pose information using two stream convolutional networks</article-title>
          ,
          <source>in: [18]</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper21.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Morlier,</surname>
          </string-name>
          <article-title>3d attention mechanisms in twin spatio-temporal convolutional neural networks. application to action classification in videos of table tennis games</article-title>
          ., in: ICPR, IEEE Computer Society,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <article-title>Spatio-temporal cnn baseline method for the sports video task of mediaeval 2021 benchmark</article-title>
          , in: MediaEval, CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Fine-Grained Action</surname>
          </string-name>
          Detection and
          <article-title>Classification from Videos with Spatio-Temporal Convolutional Neural Networks</article-title>
          . Application to Table Tennis.,
          <string-name>
            <surname>Ph</surname>
          </string-name>
          .D. thesis, University of La Rochelle, France,
          <year>2020</year>
          . URL: https://tel.archives-ouvertes.fr/tel-03128769.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>