<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sports Video Annotation: Detection of Strokes in Table Tennis task for MediaEval 2019</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pierre-Etienne Martin</string-name>
          <email>pierre-etienne.martin@u-bordeaux.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jenny Benois-Pineau</string-name>
          <email>jenny.benois-pineau@u-bordeaux.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Boris Mansencal</string-name>
          <email>boris.mansencal@labri.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Renaud Péteri</string-name>
          <email>renaud.peteri@univ-lr.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laurent Mascarilla</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jordan Calandre</string-name>
          <email>jordan.calandre1@univ-lr.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julien Morlier</string-name>
          <email>julien.morlier@u-bordeaux.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IMS, University of Bordeaux</institution>
          ,
          <addr-line>Talence</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>MIA, La Rochelle University</institution>
          ,
          <addr-line>La Rochelle</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Univ. Bordeaux</institution>
          ,
          <addr-line>CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400, Talence</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>Action detection and classification is one of the main challenges in visual content analysis and mining. Sport video analysis has been a very popular research topic, due to the variety of application areas, ranging from multimedia intelligent devices with user-tailored digests, up to analysis of athletes' performances. Datasets with sport activities are available now for benchmarking of methods. A large amount of work is also devoted to the analysis of sport gestures using motion capture systems. However, body-worn sensors and markers could disturb the natural behaviour of sports players. Furthermore, motion capture devices are not always available for potential users, be it a University Faculty or a local sport team. Coming years will build upon the basic "Sports Video Annotation: Detection of Strokes in Table Tennis" task ofered in 2019. The ultimate goal of this research is to produce automatic annotation tools for sport faculties, local clubs and associations to help coaches to better assess and advise athletes during training</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Action detection and classification is one of the main challenges in
visual content analysis and mining [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Sport video analysis has
been a very popular research topic, due to the variety of
application areas, ranging from multimedia intelligent devices with
usertailored digests, up to analysis of athletes’ performances[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].The
Sport Video Annotation project was initiated between the Faculty
of Sports STAPS of the University of Bordeaux, the LaBRI -
Université de Bordeaux and the MIA lab. - La Rochelle University. It is
supported by the CNRS federation MIRES and the New Aquitaine
Region in the framework of an "APP Recherche". The goal of this
project is to develop artificial intelligence and multimedia indexing
methods for the recognition of table tennis sports activities. The
aim is to evaluate the performance of athletes, with a particular
focus on students, in order to develop optimal training strategies.
To that aim, a video corpus named TTStroke-21 was recorded with
volunteered players. These data represent a large scientific
interest for the Multimedia community participating in the MediaEval
campaign.
      </p>
      <p>
        Other datasets such as UCF-101 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], HMDB [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and AVA [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
are used as benchmarks for action classification methods. Others,
such as the Olympic Sports dataset [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] focus on sport actions only.
However none of them is dedicated to a specific sport and its
associated rules. Furthermore, TTStroke-21 is annotated manually by
professional players or teachers of Table Tennis, making the
annotation process longer, but more temporally and qualitatively accurate.
Classification methods as I3D model [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or LTC model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
performing well on UCF-101 dataset inspired the work done in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
through a SSTCNN - Siamese Spatio Temporal Convolutional
Neural Network. Here the video stream and derived computed optical
lfow are passed through the branches of the SSTCNN. The
similarity of actions - strokes - in TTStroke-21 makes the classification
task challenging and the multi-modal method seemed to improve
performances. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], spatio temporal dependencies are learned
from the video using only RGB images and scores are promising
but are still below the multi-modal methods of I3D.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>PARTICULAR CONDITIONS</title>
      <p>Because TTStroke-21 is constituted of videos with identifiable
players of Table Tennis, this dataset is subject to particular
conditions in order to respect the personal data and privacy of the players.
These Special Conditions apply to the use of Images generated in
the framework of the program Sports video annotations:
classiifcation of strokes in table tennis, for the implementation of the
MediaEval program. They constitute the specific usage agreement
referred to in the Usage agreement for the MediaEval 2019 Research
Collections, signed between the User and the University of Delft.
The full and complete acceptance, without any reservation, of these
Special Conditions is a mandatory prerequisite for the provision
of the Images as part of the MediaEval programme. A complete
reading of these conditions are necessary and engage the user, for
example, to obscure the faces (blurring, black banner, etc.) before
any publication and to destroy the data by October 1st 2020.
3</p>
    </sec>
    <sec id="sec-3">
      <title>DATASET DESCRIPTION</title>
      <p>In MediaEval 2019, we deliver a subset of TTStroke-21 data set
which has been specifically recorded in a sport faculty facility using
a light-weight equipment, such as GoPro cameras. It is constituted
of player-centred videos recorded in natural conditions without
markers or sensors, see Fig 1. It comprises 20 table tennis strokes
a. Video acquisition</p>
      <p>b. Annotation platform
c. Dataset samples
classes, i.e. 8 services, 6 ofensive strokes and 6 defensive strokes.
This taxonomy was designed with professional table tennis
teachers.</p>
      <p>All videos are recorded in MP4 format.</p>
      <p>The organisation of the delivered data is as follows:
• The provided dataset is split into two subsets: i)training
set and ii)test set;
• In each directory, there are several videos (in MP4 format)
and each video may contain several actions;
• Each video file is accompanied with a XML file describing
the actions present in the video;
• For each action there are 3 attributes: the starting frame,
the ending frame, and the stroke class;
• In the train set XML files, all the attributes are specified
but in the test set XML files, only the starting and ending
frames are specified while the stroke class attribute is
purposely set to an invalid value ("Unknown") and should be
updated by the participants to one of the 20 valid classes.
.
4</p>
    </sec>
    <sec id="sec-4">
      <title>TASK DESCRIPTION</title>
      <p>The Sport Video Annotation task consists in assigning a label from
a given taxonomy of 20 classes of Table Tennis strokes to each
action delimited by starting frame and ending frame in each test
video file.</p>
      <p>Participants may submit up to four runs. For each runs, they must
provide one XML file per video file, with the actions associated to
the recognised stroke class. Runs may be submitted as an archive
(zip or tar.gz file) with each run in a diferent directory. Participants
should also indicate if any external data (other dataset, pretrained
networks, ...) was used to compute their runs. The task is considered
fully automatic. Once the video are provided to the system, results
should be produced without any human intervention.
5</p>
    </sec>
    <sec id="sec-5">
      <title>EVALUATION</title>
      <p>In MediaEval 2019 we propose a light-weight classification task.
It consists in classification of table tennis strokes which temporal
borders are supplied in the XML files accompanying each video file.
Hence for each test video the participants are invited to produce an
xml file in which each stroke is labelled accordingly to the given
taxonomy. This means that the label "unknown" has to be replaced
by the label of the stroke class which participant’s system has
assigned. All submissions will be evaluated in terms of per-class
accuracy(PCA) and of global accuracy (GA).The PCA is computed
for each i-th class as:</p>
      <p>PCAi = T Pi /(Nдt i )
Here T Pi is the number of True Positives, i.e. correctly labelled, by
the participant’s system, strokes for the given i-th class,Nдt i is the
number of recorded strokes of the i-th class in the test dataset.</p>
      <p>GA = T P /(Nдt )
Here T P = Í T Pi is the number of correctly labelled strokes for
the whole dataset, and Nдt is the number of strokes in the ground
truth - the whole test set.
(1)
(2)
6</p>
    </sec>
    <sec id="sec-6">
      <title>DISCUSSION</title>
      <p>Participants are welcome to share their dificulties and the results
even if they are negative. Better understanding of automatic
classification methods are easier when all aspects of the methods are
shared.</p>
      <p>Thank you for participating at MediaEval 2019 and more
specifically to our task: "Sports Video Annotation: Detection of Strokes
in Table Tennis". We look forward to seeing you at the MediaEval
Workshop and to discussing your results further.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>We would like to thank all the players and annotators who
contributed to TTStroke-21, Alain Coupet for his dedication to the
project, Xavier Daverat and Chantal Durand for their help on the
Particular Conditions formulation.</p>
      <p>This work was supported by Region of Nouvelle Aquitaine grant
CRISP and Bordeaux Idex Initiative.
Sports Video Annotation: Detection of Strokes in Table Tennis</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Joao</given-names>
            <surname>Carreira</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset</article-title>
          .
          <source>CoRR abs/1705</source>
          .07750 (
          <year>2017</year>
          ). arXiv:
          <volume>1705</volume>
          .
          <fpage>07750</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Moritz</given-names>
            <surname>Einfalt</surname>
          </string-name>
          , Dan Zecha, and
          <string-name>
            <given-names>Rainer</given-names>
            <surname>Lienhart</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>ActivityConditioned Continuous Human Pose Estimation for Performance Analysis of Athletes Using the Example of Swimming</article-title>
          . In WACV.
          <volume>446</volume>
          -
          <fpage>455</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Chunhui</given-names>
            <surname>Gu</surname>
          </string-name>
          , Chen Sun, Sudheendra Vijayanarasimhan, Caroline Pantofaru,
          <string-name>
            <given-names>David A.</given-names>
            <surname>Ross</surname>
          </string-name>
          , George Toderici,
          <string-name>
            <given-names>Yeqing</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Susanna</given-names>
            <surname>Ricco</surname>
          </string-name>
          , Rahul Sukthankar, Cordelia Schmid, and
          <string-name>
            <given-names>Jitendra</given-names>
            <surname>Malik</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions</article-title>
          .
          <source>CoRR abs/1705</source>
          .08421 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Will</given-names>
            <surname>Kay</surname>
          </string-name>
          , Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>The Kinetics Human Action Video Dataset</article-title>
          .
          <source>CoRR abs/1705</source>
          .06950 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Hildegard</given-names>
            <surname>Kuehne</surname>
          </string-name>
          , Hueihan Jhuang, Estíbaliz Garrote, Tomaso A.
          <string-name>
            <surname>Poggio</surname>
            , and
            <given-names>Thomas</given-names>
          </string-name>
          <string-name>
            <surname>Serre</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>HMDB: A large video database for human motion recognition</article-title>
          .
          <source>In ICCV. IEEE Computer Society</source>
          ,
          <fpage>2556</fpage>
          -
          <lpage>2563</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Zheng</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Haifeng</given-names>
            <surname>Hu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Spatiotemporal Relation Networks for Video Action Recognition</article-title>
          .
          <source>IEEE Access</source>
          <volume>7</volume>
          (
          <year>2019</year>
          ),
          <fpage>14969</fpage>
          -
          <lpage>14976</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jenny</surname>
            Benois-Pineau,
            <given-names>Renaud</given-names>
          </string-name>
          <string-name>
            <surname>Péteri</surname>
            , and
            <given-names>Julien</given-names>
          </string-name>
          <string-name>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis</article-title>
          .
          <source>In CBMI 2018. IEEE</source>
          , 1-
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jenny</surname>
            Benois-Pineau,
            <given-names>Renaud</given-names>
          </string-name>
          <string-name>
            <surname>Péteri</surname>
            , and
            <given-names>Julien</given-names>
          </string-name>
          <string-name>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Optimal choice of motion estimation methods for ifne-grained action classification with 3D convolutional networks</article-title>
          . In Submitted to ICIP
          <year>2019</year>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Juan</given-names>
            <surname>Carlos</surname>
          </string-name>
          <string-name>
            <given-names>Niebles</given-names>
            ,
            <surname>Chih-Wei</surname>
          </string-name>
          <string-name>
            <given-names>Chen</given-names>
            , and
            <surname>Fei-Fei Li</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification</article-title>
          .
          <source>In ECCV</source>
          <year>2010</year>
          .
          <volume>392</volume>
          -
          <fpage>405</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Khurram</surname>
            <given-names>Soomro</given-names>
          </string-name>
          , Amir Roshan Zamir, and
          <string-name>
            <given-names>Mubarak</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild</article-title>
          .
          <source>CoRR 1212</source>
          .0402 (
          <year>2012</year>
          ). arXiv:
          <volume>1212</volume>
          .
          <fpage>0402</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Andrei</surname>
            <given-names>Stoian</given-names>
          </string-name>
          , Marin Ferecatu, Jenny Benois-Pineau, and
          <string-name>
            <given-names>Michel</given-names>
            <surname>Crucianu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Fast Action Localization in Large-Scale Video Archives</article-title>
          .
          <source>IEEE Trans. Circuits Syst. Video Techn</source>
          .
          <volume>26</volume>
          ,
          <issue>10</issue>
          (
          <year>2016</year>
          ),
          <fpage>1917</fpage>
          -
          <lpage>1930</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Gül</surname>
            <given-names>Varol</given-names>
          </string-name>
          , Ivan Laptev, and
          <string-name>
            <given-names>Cordelia</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Long-Term Temporal Convolutions for Action Recognition</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>40</volume>
          ,
          <issue>6</issue>
          (
          <year>2018</year>
          ),
          <fpage>1510</fpage>
          -
          <lpage>1517</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>