<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optical Flow Singularities for Sports Video Annotation: Detection of Strokes in Table Tennis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jordan Calandre</string-name>
          <email>jordan.calandre1@univ-lr.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Renaud Péteri</string-name>
          <email>renaud.peteri@univ-lr.fr</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laurent Mascarilla</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>MIA Laboratory, La Rochelle University</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>Over the past few years, Action Recognition task has drawn considerable interests, leading to intensive researches. This is mainly due to the variety of related applications, from autonomous car to human behavior analysis. Up to now, most of researches aim to identify various sport actions such as UCF-101 dataset[11], but, due to the exponential number of online videos and the necessity to be more and more accurate, the need of finer analysis arises. In this working note, results for the MediaEval 2019 Sports Video Annotation "Detection of Strokes in Table Tennis" task [9] are presented. As in sport videos displacement flow appears to be one of the most useful information for stroke identification, especially to diferentiate quite similar strokes, this proposal relies on a combination of spatial information and Optical Flow's singularities identification. As a result, most relevant regions of video frames for the classification task are detected.</p>
      </abstract>
      <kwd-group>
        <kwd>Figure 1</kwd>
        <kwd>Extracted Optical Flow using PWC-Net</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>The selected task requires to analyze a single sport, which means
that the analysis has to be even more precise than high
interclass variance datasets. The dataset, aiming at representing real-life
sportsman training situations, is made up of videos recorded using
standard cameras with unbalanced number of training samples for
each stroke. No depth maps or data issued from motion capture
suits are available.</p>
      <p>
        This working note provides a description of the methods
proposed by the team MIA on this task. Only handcrafted features
extracted from video frames and optical flow are used: Histogram
of oriented Gradients (HoG)[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] features and dense Optical Flow
singularities’s coeficients projected on Legendre basis. These features
are represented by a Bag-of-Words model and the final classification
is obtained by mean of a linear SVM.
      </p>
    </sec>
    <sec id="sec-2">
      <title>OUR APPROACH</title>
      <p>
        The great success and popularity of Deep Learning methods for
2D images recognition tasks, led many researchers to adapt these
architectures to video analysis using 3D filters instead of 2D filters
commonly known as 3DCNN[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        For both manual and deep learning methods, the Optical Flow
was also proved relevant, with the arrival of two-stream network
architectures[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or Siamese Network[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Because the
automatically calculated filters of deep-learning methods could have no real
human meaning compared to handcrafted approaches, we decided
to extract interesting regions around the player based only on the
optical flow’s singularities [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1–3</xref>
        ] and did complementary analysis
on this areas.
      </p>
      <p>
        As already said, the proposed approach relies on dense accurate
Optical Flow. Nowadays, one of the most popular method is
probably the Farneback [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] method which starts by generating an image
pyramid of diferent resolutions, and uses polynomial expansion to
match the pixel from one resolution to another. The main issue with
this method is that when an object of uniform color is moving, only
the borders of that object are detected. Using Farneback provides
good edges, but empty objects.
      </p>
      <p>
        More recent methods are trying to overcome this drawback,
especially, the PWC-Network [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] that use CNN pyramidal feature
extraction, warping layers, and cost volume layers to match features
of the first image and warped features of the second one. Our
method uses such a network pre-trained using the Sintel dataset
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], an open source animated short film, to give clean boundaries
like in Figure 1. Compared to the Sintel dataset, the task dataset
presents a lot of compression artifacts, consequently, Gaussian blur
is applied before Optical Flow extraction, and frames are resized to
speed up consequent processing.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Optical Flow Singularities</title>
      <p>Given the horizontal and vertical components U and V of the optical
lfow, regions of high rotation or divergence are detected by the
following stage. For each frame, using a sliding window, the optical
lfow is locally approximated using a Legendre polynomial basis.
The polynomial basis P is defined as:
PK, L (x1, x2) = ÍkK=0 ÍlL=0 x1k x2l</p>
      <p>To obtain precise results, a small sliding window of 50 pixels is
chosen. The resulting computational cost is therefore limited as a
one-dimensional polynomial basis is precise enough in such a case</p>
      <p>After the projection, the two components are eficiently
calculated on a canonical basis by approximating U and V flows as
follows :</p>
      <p>UV ≃ A xx21 + b = aa2111xx11 ++ aa2122xx22 ++ bb21</p>
      <p>Each pixel region is then represented by a 2x 2 matrix made of
canonical projection coeficients of the flow. Significant region are
selected by a simple threshold:
∆ (A) = tr (A)2 − 4 ∗ det (A), ∆ (A) &lt; 0.05
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>BoW and SVM for Action Recognition</title>
      <p>The classification task follow the Bag of Word (BoW) approach:
KMeans are used to classify the various singularities (each singularity
being originally represented by the four projection coeficients) into
six clusters.</p>
      <p>Except for the first run, the relative spatial positions of the
singularities in the frames are also used. The frames are divided in
four-squared grids and the number of singularities on each of these
four regions are analysed.</p>
      <p>For the last two runs, HoG Features, as represented by a height
bins BoW, are also used but only on regions where significant
singularities have been selected. This aims at quantifying the relative
importance of optical flow-based and gradient based features.</p>
      <p>As a result, each stroke is represented by an histogram with at
most 18 bins (6 singularities, 8 HoG, and 4 spatial regions).</p>
      <p>
        Classification is done by a cross-validated linear SVM[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], thus
avoiding overfitting.
      </p>
      <p>The given dataset being seriously unbalanced, a balanced SVM is
used on the last run, giving penalties for the most common classes,
to increate the retention rate of rare strokes.
3</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND ANALYSIS</title>
      <p>The proposed method leads to four runs, using only singularities
for the first one, and adding additional information like HoG or the
position of the singularities region for the others. The accuracy of
the four runs are presented in Table 1 for both training and testing
set.</p>
      <p>The last three runs with the singularities and spatial/pixel
information have pretty similar results for the test set, but the run
using only the projection coeficients gives a lower global accuracy.
That proves that using movement-based analyze, without using
other data is not suficient to have a good enough interpretation of
a stroke, and focusing only on the flow information results in high
information loss.</p>
      <p>The second and third run, with singularity positions and
unbalanced SVM have similar results both in terms of overall accuracy
and predicted classes. This behavior is unexpected as one of the run
uses Hog features, while the others does not. Maybe, because only
one sport is present in the dataset, the players edges are not
suficient to diferentiate strokes. We used HoG on each frame, knowing
that one frame alone isn’t enough to know what stroke class it
belongs to. We stacked them over the whole sequence without taking
into account the temporal data, and that’s probably why the HoG
have no impact on the results overall.</p>
      <p>On the other hand, the only run with balanced SVM provides
a better overall accuracy. As said in the introduction, the dataset
is heterogeneously balanced. Standard unbalanced SVM predicts
the classes to increase the overall result. On this dataset, it
overpredicts the most frequent classes. By using weights, balanced SVM
increases its accuracy on the rare classes, resulting in a worst overall
result, but in better results on rare classes.
4</p>
    </sec>
    <sec id="sec-6">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>This paper presents an approach for the Sports Video Annotation
on single-sport dataset task. Due to the dificulty of the task, the rare
classes samples, missing metadata about right or left handed players,
and diferent camera viewpoints, didn’t achieved high performance
scores, but it gives an insight of what is missing in the proposed
Optical Flow’s Singularities features.</p>
      <p>There is a still rooms for improvement, mostly due to the lack of
long term temporal information and the variations between two
optical flows of the same stroke class when recorded by cameras
on diferent viewpoints.</p>
      <p>Optical Flow Singularities for Sport Video Annotation</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Cyrille</given-names>
            <surname>Beaudry</surname>
          </string-name>
          , Renaud Péteri, and
          <string-name>
            <given-names>Laurent</given-names>
            <surname>Mascarilla</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Action recognition in videos using frequency analysis of critical point trajectories</article-title>
          .
          <source>2014 IEEE International Conference on Image Processing</source>
          ,
          <string-name>
            <surname>ICIP</surname>
          </string-name>
          <year>2014</year>
          . https://doi.org/10.1109/ICIP.
          <year>2014</year>
          .7025289
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Cyrille</given-names>
            <surname>Beaudry</surname>
          </string-name>
          , Renaud Péteri, and
          <string-name>
            <given-names>Laurent</given-names>
            <surname>Mascarilla</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>An eficient and sparse approach for large scale human action recognition in videos</article-title>
          .
          <source>Machine Vision and Applications</source>
          <volume>27</volume>
          ,
          <issue>4</issue>
          (
          <year>2016</year>
          ),
          <fpage>529</fpage>
          -
          <lpage>543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Katy</given-names>
            <surname>Blanc</surname>
          </string-name>
          , Diane Lingrand, and
          <string-name>
            <given-names>Frédéric</given-names>
            <surname>Precioso</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>SINGLETS: Multi-Resolution Motion Singularities for Soccer Video Abstraction</article-title>
          . In Workshop CVsports (
          <article-title>in conjunction with CVPR) (Proceedings of the Workshop CVsports (in conjunction with CVPR))</article-title>
          .
          <source>Honolulu (Hawaii)</source>
          , United States. https://hal.archives-ouvertes.fr/hal-01540342
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Butler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wulf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. B.</given-names>
            <surname>Stanley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Black</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A naturalistic open source movie for optical flow evaluation</article-title>
          .
          <source>In European Conf. on Computer Vision</source>
          (ECCV
          <string-name>
            <surname>) (Part</surname>
            <given-names>IV</given-names>
          </string-name>
          , LNCS 7577), A. Fitzgibbon et al. (Eds.) (Ed.). Springer-Verlag,
          <fpage>611</fpage>
          -
          <lpage>625</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Chih-Chung Chang</surname>
          </string-name>
          and
          <string-name>
            <surname>Chih-Jen Lin</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>LIBSVM: A Library for Support Vector Machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>2</volume>
          (
          <year>2011</year>
          ),
          <volume>27</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          :
          <fpage>27</fpage>
          . Issue 3. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dalal</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Triggs</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Histograms of oriented gradients for human detection</article-title>
          .
          <source>In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)</source>
          , Vol.
          <volume>1</volume>
          .
          <fpage>886</fpage>
          -
          <lpage>893</lpage>
          vol.
          <volume>1</volume>
          . https://doi.org/10.1109/CVPR.
          <year>2005</year>
          .177
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Gunnar</given-names>
            <surname>Farnebäck</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Two-frame Motion Estimation Based on Polynomial Expansion</article-title>
          .
          <source>In Proceedings of the 13th Scandinavian Conference on Image Analysis (SCIA'03)</source>
          . Springer-Verlag, Berlin, Heidelberg,
          <fpage>363</fpage>
          -
          <lpage>370</lpage>
          . http://dl.acm.org/citation.cfm?id=
          <volume>1763974</volume>
          .
          <fpage>1764031</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benois-Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis</article-title>
          .
          <source>In 2018 International Conference on Content-Based Multimedia Indexing (CBMI</source>
          <year>2018</year>
          ).
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . https://doi.org/10.1109/CBMI.
          <year>2018</year>
          .8516488
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jenny</surname>
          </string-name>
          Benois-Pineau, Boris Mansencal, Renaud Péteri, Laurent Mascarilla,
          <string-name>
            <surname>Jordan Calandre</surname>
            , and
            <given-names>Julien</given-names>
          </string-name>
          <string-name>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Sports Video Annotation: Detection of Strokes in Table Tennis task for MediaEval 2019</article-title>
          .
          <source>Proc. of the MediaEval 2019 Workshop</source>
          , Sophia Antipolis, France,
          <fpage>27</fpage>
          -
          <lpage>29</lpage>
          October
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Two-Stream Convolutional Networks for Action Recognition in Videos</article-title>
          .
          <source>CoRR abs/1406</source>
          .2199 (
          <year>2014</year>
          ). arXiv:
          <volume>1406</volume>
          .2199 http://arxiv.org/abs/1406.2199
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Khurram</surname>
            <given-names>Soomro</given-names>
          </string-name>
          , Amir Roshan Zamir, and
          <string-name>
            <given-names>Mubarak</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild</article-title>
          .
          <source>CoRR abs/1212</source>
          .0402 (
          <year>2012</year>
          ). arXiv:
          <volume>1212</volume>
          .0402 http: //arxiv.org/abs/1212.0402
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Deqing</surname>
            <given-names>Sun</given-names>
          </string-name>
          , Xiaodong Yang,
          <string-name>
            <surname>Ming-Yu</surname>
            <given-names>Liu</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Kautz</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>PWCNet: CNNs for Optical Flow Using Pyramid, Warping,</article-title>
          and Cost Volume.
          <source>CoRR abs/1709</source>
          .02371 (
          <year>2017</year>
          ). arXiv:
          <volume>1709</volume>
          .02371 http://arxiv.org/abs/ 1709.02371
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Shuiwang</surname>
            <given-names>Ji ; Wei</given-names>
          </string-name>
          <string-name>
            <surname>Xu ; Ming Yang ; Kai Yu</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>3D Convolutional Neural Networks for Human Action Recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>35</volume>
          (
          <year>Jan 2013</year>
          ),
          <fpage>221</fpage>
          -
          <lpage>231</lpage>
          . https://doi.org/10.1109/TPAMI.
          <year>2012</year>
          .59
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>