=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_37
|storemode=property
|title=Optical
Flow Singularities for Sports Video Annotation: Detection of Strokes in Table Tennis
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_37.pdf
|volume=Vol-2670
|authors=Jordan Calandre,Renaud Péteri,Laurent Mascarilla
|dblpUrl=https://dblp.org/rec/conf/mediaeval/CalandrePM19
}}
==Optical
Flow Singularities for Sports Video Annotation: Detection of Strokes in Table Tennis==
Optical Flow Singularities for
Sports Video Annotation: Detection of Strokes in Table Tennis
Jordan Calandre1 , Renaud Péteri 2 , Laurent Mascarilla3
1 MIA Laboratory, La Rochelle University, France
{jordan.calandre1,renaud.peteri,lmascari}@univ-lr.fr
ABSTRACT
Over the past few years, Action Recognition task has drawn con-
siderable interests, leading to intensive researches. This is mainly
due to the variety of related applications, from autonomous car to
human behavior analysis.
Up to now, most of researches aim to identify various sport
actions such as UCF-101 dataset[11], but, due to the exponential
number of online videos and the necessity to be more and more
accurate, the need of finer analysis arises.
In this working note, results for the MediaEval 2019 Sports Video
Annotation "Detection of Strokes in Table Tennis" task [9] are pre-
sented. As in sport videos displacement flow appears to be one of
the most useful information for stroke identification, especially to
differentiate quite similar strokes, this proposal relies on a com- Figure 1: Extracted Optical Flow using PWC-Net
bination of spatial information and Optical Flow’s singularities
identification. As a result, most relevant regions of video frames
for the classification task are detected. to extract interesting regions around the player based only on the
optical flow’s singularities [1–3] and did complementary analysis
on this areas.
1 INTRODUCTION
As already said, the proposed approach relies on dense accurate
The selected task requires to analyze a single sport, which means Optical Flow. Nowadays, one of the most popular method is proba-
that the analysis has to be even more precise than high inter- bly the Farneback [7] method which starts by generating an image
class variance datasets. The dataset, aiming at representing real-life pyramid of different resolutions, and uses polynomial expansion to
sportsman training situations, is made up of videos recorded using match the pixel from one resolution to another. The main issue with
standard cameras with unbalanced number of training samples for this method is that when an object of uniform color is moving, only
each stroke. No depth maps or data issued from motion capture the borders of that object are detected. Using Farneback provides
suits are available. good edges, but empty objects.
This working note provides a description of the methods pro- More recent methods are trying to overcome this drawback,
posed by the team MIA on this task. Only handcrafted features especially, the PWC-Network [12] that use CNN pyramidal feature
extracted from video frames and optical flow are used: Histogram extraction, warping layers, and cost volume layers to match features
of oriented Gradients (HoG)[6] features and dense Optical Flow sin- of the first image and warped features of the second one. Our
gularities’s coefficients projected on Legendre basis. These features method uses such a network pre-trained using the Sintel dataset
are represented by a Bag-of-Words model and the final classification [4], an open source animated short film, to give clean boundaries
is obtained by mean of a linear SVM. like in Figure 1. Compared to the Sintel dataset, the task dataset
presents a lot of compression artifacts, consequently, Gaussian blur
2 OUR APPROACH is applied before Optical Flow extraction, and frames are resized to
The great success and popularity of Deep Learning methods for speed up consequent processing.
2D images recognition tasks, led many researchers to adapt these
architectures to video analysis using 3D filters instead of 2D filters 2.1 Optical Flow Singularities
commonly known as 3DCNN[13]. Given the horizontal and vertical components U and V of the optical
For both manual and deep learning methods, the Optical Flow flow, regions of high rotation or divergence are detected by the
was also proved relevant, with the arrival of two-stream network following stage. For each frame, using a sliding window, the optical
architectures[10] or Siamese Network[8]. Because the automati- flow is locally approximated using a Legendre polynomial basis.
cally calculated filters of deep-learning methods could have no real The polynomial basis P is defined as:
human meaning compared to handcrafted approaches, we decided P K, L (x 1 , x 2 ) = kK=0 lL=0 x 1k x 2l
Í Í
Copyright 2019 for this paper by its authors. Use
To obtain precise results, a small sliding window of 50 pixels is
permitted under Creative Commons License Attribution chosen. The resulting computational cost is therefore limited as a
4.0 International (CC BY 4.0). one-dimensional polynomial basis is precise enough in such a case
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France J. Calandre et al.
Table 1: Global accuracy
Method Train set Test set
Unbalanced SVM 153/754 25/354
Position + Unbalanced SVM 426/754 46/354
Position + HoG + Unbalanced SVM 524/754 46/354
Position + Hog + Balanced SVM 485/754 50/354
:
U = u 0,0 P 0,0 + u 0,1 P0,1 + u 1,0 P1,0
V = v 0,0 P0,0 + v 0,1 P 0,1 + v 1,0 P1,0
After the projection, the two components are efficiently cal-
culated on a canonical basis by approximating U and V flows as
follows
:
U x1 a 11x 1 + a 12x 2 + b1
≃A +b =
V x2 a 21x 1 + a 22x 2 + b2
Each pixel region is then represented by a 2x2 matrix made of
canonical projection coefficients of the flow. Significant region are Figure 2: Accuracy of the predicted classes
selected by a simple threshold:
∆(A) = tr (A)2 − 4 ∗ det(A), ∆(A) < 0.05
2.2 BoW and SVM for Action Recognition
a stroke, and focusing only on the flow information results in high
The classification task follow the Bag of Word (BoW) approach: K-
information loss.
Means are used to classify the various singularities (each singularity
The second and third run, with singularity positions and unbal-
being originally represented by the four projection coefficients) into
anced SVM have similar results both in terms of overall accuracy
six clusters.
and predicted classes. This behavior is unexpected as one of the run
Except for the first run, the relative spatial positions of the sin-
uses Hog features, while the others does not. Maybe, because only
gularities in the frames are also used. The frames are divided in
one sport is present in the dataset, the players edges are not suffi-
four-squared grids and the number of singularities on each of these
cient to differentiate strokes. We used HoG on each frame, knowing
four regions are analysed.
that one frame alone isn’t enough to know what stroke class it be-
For the last two runs, HoG Features, as represented by a height
longs to. We stacked them over the whole sequence without taking
bins BoW, are also used but only on regions where significant sin-
into account the temporal data, and that’s probably why the HoG
gularities have been selected. This aims at quantifying the relative
have no impact on the results overall.
importance of optical flow-based and gradient based features.
On the other hand, the only run with balanced SVM provides
As a result, each stroke is represented by an histogram with at
a better overall accuracy. As said in the introduction, the dataset
most 18 bins (6 singularities, 8 HoG, and 4 spatial regions).
is heterogeneously balanced. Standard unbalanced SVM predicts
Classification is done by a cross-validated linear SVM[5], thus
the classes to increase the overall result. On this dataset, it overpre-
avoiding overfitting.
dicts the most frequent classes. By using weights, balanced SVM
The given dataset being seriously unbalanced, a balanced SVM is
increases its accuracy on the rare classes, resulting in a worst overall
used on the last run, giving penalties for the most common classes,
result, but in better results on rare classes.
to increate the retention rate of rare strokes.
3 RESULTS AND ANALYSIS 4 DISCUSSION AND OUTLOOK
The proposed method leads to four runs, using only singularities This paper presents an approach for the Sports Video Annotation
for the first one, and adding additional information like HoG or the on single-sport dataset task. Due to the difficulty of the task, the rare
position of the singularities region for the others. The accuracy of classes samples, missing metadata about right or left handed players,
the four runs are presented in Table 1 for both training and testing and different camera viewpoints, didn’t achieved high performance
set. scores, but it gives an insight of what is missing in the proposed
The last three runs with the singularities and spatial/pixel in- Optical Flow’s Singularities features.
formation have pretty similar results for the test set, but the run There is a still rooms for improvement, mostly due to the lack of
using only the projection coefficients gives a lower global accuracy. long term temporal information and the variations between two
That proves that using movement-based analyze, without using optical flows of the same stroke class when recorded by cameras
other data is not sufficient to have a good enough interpretation of on different viewpoints.
Optical Flow Singularities for Sport Video Annotation MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
REFERENCES
[1] Cyrille Beaudry, Renaud Péteri, and Laurent Mascarilla. 2014. Action
recognition in videos using frequency analysis of critical point trajec-
tories. 2014 IEEE International Conference on Image Processing, ICIP
2014. https://doi.org/10.1109/ICIP.2014.7025289
[2] Cyrille Beaudry, Renaud Péteri, and Laurent Mascarilla. 2016. An
efficient and sparse approach for large scale human action recognition
in videos. Machine Vision and Applications 27, 4 (2016), 529–543.
[3] Katy Blanc, Diane Lingrand, and Frédéric Precioso. 2017. SINGLETS:
Multi-Resolution Motion Singularities for Soccer Video Abstraction.
In Workshop CVsports (in conjunction with CVPR) (Proceedings of the
Workshop CVsports (in conjunction with CVPR)). Honolulu (Hawaii),
United States. https://hal.archives-ouvertes.fr/hal-01540342
[4] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. 2012. A naturalistic
open source movie for optical flow evaluation. In European Conf. on
Computer Vision (ECCV) (Part IV, LNCS 7577), A. Fitzgibbon et al. (Eds.)
(Ed.). Springer-Verlag, 611–625.
[5] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A Library for
Support Vector Machines. ACM Transactions on Intelligent Systems
and Technology 2 (2011), 27:1–27:27. Issue 3. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[6] N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for
human detection. In 2005 IEEE Computer Society Conference on Com-
puter Vision and Pattern Recognition (CVPR’05), Vol. 1. 886–893 vol. 1.
https://doi.org/10.1109/CVPR.2005.177
[7] Gunnar Farnebäck. 2003. Two-frame Motion Estimation Based on
Polynomial Expansion. In Proceedings of the 13th Scandinavian Confer-
ence on Image Analysis (SCIA’03). Springer-Verlag, Berlin, Heidelberg,
363–370. http://dl.acm.org/citation.cfm?id=1763974.1764031
[8] P. Martin, J. Benois-Pineau, R. Péteri, and J. Morlier. 2018. Sport
Action Recognition with Siamese Spatio-Temporal CNNs: Application
to Table Tennis. In 2018 International Conference on Content-Based
Multimedia Indexing (CBMI 2018). 1–6. https://doi.org/10.1109/CBMI.
2018.8516488
[9] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud
Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2019.
Sports Video Annotation: Detection of Strokes in Table Tennis task
for MediaEval 2019. Proc. of the MediaEval 2019 Workshop, Sophia
Antipolis, France, 27-29 October 2019.
[10] Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolu-
tional Networks for Action Recognition in Videos. CoRR abs/1406.2199
(2014). arXiv:1406.2199 http://arxiv.org/abs/1406.2199
[11] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012.
UCF101: A Dataset of 101 Human Actions Classes From Videos
in The Wild. CoRR abs/1212.0402 (2012). arXiv:1212.0402 http:
//arxiv.org/abs/1212.0402
[12] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2017. PWC-
Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume.
CoRR abs/1709.02371 (2017). arXiv:1709.02371 http://arxiv.org/abs/
1709.02371
[13] Shuiwang Ji ; Wei Xu ; Ming Yang ; Kai Yu. 2013. 3D Convolutional
Neural Networks for Human Action Recognition. IEEE Transactions
on Pattern Analysis and Machine Intelligence 35 (Jan 2013), 221–231.
https://doi.org/10.1109/TPAMI.2012.59