Optical Flow Singularities for Sports Video Annotation: Detection of Strokes in Table Tennis Jordan Calandre1 , Renaud Péteri 2 , Laurent Mascarilla3 1 MIA Laboratory, La Rochelle University, France {jordan.calandre1,renaud.peteri,lmascari}@univ-lr.fr ABSTRACT Over the past few years, Action Recognition task has drawn con- siderable interests, leading to intensive researches. This is mainly due to the variety of related applications, from autonomous car to human behavior analysis. Up to now, most of researches aim to identify various sport actions such as UCF-101 dataset[11], but, due to the exponential number of online videos and the necessity to be more and more accurate, the need of finer analysis arises. In this working note, results for the MediaEval 2019 Sports Video Annotation "Detection of Strokes in Table Tennis" task [9] are pre- sented. As in sport videos displacement flow appears to be one of the most useful information for stroke identification, especially to differentiate quite similar strokes, this proposal relies on a com- Figure 1: Extracted Optical Flow using PWC-Net bination of spatial information and Optical Flow’s singularities identification. As a result, most relevant regions of video frames for the classification task are detected. to extract interesting regions around the player based only on the optical flow’s singularities [1–3] and did complementary analysis on this areas. 1 INTRODUCTION As already said, the proposed approach relies on dense accurate The selected task requires to analyze a single sport, which means Optical Flow. Nowadays, one of the most popular method is proba- that the analysis has to be even more precise than high inter- bly the Farneback [7] method which starts by generating an image class variance datasets. The dataset, aiming at representing real-life pyramid of different resolutions, and uses polynomial expansion to sportsman training situations, is made up of videos recorded using match the pixel from one resolution to another. The main issue with standard cameras with unbalanced number of training samples for this method is that when an object of uniform color is moving, only each stroke. No depth maps or data issued from motion capture the borders of that object are detected. Using Farneback provides suits are available. good edges, but empty objects. This working note provides a description of the methods pro- More recent methods are trying to overcome this drawback, posed by the team MIA on this task. Only handcrafted features especially, the PWC-Network [12] that use CNN pyramidal feature extracted from video frames and optical flow are used: Histogram extraction, warping layers, and cost volume layers to match features of oriented Gradients (HoG)[6] features and dense Optical Flow sin- of the first image and warped features of the second one. Our gularities’s coefficients projected on Legendre basis. These features method uses such a network pre-trained using the Sintel dataset are represented by a Bag-of-Words model and the final classification [4], an open source animated short film, to give clean boundaries is obtained by mean of a linear SVM. like in Figure 1. Compared to the Sintel dataset, the task dataset presents a lot of compression artifacts, consequently, Gaussian blur 2 OUR APPROACH is applied before Optical Flow extraction, and frames are resized to The great success and popularity of Deep Learning methods for speed up consequent processing. 2D images recognition tasks, led many researchers to adapt these architectures to video analysis using 3D filters instead of 2D filters 2.1 Optical Flow Singularities commonly known as 3DCNN[13]. Given the horizontal and vertical components U and V of the optical For both manual and deep learning methods, the Optical Flow flow, regions of high rotation or divergence are detected by the was also proved relevant, with the arrival of two-stream network following stage. For each frame, using a sliding window, the optical architectures[10] or Siamese Network[8]. Because the automati- flow is locally approximated using a Legendre polynomial basis. cally calculated filters of deep-learning methods could have no real The polynomial basis P is defined as: human meaning compared to handcrafted approaches, we decided P K, L (x 1 , x 2 ) = kK=0 lL=0 x 1k x 2l Í Í Copyright 2019 for this paper by its authors. Use To obtain precise results, a small sliding window of 50 pixels is permitted under Creative Commons License Attribution chosen. The resulting computational cost is therefore limited as a 4.0 International (CC BY 4.0). one-dimensional polynomial basis is precise enough in such a case MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France J. Calandre et al. Table 1: Global accuracy Method Train set Test set Unbalanced SVM 153/754 25/354 Position + Unbalanced SVM 426/754 46/354 Position + HoG + Unbalanced SVM 524/754 46/354 Position + Hog + Balanced SVM 485/754 50/354 : U = u 0,0 P 0,0 + u 0,1 P0,1 + u 1,0 P1,0 V = v 0,0 P0,0 + v 0,1 P 0,1 + v 1,0 P1,0 After the projection, the two components are efficiently cal- culated on a canonical basis by approximating U and V flows as follows  :  U x1 a 11x 1 + a 12x 2 + b1     ≃A +b = V x2 a 21x 1 + a 22x 2 + b2 Each pixel region is then represented by a 2x2 matrix made of canonical projection coefficients of the flow. Significant region are Figure 2: Accuracy of the predicted classes selected by a simple threshold: ∆(A) = tr (A)2 − 4 ∗ det(A), ∆(A) < 0.05 2.2 BoW and SVM for Action Recognition a stroke, and focusing only on the flow information results in high The classification task follow the Bag of Word (BoW) approach: K- information loss. Means are used to classify the various singularities (each singularity The second and third run, with singularity positions and unbal- being originally represented by the four projection coefficients) into anced SVM have similar results both in terms of overall accuracy six clusters. and predicted classes. This behavior is unexpected as one of the run Except for the first run, the relative spatial positions of the sin- uses Hog features, while the others does not. Maybe, because only gularities in the frames are also used. The frames are divided in one sport is present in the dataset, the players edges are not suffi- four-squared grids and the number of singularities on each of these cient to differentiate strokes. We used HoG on each frame, knowing four regions are analysed. that one frame alone isn’t enough to know what stroke class it be- For the last two runs, HoG Features, as represented by a height longs to. We stacked them over the whole sequence without taking bins BoW, are also used but only on regions where significant sin- into account the temporal data, and that’s probably why the HoG gularities have been selected. This aims at quantifying the relative have no impact on the results overall. importance of optical flow-based and gradient based features. On the other hand, the only run with balanced SVM provides As a result, each stroke is represented by an histogram with at a better overall accuracy. As said in the introduction, the dataset most 18 bins (6 singularities, 8 HoG, and 4 spatial regions). is heterogeneously balanced. Standard unbalanced SVM predicts Classification is done by a cross-validated linear SVM[5], thus the classes to increase the overall result. On this dataset, it overpre- avoiding overfitting. dicts the most frequent classes. By using weights, balanced SVM The given dataset being seriously unbalanced, a balanced SVM is increases its accuracy on the rare classes, resulting in a worst overall used on the last run, giving penalties for the most common classes, result, but in better results on rare classes. to increate the retention rate of rare strokes. 3 RESULTS AND ANALYSIS 4 DISCUSSION AND OUTLOOK The proposed method leads to four runs, using only singularities This paper presents an approach for the Sports Video Annotation for the first one, and adding additional information like HoG or the on single-sport dataset task. Due to the difficulty of the task, the rare position of the singularities region for the others. The accuracy of classes samples, missing metadata about right or left handed players, the four runs are presented in Table 1 for both training and testing and different camera viewpoints, didn’t achieved high performance set. scores, but it gives an insight of what is missing in the proposed The last three runs with the singularities and spatial/pixel in- Optical Flow’s Singularities features. formation have pretty similar results for the test set, but the run There is a still rooms for improvement, mostly due to the lack of using only the projection coefficients gives a lower global accuracy. long term temporal information and the variations between two That proves that using movement-based analyze, without using optical flows of the same stroke class when recorded by cameras other data is not sufficient to have a good enough interpretation of on different viewpoints. Optical Flow Singularities for Sport Video Annotation MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [1] Cyrille Beaudry, Renaud Péteri, and Laurent Mascarilla. 2014. Action recognition in videos using frequency analysis of critical point trajec- tories. 2014 IEEE International Conference on Image Processing, ICIP 2014. https://doi.org/10.1109/ICIP.2014.7025289 [2] Cyrille Beaudry, Renaud Péteri, and Laurent Mascarilla. 2016. An efficient and sparse approach for large scale human action recognition in videos. Machine Vision and Applications 27, 4 (2016), 529–543. [3] Katy Blanc, Diane Lingrand, and Frédéric Precioso. 2017. SINGLETS: Multi-Resolution Motion Singularities for Soccer Video Abstraction. In Workshop CVsports (in conjunction with CVPR) (Proceedings of the Workshop CVsports (in conjunction with CVPR)). Honolulu (Hawaii), United States. https://hal.archives-ouvertes.fr/hal-01540342 [4] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. 2012. A naturalistic open source movie for optical flow evaluation. In European Conf. on Computer Vision (ECCV) (Part IV, LNCS 7577), A. Fitzgibbon et al. (Eds.) (Ed.). Springer-Verlag, 611–625. [5] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1–27:27. Issue 3. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. [6] N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition (CVPR’05), Vol. 1. 886–893 vol. 1. https://doi.org/10.1109/CVPR.2005.177 [7] Gunnar Farnebäck. 2003. Two-frame Motion Estimation Based on Polynomial Expansion. In Proceedings of the 13th Scandinavian Confer- ence on Image Analysis (SCIA’03). Springer-Verlag, Berlin, Heidelberg, 363–370. http://dl.acm.org/citation.cfm?id=1763974.1764031 [8] P. Martin, J. Benois-Pineau, R. Péteri, and J. Morlier. 2018. Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis. In 2018 International Conference on Content-Based Multimedia Indexing (CBMI 2018). 1–6. https://doi.org/10.1109/CBMI. 2018.8516488 [9] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2019. Sports Video Annotation: Detection of Strokes in Table Tennis task for MediaEval 2019. Proc. of the MediaEval 2019 Workshop, Sophia Antipolis, France, 27-29 October 2019. [10] Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolu- tional Networks for Action Recognition in Videos. CoRR abs/1406.2199 (2014). arXiv:1406.2199 http://arxiv.org/abs/1406.2199 [11] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR abs/1212.0402 (2012). arXiv:1212.0402 http: //arxiv.org/abs/1212.0402 [12] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2017. PWC- Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. CoRR abs/1709.02371 (2017). arXiv:1709.02371 http://arxiv.org/abs/ 1709.02371 [13] Shuiwang Ji ; Wei Xu ; Ming Yang ; Kai Yu. 2013. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (Jan 2013), 221–231. https://doi.org/10.1109/TPAMI.2012.59