Siamese Spatio-temporal convolutional neural network for stroke classification in Table Tennis games

Siamese Spatio-temporal convolutional neural network for stroke classification in Table Tennis games Pierre-EtienneMartin pierre-etienne.martin@u-bordeaux.fr Bordeaux INP UMR 5800 Univ. Bordeaux CNRS LaBRI

F-33400 Talence France

JennyBenois-Pineau jenny.benois-pineau@u-bordeaux.fr Bordeaux INP UMR 5800 Univ. Bordeaux CNRS LaBRI

F-33400 Talence France

BorisMansencal boris.mansencal@labri.fr Bordeaux INP UMR 5800 Univ. Bordeaux CNRS LaBRI

F-33400 Talence France

RenaudPéteri renaud.peteri@univ-lr.fr MIA La Rochelle University

La Rochelle France

JulienMorlier julien.morlier@u-bordeaux.fr IMS University of Bordeaux

Talence France

Siamese Spatio-temporal convolutional neural network for stroke classification in Table Tennis games 6FDA199E8ADC7325DAE89ADEA9E2AF6B GROBID - A machine learning software for extracting information from scholarly documents

This work presents a Table Tennis stroke classification approach through a siamese spatio-temporal convolutional neural network -SSTCNN. The videos are recorded at 120 frames per second with players performing in natural conditions. The frames are extracted, resized and processed to compute the optical flow. From the optical flow, a region of interest -ROI -is inferred. The SSTCNN is then feed by RGB and optical flow ROIs stream to give a probabilistic classification over all the table tennis strokes.

INTRODUCTION

In the scope of video processing, action recognition and classification is one of the main challenge. In the Sport task of MediaEval 2019 [4], this aspect is underlined by providing a dataset of Tennis table recordings, TTStroke-21 [6], where strokes have to be extracted and classified with the aim of improving athletes performances. As a first step, videos are provided with temporal segmentation and the task is to classify those segments. However, contrary to the common datasets widely used in image and video processing such as UCF-101 [8], HMDB [3] or Kinetics [1]; this task focuses on fined grained classification with the classification of strokes highly similar. The difficulty of this task is to be able to find the characteristics of each kind of stroke using a limited dataset without over-fitting it. In this paper, we present an approach aiming at providing data with enough inter-dissimilarity and focusing on intra-similarity to feed a neural network able to classify without over-fitting on a limited dataset.

APPROACH

To deal with the low inter-variability of the classes in TTStroke-21 and avoid over-fitting on this sample of the dataset, we decided to use cuboids of optical flow in addition to cuboids of RGB images with spatio-temporal convolutions processed simultaneously through a Siamese architecture as presented in [6].

Optical Flow estimator

As shown in [7], flow estimators can have a strong impact on the classification, so we tested classification using two different flow estimators: DeepFlow [9] and Dense Inversive Search -DIS [2].

Because of the strong motion artefacts observed on DIS flow, this one is smoothed with a Gaussian blur using a kernel of size 3 × 3 and then multiplied by the computed foreground [10] to keep only foreground motion.

Spatial segmentation

RGB and Optical Flow are spatially segmented using a region of interest -ROI -of center C roi = (x r oi , y r oi ) estimated from the maximum of the optical flow norm and the center of gravity of all pixels [6] as follows:

C max = (x max , y max ) = arдmax x ,y (||D|| 1 ) C g = (x д , y д ) = 1 δ (C) C∈Ω Cδ (C) C∈Ω with δ (C) = 1 if ||D|| 1 (C) 0 0 otherwise x r oi = α f ω x (x max , W ) + (1 − α) f ω x (x д , W ) y r oi = α f ω y (y max , H ) + (1 − α) f ω y (x д , H )(1)

with parameters α = 0.6, Ω = (ω x , ω y ) = (320 × 180) the size of the resized video frames, (W , H ) the size of the data inputted to our network. The function

f ω (u, V ) = max(min(u, V − ω 2 ), ω

2 ) allows to have input data extracted within the boundaries of our data. To avoid jittering, we apply a Gaussian blur along the time dimension to average the center position using a kernel of size 40 and scale parameter σ blur = 4.44.

Data normalization

The RGB image channels are normalized by their theoretical maximum value, 255 in our case, to map them into interval [0,1]. As done in [7] which compare different normalization methods, we decide to normalize the optical flow V = (v x , v y ) using the mean µ and standard deviation σ of the maximum absolute values distribution of each optical flow components over the whole dataset. In the following equation v and v N represent respectively one component of the OF V and its normalization.

v ′ = v µ+3×σ v N (i, j) = v ′ (i, j) if |v ′ (i, j)| < 1 SIGN (v ′ (i, j)) otherwise. (2)

This normalization method maps the values into interval [-1,1] and increases the magnitude of most vectors making the optical flow easier to process for classification of very similar actions such as

SSTCNN

Our Siamese Spatio-Temporal Convolutional Neural Network -SSTCNN, see Fig. 1, is constituted of 2 branches with three 3D convolutional layers with 30, 60, 80 filter response maps, followed by a fully connected layer of size 500. They take respectively cuboides of RGB values and optical flow computed from them of size (W × H × T )= (120×120×100). The 3D convolutional layers use 3×3×3 spacetime filters with a dense stride and padding of 1 in each direction. The two branches are fused through a final fully connected layer of size 21 followed by a Softmax function to output a probabilistic classification. We also spatially augment the data by applying random rotation in the range ±10 • , random translation in range ±0.1 in x and y directions, random homothety in range 1 ± 0.1 and a 0.5 chance flip in horizontal direction and random channel swaps on the RGB data. We take extra care of applying those changing on the Optical Flow by updating its values according to the transformations. Transformations are applied and centered on the region of interest avoiding crops outside of the camera range.

Training and submitted runs

All models were trained from scratch. We used firstly 250 epochs with the data samples split randomly between all strokes and then split using only two videos for validation. However we noticed the results obtained by splitting the dataset between videos were not satisfying. After looking at the dataset in detail, this is due to the fact that most of the videos contain only one kind of stroke performed by the same player. So the model will over-fit easily to the player appearance and not the characteristics of the stroke itself. With such a limited dataset and a limited time window we preferred to focus on the random distribution of the strokes among our training and validation sets. The two first runs are the classification obtained with the model trained on the split dataset and saved on the minimum loss obtained on the validation set with two different flows presented in section 2.1. The other two runs are the same models but retrained from scratch using all data samples with the number of epochs used for obtaining best performance on the first validation set.

RESULTS

On the left side of the Table 1 we can see results of the first two runs from the models trained on the split database with 250 epochs; and on the right side two others runs obtained from the models trained with all the data. Compared to what has been obtained in previous work [6], the results are very low. The main differences are i) the lack of a negative class and ii) the split of the dataset in train and test sets between videos. It directly leads to an over-fitting of the dataset and makes the model much less able to do a proper classification. Best results were obtained by using DeepFlow estimator.

Figure 2: Confusion Matrix of our best run

Furthermore, if we consider the confusion matrix of our best run, Fig. 2, and group strokes in larger classes as: 'Forehand', 'Backhand' or 'Service', 'Offensive', 'Defensive' or their intersection (6 classes), we respectively get accuracies of 76.8%, 65.8% and 54.8%.

CONCLUSION

Despite a strong over-fitting, by grouping strokes together in larger classes, we can notice that some characteristics to recognize strokes are still learned. Furthermore, the work on TTStroke-21 [5] is still in progress and the enrichment of the dataset will be a big contribution in the domain of action detection and classification especially for very similar actions.

Figure 1 :1Figure 1: SSTCNN architecture

Table Tennis strokes.MediaEval'19, 27-29 October 2019, Sophia Antipolis, FranceP-e Martin et al.

Table 1 :1Runs resultsFlowEpochs Train Val Test Train TestDIS24970.4 52.6 19.2 61.2 17.8DeepFlow22974.7 56.1 17.2 70.2 22.9

ACKNOWLEDGMENTS

This work was supported by Region of Nouvelle Aquitaine grant CRISP and Bordeaux Idex Initiative.

The Kinetics Human Action Video Dataset WillKay JoaoCarreira KarenSimonyan BrianZhang ChloeHillier SudheendraVijayanarasimhan FabioViola TimGreen TrevorBack PaulNatsev MustafaSuleyman AndrewZisserman CoRR abs/1705.06950 2017. 2017 Fast Optical Flow Using Dense Inverse Search TillKroeger RaduTimofte DengxinDai LucVan Gool ECCV (LNCS) Springer 2016. 9908 HMDB: A large video database for human motion recognition HildegardKuehne HueihanJhuang EstíbalizGarrote TomasoAPoggio ThomasSerre ICCV. IEEE Computer Society 2011 Sports Video Annotation: Detection of Strokes in Table Tennis task for MediaEval Pierre-EtienneMartin JennyBenois-Pineau BorisMansencal RenaudPéteri LaurentMascarilla JordanCalandre JulienMorlier Proc. of the MediaEval 2019 Workshop of the MediaEval 2019 Workshop

Sophia Antipolis, France

2019. 2019. 27-29 October 2019 Fine-Grained Action Detection and Classification in Table Tennis with Siamese Spatio-Temporal Convolutional Neural Network Pierre-EtienneMartin JennyBenois-Pineau RenaudPéteri ICIP 2019 IEEE 2019 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis Pierre-EtienneMartin JennyBenois-Pineau RenaudPéteri JulienMorlier CBMI 2018. IEEE 2018 Optimal choice of motion estimation methods for finegrained action classification with 3D convolutional networks Pierre-EtienneMartin JennyBenois-Pineau RenaudPéteri JulienMorlier ICIP 2019. IEEE 2019 UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild KhurramSoomro MubarakAmir Roshan Zamir Shah arXiv:1212.0402 CoRR 1212 402 2012. 2012 DeepFlow: Large Displacement Optical Flow with Deep Matching PhilippeWeinzaepfel JérômeRevaud ZaïdHarchaoui CordeliaSchmid IEEE ICCV 2013 Efficient adaptive density estimation per image pixel for the task of background subtraction ZoranZivkovic FerdinandVan Der Heijden Pattern Recognition Letters 27 7 2006. 2006