=Paper= {{Paper |id=Vol-2670/MediaEval_19_paper_58 |storemode=property |title=Siamese Spatio-Temporal Convolutional Neural Network for Stroke Classification in Table Tennis Games |pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_58.pdf |volume=Vol-2670 |authors=Pierre-Etienne Martin,Jenny Benois-Pineau,Boris Mansencal,Renaud Péteri,Julien Morlier |dblpUrl=https://dblp.org/rec/conf/mediaeval/MartinBMPM19 }} ==Siamese Spatio-Temporal Convolutional Neural Network for Stroke Classification in Table Tennis Games== https://ceur-ws.org/Vol-2670/MediaEval_19_paper_58.pdf

Siamese Spatio-temporal convolutional neural network for
stroke classification in Table Tennis games
Pierre-Etienne Martin1 , Jenny Benois-Pineau1 , Boris Mansencal1 ,
Renaud Péteri2 , Julien Morlier3
1 Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400, Talence, France
2 MIA, La Rochelle University, La Rochelle, France
3 IMS, University of Bordeaux, Talence, France

pierre-etienne.martin@u-bordeaux.fr,jenny.benois-pineau@u-bordeaux.fr,boris.mansencal@labri.fr
renaud.peteri@univ-lr.fr,julien.morlier@u-bordeaux.fr

ABSTRACT Because of the strong motion artefacts observed on DIS flow, this
This work presents a Table Tennis stroke classification approach one is smoothed with a Gaussian blur using a kernel of size 3 × 3
through a siamese spatio-temporal convolutional neural network - and then multiplied by the computed foreground [10] to keep only
SSTCNN. The videos are recorded at 120 frames per second with foreground motion.
players performing in natural conditions. The frames are extracted,
resized and processed to compute the optical flow. From the optical 2.2 Spatial segmentation
flow, a region of interest - ROI - is inferred. The SSTCNN is then RGB and Optical Flow are spatially segmented using a region of
feed by RGB and optical flow ROIs stream to give a probabilistic interest - ROI - of center Croi = (x r oi , yr oi ) estimated from the
classification over all the table tennis strokes. maximum of the optical flow norm and the center of gravity of all
pixels [6] as follows:
Cmax = (xmax , ymax ) = arдmax (||D||1 )
1 INTRODUCTION
Í x ,y
In the scope of video processing, action recognition and classifica- Cg = (xд , yд ) = Í δ1(C) Cδ (C)
C∈Ω C∈Ω
tion is one of the main challenge. In the Sport task of MediaEval 2019
(1)

[4], this aspect is underlined by providing a dataset of Tennis table 1 if ||D||1 (C) , 0
with δ (C) =
recordings, TTStroke-21 [6], where strokes have to be extracted 0 otherwise
and classified with the aim of improving athletes performances. As
x r oi = α f ωx (xmax , W ) + (1 − α) f ωx (xд , W )
a first step, videos are provided with temporal segmentation and
yr oi = α f ωy (ymax , H ) + (1 − α) f ωy (xд , H )
the task is to classify those segments. However, contrary to the
common datasets widely used in image and video processing such with parameters α = 0.6, Ω = (ωx , ωy ) = (320 × 180) the size of
as UCF-101 [8], HMDB [3] or Kinetics [1]; this task focuses on fined the resized video frames, (W , H ) the size of the data inputted to our
grained classification with the classification of strokes highly simi- network. The function f ω (u, V ) = max(min(u, V − ω2 ), ω2 ) allows
lar. The difficulty of this task is to be able to find the characteristics to have input data extracted within the boundaries of our data. To
of each kind of stroke using a limited dataset without over-fitting avoid jittering, we apply a Gaussian blur along the time dimension
it. In this paper, we present an approach aiming at providing data to average the center position using a kernel of size 40 and scale
with enough inter-dissimilarity and focusing on intra-similarity parameter σblur = 4.44.
to feed a neural network able to classify without over-fitting on a
limited dataset. 2.3 Data normalization
The RGB image channels are normalized by their theoretical maxi-
2 APPROACH mum value, 255 in our case, to map them into interval [0,1]. As done
To deal with the low inter-variability of the classes in TTStroke-21 in [7] which compare different normalization methods, we decide
and avoid over-fitting on this sample of the dataset, we decided to normalize the optical flow V = (v x , vy ) using the mean µ and
to use cuboids of optical flow in addition to cuboids of RGB im- standard deviation σ of the maximum absolute values distribution
ages with spatio-temporal convolutions processed simultaneously of each optical flow components over the whole dataset. In the fol-
through a Siamese architecture as presented in [6]. lowing equation v and v N represent respectively one component
of the OF V and its normalization.
2.1 Optical Flow estimator v
v ′ = µ+3×σ
As shown in [7], flow estimators can have a strong impact on the v (i, j)
′
if |v ′ (i, j)| < 1 (2)
classification, so we tested classification using two different flow v N (i, j) =
SIGN (v ′ (i, j)) otherwise.
estimators: DeepFlow [9] and Dense Inversive Search - DIS [2].
This normalization method maps the values into interval [-1,1]
Copyright 2019 for this paper by its authors. Use and increases the magnitude of most vectors making the optical
permitted under Creative Commons License Attribution flow easier to process for classification of very similar actions such
4.0 International (CC BY 4.0). as Table Tennis strokes.
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France P-e Martin et al.

2.4 SSTCNN 3 RESULTS
Our Siamese Spatio-Temporal Convolutional Neural Network - On the left side of the Table 1 we can see results of the first two
SSTCNN, see Fig. 1, is constituted of 2 branches with three 3D con- runs from the models trained on the split database with 250 epochs;
volutional layers with 30, 60, 80 filter response maps, followed by a and on the right side two others runs obtained from the models
fully connected layer of size 500. They take respectively cuboides of trained with all the data.
RGB values and optical flow computed from them of size (W × H × Table 1: Runs results
T )= (120×120×100). The 3D convolutional layers use 3×3×3 space-
time filters with a dense stride and padding of 1 in each direction. Flow Epochs Train Val Test Train Test
The two branches are fused through a final fully connected layer DIS 249 70.4 52.6 19.2 61.2 17.8
of size 21 followed by a Softmax function to output a probabilistic DeepFlow 229 74.7 56.1 17.2 70.2 22.9
classification.
Compared to what has been obtained in previous work [6], the
results are very low. The main differences are i) the lack of a negative
class and ii) the split of the dataset in train and test sets between
videos. It directly leads to an over-fitting of the dataset and makes
the model much less able to do a proper classification. Best results
were obtained by using DeepFlow estimator.

Figure 1: SSTCNN architecture

2.5 Data augmentation
Data augmentation is made online and is different for each epoch.
Each stroke feed our SSTCNN once per epoch. For each stroke, we
extract one video sample of size (W × H × T ). The T successive
frames from the RGB and Optical Flow are extracted following a
normal distribution around the center of our stroke with standard
deviation of σ = ∆t6−T . We also spatially augment the data by
applying random rotation in the range ±10◦ , random translation
in range ±0.1 in x and y directions, random homothety in range
1 ± 0.1 and a 0.5 chance flip in horizontal direction and random
channel swaps on the RGB data. We take extra care of applying
those changing on the Optical Flow by updating its values according
to the transformations. Transformations are applied and centered
on the region of interest avoiding crops outside of the camera range.

2.6 Training and submitted runs
Figure 2: Confusion Matrix of our best run
All models were trained from scratch. We used firstly 250 epochs
with the data samples split randomly between all strokes and then Furthermore, if we consider the confusion matrix of our best run,
split using only two videos for validation. However we noticed Fig. 2, and group strokes in larger classes as: ’Forehand’, ’Backhand’
the results obtained by splitting the dataset between videos were or ’Service’, ’Offensive’, ’Defensive’ or their intersection (6 classes),
not satisfying. After looking at the dataset in detail, this is due to we respectively get accuracies of 76.8%, 65.8% and 54.8%.
the fact that most of the videos contain only one kind of stroke
performed by the same player. So the model will over-fit easily 4 CONCLUSION
to the player appearance and not the characteristics of the stroke Despite a strong over-fitting, by grouping strokes together in larger
itself. With such a limited dataset and a limited time window we classes, we can notice that some characteristics to recognize strokes
preferred to focus on the random distribution of the strokes among are still learned. Furthermore, the work on TTStroke-21 [5] is
our training and validation sets. The two first runs are the classi- still in progress and the enrichment of the dataset will be a big
fication obtained with the model trained on the split dataset and contribution in the domain of action detection and classification
saved on the minimum loss obtained on the validation set with two especially for very similar actions.
different flows presented in section 2.1. The other two runs are the
same models but retrained from scratch using all data samples with ACKNOWLEDGMENTS
the number of epochs used for obtaining best performance on the This work was supported by Region of Nouvelle Aquitaine grant
first validation set. CRISP and Bordeaux Idex Initiative.
Sports Video Annotation MediaEval’19, 27-29 October 2019, Sophia Antipolis, France

REFERENCES
[1] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier,
Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back,
Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The
Kinetics Human Action Video Dataset. CoRR abs/1705.06950 (2017).
[2] Till Kroeger, Radu Timofte, Dengxin Dai, and Luc Van Gool. 2016. Fast
Optical Flow Using Dense Inverse Search. In ECCV (LNCS), Vol. 9908.
Springer, 471–488.
[3] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso A.
Poggio, and Thomas Serre. 2011. HMDB: A large video database for
human motion recognition. In ICCV. IEEE Computer Society, 2556–
2563.
[4] Pierre-Etienne Martin, Jenny Benois-Pineau, Boris Mansencal, Renaud
Péteri, Laurent Mascarilla, Jordan Calandre, and Julien Morlier. 2019.
Sports Video Annotation: Detection of Strokes in Table Tennis task
for MediaEval 2019. In Proc. of the MediaEval 2019 Workshop, Sophia
Antipolis, France, 27-29 October 2019.
[5] Pierre-Etienne Martin, Jenny Benois-Pineau, and Renaud Péteri. 2019.
Fine-Grained Action Detection and Classification in Table Tennis with
Siamese Spatio-Temporal Convolutional Neural Network. In ICIP 2019.
IEEE, 3027–3028.
[6] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
Morlier. 2018. Sport Action Recognition with Siamese Spatio-Temporal
CNNs: Application to Table Tennis. In CBMI 2018. IEEE, 1–6.
[7] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien
Morlier. 2019. Optimal choice of motion estimation methods for fine-
grained action classification with 3D convolutional networks. In ICIP
2019. IEEE, 554–558.
[8] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012.
UCF101: A Dataset of 101 Human Actions Classes From Videos in The
Wild. CoRR 1212.0402 (2012). arXiv:1212.0402
[9] Philippe Weinzaepfel, Jérôme Revaud, Zaïd Harchaoui, and Cordelia
Schmid. 2013. DeepFlow: Large Displacement Optical Flow with Deep
Matching. In IEEE ICCV. 1385–1392.
[10] Zoran Zivkovic and Ferdinand van der Heijden. 2006. Efficient adap-
tive density estimation per image pixel for the task of background
subtraction. Pattern Recognition Letters 27, 7 (2006), 773–780.