<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Siamese Spatio-temporal convolutional neural network for stroke classification in Table Tennis games</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Pierre-Etienne</forename><surname>Martin</surname></persName>
							<email>pierre-etienne.martin@u-bordeaux.fr</email>
							<affiliation key="aff0">
								<orgName type="laboratory" key="lab1">Bordeaux INP</orgName>
								<orgName type="laboratory" key="lab2">UMR 5800</orgName>
								<orgName type="institution" key="instit1">Univ. Bordeaux</orgName>
								<orgName type="institution" key="instit2">CNRS</orgName>
								<orgName type="institution" key="instit3">LaBRI</orgName>
								<address>
									<postCode>F-33400</postCode>
									<settlement>Talence</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jenny</forename><surname>Benois-Pineau</surname></persName>
							<email>jenny.benois-pineau@u-bordeaux.fr</email>
							<affiliation key="aff0">
								<orgName type="laboratory" key="lab1">Bordeaux INP</orgName>
								<orgName type="laboratory" key="lab2">UMR 5800</orgName>
								<orgName type="institution" key="instit1">Univ. Bordeaux</orgName>
								<orgName type="institution" key="instit2">CNRS</orgName>
								<orgName type="institution" key="instit3">LaBRI</orgName>
								<address>
									<postCode>F-33400</postCode>
									<settlement>Talence</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Boris</forename><surname>Mansencal</surname></persName>
							<email>boris.mansencal@labri.fr</email>
							<affiliation key="aff0">
								<orgName type="laboratory" key="lab1">Bordeaux INP</orgName>
								<orgName type="laboratory" key="lab2">UMR 5800</orgName>
								<orgName type="institution" key="instit1">Univ. Bordeaux</orgName>
								<orgName type="institution" key="instit2">CNRS</orgName>
								<orgName type="institution" key="instit3">LaBRI</orgName>
								<address>
									<postCode>F-33400</postCode>
									<settlement>Talence</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Renaud</forename><surname>Péteri</surname></persName>
							<email>renaud.peteri@univ-lr.fr</email>
							<affiliation key="aff1">
								<orgName type="laboratory">MIA</orgName>
								<orgName type="institution">La Rochelle University</orgName>
								<address>
									<settlement>La Rochelle</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Julien</forename><surname>Morlier</surname></persName>
							<email>julien.morlier@u-bordeaux.fr</email>
							<affiliation key="aff2">
								<orgName type="department">IMS</orgName>
								<orgName type="institution">University of Bordeaux</orgName>
								<address>
									<settlement>Talence</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Siamese Spatio-temporal convolutional neural network for stroke classification in Table Tennis games</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">6FDA199E8ADC7325DAE89ADEA9E2AF6B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T20:14+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This work presents a Table Tennis stroke classification approach through a siamese spatio-temporal convolutional neural network -SSTCNN. The videos are recorded at 120 frames per second with players performing in natural conditions. The frames are extracted, resized and processed to compute the optical flow. From the optical flow, a region of interest -ROI -is inferred. The SSTCNN is then feed by RGB and optical flow ROIs stream to give a probabilistic classification over all the table tennis strokes.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>In the scope of video processing, action recognition and classification is one of the main challenge. In the Sport task of MediaEval 2019 <ref type="bibr" target="#b3">[4]</ref>, this aspect is underlined by providing a dataset of Tennis table recordings, TTStroke-21 <ref type="bibr" target="#b5">[6]</ref>, where strokes have to be extracted and classified with the aim of improving athletes performances. As a first step, videos are provided with temporal segmentation and the task is to classify those segments. However, contrary to the common datasets widely used in image and video processing such as UCF-101 <ref type="bibr" target="#b7">[8]</ref>, HMDB <ref type="bibr" target="#b2">[3]</ref> or Kinetics <ref type="bibr" target="#b0">[1]</ref>; this task focuses on fined grained classification with the classification of strokes highly similar. The difficulty of this task is to be able to find the characteristics of each kind of stroke using a limited dataset without over-fitting it. In this paper, we present an approach aiming at providing data with enough inter-dissimilarity and focusing on intra-similarity to feed a neural network able to classify without over-fitting on a limited dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">APPROACH</head><p>To deal with the low inter-variability of the classes in TTStroke-21 and avoid over-fitting on this sample of the dataset, we decided to use cuboids of optical flow in addition to cuboids of RGB images with spatio-temporal convolutions processed simultaneously through a Siamese architecture as presented in <ref type="bibr" target="#b5">[6]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Optical Flow estimator</head><p>As shown in <ref type="bibr" target="#b6">[7]</ref>, flow estimators can have a strong impact on the classification, so we tested classification using two different flow estimators: DeepFlow <ref type="bibr" target="#b8">[9]</ref> and Dense Inversive Search -DIS <ref type="bibr" target="#b1">[2]</ref>.</p><p>Because of the strong motion artefacts observed on DIS flow, this one is smoothed with a Gaussian blur using a kernel of size 3 × 3 and then multiplied by the computed foreground <ref type="bibr" target="#b9">[10]</ref> to keep only foreground motion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Spatial segmentation</head><p>RGB and Optical Flow are spatially segmented using a region of interest -ROI -of center C roi = (x r oi , y r oi ) estimated from the maximum of the optical flow norm and the center of gravity of all pixels <ref type="bibr" target="#b5">[6]</ref> as follows:</p><formula xml:id="formula_0">C max = (x max , y max ) = arдmax x ,y (||D|| 1 ) C g = (x д , y д ) = 1 δ (C) C∈Ω Cδ (C) C∈Ω with δ (C) = 1 if ||D|| 1 (C) 0 0 otherwise x r oi = α f ω x (x max , W ) + (1 − α) f ω x (x д , W ) y r oi = α f ω y (y max , H ) + (1 − α) f ω y (x д , H )<label>(1)</label></formula><p>with parameters α = 0.6, Ω = (ω x , ω y ) = (320 × 180) the size of the resized video frames, (W , H ) the size of the data inputted to our network. The function</p><formula xml:id="formula_1">f ω (u, V ) = max(min(u, V − ω 2 ), ω</formula><p>2 ) allows to have input data extracted within the boundaries of our data. To avoid jittering, we apply a Gaussian blur along the time dimension to average the center position using a kernel of size 40 and scale parameter σ blur = 4.44.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Data normalization</head><p>The RGB image channels are normalized by their theoretical maximum value, 255 in our case, to map them into interval [0,1]. As done in <ref type="bibr" target="#b6">[7]</ref> which compare different normalization methods, we decide to normalize the optical flow V = (v x , v y ) using the mean µ and standard deviation σ of the maximum absolute values distribution of each optical flow components over the whole dataset. In the following equation v and v N represent respectively one component of the OF V and its normalization.</p><formula xml:id="formula_2">v ′ = v µ+3×σ v N (i, j) = v ′ (i, j) if |v ′ (i, j)| &lt; 1 SIGN (v ′ (i, j)) otherwise. (<label>2</label></formula><formula xml:id="formula_3">)</formula><p>This normalization method maps the values into interval [-1,1] and increases the magnitude of most vectors making the optical flow easier to process for classification of very similar actions such as </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">SSTCNN</head><p>Our Siamese Spatio-Temporal Convolutional Neural Network -SSTCNN, see Fig. <ref type="figure" target="#fig_0">1</ref>, is constituted of 2 branches with three 3D convolutional layers with 30, 60, 80 filter response maps, followed by a fully connected layer of size 500. They take respectively cuboides of RGB values and optical flow computed from them of size (W × H × T )= (120×120×100). The 3D convolutional layers use 3×3×3 spacetime filters with a dense stride and padding of 1 in each direction. The two branches are fused through a final fully connected layer of size 21 followed by a Softmax function to output a probabilistic classification. We also spatially augment the data by applying random rotation in the range ±10 • , random translation in range ±0.1 in x and y directions, random homothety in range 1 ± 0.1 and a 0.5 chance flip in horizontal direction and random channel swaps on the RGB data. We take extra care of applying those changing on the Optical Flow by updating its values according to the transformations. Transformations are applied and centered on the region of interest avoiding crops outside of the camera range.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.6">Training and submitted runs</head><p>All models were trained from scratch. We used firstly 250 epochs with the data samples split randomly between all strokes and then split using only two videos for validation. However we noticed the results obtained by splitting the dataset between videos were not satisfying. After looking at the dataset in detail, this is due to the fact that most of the videos contain only one kind of stroke performed by the same player. So the model will over-fit easily to the player appearance and not the characteristics of the stroke itself. With such a limited dataset and a limited time window we preferred to focus on the random distribution of the strokes among our training and validation sets. The two first runs are the classification obtained with the model trained on the split dataset and saved on the minimum loss obtained on the validation set with two different flows presented in section 2.1. The other two runs are the same models but retrained from scratch using all data samples with the number of epochs used for obtaining best performance on the first validation set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">RESULTS</head><p>On the left side of the Table <ref type="table" target="#tab_1">1</ref> we can see results of the first two runs from the models trained on the split database with 250 epochs; and on the right side two others runs obtained from the models trained with all the data. Compared to what has been obtained in previous work <ref type="bibr" target="#b5">[6]</ref>, the results are very low. The main differences are i) the lack of a negative class and ii) the split of the dataset in train and test sets between videos. It directly leads to an over-fitting of the dataset and makes the model much less able to do a proper classification. Best results were obtained by using DeepFlow estimator.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 2: Confusion Matrix of our best run</head><p>Furthermore, if we consider the confusion matrix of our best run, Fig. <ref type="figure">2</ref>, and group strokes in larger classes as: 'Forehand', 'Backhand' or 'Service', 'Offensive', 'Defensive' or their intersection (6 classes), we respectively get accuracies of 76.8%, 65.8% and 54.8%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">CONCLUSION</head><p>Despite a strong over-fitting, by grouping strokes together in larger classes, we can notice that some characteristics to recognize strokes are still learned. Furthermore, the work on TTStroke-21 <ref type="bibr" target="#b4">[5]</ref> is still in progress and the enrichment of the dataset will be a big contribution in the domain of action detection and classification especially for very similar actions.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: SSTCNN architecture</figDesc><graphic coords="2,65.81,220.68,216.21,78.42" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Table Tennis strokes.</figDesc><table><row><cell>MediaEval'19, 27-29 October 2019, Sophia Antipolis, France</cell><cell>P-e Martin et al.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 :</head><label>1</label><figDesc>Runs results</figDesc><table><row><cell>Flow</cell><cell cols="2">Epochs Train Val Test Train Test</cell></row><row><cell>DIS</cell><cell>249</cell><cell>70.4 52.6 19.2 61.2 17.8</cell></row><row><cell>DeepFlow</cell><cell>229</cell><cell>74.7 56.1 17.2 70.2 22.9</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENTS</head><p>This work was supported by Region of Nouvelle Aquitaine grant CRISP and Bordeaux Idex Initiative.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">The Kinetics Human Action Video Dataset</title>
		<author>
			<persName><forename type="first">Will</forename><surname>Kay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Joao</forename><surname>Carreira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Karen</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Brian</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chloe</forename><surname>Hillier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sudheendra</forename><surname>Vijayanarasimhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fabio</forename><surname>Viola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tim</forename><surname>Green</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Trevor</forename><surname>Back</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paul</forename><surname>Natsev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mustafa</forename><surname>Suleyman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Zisserman</surname></persName>
		</author>
		<idno>CoRR abs/1705.06950</idno>
		<imprint>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Fast Optical Flow Using Dense Inverse Search</title>
		<author>
			<persName><forename type="first">Till</forename><surname>Kroeger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Radu</forename><surname>Timofte</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dengxin</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Luc</forename><surname>Van Gool</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ECCV (LNCS)</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2016">2016. 9908</date>
			<biblScope unit="page" from="471" to="488" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">HMDB: A large video database for human motion recognition</title>
		<author>
			<persName><forename type="first">Hildegard</forename><surname>Kuehne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hueihan</forename><surname>Jhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Estíbaliz</forename><surname>Garrote</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tomaso</forename><forename type="middle">A</forename><surname>Poggio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas</forename><surname>Serre</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICCV. IEEE Computer Society</title>
				<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="2556" to="2563" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Sports Video Annotation: Detection of Strokes in Table Tennis task for MediaEval</title>
		<author>
			<persName><forename type="first">Pierre-Etienne</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jenny</forename><surname>Benois-Pineau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Boris</forename><surname>Mansencal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Renaud</forename><surname>Péteri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Laurent</forename><surname>Mascarilla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jordan</forename><surname>Calandre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Julien</forename><surname>Morlier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the MediaEval 2019 Workshop</title>
				<meeting>of the MediaEval 2019 Workshop<address><addrLine>Sophia Antipolis, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019-10-29">2019. 2019. 27-29 October 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Fine-Grained Action Detection and Classification in Table Tennis with Siamese Spatio-Temporal Convolutional Neural Network</title>
		<author>
			<persName><forename type="first">Pierre-Etienne</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jenny</forename><surname>Benois-Pineau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Renaud</forename><surname>Péteri</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICIP 2019</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="3027" to="3028" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis</title>
		<author>
			<persName><forename type="first">Pierre-Etienne</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jenny</forename><surname>Benois-Pineau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Renaud</forename><surname>Péteri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Julien</forename><surname>Morlier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CBMI 2018. IEEE</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Optimal choice of motion estimation methods for finegrained action classification with 3D convolutional networks</title>
		<author>
			<persName><forename type="first">Pierre-Etienne</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jenny</forename><surname>Benois-Pineau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Renaud</forename><surname>Péteri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Julien</forename><surname>Morlier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICIP 2019. IEEE</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="554" to="558" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild</title>
		<author>
			<persName><forename type="first">Khurram</forename><surname>Soomro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mubarak</forename><surname>Amir Roshan Zamir</surname></persName>
		</author>
		<author>
			<persName><surname>Shah</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1212.0402</idno>
	</analytic>
	<monogr>
		<title level="j">CoRR</title>
		<imprint>
			<biblScope unit="volume">1212</biblScope>
			<biblScope unit="page">402</biblScope>
			<date type="published" when="2012">2012. 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">DeepFlow: Large Displacement Optical Flow with Deep Matching</title>
		<author>
			<persName><forename type="first">Philippe</forename><surname>Weinzaepfel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jérôme</forename><surname>Revaud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zaïd</forename><surname>Harchaoui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cordelia</forename><surname>Schmid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE ICCV</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="1385" to="1392" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Efficient adaptive density estimation per image pixel for the task of background subtraction</title>
		<author>
			<persName><forename type="first">Zoran</forename><surname>Zivkovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ferdinand</forename><surname>Van Der Heijden</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition Letters</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="773" to="780" />
			<date type="published" when="2006">2006. 2006</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
