=Paper=
{{Paper
|id=Vol-3181/paper36
|storemode=property
|title=HCMUS at MediaEval 2021: Ensembles of Action Recognition Networks with Prior
Knowledge for Table Tennis Strokes Classification Task
|pdfUrl=https://ceur-ws.org/Vol-3181/paper36.pdf
|volume=Vol-3181
|authors=Trong-Tung Nguyen,Thanh-Son Nguyen,Gia-Bao Dinh Ho,Hai-Dang Nguyen,Minh-Triet
Tran
|dblpUrl=https://dblp.org/rec/conf/mediaeval/NguyenNHNT21
}}
==HCMUS at MediaEval 2021: Ensembles of Action Recognition Networks with Prior
Knowledge for Table Tennis Strokes Classification Task==
HCMUS at MediaEval 2021: Ensembles of Action Recognition
Networks with Prior Knowledge for Table Tennis Strokes
Classification Task
Trong-Tung Nguyen1,3 , Thanh-Son Nguyen1,3 , Gia-Bao Dinh Ho1,3 , Hai-Dang Nguyen1,3 , Minh-Triet
Tran1,2,3
1 University of Science, VNU-HCM, 2 John von Neumann Institute, VNU-HCM
3 Vietnam National University, Ho Chi Minh city, Vietnam
{ntrtung17,dhgbao}@apcs.fitus.edu.vn,{nthanhson,nhdang}@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn
ABSTRACT 2 METHOD
The SportsVideo task of the MediaEval 2021 benchmark is made up 2.1 Run 01
of two subtasks: stroke detection and stroke classification. For the
In this run, we stacked images in sub-clips spatially to create a
detection task, participants are required to find some specific frame
super image with size π Γ π as a representation for the full clip and
intervals broadcasting the strokes of interest. Subsequently, this can
treat the video classification task as an image classification problem.
be utilized as a preliminary step for classifying a stroke that has been
After that, a classification head was used for making prediction
performed. This year, our HCMUS team engaged in the challenge
about stroke categories.
with the main contribution of improving the classification, aiming
to intensify the effectiveness of our previous method in 2020. For
our five runs, we proposed three different approaches followed by 2.2 Run 02
an ensemble stage for the two remaining runs. Eventually, our best In this run, we decomposed the original classification problem into
run ranked second in the Sports Video Task with 68.8% of accuracy. three sub-classification branches, with the sub-categories split for
each classifier described in Table 1. This mechanism was based
on our motivation to disentangle the existing ambiguity of the
1 INTRODUCTION raw labels. It would be more relevant to discriminate among serve,
offensive, and defensive strokes rather than serve, forehand, and
In the Multimedia Evaluation Challenge 2021, there are two main backhand types. Moreover, our proposed sub-categories classifi-
sub-tasks: detection and classification. Specifically, the latter speci- cation method by breaking the raw labels into many sub-classes
fies video boundaries as inputs to perform classifying stroke cat- can supplement more training samples for each category in the
egories. About the dataset, strokes are categorized into the same classifiers, as the collection of some strokes in table tennis are
20 classes as that of last year, with an addition of new and more still limited. Eventually, each classifier utilized both shared and
diverse samples [6]. exclusive features useful for the corresponding tasks.
We conducted three experiments with different model architec- The first and third components utilized the shared features
tures and submitted five runs in total. Generally, the first, second, ππ βππππ_13 which were constructed by performing concatenation
and fifth runs were independent methods. Turning to the other between the temporal visual features and temporal pose features. A
versions, the third run is the ensemble of the first and the fifth 3D-CNN architecture implemented by [5] was employed for extract-
runs, while the first and second runs are used for ensembling the ing the temporal visual features ππ£ππ π’ππ_3π·πΆπ π1 , given an image
fourth run. For the first run, we employed a rudimentary method to with shape π» Γπ ΓπΆ. On the other hand, the temporal pose features
handle video classification by spatially stacking images in a video ππ‘πππππππ_πππ π were the results of providing 17 human key points
sequence to form a super image, as such a simple idea is proven to of multiple frames successively to an LSTM architecture. Initially,
be efficient in [8]. The second run was delegated to a more system- we performed sampling πΉ frames with a strategy for ensuring the
atic approach. We decomposed the problem into three branches of consistency of keypoint extracted in video sequences. Key points
classification problem with the help of multi-task learning. This were represented by two coordinate values, which results in 34
aims to inject relevant features and human biases into each branch different values for a specific pose. The first and third components
independently. For the fifth run, we continued to employ our previ- were paired to use similar features due to their similarity in visual
ous approaches [7] with some modifications. Our post-processing appearance and might use the same sources of information for
stages were modified to a more general scenario with the help predicting sub-categories.
of conditional probabilities and prior knowledge to eliminate the However, another significant feature should be incorporated
sensitive outcomes of classification models. when handling with the third classifier (Forehand, Backhand). We
first performed cropping the original image based on the bound-
aries of the handsβ region, which can be extracted by selecting the
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons coordinates of key points that satisfy a plausible position for hu-
License Attribution 4.0 International (CC BY 4.0).
MediaEvalβ21, December 13-15, 2021, Online man hands. After that, the concatenation of two hand images of
shape π» 1 Γ π1 Γ πΆ were supplied into a different 3D-CNN branch
MediaEvalβ21, December 13-15, 2021, Online Trong-Tung Nguyen et al.
Classifier type Categories # Prediction Heads Prior Knowledge about Possible sets of labels
First Serve, Offensive, First Component for Third Component
3
Component Defensive Backspin, Loop,
Serve
Second Forehand, Sidespin, Topspin
2
Component Backhand Offensive Hit, Loop, Flip
Backspin, Loop, Defensive Push, Block, Backspin
Third Sidespin, Topspin, Table 2: Prior Knowledge tables
8
Component Hit, Flip,
Push, Block
Table 1: Three splitted sub-categories for three classifier types 2.3 Run 05
We made a small modification on the second run by replacing our
designed classifier with a more powerful model architecture for the
action recognition problem, which we have utilized last year [1, 7].
to produce another temporal visual hand feature for the third clas-
Similarly, three different classifiers produced the outputs indepen-
sification branch
dently which were then combined to get the final results with
After that, three multi-layer perceptrons ππΏππ (1) were designed
our conditional probability using the prior knowledge mechanism
for each branch of classification with different number prediction
demonstrated.
heads shown in Table 1. The loss function of each branch was then
aggregated for serving the final multi-task learning loss L .
3 EXPERIMENTS AND RESULTS
πΛπ = ππ π π‘πππ₯ (ππΏππ (ππ )) (1) In the first run, the final score is the average score of two sub-
clips in the video. All of the images were resized to shapes of
Finally, we formulated the joint probabilities π (π 1, π 2, π 3 ) (2) of
224Γ224. We passed the super image to the ResNet-50 [4] backbone,
predicting three independent sub-categories using prior knowledge.
followed by a global average pooling layer to get a 2048-dimension
By conducting a thorough analysis about the co-existence of three
vector. For each video, we sampled two sub-clips separated by five
sub-categories, we concluded that the existence of the second com-
frames, with 16 images per sub-clips. Random flip, color jittering,
ponent label was independent of the first and third component
and random augmentation [3] are also used with the default settings
label. On the other hand, it was possible to narrow down plausible
in MMAction2 [2]. We trained our model in this run using the focal
labels of the third component given the prior knowledge about the
loss [9] to handle the data imbalance problem. In the second run,
categories of the first component. In Table 2, we summarize the
we passed video sequences with 30 samples of frame interval with
relation of existence between the first and third components that
a shape of 120 Γ 120 to the shared network (the first and third
we have investigated so far.
classification branch). Meanwhile, two hand images were cropped
π (π 1, π 2, π 3 ) = π (π 3, π 1 |π 2 ) Β· π (π 2 ) and concatenated as shape of 120 Γ 240 before feeding the third
= π (π 3, π 1 ) Β· π (π 2 ) (2) classifier. In the fifth run, we utilized the parameters similar to our
previous methods [7] for each classifier. For the ensemble versions,
= π (π 3, π 1 ) Β· πΛ2π 2
highest confidence scores were returned as final results.
The second term can be referred to the π π‘β Λ
2 value of π 2 (1) of Run ID Run 1 Run 2 Run 3 Run 4 Run 5
the second classifier. Meanwhile, the first term π (π 3, π 1 ) (3) was Accuracy 61.99% 44.80% 68.78% 60.63% 67.87%
factorized into two terms.
π (π 3, π 1 ) = π (π 3 |π 1 ) Β· π (π 1 ) Table 3: HCMUS Team Submission results for Table Tennis
(3) Stroke Classification Task
= πΛππ π ππππ3π 3 Β· πΛ1π 1
Given the prior knowledge tables, we first construct a binary 4 CONCLUSION AND FUTURE WORKS
referenced matrix π β π
3Γ8 , which encodes the co-existence of Conclusively, we benchmarked various different approaches on the
labels between the first and third component. Then, we perform video classification task for table tennis at MediaEval benchmark
Hadamard product on the two vectors ππ (π 1 ) βR1Γ8 (where π(π 1 ) 2021. Furthermore, one of our submissions achieved the second-
= {0, 1, 2} represents the true index of π 1 ) and πΛ3 βR1Γ8 to produce best result in terms of global accuracy, which is 68.78%. Future
the refined probability πΛππ π ππππ3 βR1Γ8 (4). Finally, it is normalized works should be considered on the analysis of features selection
before being multiplied with the π π‘β Λ
1 value of π 1 (1): and semantic of the raw labels for modeling the action in the table
Γ tennis domain, with the help of human pose and prior knowledge
πΛππ π ππππ3 = ππ (π 1 ) πΛ3 (4) information.
ACKNOWLEDGMENTS
π (π 3, π 1 ) = π (π 3 |π 1 ) Β· π (π 1 )
This work was funded by Gia Lam Urban Development and Invest-
πΛππ π ππππ3π 3 (5) ment Company Limited, Vingroup and supported by Vingroup In-
= Γπ=8 Β· πΛ1π 1
π=1 πΛππ π ππππ 3π novation Foundation (VINIF) under project code VINIF.2019.DA19.
Sports Video Task MediaEvalβ21, December 13-15, 2021, Online
REFERENCES
[1] MMPose Contributors. 2020. OpenMMLab Pose Estimation Toolbox
and Benchmark. https://github.com/open-mmlab/mmpose. (2020).
[2] MMAction2 Contributors. 2020. OpenMMLabβs Next Generation
Video Understanding Toolbox and Benchmark. https://github.com/
open-mmlab/mmaction2. (2020).
[3] Jonathon Shlens Quoc V. Le Ekin D. Cubuk, Barret Zoph. 2020. Ran-
daugment: Practical automated data augmentation with a reduced
search space. In Proc. of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) Workshops. 702β703.
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
Residual Learning for Image Recognition. 770β778. https://doi.org/10.
1109/CVPR.2016.90
[5] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Peteri, and Julien
Morlier. 2020. Fine grained sport action recognition with Twin spatio-
temporal convolutional neural networks: Application to table tennis.
Multimedia Tools and Applications 79 (07 2020). https://doi.org/10.
1007/s11042-020-08917-3
[6] Pierre-Etienne Martin, Jordan Calandre, Boris Mansencal, Jenny
Benois-Pineau, Renaud PΓ©teri, Laurent Mascarilla, and Julien Morlier.
2021. Sports Video: Fine-Grained Action Detection and Classification
of Table Tennis Strokes from videos for MediaEval 2021. (2021).
[7] Hai Nguyen-Truong, San Cao, N. A. Khoa Nguyen, Bang-Dang Pham,
Hieu Dao, Minh-Quan Le, Hoang-Phuc Nguyen-Dinh, Hai-Dang
Nguyen, and Minh-Triet Tran. 2020. HCMUS at MediaEval 2020: En-
sembles of Temporal Deep Neural Networks for Table Tennis Strokes
Classification Task. In Working Notes Proceedings of the MediaEval
2020 Workshop, Online, 14-15 December 2020 (CEUR Workshop Pro-
ceedings), Steven Hicks, Debesh Jha, Konstantin Pogorelov, Alba Gar-
cΓa Seco de Herrera, Dmitry Bogdanov, Pierre-Etienne Martin, Stelios
Andreadis, Minh-Son Dao, Zhuoran Liu, JosΓ© Vargas Quiros, Ben-
jamin Kille, and Martha A. Larson (Eds.), Vol. 2882. CEUR-WS.org.
http://ceur-ws.org/Vol-2882/paper50.pdf
[8] Rameswar Panda Quanfu Fan, Chun-Fu (Richard) Chen. 2021. An
Image Classifier Can Suffice For Video Understanding. (06 2021).
[9] Ross Girshick Kaiming He Piotr Dollar Tsung-Yi Lin, Priya Goyal. 2017.
Focal Loss for Dense Object Detection. In ICCV.