HCMUS at MediaEval 2021: Ensembles of Action Recognition Networks with Prior Knowledge for Table Tennis Strokes Classification Task Trong-Tung Nguyen1,3 , Thanh-Son Nguyen1,3 , Gia-Bao Dinh Ho1,3 , Hai-Dang Nguyen1,3 , Minh-Triet Tran1,2,3 1 University of Science, VNU-HCM, 2 John von Neumann Institute, VNU-HCM 3 Vietnam National University, Ho Chi Minh city, Vietnam {ntrtung17,dhgbao}@apcs.fitus.edu.vn,{nthanhson,nhdang}@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn ABSTRACT 2 METHOD The SportsVideo task of the MediaEval 2021 benchmark is made up 2.1 Run 01 of two subtasks: stroke detection and stroke classification. For the In this run, we stacked images in sub-clips spatially to create a detection task, participants are required to find some specific frame super image with size 𝑁 Γ— 𝑁 as a representation for the full clip and intervals broadcasting the strokes of interest. Subsequently, this can treat the video classification task as an image classification problem. be utilized as a preliminary step for classifying a stroke that has been After that, a classification head was used for making prediction performed. This year, our HCMUS team engaged in the challenge about stroke categories. with the main contribution of improving the classification, aiming to intensify the effectiveness of our previous method in 2020. For our five runs, we proposed three different approaches followed by 2.2 Run 02 an ensemble stage for the two remaining runs. Eventually, our best In this run, we decomposed the original classification problem into run ranked second in the Sports Video Task with 68.8% of accuracy. three sub-classification branches, with the sub-categories split for each classifier described in Table 1. This mechanism was based on our motivation to disentangle the existing ambiguity of the 1 INTRODUCTION raw labels. It would be more relevant to discriminate among serve, offensive, and defensive strokes rather than serve, forehand, and In the Multimedia Evaluation Challenge 2021, there are two main backhand types. Moreover, our proposed sub-categories classifi- sub-tasks: detection and classification. Specifically, the latter speci- cation method by breaking the raw labels into many sub-classes fies video boundaries as inputs to perform classifying stroke cat- can supplement more training samples for each category in the egories. About the dataset, strokes are categorized into the same classifiers, as the collection of some strokes in table tennis are 20 classes as that of last year, with an addition of new and more still limited. Eventually, each classifier utilized both shared and diverse samples [6]. exclusive features useful for the corresponding tasks. We conducted three experiments with different model architec- The first and third components utilized the shared features tures and submitted five runs in total. Generally, the first, second, π‘“π‘ β„Žπ‘Žπ‘Ÿπ‘’π‘‘_13 which were constructed by performing concatenation and fifth runs were independent methods. Turning to the other between the temporal visual features and temporal pose features. A versions, the third run is the ensemble of the first and the fifth 3D-CNN architecture implemented by [5] was employed for extract- runs, while the first and second runs are used for ensembling the ing the temporal visual features π‘“π‘£π‘–π‘ π‘’π‘Žπ‘™_3𝐷𝐢𝑁 𝑁1 , given an image fourth run. For the first run, we employed a rudimentary method to with shape 𝐻 Γ—π‘Š ×𝐢. On the other hand, the temporal pose features handle video classification by spatially stacking images in a video π‘“π‘‘π‘’π‘šπ‘π‘œπ‘Ÿπ‘Žπ‘™_π‘π‘œπ‘ π‘’ were the results of providing 17 human key points sequence to form a super image, as such a simple idea is proven to of multiple frames successively to an LSTM architecture. Initially, be efficient in [8]. The second run was delegated to a more system- we performed sampling 𝐹 frames with a strategy for ensuring the atic approach. We decomposed the problem into three branches of consistency of keypoint extracted in video sequences. Key points classification problem with the help of multi-task learning. This were represented by two coordinate values, which results in 34 aims to inject relevant features and human biases into each branch different values for a specific pose. The first and third components independently. For the fifth run, we continued to employ our previ- were paired to use similar features due to their similarity in visual ous approaches [7] with some modifications. Our post-processing appearance and might use the same sources of information for stages were modified to a more general scenario with the help predicting sub-categories. of conditional probabilities and prior knowledge to eliminate the However, another significant feature should be incorporated sensitive outcomes of classification models. when handling with the third classifier (Forehand, Backhand). We first performed cropping the original image based on the bound- aries of the hands’ region, which can be extracted by selecting the Copyright 2021 for this paper by its authors. Use permitted under Creative Commons coordinates of key points that satisfy a plausible position for hu- License Attribution 4.0 International (CC BY 4.0). MediaEval’21, December 13-15, 2021, Online man hands. After that, the concatenation of two hand images of shape 𝐻 1 Γ— π‘Š1 Γ— 𝐢 were supplied into a different 3D-CNN branch MediaEval’21, December 13-15, 2021, Online Trong-Tung Nguyen et al. Classifier type Categories # Prediction Heads Prior Knowledge about Possible sets of labels First Serve, Offensive, First Component for Third Component 3 Component Defensive Backspin, Loop, Serve Second Forehand, Sidespin, Topspin 2 Component Backhand Offensive Hit, Loop, Flip Backspin, Loop, Defensive Push, Block, Backspin Third Sidespin, Topspin, Table 2: Prior Knowledge tables 8 Component Hit, Flip, Push, Block Table 1: Three splitted sub-categories for three classifier types 2.3 Run 05 We made a small modification on the second run by replacing our designed classifier with a more powerful model architecture for the action recognition problem, which we have utilized last year [1, 7]. to produce another temporal visual hand feature for the third clas- Similarly, three different classifiers produced the outputs indepen- sification branch dently which were then combined to get the final results with After that, three multi-layer perceptrons 𝑀𝐿𝑃𝑖 (1) were designed our conditional probability using the prior knowledge mechanism for each branch of classification with different number prediction demonstrated. heads shown in Table 1. The loss function of each branch was then aggregated for serving the final multi-task learning loss L . 3 EXPERIMENTS AND RESULTS 𝑝ˆ𝑖 = π‘†π‘œ 𝑓 π‘‘π‘šπ‘Žπ‘₯ (𝑀𝐿𝑃𝑖 (𝑓𝑖 )) (1) In the first run, the final score is the average score of two sub- clips in the video. All of the images were resized to shapes of Finally, we formulated the joint probabilities 𝑃 (𝑐 1, 𝑐 2, 𝑐 3 ) (2) of 224Γ—224. We passed the super image to the ResNet-50 [4] backbone, predicting three independent sub-categories using prior knowledge. followed by a global average pooling layer to get a 2048-dimension By conducting a thorough analysis about the co-existence of three vector. For each video, we sampled two sub-clips separated by five sub-categories, we concluded that the existence of the second com- frames, with 16 images per sub-clips. Random flip, color jittering, ponent label was independent of the first and third component and random augmentation [3] are also used with the default settings label. On the other hand, it was possible to narrow down plausible in MMAction2 [2]. We trained our model in this run using the focal labels of the third component given the prior knowledge about the loss [9] to handle the data imbalance problem. In the second run, categories of the first component. In Table 2, we summarize the we passed video sequences with 30 samples of frame interval with relation of existence between the first and third components that a shape of 120 Γ— 120 to the shared network (the first and third we have investigated so far. classification branch). Meanwhile, two hand images were cropped 𝑃 (𝑐 1, 𝑐 2, 𝑐 3 ) = 𝑃 (𝑐 3, 𝑐 1 |𝑐 2 ) Β· 𝑃 (𝑐 2 ) and concatenated as shape of 120 Γ— 240 before feeding the third = 𝑃 (𝑐 3, 𝑐 1 ) Β· 𝑃 (𝑐 2 ) (2) classifier. In the fifth run, we utilized the parameters similar to our previous methods [7] for each classifier. For the ensemble versions, = 𝑃 (𝑐 3, 𝑐 1 ) Β· 𝑝ˆ2𝑐 2 highest confidence scores were returned as final results. The second term can be referred to the 𝑐 π‘‘β„Ž Λ† 2 value of 𝑝 2 (1) of Run ID Run 1 Run 2 Run 3 Run 4 Run 5 the second classifier. Meanwhile, the first term 𝑃 (𝑐 3, 𝑐 1 ) (3) was Accuracy 61.99% 44.80% 68.78% 60.63% 67.87% factorized into two terms. 𝑃 (𝑐 3, 𝑐 1 ) = 𝑃 (𝑐 3 |𝑐 1 ) Β· 𝑃 (𝑐 1 ) Table 3: HCMUS Team Submission results for Table Tennis (3) Stroke Classification Task = π‘Λ†π‘Ÿπ‘’ 𝑓 𝑖𝑛𝑒𝑑3𝑐 3 Β· 𝑝ˆ1𝑐 1 Given the prior knowledge tables, we first construct a binary 4 CONCLUSION AND FUTURE WORKS referenced matrix 𝑀 ∈ 𝑅 3Γ—8 , which encodes the co-existence of Conclusively, we benchmarked various different approaches on the labels between the first and third component. Then, we perform video classification task for table tennis at MediaEval benchmark Hadamard product on the two vectors 𝑀𝑔 (𝑐 1 ) ∈R1Γ—8 (where 𝑔(𝑐 1 ) 2021. Furthermore, one of our submissions achieved the second- = {0, 1, 2} represents the true index of 𝑐 1 ) and 𝑝ˆ3 ∈R1Γ—8 to produce best result in terms of global accuracy, which is 68.78%. Future the refined probability π‘Λ†π‘Ÿπ‘’ 𝑓 𝑖𝑛𝑒𝑑3 ∈R1Γ—8 (4). Finally, it is normalized works should be considered on the analysis of features selection before being multiplied with the 𝑐 π‘‘β„Ž Λ† 1 value of 𝑝 1 (1): and semantic of the raw labels for modeling the action in the table È tennis domain, with the help of human pose and prior knowledge π‘Λ†π‘Ÿπ‘’ 𝑓 𝑖𝑛𝑒𝑑3 = 𝑀𝑔 (𝑐 1 ) 𝑝ˆ3 (4) information. ACKNOWLEDGMENTS 𝑃 (𝑐 3, 𝑐 1 ) = 𝑃 (𝑐 3 |𝑐 1 ) Β· 𝑃 (𝑐 1 ) This work was funded by Gia Lam Urban Development and Invest- π‘Λ†π‘Ÿπ‘’ 𝑓 𝑖𝑛𝑒𝑑3𝑐 3 (5) ment Company Limited, Vingroup and supported by Vingroup In- = Í𝑛=8 Β· 𝑝ˆ1𝑐 1 𝑖=1 π‘Λ†π‘Ÿπ‘’ 𝑓 𝑖𝑛𝑒𝑑 3𝑖 novation Foundation (VINIF) under project code VINIF.2019.DA19. Sports Video Task MediaEval’21, December 13-15, 2021, Online REFERENCES [1] MMPose Contributors. 2020. OpenMMLab Pose Estimation Toolbox and Benchmark. https://github.com/open-mmlab/mmpose. (2020). [2] MMAction2 Contributors. 2020. OpenMMLab’s Next Generation Video Understanding Toolbox and Benchmark. https://github.com/ open-mmlab/mmaction2. (2020). [3] Jonathon Shlens Quoc V. Le Ekin D. Cubuk, Barret Zoph. 2020. Ran- daugment: Practical automated data augmentation with a reduced search space. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 702–703. [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. 770–778. https://doi.org/10. 1109/CVPR.2016.90 [5] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Peteri, and Julien Morlier. 2020. Fine grained sport action recognition with Twin spatio- temporal convolutional neural networks: Application to table tennis. Multimedia Tools and Applications 79 (07 2020). https://doi.org/10. 1007/s11042-020-08917-3 [6] Pierre-Etienne Martin, Jordan Calandre, Boris Mansencal, Jenny Benois-Pineau, Renaud PΓ©teri, Laurent Mascarilla, and Julien Morlier. 2021. Sports Video: Fine-Grained Action Detection and Classification of Table Tennis Strokes from videos for MediaEval 2021. (2021). [7] Hai Nguyen-Truong, San Cao, N. A. Khoa Nguyen, Bang-Dang Pham, Hieu Dao, Minh-Quan Le, Hoang-Phuc Nguyen-Dinh, Hai-Dang Nguyen, and Minh-Triet Tran. 2020. HCMUS at MediaEval 2020: En- sembles of Temporal Deep Neural Networks for Table Tennis Strokes Classification Task. In Working Notes Proceedings of the MediaEval 2020 Workshop, Online, 14-15 December 2020 (CEUR Workshop Pro- ceedings), Steven Hicks, Debesh Jha, Konstantin Pogorelov, Alba Gar- cΓ­a Seco de Herrera, Dmitry Bogdanov, Pierre-Etienne Martin, Stelios Andreadis, Minh-Son Dao, Zhuoran Liu, JosΓ© Vargas Quiros, Ben- jamin Kille, and Martha A. Larson (Eds.), Vol. 2882. CEUR-WS.org. http://ceur-ws.org/Vol-2882/paper50.pdf [8] Rameswar Panda Quanfu Fan, Chun-Fu (Richard) Chen. 2021. An Image Classifier Can Suffice For Video Understanding. (06 2021). [9] Ross Girshick Kaiming He Piotr Dollar Tsung-Yi Lin, Priya Goyal. 2017. Focal Loss for Dense Object Detection. In ICCV.