=Paper=
{{Paper
|id=Vol-3181/paper36
|storemode=property
|title=HCMUS at MediaEval 2021: Ensembles of Action Recognition Networks with Prior
						Knowledge for Table Tennis Strokes Classification Task
|pdfUrl=https://ceur-ws.org/Vol-3181/paper36.pdf
|volume=Vol-3181
|authors=Trong-Tung Nguyen,Thanh-Son Nguyen,Gia-Bao Dinh Ho,Hai-Dang Nguyen,Minh-Triet
					Tran
|dblpUrl=https://dblp.org/rec/conf/mediaeval/NguyenNHNT21
}}
==HCMUS at MediaEval 2021: Ensembles of Action Recognition Networks with Prior
						Knowledge for Table Tennis Strokes Classification Task==
<pdf width="1500px">https://ceur-ws.org/Vol-3181/paper36.pdf</pdf>
<pre>
    HCMUS at MediaEval 2021: Ensembles of Action Recognition
     Networks with Prior Knowledge for Table Tennis Strokes
                       Classification Task
    Trong-Tung Nguyen1,3 , Thanh-Son Nguyen1,3 , Gia-Bao Dinh Ho1,3 , Hai-Dang Nguyen1,3 , Minh-Triet
                                               Tran1,2,3
                                1 University of Science, VNU-HCM, 2 John von Neumann Institute, VNU-HCM
                                 3 Vietnam National University, Ho Chi Minh city, Vietnam

           {ntrtung17,dhgbao}@apcs.fitus.edu.vn,{nthanhson,nhdang}@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn

ABSTRACT                                                                             2 METHOD
The SportsVideo task of the MediaEval 2021 benchmark is made up                      2.1 Run 01
of two subtasks: stroke detection and stroke classification. For the
                                                                                     In this run, we stacked images in sub-clips spatially to create a
detection task, participants are required to find some specific frame
                                                                                     super image with size 𝑁 × 𝑁 as a representation for the full clip and
intervals broadcasting the strokes of interest. Subsequently, this can
                                                                                     treat the video classification task as an image classification problem.
be utilized as a preliminary step for classifying a stroke that has been
                                                                                     After that, a classification head was used for making prediction
performed. This year, our HCMUS team engaged in the challenge
                                                                                     about stroke categories.
with the main contribution of improving the classification, aiming
to intensify the effectiveness of our previous method in 2020. For
our five runs, we proposed three different approaches followed by                    2.2    Run 02
an ensemble stage for the two remaining runs. Eventually, our best                   In this run, we decomposed the original classification problem into
run ranked second in the Sports Video Task with 68.8% of accuracy.                   three sub-classification branches, with the sub-categories split for
                                                                                     each classifier described in Table 1. This mechanism was based
                                                                                     on our motivation to disentangle the existing ambiguity of the
1    INTRODUCTION                                                                    raw labels. It would be more relevant to discriminate among serve,
                                                                                     offensive, and defensive strokes rather than serve, forehand, and
In the Multimedia Evaluation Challenge 2021, there are two main                      backhand types. Moreover, our proposed sub-categories classifi-
sub-tasks: detection and classification. Specifically, the latter speci-             cation method by breaking the raw labels into many sub-classes
fies video boundaries as inputs to perform classifying stroke cat-                   can supplement more training samples for each category in the
egories. About the dataset, strokes are categorized into the same                    classifiers, as the collection of some strokes in table tennis are
20 classes as that of last year, with an addition of new and more                    still limited. Eventually, each classifier utilized both shared and
diverse samples [6].                                                                 exclusive features useful for the corresponding tasks.
   We conducted three experiments with different model architec-                        The first and third components utilized the shared features
tures and submitted five runs in total. Generally, the first, second,                𝑓𝑠ℎ𝑎𝑟𝑒𝑑_13 which were constructed by performing concatenation
and fifth runs were independent methods. Turning to the other                        between the temporal visual features and temporal pose features. A
versions, the third run is the ensemble of the first and the fifth                   3D-CNN architecture implemented by [5] was employed for extract-
runs, while the first and second runs are used for ensembling the                    ing the temporal visual features 𝑓𝑣𝑖𝑠𝑢𝑎𝑙_3𝐷𝐶𝑁 𝑁1 , given an image
fourth run. For the first run, we employed a rudimentary method to                   with shape 𝐻 ×𝑊 ×𝐶. On the other hand, the temporal pose features
handle video classification by spatially stacking images in a video                  𝑓𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙_𝑝𝑜𝑠𝑒 were the results of providing 17 human key points
sequence to form a super image, as such a simple idea is proven to                   of multiple frames successively to an LSTM architecture. Initially,
be efficient in [8]. The second run was delegated to a more system-                  we performed sampling 𝐹 frames with a strategy for ensuring the
atic approach. We decomposed the problem into three branches of                      consistency of keypoint extracted in video sequences. Key points
classification problem with the help of multi-task learning. This                    were represented by two coordinate values, which results in 34
aims to inject relevant features and human biases into each branch                   different values for a specific pose. The first and third components
independently. For the fifth run, we continued to employ our previ-                  were paired to use similar features due to their similarity in visual
ous approaches [7] with some modifications. Our post-processing                      appearance and might use the same sources of information for
stages were modified to a more general scenario with the help                        predicting sub-categories.
of conditional probabilities and prior knowledge to eliminate the                       However, another significant feature should be incorporated
sensitive outcomes of classification models.                                         when handling with the third classifier (Forehand, Backhand). We
                                                                                     first performed cropping the original image based on the bound-
                                                                                     aries of the hands’ region, which can be extracted by selecting the
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   coordinates of key points that satisfy a plausible position for hu-
License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, December 13-15, 2021, Online                                           man hands. After that, the concatenation of two hand images of
                                                                                     shape 𝐻 1 × 𝑊1 × 𝐶 were supplied into a different 3D-CNN branch
MediaEval’21, December 13-15, 2021, Online                                                                              Trong-Tung Nguyen et al.

   Classifier type      Categories       # Prediction Heads                           Prior Knowledge about      Possible sets of labels
        First         Serve, Offensive,                                                  First Component         for Third Component
                                                   3
    Component             Defensive                                                                                  Backspin, Loop,
                                                                                                Serve
       Second             Forehand,                                                                                 Sidespin, Topspin
                                                   2
    Component             Backhand                                                             Offensive              Hit, Loop, Flip
                      Backspin, Loop,                                                          Defensive          Push, Block, Backspin
        Third        Sidespin, Topspin,                                                        Table 2: Prior Knowledge tables
                                                   8
    Component              Hit, Flip,
                         Push, Block
Table 1: Three splitted sub-categories for three classifier types             2.3     Run 05
                                                                              We made a small modification on the second run by replacing our
                                                                              designed classifier with a more powerful model architecture for the
                                                                              action recognition problem, which we have utilized last year [1, 7].
to produce another temporal visual hand feature for the third clas-
                                                                              Similarly, three different classifiers produced the outputs indepen-
sification branch
                                                                              dently which were then combined to get the final results with
    After that, three multi-layer perceptrons 𝑀𝐿𝑃𝑖 (1) were designed
                                                                              our conditional probability using the prior knowledge mechanism
for each branch of classification with different number prediction
                                                                              demonstrated.
heads shown in Table 1. The loss function of each branch was then
aggregated for serving the final multi-task learning loss L .
                                                                              3     EXPERIMENTS AND RESULTS
                        𝑝ˆ𝑖 = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝑀𝐿𝑃𝑖 (𝑓𝑖 ))                   (1)    In the first run, the final score is the average score of two sub-
                                                                              clips in the video. All of the images were resized to shapes of
   Finally, we formulated the joint probabilities 𝑃 (𝑐 1, 𝑐 2, 𝑐 3 ) (2) of
                                                                              224×224. We passed the super image to the ResNet-50 [4] backbone,
predicting three independent sub-categories using prior knowledge.
                                                                              followed by a global average pooling layer to get a 2048-dimension
By conducting a thorough analysis about the co-existence of three
                                                                              vector. For each video, we sampled two sub-clips separated by five
sub-categories, we concluded that the existence of the second com-
                                                                              frames, with 16 images per sub-clips. Random flip, color jittering,
ponent label was independent of the first and third component
                                                                              and random augmentation [3] are also used with the default settings
label. On the other hand, it was possible to narrow down plausible
                                                                              in MMAction2 [2]. We trained our model in this run using the focal
labels of the third component given the prior knowledge about the
                                                                              loss [9] to handle the data imbalance problem. In the second run,
categories of the first component. In Table 2, we summarize the
                                                                              we passed video sequences with 30 samples of frame interval with
relation of existence between the first and third components that
                                                                              a shape of 120 × 120 to the shared network (the first and third
we have investigated so far.
                                                                              classification branch). Meanwhile, two hand images were cropped
                  𝑃 (𝑐 1, 𝑐 2, 𝑐 3 ) = 𝑃 (𝑐 3, 𝑐 1 |𝑐 2 ) · 𝑃 (𝑐 2 )          and concatenated as shape of 120 × 240 before feeding the third
                                    = 𝑃 (𝑐 3, 𝑐 1 ) · 𝑃 (𝑐 2 )         (2)    classifier. In the fifth run, we utilized the parameters similar to our
                                                                              previous methods [7] for each classifier. For the ensemble versions,
                                    = 𝑃 (𝑐 3, 𝑐 1 ) · 𝑝ˆ2𝑐 2
                                                                              highest confidence scores were returned as final results.
   The second term can be referred to the 𝑐 𝑡ℎ                 ˆ
                                              2 value of 𝑝 2 (1) of                 Run ID      Run 1    Run 2     Run 3      Run 4     Run 5
the second classifier. Meanwhile, the first term 𝑃 (𝑐 3, 𝑐 1 ) (3) was              Accuracy    61.99%   44.80%    68.78%     60.63%    67.87%
factorized into two terms.
                      𝑃 (𝑐 3, 𝑐 1 ) = 𝑃 (𝑐 3 |𝑐 1 ) · 𝑃 (𝑐 1 )                Table 3: HCMUS Team Submission results for Table Tennis
                                                                       (3)    Stroke Classification Task
                                   = 𝑝ˆ𝑟𝑒 𝑓 𝑖𝑛𝑒𝑑3𝑐 3 · 𝑝ˆ1𝑐 1
   Given the prior knowledge tables, we first construct a binary              4     CONCLUSION AND FUTURE WORKS
referenced matrix 𝑀 ∈ 𝑅 3×8 , which encodes the co-existence of               Conclusively, we benchmarked various different approaches on the
labels between the first and third component. Then, we perform                video classification task for table tennis at MediaEval benchmark
Hadamard product on the two vectors 𝑀𝑔 (𝑐 1 ) ∈R1×8 (where 𝑔(𝑐 1 )            2021. Furthermore, one of our submissions achieved the second-
= {0, 1, 2} represents the true index of 𝑐 1 ) and 𝑝ˆ3 ∈R1×8 to produce       best result in terms of global accuracy, which is 68.78%. Future
the refined probability 𝑝ˆ𝑟𝑒 𝑓 𝑖𝑛𝑒𝑑3 ∈R1×8 (4). Finally, it is normalized     works should be considered on the analysis of features selection
before being multiplied with the 𝑐 𝑡ℎ                ˆ
                                      1 value of 𝑝 1 (1):                     and semantic of the raw labels for modeling the action in the table
                                               È                              tennis domain, with the help of human pose and prior knowledge
                      𝑝ˆ𝑟𝑒 𝑓 𝑖𝑛𝑒𝑑3 = 𝑀𝑔 (𝑐 1 )   𝑝ˆ3                   (4)    information.

                                                                              ACKNOWLEDGMENTS
                  𝑃 (𝑐 3, 𝑐 1 ) = 𝑃 (𝑐 3 |𝑐 1 ) · 𝑃 (𝑐 1 )
                                                                              This work was funded by Gia Lam Urban Development and Invest-
                                   𝑝ˆ𝑟𝑒 𝑓 𝑖𝑛𝑒𝑑3𝑐 3                     (5)    ment Company Limited, Vingroup and supported by Vingroup In-
                               = Í𝑛=8                · 𝑝ˆ1𝑐 1
                                  𝑖=1 𝑝ˆ𝑟𝑒 𝑓 𝑖𝑛𝑒𝑑 3𝑖                          novation Foundation (VINIF) under project code VINIF.2019.DA19.
Sports Video Task                                                            MediaEval’21, December 13-15, 2021, Online


REFERENCES
[1] MMPose Contributors. 2020. OpenMMLab Pose Estimation Toolbox
    and Benchmark. https://github.com/open-mmlab/mmpose. (2020).
[2] MMAction2 Contributors. 2020. OpenMMLab’s Next Generation
    Video Understanding Toolbox and Benchmark. https://github.com/
    open-mmlab/mmaction2. (2020).
[3] Jonathon Shlens Quoc V. Le Ekin D. Cubuk, Barret Zoph. 2020. Ran-
    daugment: Practical automated data augmentation with a reduced
    search space. In Proc. of the IEEE/CVF Conference on Computer Vision
    and Pattern Recognition (CVPR) Workshops. 702–703.
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
    Residual Learning for Image Recognition. 770–778. https://doi.org/10.
    1109/CVPR.2016.90
[5] Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Peteri, and Julien
    Morlier. 2020. Fine grained sport action recognition with Twin spatio-
    temporal convolutional neural networks: Application to table tennis.
    Multimedia Tools and Applications 79 (07 2020). https://doi.org/10.
    1007/s11042-020-08917-3
[6] Pierre-Etienne Martin, Jordan Calandre, Boris Mansencal, Jenny
    Benois-Pineau, Renaud Péteri, Laurent Mascarilla, and Julien Morlier.
    2021. Sports Video: Fine-Grained Action Detection and Classification
    of Table Tennis Strokes from videos for MediaEval 2021. (2021).
[7] Hai Nguyen-Truong, San Cao, N. A. Khoa Nguyen, Bang-Dang Pham,
    Hieu Dao, Minh-Quan Le, Hoang-Phuc Nguyen-Dinh, Hai-Dang
    Nguyen, and Minh-Triet Tran. 2020. HCMUS at MediaEval 2020: En-
    sembles of Temporal Deep Neural Networks for Table Tennis Strokes
    Classification Task. In Working Notes Proceedings of the MediaEval
    2020 Workshop, Online, 14-15 December 2020 (CEUR Workshop Pro-
    ceedings), Steven Hicks, Debesh Jha, Konstantin Pogorelov, Alba Gar-
    cía Seco de Herrera, Dmitry Bogdanov, Pierre-Etienne Martin, Stelios
    Andreadis, Minh-Son Dao, Zhuoran Liu, José Vargas Quiros, Ben-
    jamin Kille, and Martha A. Larson (Eds.), Vol. 2882. CEUR-WS.org.
    http://ceur-ws.org/Vol-2882/paper50.pdf
[8] Rameswar Panda Quanfu Fan, Chun-Fu (Richard) Chen. 2021. An
    Image Classifier Can Suffice For Video Understanding. (06 2021).
[9] Ross Girshick Kaiming He Piotr Dollar Tsung-Yi Lin, Priya Goyal. 2017.
    Focal Loss for Dense Object Detection. In ICCV.

</pre>