<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HCMUS at MediaEval 2021: Ensembles of Action Recognition Networks with Prior Knowledge for Table Tennis Strokes Classification Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Trong-Tung Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thanh-Son Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gia-Bao Dinh Ho</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hai-Dang Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Triet Tr</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>John von Neumann Institute</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh city</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The SportsVideo task of the MediaEval 2021 benchmark is made up of two subtasks: stroke detection and stroke classification. For the detection task, participants are required to find some specific frame intervals broadcasting the strokes of interest. Subsequently, this can be utilized as a preliminary step for classifying a stroke that has been performed. This year, our HCMUS team engaged in the challenge with the main contribution of improving the classification, aiming to intensify the efectiveness of our previous method in 2020. For our five runs, we proposed three diferent approaches followed by an ensemble stage for the two remaining runs. Eventually, our best run ranked second in the Sports Video Task with 68.8% of accuracy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In the Multimedia Evaluation Challenge 2021, there are two main
sub-tasks: detection and classification. Specifically, the latter
speciifes video boundaries as inputs to perform classifying stroke
categories. About the dataset, strokes are categorized into the same
20 classes as that of last year, with an addition of new and more
diverse samples [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        We conducted three experiments with diferent model
architectures and submitted five runs in total. Generally, the first, second,
and fifth runs were independent methods. Turning to the other
versions, the third run is the ensemble of the first and the fifth
runs, while the first and second runs are used for ensembling the
fourth run. For the first run, we employed a rudimentary method to
handle video classification by spatially stacking images in a video
sequence to form a super image, as such a simple idea is proven to
be eficient in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The second run was delegated to a more
systematic approach. We decomposed the problem into three branches of
classification problem with the help of multi-task learning. This
aims to inject relevant features and human biases into each branch
independently. For the fifth run, we continued to employ our
previous approaches [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] with some modifications. Our post-processing
stages were modified to a more general scenario with the help
of conditional probabilities and prior knowledge to eliminate the
sensitive outcomes of classification models.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHOD Run 01</title>
      <p>In this run, we stacked images in sub-clips spatially to create a
super image with size  ×  as a representation for the full clip and
treat the video classification task as an image classification problem.
After that, a classification head was used for making prediction
about stroke categories.
2.2</p>
      <p>Run 02
In this run, we decomposed the original classification problem into
three sub-classification branches, with the sub-categories split for
each classifier described in Table 1. This mechanism was based
on our motivation to disentangle the existing ambiguity of the
raw labels. It would be more relevant to discriminate among serve,
ofensive, and defensive strokes rather than serve, forehand, and
backhand types. Moreover, our proposed sub-categories
classification method by breaking the raw labels into many sub-classes
can supplement more training samples for each category in the
classifiers, as the collection of some strokes in table tennis are
still limited. Eventually, each classifier utilized both shared and
exclusive features useful for the corresponding tasks.</p>
      <p>
        The first and third components utilized the shared features
ℎ_13 which were constructed by performing concatenation
between the temporal visual features and temporal pose features. A
3D-CNN architecture implemented by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] was employed for
extracting the temporal visual features  _3 1 , given an image
with shape  × ×. On the other hand, the temporal pose features
_ were the results of providing 17 human key points
of multiple frames successively to an LSTM architecture. Initially,
we performed sampling  frames with a strategy for ensuring the
consistency of keypoint extracted in video sequences. Key points
were represented by two coordinate values, which results in 34
diferent values for a specific pose. The first and third components
were paired to use similar features due to their similarity in visual
appearance and might use the same sources of information for
predicting sub-categories.
      </p>
      <p>However, another significant feature should be incorporated
when handling with the third classifier (Forehand, Backhand). We
ifrst performed cropping the original image based on the
boundaries of the hands’ region, which can be extracted by selecting the
coordinates of key points that satisfy a plausible position for
human hands. After that, the concatenation of two hand images of
shape 1 × 1 ×  were supplied into a diferent 3D-CNN branch</p>
      <sec id="sec-2-1">
        <title>Classifier type Categories # Prediction Heads</title>
        <p>First Serve, Ofensive, 3
Component Defensive</p>
        <p>Second Forehand, 2
Component Backhand</p>
        <p>Backspin, Loop,</p>
        <p>Third Sidespin, Topspin, 8
Component Hit, Flip,</p>
        <p>Push, Block
Table 1: Three splitted sub-categories for three classifier types
to produce another temporal visual hand feature for the third
classification branch</p>
        <p>After that, three multi-layer perceptrons  (1) were designed
for each branch of classification with diferent number prediction
heads shown in Table 1. The loss function of each branch was then
aggregated for serving the final multi-task learning loss L .</p>
        <p>ˆ =    ( (  ))</p>
        <p>Finally, we formulated the joint probabilities  (1, 2, 3) (2) of
predicting three independent sub-categories using prior knowledge.
By conducting a thorough analysis about the co-existence of three
sub-categories, we concluded that the existence of the second
component label was independent of the first and third component
label. On the other hand, it was possible to narrow down plausible
labels of the third component given the prior knowledge about the
categories of the first component. In Table 2, we summarize the
relation of existence between the first and third components that
we have investigated so far.</p>
        <p>(1, 2, 3) =  (3, 1 |2) · (2)
=  (3, 1) · (2)
=  (3, 1) ·ˆ22
The second term can be referred to the ℎ value of ˆ2 (1) of
2
the second classifier. Meanwhile, the first term  (3, 1) (3) was
factorized into two terms.</p>
        <p>(3, 1) =  (3 |1) · (1)</p>
        <p>= ˆ  33 ·ˆ11</p>
        <p>Given the prior knowledge tables, we first construct a binary
referenced matrix  ∈ 3×8, which encodes the co-existence of
labels between the first and third component. Then, we perform
Hadamard product on the two vectors (1) ∈R1×8 (where (1)
= {0, 1, 2} represents the true index of 1) and ˆ3 ∈R1×8 to produce
the refined probability ˆ  3 ∈R1×8 (4). Finally, it is normalized
before being multiplied with the ℎ value of ˆ1 (1):
1
ˆ  3 = (1)
È</p>
        <p>ˆ3
 (3, 1) =  (3 |1) · (1)
=</p>
        <p>ˆ  33
Í==18 ˆ  3
·ˆ11</p>
      </sec>
      <sec id="sec-2-2">
        <title>Prior Knowledge about</title>
      </sec>
      <sec id="sec-2-3">
        <title>First Component</title>
        <sec id="sec-2-3-1">
          <title>Serve</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>Ofensive Defensive</title>
          <p>Table 2: Prior Knowledge tables
Trong-Tung Nguyen et al.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Possible sets of labels for Third Component</title>
        <p>Backspin, Loop,
Sidespin, Topspin</p>
        <p>Hit, Loop, Flip
Push, Block, Backspin
(1)
(2)
(3)
(4)
(5)
2.3</p>
        <p>
          Run 05
We made a small modification on the second run by replacing our
designed classifier with a more powerful model architecture for the
action recognition problem, which we have utilized last year [
          <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
          ].
Similarly, three diferent classifiers produced the outputs
independently which were then combined to get the final results with
our conditional probability using the prior knowledge mechanism
demonstrated.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>
        In the first run, the final score is the average score of two
subclips in the video. All of the images were resized to shapes of
224×224. We passed the super image to the ResNet-50 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] backbone,
followed by a global average pooling layer to get a 2048-dimension
vector. For each video, we sampled two sub-clips separated by five
frames, with 16 images per sub-clips. Random flip, color jittering,
and random augmentation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are also used with the default settings
in MMAction2 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We trained our model in this run using the focal
loss [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to handle the data imbalance problem. In the second run,
we passed video sequences with 30 samples of frame interval with
a shape of 120 × 120 to the shared network (the first and third
classification branch). Meanwhile, two hand images were cropped
and concatenated as shape of 120 × 240 before feeding the third
classifier. In the fifth run, we utilized the parameters similar to our
previous methods [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for each classifier. For the ensemble versions,
highest confidence scores were returned as final results.
      </p>
      <p>Run ID
Accuracy</p>
      <p>Run 1
61.99%</p>
      <p>Run 2
44.80%</p>
      <p>Run 3
68.78%</p>
      <p>Run 4
60.63%</p>
      <p>Run 5
67.87%</p>
    </sec>
    <sec id="sec-4">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was funded by Gia Lam Urban Development and
Investment Company Limited, Vingroup and supported by Vingroup
Innovation Foundation (VINIF) under project code VINIF.2019.DA19.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>MMPose</given-names>
            <surname>Contributors</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>OpenMMLab Pose Estimation Toolbox and Benchmark</article-title>
          . https://github.com/open-mmlab/mmpose. (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>MMAction2</given-names>
            <surname>Contributors</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark</article-title>
          . https://github.com/ open-mmlab/
          <year>mmaction2</year>
          . (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jonathon</given-names>
            <surname>Shlens Quoc V. Le Ekin D. Cubuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Barret</given-names>
            <surname>Zoph</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Randaugment: Practical automated data augmentation with a reduced search space</article-title>
          .
          <source>In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops</source>
          .
          <fpage>702</fpage>
          -
          <lpage>703</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . https://doi.org/10. 1109/CVPR.
          <year>2016</year>
          .90
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jenny</surname>
            Benois-Pineau,
            <given-names>Renaud</given-names>
          </string-name>
          <string-name>
            <surname>Peteri</surname>
            , and
            <given-names>Julien</given-names>
          </string-name>
          <string-name>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Fine grained sport action recognition with Twin spatiotemporal convolutional neural networks: Application to table tennis</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          <volume>79</volume>
          (07
          <year>2020</year>
          ). https://doi.org/10. 1007/s11042-020-08917-3
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan Calandre</surname>
          </string-name>
          , Boris Mansencal, Jenny Benois-Pineau, Renaud Péteri, Laurent Mascarilla, and
          <string-name>
            <given-names>Julien</given-names>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Sports Video: Fine-Grained Action Detection and Classification of Table Tennis Strokes from videos for MediaEval 2021</article-title>
          . (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Hai</given-names>
            <surname>Nguyen-Truong</surname>
          </string-name>
          , San Cao,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Khoa</surname>
          </string-name>
          <string-name>
            <given-names>Nguyen</given-names>
            ,
            <surname>Bang-Dang</surname>
          </string-name>
          <string-name>
            <given-names>Pham</given-names>
            , Hieu Dao,
            <surname>Minh-Quan</surname>
          </string-name>
          <string-name>
            <given-names>Le</given-names>
            ,
            <surname>Hoang-Phuc</surname>
          </string-name>
          Nguyen-Dinh,
          <article-title>Hai-Dang Nguyen, and</article-title>
          <string-name>
            <given-names>Minh-Triet</given-names>
            <surname>Tran</surname>
          </string-name>
          .
          <year>2020</year>
          . HCMUS at MediaEval 2020:
          <article-title>Ensembles of Temporal Deep Neural Networks for Table Tennis Strokes Classification Task</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2020 Workshop</source>
          , Online,
          <fpage>14</fpage>
          -
          <issue>15</issue>
          <year>December 2020</year>
          (CEUR Workshop Proceedings), Steven Hicks, Debesh Jha, Konstantin Pogorelov, Alba García Seco de Herrera, Dmitry Bogdanov,
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Stelios Andreadis</given-names>
            ,
            <surname>Minh-Son</surname>
          </string-name>
          <string-name>
            <surname>Dao</surname>
          </string-name>
          , Zhuoran Liu, José Vargas Quiros,
          <source>Benjamin Kille, and Martha A. Larson (Eds.)</source>
          , Vol.
          <volume>2882</volume>
          .
          <article-title>CEUR-WS.org</article-title>
          . http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2882</volume>
          /paper50.pdf
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Rameswar</given-names>
            <surname>Panda Quanfu Fan</surname>
          </string-name>
          , Chun-Fu (Richard) Chen.
          <year>2021</year>
          .
          <article-title>An Image Classifier Can Sufice For Video Understanding</article-title>
          .
          <source>(06</source>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Ross</given-names>
            <surname>Girshick Kaiming He Piotr Dollar Tsung-Yi</surname>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Priya</given-names>
            <surname>Goyal</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Focal Loss for Dense Object Detection</article-title>
          . In ICCV.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>