<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table Tennis Strokes Classification Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hai Nguyen-Truong</string-name>
          <email>nthai18@apcs.vn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>San Cao</string-name>
          <email>ctsan18@apcs.vn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Khoa N. A. Nguyen</string-name>
          <email>nnakhoa18@apcs.vn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bang-Dang Pham</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hieu Dao</string-name>
          <email>dhieu@apcs.vn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Quan Le</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hoang-Phuc Nguyen-Dinh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hai-Dang Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Triet Tran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>John von Neumann Institute</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh city</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The Sports Video Classification Tasks in the Multimedia Evaluation 2020 Challenge focuses on classifying diferent types of table tennis strokes in video segments. In this task, we - the HCMUS Team perform multiple experiments, which includes a combination of models such as SlowFast, Optical Flow, DensePose, R2+1, ChannelSeparated Convolutional Networks, to classify 21 types of table tennis strokes from video segments. In total, we submit eight runs corresponding to five diferent models with diferent sets of hyperparameters in each of our models. In addition, we apply some pre-processing techniques on the dataset in order for our model to learn and classify more accurately. According to the evaluation results, one of our team's methods out-performs all other teams. In particular, our best run achieves 31.35% global accuracy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In the Multimedia Evalutaion Challenge 2020 (MediaEval2020), one
of the tasks is classification table tennis strokes in video segments
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In the task, the authors conduct experiments on the
TTStroke21 dataset [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The dataset consists of 20 table tennis stroke classes,
combining 8 kind of services, 6 ofensive strokes, and 6 defensive
strokes. In addition, there is a class named "Unknown" for
identifying the video segments without any activity or stroke.
      </p>
      <p>We implemented five runs independently in order to benchmark
diferent methods, and we conduct experiments on distinct sets of
our augmented / pre-processed dataset. Thus, we describe the five
runs in Section 2 and 3.</p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>By examining the videos of training and test set, we realize that the
context around the table tennis player in each video is not important,
and we desire our models to solely concentrate on learning the
action of the player. Therefore, we propose a way to remove the
background around the player by the following technique.</p>
    </sec>
    <sec id="sec-3">
      <title>Data pre-processing with DensePose for</title>
    </sec>
    <sec id="sec-4">
      <title>Background Removal</title>
      <p>
        Particularly, for both train and test set, we utilize the DensePose
model [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to extract the mask from the person in each frame of the
video sequence. Then, we extend the mask to its local region to
capture minor context around using binary dilation, and we blur
the mask inside-out by the Gaussian filter with suitable parameters.
After that, we multiply the created mask with the original frame
to get a new frame showing just the "biggest" player. In case the
mask obtained from DensePose from a frame is too small in area
(smaller than a pre-defined threshold - 5 percent of the area of the
image in our experiment), we do not modify that frame. After this
step, we have videos that only concentrate on showing the players.
We are still unable to process the case when DensePose detects
more than one player in a frame. Besides, we also employ simple
data augmentation methods on the video segments such as rotation,
translation, flip to get more relevant data. The background removal
process is shown in the Figure 1.
2.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Late Temporal Modeling in 3D CNN</title>
    </sec>
    <sec id="sec-6">
      <title>Architectures with BERT</title>
      <p>
        Late Temporal Modeling in 3D CNN Architecrtures
(LateTemporal3DCNN) with BERT for Action Recognition [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is a method
combining 3D convolution with late temporal modeling for action
recognition. The paper replaces the conventional Temporal Global
Average Pooling (TGAP) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] layer at the end of 3D convolutional
architecture with the Bidirectional Encoder Representations from
Transformers (BERT) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] layer in order to better utilize the temporal
information with BERT’s attention mechanism.
      </p>
    </sec>
    <sec id="sec-7">
      <title>2.3 Channel-Separated Convolutional</title>
    </sec>
    <sec id="sec-8">
      <title>Networks (CSN)</title>
      <p>
        Channel-Separated Convolutional Networks [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] was first
introduced by Facebook AI in ICCV 2019. The paper emphasized the
important role of the amount of channels interaction in the
accuracy of 3D group convolutional networks. All of convolutional
operations are separated into either pointwise 1 × 1 × 1 or
depthwise 3 × 3 × 3 convolutions. That change not only reduces the
computational cost but also improves the accuracy significantly.
2.4
      </p>
    </sec>
    <sec id="sec-9">
      <title>Twin Spatio-Temporal Convolutional</title>
    </sec>
    <sec id="sec-10">
      <title>Neural Networks (TSTCNN)</title>
      <p>
        In this task, we also use the Twin Spatio-Temporal Convolutional
Neural Networks (TSTCNN) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and conduct experiments on it
with our minor adjustments to classify the fine-grained sports
actions. To extract the useful information, we compute the optical
lfow values of each video frame by Lucas-Kanade method, then
      </p>
      <p>Classifier
number
(1)
(2)</p>
      <p>
        Videos with labels
For this run, we use CSN method (mentioned in Section 2.3)
without modified as the baseline to demonstrate for the method. We
use Resnet3D architecture as our backbone, with I3D heads [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as
classification part. We also disable batch norm operations because
it leads to a higher accuracy in overall. At the result, we achieve
86.9% on our validation dataset and 28.81% on the test dataset.
3.2
      </p>
    </sec>
    <sec id="sec-11">
      <title>Second run - Run 04</title>
      <p>
        In this run, we use LateTemporalModeling3DCNN method
(mention in Section 2.2) combined with several models to inspect the
efectiveness of the method. The used methods are RGB ResNeXt101
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and RGB ResNeXt101 with BERT, RGB SlowFast50 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] (derived
from ResNet50) and RGB SlowFast50 with BERT, and RGB R(2+1)D
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. All models use 64-frame length except the RGB R(2+1)D uses
32-frame length because we want to keep the configuration from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Initially, we accidentally set the number of classes to be 51 since we
try to configure the dataset to be the same as the HMDB51 and the
RGB ResNeXt101 gives the best result of this run (we achieve 87.9%
on our validation dataset and 25.42% on the test dataset). However,
when we fix the number of classes to be 20 - the actual one - and
use the more complex backbones (even with BERT architecture),
the results are not as good as the initial one.
3.3
      </p>
    </sec>
    <sec id="sec-12">
      <title>Third run - Run 06</title>
      <p>
        In run 6, we use multi-video classification models based on the
SlowFast Network [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] on the background removal video frames.
Particularly for the training phase, we train six diferent classifiers
with six diferent sets of videos, shown in Table 1.
      </p>
      <p>All of the six classifiers are SlowFast Network with the ResNet50
backbone. In the inference phase, we first predict the person playing
Forehand or Backhand stroke, then Serve, Ofensive, or Defensive
stroke. After that, based on this prediction, we choose the model
to infer the remaining part of the stroke. Experiments show that
with this method our models can recognize forehand, backhand,
and serve with high precision.</p>
    </sec>
    <sec id="sec-13">
      <title>3.4 Fourth run - Run 07</title>
      <p>Inherited from the impressive performance of CSN method (run 03),
we modify the model to solve multi-label classification task. By our
observation, the 20 classes can be split into three separate labels
as following, Ofensive/ Defensive/ Serve, Forehand/ Backhand,
and Loop/ Backspin/ Sidespin/ Topspin/ Hit/ Push/ Flip/ Block.
That idea makes our model learn the partial labels and reduce the
confusion of the similarity in the 20 classes. Instead of using Cross
Entropy Loss, we use Binary Cross Entropy with Logits Loss to
demonstrates the score of each class.</p>
      <p>After the training phase, we post-process the predictions in each
video. Among the predictions which have positive scores, if the
labels of the top 3 highest scores exist (present in the 20 classes),
then we use 3 of them as the reliable results. In case the predicted
label is not in the 20 labels, we take top 5 (either positive or negative),
and the combination which has the highest total score is chosen.
The results is unreliable and need to take into consideration in the
ensemble process. Using multi-label classification, we achieve 97%
mAP ( ≈ 89.51% top 1 score) on the validation dataset. Another key
idea of the run 07 is that, we try to ensemble it with run 03 on the
Serve activities.
3.5 Fifth run - Run 08
In this run, we consider this task as a multi-label classification
problem and we design our pipeline to classify each of the videos to
multi-labels. We split the combined original label into multi-label
as in run07 but there is a minor diference, for instance, Defensive
Backhand Backspin is split into Defensive Backhand and Backspin.
Our pipeline consists of three modified TSTCNN models (mentioned
in Section 2.4) with the same architecture and their outputs are two
spitted labels and the original label, respectively.</p>
    </sec>
    <sec id="sec-14">
      <title>3.6 Results</title>
    </sec>
    <sec id="sec-15">
      <title>CONCLUSION AND FUTURE WORKS</title>
      <p>In conclusion, we benchmark many diferent approaches on the
manipulated TTStroke dataset during the MediaEval Challenge
2020. One of our submissions achieve the best result in term of
global accuracy, which is 31.35%, compared to the submissions of
all other teams. For the future work, we aim to extract human 3D
mesh-based from each frame of the videos in order to have better
classification results. The mesh could be rotated in diferent angles
which helps our model to learn more eficiently.</p>
    </sec>
    <sec id="sec-16">
      <title>ACKNOWLEDGMENTS</title>
      <p>Research is supported by Vingroup Innovation Foundation (VINIF)
in project code VINIF.2019.DA19. We would like to give a special
thank to Mr. Huu-Quoc Hoang (Ho Chi Minh city University of
Technology), who supports us in examining the dataset and gives
us advices.</p>
      <p>Sports Video Classification: Classification of Strokes in Table Tennis</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Rıza</given-names>
            <surname>Alp</surname>
          </string-name>
          <string-name>
            <surname>Güler</surname>
          </string-name>
          , Natalia Neverova, and
          <string-name>
            <given-names>Iasonas</given-names>
            <surname>Kokkinos</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Densepose: Dense human pose estimation in the wild</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>7297</fpage>
          -
          <lpage>7306</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Joao</given-names>
            <surname>Carreira</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset</article-title>
          . (
          <year>2018</year>
          ).
          <source>arXiv:cs.CV/1705.07750</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Feichtenhofer</surname>
          </string-name>
          , Haoqi Fan, Jitendra Malik, and
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Slowfast networks for video recognition</article-title>
          .
          <source>In Proceedings of the IEEE international conference on computer vision</source>
          . 6202-
          <fpage>6211</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Kensho</given-names>
            <surname>Hara</surname>
          </string-name>
          , Hirokatsu Kataoka, and
          <string-name>
            <given-names>Yutaka</given-names>
            <surname>Satoh</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?</article-title>
          .
          <source>In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>6546</fpage>
          -
          <lpage>6555</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M</given-names>
            <surname>Kalfaoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sinan</given-names>
            <surname>Kalkan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>A Aydin</given-names>
            <surname>Alatan</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition</article-title>
          . arXiv preprint arXiv:
          <year>2008</year>
          .
          <volume>01232</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Min</given-names>
            <surname>Lin</surname>
          </string-name>
          , Qiang
          <string-name>
            <surname>Chen</surname>
            , and
            <given-names>Shuicheng</given-names>
          </string-name>
          <string-name>
            <surname>Yan</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Network in network</article-title>
          .
          <source>arXiv preprint arXiv:1312.4400</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jenny</surname>
          </string-name>
          Benois-Pineau, Boris Mansencal, Renaud Péteri, Laurent Mascarilla,
          <string-name>
            <surname>Jordan Calandre</surname>
            , and
            <given-names>Julien</given-names>
          </string-name>
          <string-name>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Sports Video Classification: Classification of Strokes in Table Tennis for MediaEval 2020</article-title>
          .
          <source>In Proc. of the MediaEval 2020 Workshop</source>
          , Online,
          <fpage>14</fpage>
          -
          <lpage>15</lpage>
          December
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jenny</surname>
          </string-name>
          Benois-Pineau, Boris Mansencal, Renaud Péteri, Laurent Mascarilla,
          <string-name>
            <surname>Jordan Calandre</surname>
            , and
            <given-names>Julien</given-names>
          </string-name>
          <string-name>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Sports Video Annotation: Detection of Strokes in Table Tennis task for MediaEval 2019</article-title>
          . In MediaEval 2019 Workshop.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Renaud</given-names>
            <surname>Péteri Julien Morlier Pierre-Etienne</surname>
          </string-name>
          <string-name>
            <surname>Martin</surname>
          </string-name>
          ,
          <source>Jenny BenoisPineau</source>
          .
          <year>2020</year>
          .
          <article-title>Fine grained sport action recognition with Twin spatiotemporal convolutional neural networks</article-title>
          . (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Du</surname>
            <given-names>Tran</given-names>
          </string-name>
          , Heng Wang,
          <string-name>
            <surname>Lorenzo Torresani</surname>
            , and
            <given-names>Matt</given-names>
          </string-name>
          <string-name>
            <surname>Feiszli</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Video classification with channel-separated convolutional networks</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          . 5552-
          <fpage>5561</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Du</surname>
            <given-names>Tran</given-names>
          </string-name>
          , Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and
          <string-name>
            <given-names>Manohar</given-names>
            <surname>Paluri</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A closer look at spatiotemporal convolutions for action recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>6450</fpage>
          -
          <lpage>6459</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>