<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Unbiased Transformer for Long-Tail Sports Action Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yijun Qian</string-name>
          <email>yijunqian@cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lijun Yu</string-name>
          <email>lijun@cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenhe Liu</string-name>
          <email>wenhel@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander G. Hauptmann</string-name>
          <email>alex@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Language Technologies Institute, Carnegie Mellon University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The Sports Video Task in MediaEval 2021 Challenge contains two subtasks, detection and classification. The classification subtask aims to classify diferent strokes in table tennis segments. These strokes are fine grained actions and dificult to distinguish. To solve this challenge, we, the INF Team, proposed a fine grained action classification pipeline with SWIN-Transformer and a combination of optimization techniques. According to the evaluation results, our best submission ranks first with 74.21% accuracy and significantly outperforms the runner-up (74.21% v.s. 68.78%).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Action classification has been a heated topic in computer vision and
can be widely implemented in real-world applications. Recent years
have witnessed many successful works on action classification[
        <xref ref-type="bibr" rid="ref12 ref6 ref9">6,
9, 12</xref>
        ]. The recent improvements of these methods can be highly
attributed to the advancement of temporal modeling capacity.
Different from previous series of 2D-Stream CNN works or 3D-CNN
methods, [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] factorizes the 3D spatial-temporal convolution to a
2D spatial convolution and a 1D temporal convolution. TRM [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
directly replaces convolution operation with temporal relocation
operation to enable the 2D CNNs the capability of spatial-temporal
modeling with an equivalent temporal receptive field of the whole
input video clip. Given the recent success of implementing
transformer [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] based methods in image-level computer vision tasks (i.e.
ViT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for image classification ), Video SWIN-Transformer (VST) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
proposed a transformer based video feature extractor model and
surpassed previous CNN based SOTAs with noticeable margins on
multiple action recognition benchmarks. However, directly
implementing the VST model on the dataset of sports video classification
task in the 2021 Mediaeval Challenge won’t be the optimal solution.
Diferent from the other action classification benchmarks [
        <xref ref-type="bibr" rid="ref11 ref4 ref7">4, 7, 11</xref>
        ],
the Sports Video Classification Task [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] of 2021 Mediaeval
Challenge specifically focused on strokes within table tennis segments.
These strokes are fine-grained actions that are visually similar and
take place in limited scenes. Meanwhile, the samples for training
are pretty limited, and the dataset is severely long-tail distributed.
Without specially-designed techniques, the model will be easily
overfitted and biased to strokes of head classes. To solve this, we
implemented Background Erasing [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] which prevents the model
from overfitting to background regions. We also proposed a
samplebalanced cross entropy loss for model optimization on the long-tail
distributed dataset.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
    </sec>
    <sec id="sec-3">
      <title>Implementation of VST Model</title>
      <p>
        Unless otherwise mentioned, all our reported results use VST-B [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
as the backbone extractor. Specifically, the channel number of the
hidden layers in the first stage is 128. The window size is set to  = 8
and  = 7. The query dimension of each head is  = 32, and the
expansion layer of each MLP is set to  = 4. The layer numbers of
the four stages are {2, 2, 18, 2}. The model is initialized with weights
pretrained on Kinetics600 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We employ an SGD optimizer with
plateau scheduler and train the model for 30 epochs. We use rank1
accuracy as the monitor metric of plat scheduler, and the patience is
set as 1. During training stage, the input frames are firstly resized to
256×256, then randomly cropped to 224×224 for data augmentation.
In evaluation stage, the input frames are firstly resized to 256 × 256,
then center cropped to 224 × 224. For each segment, 32 frames are
evenly sampled as the input instance. Therefore, for each segment,
the size of input sample  is 32 × 224 × 224.
2.2
      </p>
      <p>
        Implementation of Background Erasing
After analyzing the training set videos, we find the scenes are quite
similar, e.g., many videos are recorded in the same scene. As a result,
the model may easily become background biased as reported in
[
        <xref ref-type="bibr" rid="ref16 ref17 ref18 ref5">5, 16–18</xref>
        ] and experiments in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To solve this issue, we followed
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to apply a background erasing algorithm in training. To be
specific, one static frame is randomly sampled from each input
segment and added to every other frames within the segment to
construct a distracting sample. Then, an MSE loss is implemented
to force the features extracted from the original clip to be similar
to those extracted from the distracting sample.
      </p>
      <p>L = ∥N () − N ( ) ∥
2
where N represents the backbone VST extractor,  represents
the original input clip, and  represents the background erased
clip.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Implementation of Balanced Loss</title>
      <p>As is shown in Figure 1, the training dataset is severely long-tail
distributed. If all samples are evenly weighted, the model may
easily become biased to the head classes (i.e. the classes with much
more samples than others in the training set). Thus, we use a
classwise weight  = {1,  2, ...,  } to balance samples of diferent
strokes.</p>
      <p>ˆ =
1
 
(1)
(2)
where   represents the ℎ stroke’s number of samples for training,
and  represents the number of strokes (20 here). The overall loss
function for optimization becomes:</p>
      <p>L = − log( Í exp( (N ( )))</p>
      <p>exp( (N ( )))</p>
      <p>)
L =
︁∑</p>
      <p>L

L =  L +  L
(4)
(5)
(6)
where  represents the MLP classifier with dropout layers that
projects extracted video feature to vector of probabilities. Unless
specially mentioned, we set  = 1 and  = 1 for all our results in
this report.
3</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND ANALYSIS</title>
      <p>As is shown in Table 1, we report the performance of our three
submissions on both self-evaluated validation set and oficial
hidden test set. Through comparing Run1 and Run3, we can find that
the implementation of balanced loss brings 3.41% improvements
on validation set and 2.71% improvements on test set. It shows
that balanced sampling can improve the final performance through
forcing the model pay more attention on tail classes and less
attention on head classes. It may also work for similar tasks[8, 10, 15?
]. Through comparing Run2 and Run3, we can find that the usage
of background erasing significantly improves the performance on
both validation set (7.44%) and test set (8.15%).
v.s. ofensive, and server v.s. defensive. However, it doesn’t perform
as well when encountering ofensive v.s. defensive. We suggest the
0-1 classification of sub-group attributes can be included in next
year’s challenge as extra metric. Meanwhile, we find several strokes
(i.e. Serve Backhand Loop and Serve Backhand Sidespin) never appear
in training or validation sets. Although the balanced loss can relieve
the classifier bias to head classes to some extent, the number of
samples for several strokes (i.e. Serve Forehand Loop ) is still too small
to train a robust model. Thus, we hope the dataset can be re-split
or augmented for next year’s challenge. Finally, we didn’t use both
train and val samples for final submission, we will have a try next
year to see if the performance get improved. Meanwhile, we also
assume initializing with weights pretrained on large fine-grained
action recognition datasets may also improvements.</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research is supported in part by the Intelligence Advanced
Research Projects Activity (IARPA) via Department of Interior/Interior
Business Center (DOI/IBC) contract number D17PC00340. This
research is supported in part through the financial assistance award
60NANB17D156 from U.S. Department of Commerce, National
Institute of Standards and Technology. This project is funded in part
by Carnegie Mellon University’s Mobility21 National University
Transportation Center, which is sponsored by the US Department
of Transportation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Joao</given-names>
            <surname>Carreira</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Quo vadis, action recognition? a new model and the kinetics dataset</article-title>
          .
          <source>In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>6299</fpage>
          -
          <lpage>6308</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Jinwoo</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Chen</given-names>
            <surname>Gao</surname>
          </string-name>
          , Joseph CE Messou, and
          <string-name>
            <surname>Jia-Bin Huang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition</article-title>
          . arXiv preprint arXiv:
          <year>1912</year>
          .
          <volume>05534</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Alexey</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          , Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and
          <string-name>
            <given-names>Neil</given-names>
            <surname>Houlsby</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale</article-title>
          .
          <source>ICLR</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuehne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jhuang</surname>
          </string-name>
          , E. Garrote,
          <string-name>
            <given-names>T.</given-names>
            <surname>Poggio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Serre</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>HMDB: a large video database for human motion recognition</article-title>
          .
          <source>In Proceedings of the International Conference on Computer Vision</source>
          (ICCV).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Wenhe</given-names>
            <surname>Liu</surname>
          </string-name>
          , Guoliang Kang,
          <string-name>
            <surname>Po-Yao</surname>
            <given-names>Huang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaojun</given-names>
            <surname>Chang</surname>
          </string-name>
          , Yijun Qian, Junwei Liang, Liangke Gui, Jing Wen, and
          <string-name>
            <given-names>Peng</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Argus: Eficient activity detection system for extended video analysis</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops</source>
          .
          <fpage>126</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Ze</given-names>
            <surname>Liu</surname>
          </string-name>
          , Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and
          <string-name>
            <given-names>Han</given-names>
            <surname>Hu</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Video swin transformer</article-title>
          .
          <source>arXiv preprint arXiv:2106.13230</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Pierre-Etienne</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan Calandre</surname>
          </string-name>
          , Boris Mansencal, Jenny Benois-Pineau, Renaud Péteri, Laurent Mascarilla, and
          <string-name>
            <given-names>Julien</given-names>
            <surname>Morlier</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Sports Video: Fine-Grained Action Detection and Classification of Table Tennis Strokes from videos for MediaEval 2021</article-title>
          . (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Yijun</given-names>
            <surname>Qian</surname>
          </string-name>
          , Lijun Yu, Wenhe Liu, and Alexander G Hauptmann.
          <year>2020</year>
          .
          <article-title>Electricity: An eficient multi-camera vehicle tracking system for intelligent city</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops</source>
          .
          <fpage>588</fpage>
          -
          <lpage>589</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Yijun</given-names>
            <surname>Qian</surname>
          </string-name>
          , Lijun Yu, Wenhe Liu, and
          <string-name>
            <surname>Alexander</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Hauptmann</surname>
          </string-name>
          .
          <year>2022</year>
          .
          <article-title>TRM: Temporal Relocation Module for Video Recognition</article-title>
          .
          <source>In Proceedings of the IEEE Winter Conference on Applications of Computer Vision</source>
          Workshops.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Yijun</surname>
            <given-names>Qian</given-names>
          </string-name>
          , Lijun Yu, Wenhe Liu, Guoliang Kang, and Alexander G Hauptmann.
          <year>2020</year>
          .
          <article-title>Adaptive feature aggregation for video object detection</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops</source>
          .
          <fpage>143</fpage>
          -
          <lpage>147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Khurram</surname>
            <given-names>Soomro</given-names>
          </string-name>
          , Amir Roshan Zamir, and
          <string-name>
            <given-names>Mubarak</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>UCF101: A dataset of 101 human actions classes from videos in the wild</article-title>
          .
          <source>arXiv preprint arXiv:1212.0402</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Du</surname>
            <given-names>Tran</given-names>
          </string-name>
          , Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and
          <string-name>
            <given-names>Manohar</given-names>
            <surname>Paluri</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A closer look at spatiotemporal convolutions for action recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>6450</fpage>
          -
          <lpage>6459</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <string-name>
            <surname>Łukasz Kaiser</surname>
            , and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>5998</volume>
          -
          <fpage>6008</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Jinpeng</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuting Gao</surname>
            ,
            <given-names>Ke</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Yiqi</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            , Andy J Ma, Hao Cheng, Pai Peng, Feiyue Huang,
            <given-names>Rongrong</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            , and
            <given-names>Xing</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>11804</fpage>
          -
          <lpage>11813</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Lijun</surname>
            <given-names>Yu</given-names>
          </string-name>
          , Qianyu Feng, Yijun Qian, Wenhe Liu, and
          <string-name>
            <surname>Alexander</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Hauptmann</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Zero-VIRUS: Zero-Shot VehIcle Route Understanding System for Intelligent Transportation</article-title>
          .
          <fpage>594</fpage>
          -
          <lpage>595</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Lijun</surname>
            <given-names>Yu</given-names>
          </string-name>
          , Yijun Qian, Wenhe Liu, and Alexander G Hauptmann.
          <source>CMU Informedia at TRECVID</source>
          <year>2020</year>
          :
          <article-title>Activity Detection with Dense Spatiotemporal Proposals</article-title>
          .
          <source>In TRECVID</source>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Lijun</surname>
            <given-names>Yu</given-names>
          </string-name>
          , Yijun Qian, Wenhe Liu, and Alexander G Hauptmann.
          <source>CMU Informedia at TRECVID</source>
          <year>2021</year>
          :
          <article-title>Activity Detection with Argus++</article-title>
          .
          <source>In TRECVID</source>
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Lijun</surname>
            <given-names>Yu</given-names>
          </string-name>
          , Yijun Qian, Wenhe Liu, and
          <string-name>
            <surname>Alexander</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Hauptmann</surname>
          </string-name>
          .
          <year>2022</year>
          .
          <article-title>Argus++: Robust Real-time Activity Detection for Unconstrained Video Streams with Overlapping Cube Proposals</article-title>
          .
          <source>In Proceedings of the IEEE Winter Conference on Applications of Computer Vision</source>
          Workshops.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>