Learning Unbiased Transformer
                               for Long-Tail Sports Action Classification
                                    Yijun Qian, Lijun Yu, Wenhe Liu, Alexander G. Hauptmann
                                           Language Technologies Institute, Carnegie Mellon University
                                              {yijunqian,lijun}@cmu.edu,{wenhel,alex}@cs.cmu.edu

ABSTRACT                                                                             2 APPROACH
The Sports Video Task in MediaEval 2021 Challenge contains two                       2.1 Implementation of VST Model
subtasks, detection and classification. The classification subtask
                                                                                     Unless otherwise mentioned, all our reported results use VST-B [6]
aims to classify different strokes in table tennis segments. These
                                                                                     as the backbone extractor. Specifically, the channel number of the
strokes are fine grained actions and difficult to distinguish. To solve
                                                                                     hidden layers in the first stage is 128. The window size is set to 𝑃 = 8
this challenge, we, the INF Team, proposed a fine grained action
                                                                                     and 𝑀 = 7. The query dimension of each head is 𝑑 = 32, and the
classification pipeline with SWIN-Transformer and a combination
                                                                                     expansion layer of each MLP is set to 𝛼 = 4. The layer numbers of
of optimization techniques. According to the evaluation results, our
                                                                                     the four stages are {2, 2, 18, 2}. The model is initialized with weights
best submission ranks first with 74.21% accuracy and significantly
                                                                                     pretrained on Kinetics600 [1]. We employ an SGD optimizer with
outperforms the runner-up (74.21% v.s. 68.78%).
                                                                                     plateau scheduler and train the model for 30 epochs. We use rank1
                                                                                     accuracy as the monitor metric of plat scheduler, and the patience is
                                                                                     set as 1. During training stage, the input frames are firstly resized to
1    INTRODUCTION
                                                                                     256×256, then randomly cropped to 224×224 for data augmentation.
Action classification has been a heated topic in computer vision and                 In evaluation stage, the input frames are firstly resized to 256 × 256,
can be widely implemented in real-world applications. Recent years                   then center cropped to 224 × 224. For each segment, 32 frames are
have witnessed many successful works on action classification[6,                     evenly sampled as the input instance. Therefore, for each segment,
9, 12]. The recent improvements of these methods can be highly                       the size of input sample 𝑉𝑏𝑒 is 32 × 224 × 224.
attributed to the advancement of temporal modeling capacity. Dif-
ferent from previous series of 2D-Stream CNN works or 3D-CNN                         2.2    Implementation of Background Erasing
methods, [12] factorizes the 3D spatial-temporal convolution to a
2D spatial convolution and a 1D temporal convolution. TRM [9]                        After analyzing the training set videos, we find the scenes are quite
directly replaces convolution operation with temporal relocation                     similar, e.g., many videos are recorded in the same scene. As a result,
operation to enable the 2D CNNs the capability of spatial-temporal                   the model may easily become background biased as reported in
modeling with an equivalent temporal receptive field of the whole                    [5, 16–18] and experiments in [2]. To solve this issue, we followed
input video clip. Given the recent success of implementing trans-                    [14] to apply a background erasing algorithm in training. To be
former [13] based methods in image-level computer vision tasks (i.e.                 specific, one static frame is randomly sampled from each input
ViT [3] for image classification), Video SWIN-Transformer (VST) [6]                  segment and added to every other frames within the segment to
proposed a transformer based video feature extractor model and                       construct a distracting sample. Then, an MSE loss is implemented
surpassed previous CNN based SOTAs with noticeable margins on                        to force the features extracted from the original clip to be similar
multiple action recognition benchmarks. However, directly imple-                     to those extracted from the distracting sample.
menting the VST model on the dataset of sports video classification                                     L𝑚𝑠𝑒 = ∥N (𝑉𝑜𝑟𝑔 ) − N (𝑉𝑏𝑒 )∥ 2                  (1)
task in the 2021 Mediaeval Challenge won’t be the optimal solution.
Different from the other action classification benchmarks [4, 7, 11],                where N represents the backbone VST extractor, 𝑉𝑜𝑟𝑔 represents
the Sports Video Classification Task [7] of 2021 Mediaeval Chal-                     the original input clip, and 𝑉𝑏𝑒 represents the background erased
lenge specifically focused on strokes within table tennis segments.                  clip.
These strokes are fine-grained actions that are visually similar and
take place in limited scenes. Meanwhile, the samples for training                    2.3    Implementation of Balanced Loss
are pretty limited, and the dataset is severely long-tail distributed.               As is shown in Figure 1, the training dataset is severely long-tail
Without specially-designed techniques, the model will be easily                      distributed. If all samples are evenly weighted, the model may
overfitted and biased to strokes of head classes. To solve this, we                  easily become biased to the head classes (i.e. the classes with much
implemented Background Erasing [14] which prevents the model                         more samples than others in the training set). Thus, we use a class-
from overfitting to background regions. We also proposed a sample-                   wise weight 𝑊𝑠 = {𝑤𝑠1, 𝑤𝑎2 , ..., 𝑤𝑠𝑛 } to balance samples of different
balanced cross entropy loss for model optimization on the long-tail                  strokes.
distributed dataset.                                                                                                         1
                                                                                                                    𝑤ˆ 𝑠𝑖 = 𝑖                             (2)
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
                                                                                                                            𝑁
License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, December 13-15 2021, Online                                                                                  𝑤ˆ 𝑖
                                                                                                                𝑤𝑠𝑖 = 𝑛 × Í 𝑠 𝑖                          (3)
                                                                                                                            𝑖𝑤 ˆ𝑠
MediaEval’21, December 13-15 2021, Online                                                                                    Yijun Qian et al.


Figure 1: The number of segments for training varies among
different strokes. Especially, there are no samples of Serve
Backhand Loop and Serve Backhand Sidespin for training.


Table 1: Results of CMU INF Team in Sports Classification
Task of 2021 Mediaeval Challenge

                                                                         Figure 2: Confusion matrix among sub-group attributes of
 Run ID            System Spec               Val Acc %   Test Acc %      Run2 submission.
    Run1       swin-transformer               63.40      63.35
    Run3     Run1 + balanced loss             67.81      66.06
    Run2   Run3 + background erasing          75.25      74.21
                                                                         4   DISCUSSION AND OUTLOOK
                                                                         The strokes in the sports classification task have several sub-group
                                                                         attribute pairs (i.e., Defensive v.s. Offensive and Forehand v.s. Back-
where 𝑁 𝑖 represents the 𝑖 𝑡ℎ stroke’s number of samples for training,   hand). So besides comparing the global accuracy performance, we
and 𝑛 represents the number of strokes (20 here). The overall loss       also analyze the confusion matrix of these sub-group attributes. As
function for optimization becomes:                                       is shown in Figure 2, we can find our system can successfully distin-
                                                                         guish similar attribute pairs such as forehand v.s. backhand, server
                𝑖               exp(𝜙 (N (𝑥𝑛𝑖 )))
               L𝑥𝑒 = −𝑤𝑠𝑖 log( Í               𝑗
                                                    )             (4)    v.s. offensive, and server v.s. defensive. However, it doesn’t perform
                                 𝑗 exp(𝜙 (N (𝑥𝑛 )))                      as well when encountering offensive v.s. defensive. We suggest the
                                                                         0-1 classification of sub-group attributes can be included in next
                                  ∑︁ L𝑖
                                        𝑥𝑒
                                                                         year’s challenge as extra metric. Meanwhile, we find several strokes
                          L𝑥𝑒 =                                   (5)    (i.e. Serve Backhand Loop and Serve Backhand Sidespin) never appear
                                   𝑖
                                       𝑛
                                                                         in training or validation sets. Although the balanced loss can relieve
                                                                         the classifier bias to head classes to some extent, the number of sam-
                        L = 𝛼 L𝑚𝑠𝑒 + 𝛽L𝑥𝑒                         (6)
                                                                         ples for several strokes (i.e. Serve Forehand Loop ) is still too small
where 𝜙 represents the MLP classifier with dropout layers that           to train a robust model. Thus, we hope the dataset can be re-split
projects extracted video feature to vector of probabilities. Unless      or augmented for next year’s challenge. Finally, we didn’t use both
specially mentioned, we set 𝛼 = 1 and 𝛽 = 1 for all our results in       train and val samples for final submission, we will have a try next
this report.                                                             year to see if the performance get improved. Meanwhile, we also
                                                                         assume initializing with weights pretrained on large fine-grained
3    RESULTS AND ANALYSIS                                                action recognition datasets may also improvements.
As is shown in Table 1, we report the performance of our three
submissions on both self-evaluated validation set and official hid-      ACKNOWLEDGMENTS
den test set. Through comparing Run1 and Run3, we can find that          This research is supported in part by the Intelligence Advanced Re-
the implementation of balanced loss brings 3.41% improvements            search Projects Activity (IARPA) via Department of Interior/Interior
on validation set and 2.71% improvements on test set. It shows           Business Center (DOI/IBC) contract number D17PC00340. This re-
that balanced sampling can improve the final performance through         search is supported in part through the financial assistance award
forcing the model pay more attention on tail classes and less atten-     60NANB17D156 from U.S. Department of Commerce, National In-
tion on head classes. It may also work for similar tasks[8, 10, 15?      stitute of Standards and Technology. This project is funded in part
]. Through comparing Run2 and Run3, we can find that the usage           by Carnegie Mellon University’s Mobility21 National University
of background erasing significantly improves the performance on          Transportation Center, which is sponsored by the US Department
both validation set (7.44%) and test set (8.15%).                        of Transportation.
Sports Video Task                                                                                     MediaEval’21, December 13-15 2021, Online


REFERENCES                                                                     [17] Lijun Yu, Yijun Qian, Wenhe Liu, and Alexander G Hauptmann. CMU
 [1] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recogni-           Informedia at TRECVID 2021: Activity Detection with Argus++. In
     tion? a new model and the kinetics dataset. In proceedings of the IEEE         TRECVID 2021.
     Conference on Computer Vision and Pattern Recognition. 6299–6308.         [18] Lijun Yu, Yijun Qian, Wenhe Liu, and Alexander G. Hauptmann. 2022.
 [2] Jinwoo Choi, Chen Gao, Joseph CE Messou, and Jia-Bin Huang. 2019.              Argus++: Robust Real-time Activity Detection for Unconstrained
     Why Can’t I Dance in the Mall? Learning to Mitigate Scene Bias in              Video Streams with Overlapping Cube Proposals. In Proceedings of the
     Action Recognition. arXiv preprint arXiv:1912.05534 (2019).                    IEEE Winter Conference on Applications of Computer Vision Workshops.
 [3] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis-
     senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani,
     Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit,
     and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transform-
     ers for Image Recognition at Scale. ICLR (2021).
 [4] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. 2011. HMDB:
     a large video database for human motion recognition. In Proceedings
     of the International Conference on Computer Vision (ICCV).
 [5] Wenhe Liu, Guoliang Kang, Po-Yao Huang, Xiaojun Chang, Yijun
     Qian, Junwei Liang, Liangke Gui, Jing Wen, and Peng Chen. 2020.
     Argus: Efficient activity detection system for extended video analysis.
     In Proceedings of the IEEE/CVF Winter Conference on Applications of
     Computer Vision Workshops. 126–133.
 [6] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and
     Han Hu. 2021. Video swin transformer. arXiv preprint arXiv:2106.13230
     (2021).
 [7] Pierre-Etienne Martin, Jordan Calandre, Boris Mansencal, Jenny
     Benois-Pineau, Renaud Péteri, Laurent Mascarilla, and Julien Morlier.
     2021. Sports Video: Fine-Grained Action Detection and Classification
     of Table Tennis Strokes from videos for MediaEval 2021. (2021).
 [8] Yijun Qian, Lijun Yu, Wenhe Liu, and Alexander G Hauptmann. 2020.
     Electricity: An efficient multi-camera vehicle tracking system for in-
     telligent city. In Proceedings of the IEEE/CVF Conference on Computer
     Vision and Pattern Recognition Workshops. 588–589.
 [9] Yijun Qian, Lijun Yu, Wenhe Liu, and Alexander G. Hauptmann. 2022.
     TRM: Temporal Relocation Module for Video Recognition. In Proceed-
     ings of the IEEE Winter Conference on Applications of Computer Vision
     Workshops.
[10] Yijun Qian, Lijun Yu, Wenhe Liu, Guoliang Kang, and Alexander G
     Hauptmann. 2020. Adaptive feature aggregation for video object detec-
     tion. In Proceedings of the IEEE/CVF Winter Conference on Applications
     of Computer Vision Workshops. 143–147.
[11] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012.
     UCF101: A dataset of 101 human actions classes from videos in the
     wild. arXiv preprint arXiv:1212.0402 (2012).
[12] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and
     Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for
     action recognition. In Proceedings of the IEEE conference on Computer
     Vision and Pattern Recognition. 6450–6459.
[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
     Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At-
     tention is all you need. In Advances in neural information processing
     systems. 5998–6008.
[14] Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J Ma, Hao Cheng,
     Pai Peng, Feiyue Huang, Rongrong Ji, and Xing Sun. 2021. Removing
     the Background by Adding the Background: Towards Background
     Robust Self-supervised Video Representation Learning. In Proceedings
     of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
     11804–11813.
[15] Lijun Yu, Qianyu Feng, Yijun Qian, Wenhe Liu, and Alexander G.
     Hauptmann. 2020. Zero-VIRUS: Zero-Shot VehIcle Route Understand-
     ing System for Intelligent Transportation. 594–595.
[16] Lijun Yu, Yijun Qian, Wenhe Liu, and Alexander G Hauptmann. CMU
     Informedia at TRECVID 2020: Activity Detection with Dense Spatio-
     temporal Proposals. In TRECVID 2020.