Learning Unbiased Transformer for Long-Tail Sports Action Classification Yijun Qian, Lijun Yu, Wenhe Liu, Alexander G. Hauptmann Language Technologies Institute, Carnegie Mellon University {yijunqian,lijun}@cmu.edu,{wenhel,alex}@cs.cmu.edu ABSTRACT 2 APPROACH The Sports Video Task in MediaEval 2021 Challenge contains two 2.1 Implementation of VST Model subtasks, detection and classification. The classification subtask Unless otherwise mentioned, all our reported results use VST-B [6] aims to classify different strokes in table tennis segments. These as the backbone extractor. Specifically, the channel number of the strokes are fine grained actions and difficult to distinguish. To solve hidden layers in the first stage is 128. The window size is set to 𝑃 = 8 this challenge, we, the INF Team, proposed a fine grained action and 𝑀 = 7. The query dimension of each head is 𝑑 = 32, and the classification pipeline with SWIN-Transformer and a combination expansion layer of each MLP is set to 𝛼 = 4. The layer numbers of of optimization techniques. According to the evaluation results, our the four stages are {2, 2, 18, 2}. The model is initialized with weights best submission ranks first with 74.21% accuracy and significantly pretrained on Kinetics600 [1]. We employ an SGD optimizer with outperforms the runner-up (74.21% v.s. 68.78%). plateau scheduler and train the model for 30 epochs. We use rank1 accuracy as the monitor metric of plat scheduler, and the patience is set as 1. During training stage, the input frames are firstly resized to 1 INTRODUCTION 256Γ—256, then randomly cropped to 224Γ—224 for data augmentation. Action classification has been a heated topic in computer vision and In evaluation stage, the input frames are firstly resized to 256 Γ— 256, can be widely implemented in real-world applications. Recent years then center cropped to 224 Γ— 224. For each segment, 32 frames are have witnessed many successful works on action classification[6, evenly sampled as the input instance. Therefore, for each segment, 9, 12]. The recent improvements of these methods can be highly the size of input sample 𝑉𝑏𝑒 is 32 Γ— 224 Γ— 224. attributed to the advancement of temporal modeling capacity. Dif- ferent from previous series of 2D-Stream CNN works or 3D-CNN 2.2 Implementation of Background Erasing methods, [12] factorizes the 3D spatial-temporal convolution to a 2D spatial convolution and a 1D temporal convolution. TRM [9] After analyzing the training set videos, we find the scenes are quite directly replaces convolution operation with temporal relocation similar, e.g., many videos are recorded in the same scene. As a result, operation to enable the 2D CNNs the capability of spatial-temporal the model may easily become background biased as reported in modeling with an equivalent temporal receptive field of the whole [5, 16–18] and experiments in [2]. To solve this issue, we followed input video clip. Given the recent success of implementing trans- [14] to apply a background erasing algorithm in training. To be former [13] based methods in image-level computer vision tasks (i.e. specific, one static frame is randomly sampled from each input ViT [3] for image classification), Video SWIN-Transformer (VST) [6] segment and added to every other frames within the segment to proposed a transformer based video feature extractor model and construct a distracting sample. Then, an MSE loss is implemented surpassed previous CNN based SOTAs with noticeable margins on to force the features extracted from the original clip to be similar multiple action recognition benchmarks. However, directly imple- to those extracted from the distracting sample. menting the VST model on the dataset of sports video classification Lπ‘šπ‘ π‘’ = βˆ₯N (π‘‰π‘œπ‘Ÿπ‘” ) βˆ’ N (𝑉𝑏𝑒 )βˆ₯ 2 (1) task in the 2021 Mediaeval Challenge won’t be the optimal solution. Different from the other action classification benchmarks [4, 7, 11], where N represents the backbone VST extractor, π‘‰π‘œπ‘Ÿπ‘” represents the Sports Video Classification Task [7] of 2021 Mediaeval Chal- the original input clip, and 𝑉𝑏𝑒 represents the background erased lenge specifically focused on strokes within table tennis segments. clip. These strokes are fine-grained actions that are visually similar and take place in limited scenes. Meanwhile, the samples for training 2.3 Implementation of Balanced Loss are pretty limited, and the dataset is severely long-tail distributed. As is shown in Figure 1, the training dataset is severely long-tail Without specially-designed techniques, the model will be easily distributed. If all samples are evenly weighted, the model may overfitted and biased to strokes of head classes. To solve this, we easily become biased to the head classes (i.e. the classes with much implemented Background Erasing [14] which prevents the model more samples than others in the training set). Thus, we use a class- from overfitting to background regions. We also proposed a sample- wise weight π‘Šπ‘  = {𝑀𝑠1, π‘€π‘Ž2 , ..., 𝑀𝑠𝑛 } to balance samples of different balanced cross entropy loss for model optimization on the long-tail strokes. distributed dataset. 1 𝑀ˆ 𝑠𝑖 = 𝑖 (2) Copyright 2021 for this paper by its authors. Use permitted under Creative Commons 𝑁 License Attribution 4.0 International (CC BY 4.0). MediaEval’21, December 13-15 2021, Online 𝑀ˆ 𝑖 𝑀𝑠𝑖 = 𝑛 Γ— Í 𝑠 𝑖 (3) 𝑖𝑀 ˆ𝑠 MediaEval’21, December 13-15 2021, Online Yijun Qian et al. Figure 1: The number of segments for training varies among different strokes. Especially, there are no samples of Serve Backhand Loop and Serve Backhand Sidespin for training. Table 1: Results of CMU INF Team in Sports Classification Task of 2021 Mediaeval Challenge Figure 2: Confusion matrix among sub-group attributes of Run ID System Spec Val Acc % Test Acc % Run2 submission. Run1 swin-transformer 63.40 63.35 Run3 Run1 + balanced loss 67.81 66.06 Run2 Run3 + background erasing 75.25 74.21 4 DISCUSSION AND OUTLOOK The strokes in the sports classification task have several sub-group attribute pairs (i.e., Defensive v.s. Offensive and Forehand v.s. Back- where 𝑁 𝑖 represents the 𝑖 π‘‘β„Ž stroke’s number of samples for training, hand). So besides comparing the global accuracy performance, we and 𝑛 represents the number of strokes (20 here). The overall loss also analyze the confusion matrix of these sub-group attributes. As function for optimization becomes: is shown in Figure 2, we can find our system can successfully distin- guish similar attribute pairs such as forehand v.s. backhand, server 𝑖 exp(πœ™ (N (π‘₯𝑛𝑖 ))) Lπ‘₯𝑒 = βˆ’π‘€π‘ π‘– log( Í 𝑗 ) (4) v.s. offensive, and server v.s. defensive. However, it doesn’t perform 𝑗 exp(πœ™ (N (π‘₯𝑛 ))) as well when encountering offensive v.s. defensive. We suggest the 0-1 classification of sub-group attributes can be included in next βˆ‘οΈ L𝑖 π‘₯𝑒 year’s challenge as extra metric. Meanwhile, we find several strokes Lπ‘₯𝑒 = (5) (i.e. Serve Backhand Loop and Serve Backhand Sidespin) never appear 𝑖 𝑛 in training or validation sets. Although the balanced loss can relieve the classifier bias to head classes to some extent, the number of sam- L = 𝛼 Lπ‘šπ‘ π‘’ + 𝛽Lπ‘₯𝑒 (6) ples for several strokes (i.e. Serve Forehand Loop ) is still too small where πœ™ represents the MLP classifier with dropout layers that to train a robust model. Thus, we hope the dataset can be re-split projects extracted video feature to vector of probabilities. Unless or augmented for next year’s challenge. Finally, we didn’t use both specially mentioned, we set 𝛼 = 1 and 𝛽 = 1 for all our results in train and val samples for final submission, we will have a try next this report. year to see if the performance get improved. Meanwhile, we also assume initializing with weights pretrained on large fine-grained 3 RESULTS AND ANALYSIS action recognition datasets may also improvements. As is shown in Table 1, we report the performance of our three submissions on both self-evaluated validation set and official hid- ACKNOWLEDGMENTS den test set. Through comparing Run1 and Run3, we can find that This research is supported in part by the Intelligence Advanced Re- the implementation of balanced loss brings 3.41% improvements search Projects Activity (IARPA) via Department of Interior/Interior on validation set and 2.71% improvements on test set. It shows Business Center (DOI/IBC) contract number D17PC00340. This re- that balanced sampling can improve the final performance through search is supported in part through the financial assistance award forcing the model pay more attention on tail classes and less atten- 60NANB17D156 from U.S. Department of Commerce, National In- tion on head classes. It may also work for similar tasks[8, 10, 15? stitute of Standards and Technology. This project is funded in part ]. Through comparing Run2 and Run3, we can find that the usage by Carnegie Mellon University’s Mobility21 National University of background erasing significantly improves the performance on Transportation Center, which is sponsored by the US Department both validation set (7.44%) and test set (8.15%). of Transportation. Sports Video Task MediaEval’21, December 13-15 2021, Online REFERENCES [17] Lijun Yu, Yijun Qian, Wenhe Liu, and Alexander G Hauptmann. CMU [1] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recogni- Informedia at TRECVID 2021: Activity Detection with Argus++. In tion? a new model and the kinetics dataset. In proceedings of the IEEE TRECVID 2021. Conference on Computer Vision and Pattern Recognition. 6299–6308. [18] Lijun Yu, Yijun Qian, Wenhe Liu, and Alexander G. Hauptmann. 2022. [2] Jinwoo Choi, Chen Gao, Joseph CE Messou, and Jia-Bin Huang. 2019. Argus++: Robust Real-time Activity Detection for Unconstrained Why Can’t I Dance in the Mall? Learning to Mitigate Scene Bias in Video Streams with Overlapping Cube Proposals. In Proceedings of the Action Recognition. arXiv preprint arXiv:1912.05534 (2019). IEEE Winter Conference on Applications of Computer Vision Workshops. [3] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transform- ers for Image Recognition at Scale. ICLR (2021). [4] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. 2011. HMDB: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV). [5] Wenhe Liu, Guoliang Kang, Po-Yao Huang, Xiaojun Chang, Yijun Qian, Junwei Liang, Liangke Gui, Jing Wen, and Peng Chen. 2020. Argus: Efficient activity detection system for extended video analysis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops. 126–133. [6] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2021. Video swin transformer. arXiv preprint arXiv:2106.13230 (2021). [7] Pierre-Etienne Martin, Jordan Calandre, Boris Mansencal, Jenny Benois-Pineau, Renaud PΓ©teri, Laurent Mascarilla, and Julien Morlier. 2021. Sports Video: Fine-Grained Action Detection and Classification of Table Tennis Strokes from videos for MediaEval 2021. (2021). [8] Yijun Qian, Lijun Yu, Wenhe Liu, and Alexander G Hauptmann. 2020. Electricity: An efficient multi-camera vehicle tracking system for in- telligent city. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 588–589. [9] Yijun Qian, Lijun Yu, Wenhe Liu, and Alexander G. Hauptmann. 2022. TRM: Temporal Relocation Module for Video Recognition. In Proceed- ings of the IEEE Winter Conference on Applications of Computer Vision Workshops. [10] Yijun Qian, Lijun Yu, Wenhe Liu, Guoliang Kang, and Alexander G Hauptmann. 2020. Adaptive feature aggregation for video object detec- tion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops. 143–147. [11] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012). [12] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450–6459. [13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need. In Advances in neural information processing systems. 5998–6008. [14] Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J Ma, Hao Cheng, Pai Peng, Feiyue Huang, Rongrong Ji, and Xing Sun. 2021. Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11804–11813. [15] Lijun Yu, Qianyu Feng, Yijun Qian, Wenhe Liu, and Alexander G. Hauptmann. 2020. Zero-VIRUS: Zero-Shot VehIcle Route Understand- ing System for Intelligent Transportation. 594–595. [16] Lijun Yu, Yijun Qian, Wenhe Liu, and Alexander G Hauptmann. CMU Informedia at TRECVID 2020: Activity Detection with Dense Spatio- temporal Proposals. In TRECVID 2020.