SELAB-HCMUS at MediaEval 2021: Music Theme and Emotion Classification with Co-teaching Training Strategy Phu-Thinh Pham1,3 , Minh-Hieu Huynh1,3 , Hai-Dang Nguyen1,3 , Minh-Triet Tran1,2,3 1 University of Science, VNU-HCM 2 John von Neumann Institute, VNU-HCM 3 Vietnam National University, Ho Chi Minh city, Vietnam {phpthinh18,hmhieu18}@apcs.fitus.edu.vn,nhdang@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn ABSTRACT 2.1 Data pre-processing MediaEval 2021 offers the third challenge motivating studies on 2.1.1 Data shortening. At first, we attempted to train the whole automatically recognizing the emotions and themes conveyed in a length of each track; however, it takes too much time to train, ap- music recording. Team SELAB-HCMUS proposes various methods proximately 20 minutes/epoch. Through data analysis, we recognize to deal with this problem. In this competition, we have applied that each music track can be counted as a sequence of repeated an efficient training strategy to solve the task. In addition, with rhythms. Thus, instead of training the whole track, we perform a short segments of input representations, the model achieves better randomly cut on each track. During the training stage, each mel- results and a reduction in training time. From the official evalua- spectrogram instance will be randomly cut to have the dimension tion, the best result achieves 0.1435 PR-AUC and 0.7599 ROC-AUC of 96 × 960. In validating and testing stages, each sample is divided measurements. into 16 segments, and the final prediction would be the average score. This method has facilitated the training stage, which reduces the training time remarkably from 20 minutes to 15 minutes per epoch while preserving the models’ accuracy. 1 INTRODUCTION The third edition of Emotions and Themes in Music task [12] is 2.1.2 Mixup. We adapt the learning principle Mixup [13] by introduced in MediaEval workshop 2021. The aim of the task is training on convex combinations of pairs of data and their labels. to predict the mood and themes of given raw audio, with 56 tags The previous work [6, 7] has shown the essentialness of this method in total, and audio can be labeled multiple tags, which could be to enhance the performance and generalization of the models. considered as a multi-label classification problem. 2.1.3 SpecAugment. We use a simple data augmentation method Recognizing themes and emotions in music tracks is really impor- SpecAugment [8], originally used for speech recognition, to train tant and has a wide range of applications, such as music recommen- models more effectively. This technique masks blocks of frequency dation systems or music analysis. However, this task is considered channels and consecutive time steps in mel-spectrogram features. to be extremely difficult. Determining the emotional perspectives However, in validation and testing, we do not apply SpecAugment. of a song can be quite ambiguous because a song’s emotion or mood heavily depends on the sentiments of the one listening to 2.1.4 Data balancing. Following the study [1], we attempt to it. Moreover, the length of each audio file is quite long, which can reduce the ambiguity in the data by changing the labels from multi- lead to an exponential growth in the number of parameters for the tags to single-tags. This has been proved to be effective in giving deep learning models, while the emotions and themes in music better results in comparison to default labels. recordings could be determined by short segment audio. In order to tackle these problems, we tried to find some alter- 2.2 Model Architecture native solutions to preprocess the data instead of training on the Since the mel-spectrogram feature can be considered images, we whole audio. Furthermore, we utilized some pre-trained CNN (Con- can apply CNN models, which are frequently used in image and volutional Neural Network) models to build the ensemble model, video processing, to the audio-based problem. The models used in which achieves 0.1435 PR-AUC and 0.7599 ROC-AUC. Along with our experiment are EfficientNet-B0 [10], ReXNet-100 [3], MixNet-S the mentioned methods, we also applied a model training approach [11], ResNet [4], RegNet [9]. After conducting several experiments called co-teaching to improve the efficiency of our models. and evaluations on those models, EfficientNet and RexNet are the most suitable architectures for the task. 2 APPROACH We tried to improve these two models more by applying the We have approached this problem in several different ways, which co-teaching paradigm, which will be explained in Section 2.3. can be divided into 3 subsections, namely data pre-processing, model architecture, and co-teaching. 2.3 Co-teaching Co-teaching [2] is an effective learning paradigm that trains 2 net- works simultaneously, allowing them to teach each other. First, in Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). each mini-batch, each network ignores a small part of data, con- MediaEval’21, December 13-15 2021, Online sidering only useful knowledge. Therefore, for each epoch, each network selects small-loss instances in the data and uses them to MediaEval’21, December 13-15 2021, Online P. Thinh et al. Optimization Optimization EfficientNet-B0 EfficientNet-B0 ... EfficientNet-B0 Epoch 1 Epoch 2 Epoch n Mel-spectrograms ReXNet ReXNet ... ReXNet Figure 1: Overview of the system trained with Co-teaching strategy optimize its peer network. This allows them to communicate with Model Description PR-AUC-macro each other to distinguish helpful data to be used. Ensemble Run 1 + 2 0.1435 From the observation that this approach is potential, we have ap- (Run 4) plied this method for the task. Since the system consists of 2 models, EfficientNet-B0 Co-teaching 0.1415 one model is EfficientNet [10]; for the other one, we choose ReXNet (Run 1) w/ ReXNet [3]. This is a combination of two networks discussed in Section ReXNet Co-teaching 0.1343 2.2. The illustration of the system flow is shown in Figure 1. In our (Run 2) w/ EfficientNet-B0 experiments, we set the forget rate 𝛼 = 0.05, which decides the ReXNet Using proposed 0.1262 amount of data to be ignored. (Run 3) data processing EfficientNet-B0 [1] Using data augmentation 0.139 3 SUBMISSIONS AND RESULTS Table 1: Model performance evaluation on the test set. 3.1 Experimental setup To conduct experiments, we use Adam [5] optimizer for 100 epochs. We start training with a learning rate of 1 × 10−3 . The learning rate will be decreased 10 times after 5 consecutive epochs without improvement. The loss function used for development is Binary Cross Entropy (BCE). The experiments are carried out on Google optimal coefficient of each model based on the result from the Colab Pro, with the GPU NVIDIA Tesla P100. validation set. Finally, we get the best result of 0.1435 in PR-AUC, which is slightly higher than working on an individual model. 3.2 Submissions We submitted 4 runs, corresponding to 4 models having the highest 4 CONCLUSION AND FUTURE WORKS PR-AUC measurement, to the MediaEval 2021 organizers. This paper introduces a method to process data before training, • Run 1: Model EfficientNet-B0 trained by Co-teaching strat- which is data shortening. We apply CNN, an approach commonly egy with the peer network ReXNet in parallel. used in image-based classification problems, to the audio-based • Run 2: Model ReXNet trained by Co-teaching strategy with classification problem. Besides, by realizing the problem with a the peer network EfficientNet-B0 in parallel. large number of tags, we make use of the co-teaching paradigm, • Run 3: Model ReXNet using data processing. which is designed to deal with the problem of the noisy labels. With • Run 4: Ensemble model from Run 1 and Run 2, with the all of those efforts, we achieve our highest PR-AUC of 0.1435. factor of 0.65 and 0.35 for Run 1 and Run 2, respectively. For future work, we first aim to investigate the factors that could improve the model’s accuracy. Then, we want to design more 3.3 Results efficient models specified for the music emotion recognition task. After experimenting and evaluating the selected models, the results are shown in Table 1. In comparison, the EfficientNet and ReXNet ACKNOWLEDGMENTS models trained with the co-teaching method have better accuracy than the traditional-trained models. Intended to increase the overall This work was funded by Gia Lam Urban Development and In- accuracy, from these two models, we created the ensemble model. vestment Company Limited, Vingroup and supported by Vingroup To maximize the result, we apply linear optimization to find the Innovation Foundation (VINIF) under project code VINIF.2019.DA19 Emotions and Themes in Music MediaEval’21, December 13-15 2021, Online REFERENCES European signal processing conference (EUSIPCO). IEEE, 1–5. [1] Tri-Nhan Do, Minh-Tri Nguyen, Hai-Dang Nguyen, Minh-Triet Tran, [8] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret and Xuan-Nam Cao. 2020. HCMUS at MediaEval 2020: Emotion Classi- Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple fication Using Wavenet Features with SpecAugment and EfficientNet. data augmentation method for automatic speech recognition. arXiv In MediaEval 2020 Workshop. preprint arXiv:1904.08779 (2019). [2] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, [9] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Ivor Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust training and Piotr Dollár. 2020. Designing network design spaces. In Pro- of deep neural networks with extremely noisy labels. arXiv preprint ceedings of the IEEE/CVF Conference on Computer Vision and Pattern arXiv:1804.06872 (2018). Recognition. 10428–10436. [3] Dongyoon Han, Sangdoo Yun, Byeongho Heo, and YoungJoon Yoo. [10] Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model 2021. Rethinking Channel Dimensions for Efficient Model Design. In scaling for convolutional neural networks. In International Conference Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern on Machine Learning. PMLR, 6105–6114. Recognition. 732–741. [11] Mingxing Tan and Quoc V Le. 2019. Mixconv: Mixed depthwise [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep convolutional kernels. arXiv preprint arXiv:1907.09595 (2019). residual learning for image recognition. In Proceedings of the IEEE [12] Philip Tovstogan, Dmitry Bogdanov, and Alastair Porter. 2021. Media- conference on computer vision and pattern recognition. 770–778. Eval 2021: Emotion and Theme Recognition in Music Using Jamendo. [5] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic In MediaEval 2021 Workshop. optimization. arXiv preprint arXiv:1412.6980 (2014). [13] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez- [6] Khaled Koutini, Shreyan Chowdhury, Verena Haunschmid, Hamid Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint Eghbal-Zadeh, and Gerhard Widmer. 2019. Emotion and theme recog- arXiv:1710.09412 (2017). nition in music with Frequency-Aware RF-Regularized CNNs, In MediaEval 2019 Workshop. arXiv preprint arXiv:1911.05833. [7] Khaled Koutini, Hamid Eghbal-Zadeh, Matthias Dorfer, and Gerhard Widmer. 2019. The receptive field as a regularizer in deep convolu- tional neural networks for acoustic scene classification. In 2019 27th