SELAB-HCMUS at MediaEval 2021: Music Theme and Emotion
         Classification with Co-teaching Training Strategy
                Phu-Thinh Pham1,3 , Minh-Hieu Huynh1,3 , Hai-Dang Nguyen1,3 , Minh-Triet Tran1,2,3
                                                              1 University of Science, VNU-HCM
                                                         2 John von Neumann Institute, VNU-HCM
                                              3 Vietnam National University, Ho Chi Minh city, Vietnam

                 {phpthinh18,hmhieu18}@apcs.fitus.edu.vn,nhdang@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn

ABSTRACT                                                                             2.1    Data pre-processing
MediaEval 2021 offers the third challenge motivating studies on                         2.1.1 Data shortening. At first, we attempted to train the whole
automatically recognizing the emotions and themes conveyed in a                      length of each track; however, it takes too much time to train, ap-
music recording. Team SELAB-HCMUS proposes various methods                           proximately 20 minutes/epoch. Through data analysis, we recognize
to deal with this problem. In this competition, we have applied                      that each music track can be counted as a sequence of repeated
an efficient training strategy to solve the task. In addition, with                  rhythms. Thus, instead of training the whole track, we perform a
short segments of input representations, the model achieves better                   randomly cut on each track. During the training stage, each mel-
results and a reduction in training time. From the official evalua-                  spectrogram instance will be randomly cut to have the dimension
tion, the best result achieves 0.1435 PR-AUC and 0.7599 ROC-AUC                      of 96 × 960. In validating and testing stages, each sample is divided
measurements.                                                                        into 16 segments, and the final prediction would be the average
                                                                                     score. This method has facilitated the training stage, which reduces
                                                                                     the training time remarkably from 20 minutes to 15 minutes per
                                                                                     epoch while preserving the models’ accuracy.
1    INTRODUCTION
The third edition of Emotions and Themes in Music task [12] is                          2.1.2 Mixup. We adapt the learning principle Mixup [13] by
introduced in MediaEval workshop 2021. The aim of the task is                        training on convex combinations of pairs of data and their labels.
to predict the mood and themes of given raw audio, with 56 tags                      The previous work [6, 7] has shown the essentialness of this method
in total, and audio can be labeled multiple tags, which could be                     to enhance the performance and generalization of the models.
considered as a multi-label classification problem.                                    2.1.3 SpecAugment. We use a simple data augmentation method
    Recognizing themes and emotions in music tracks is really impor-                 SpecAugment [8], originally used for speech recognition, to train
tant and has a wide range of applications, such as music recommen-                   models more effectively. This technique masks blocks of frequency
dation systems or music analysis. However, this task is considered                   channels and consecutive time steps in mel-spectrogram features.
to be extremely difficult. Determining the emotional perspectives                    However, in validation and testing, we do not apply SpecAugment.
of a song can be quite ambiguous because a song’s emotion or
mood heavily depends on the sentiments of the one listening to                          2.1.4 Data balancing. Following the study [1], we attempt to
it. Moreover, the length of each audio file is quite long, which can                 reduce the ambiguity in the data by changing the labels from multi-
lead to an exponential growth in the number of parameters for the                    tags to single-tags. This has been proved to be effective in giving
deep learning models, while the emotions and themes in music                         better results in comparison to default labels.
recordings could be determined by short segment audio.
    In order to tackle these problems, we tried to find some alter-                  2.2    Model Architecture
native solutions to preprocess the data instead of training on the                   Since the mel-spectrogram feature can be considered images, we
whole audio. Furthermore, we utilized some pre-trained CNN (Con-                     can apply CNN models, which are frequently used in image and
volutional Neural Network) models to build the ensemble model,                       video processing, to the audio-based problem. The models used in
which achieves 0.1435 PR-AUC and 0.7599 ROC-AUC. Along with                          our experiment are EfficientNet-B0 [10], ReXNet-100 [3], MixNet-S
the mentioned methods, we also applied a model training approach                     [11], ResNet [4], RegNet [9]. After conducting several experiments
called co-teaching to improve the efficiency of our models.                          and evaluations on those models, EfficientNet and RexNet are the
                                                                                     most suitable architectures for the task.
2    APPROACH                                                                           We tried to improve these two models more by applying the
We have approached this problem in several different ways, which                     co-teaching paradigm, which will be explained in Section 2.3.
can be divided into 3 subsections, namely data pre-processing,
model architecture, and co-teaching.                                                 2.3    Co-teaching
                                                                                     Co-teaching [2] is an effective learning paradigm that trains 2 net-
                                                                                     works simultaneously, allowing them to teach each other. First, in
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                   each mini-batch, each network ignores a small part of data, con-
MediaEval’21, December 13-15 2021, Online                                            sidering only useful knowledge. Therefore, for each epoch, each
                                                                                     network selects small-loss instances in the data and uses them to
MediaEval’21, December 13-15 2021, Online                                                                                        P. Thinh et al.

                                                         Optimization                                      Optimization
                                      EfficientNet-B0                    EfficientNet-B0        ...                          EfficientNet-B0


                                          Epoch 1                           Epoch 2                                              Epoch n


      Mel-spectrograms

                                          ReXNet                            ReXNet              ...                              ReXNet


                                 Figure 1: Overview of the system trained with Co-teaching strategy


optimize its peer network. This allows them to communicate with                      Model             Description       PR-AUC-macro
each other to distinguish helpful data to be used.                                Ensemble
                                                                                                        Run 1 + 2            0.1435
From the observation that this approach is potential, we have ap-                   (Run 4)
plied this method for the task. Since the system consists of 2 models,         EfficientNet-B0         Co-teaching
                                                                                                                             0.1415
one model is EfficientNet [10]; for the other one, we choose ReXNet                 (Run 1)             w/ ReXNet
[3]. This is a combination of two networks discussed in Section                    ReXNet              Co-teaching
                                                                                                                             0.1343
2.2. The illustration of the system flow is shown in Figure 1. In our               (Run 2)         w/ EfficientNet-B0
experiments, we set the forget rate 𝛼 = 0.05, which decides the                    ReXNet            Using proposed
                                                                                                                             0.1262
amount of data to be ignored.                                                       (Run 3)          data processing
                                                                             EfficientNet-B0 [1] Using data augmentation      0.139
3 SUBMISSIONS AND RESULTS                                                     Table 1: Model performance evaluation on the test set.
3.1 Experimental setup
To conduct experiments, we use Adam [5] optimizer for 100 epochs.
We start training with a learning rate of 1 × 10−3 . The learning
rate will be decreased 10 times after 5 consecutive epochs without
improvement. The loss function used for development is Binary
Cross Entropy (BCE). The experiments are carried out on Google              optimal coefficient of each model based on the result from the
Colab Pro, with the GPU NVIDIA Tesla P100.                                  validation set. Finally, we get the best result of 0.1435 in PR-AUC,
                                                                            which is slightly higher than working on an individual model.
3.2    Submissions
We submitted 4 runs, corresponding to 4 models having the highest           4   CONCLUSION AND FUTURE WORKS
PR-AUC measurement, to the MediaEval 2021 organizers.                       This paper introduces a method to process data before training,
      • Run 1: Model EfficientNet-B0 trained by Co-teaching strat-          which is data shortening. We apply CNN, an approach commonly
        egy with the peer network ReXNet in parallel.                       used in image-based classification problems, to the audio-based
      • Run 2: Model ReXNet trained by Co-teaching strategy with            classification problem. Besides, by realizing the problem with a
        the peer network EfficientNet-B0 in parallel.                       large number of tags, we make use of the co-teaching paradigm,
      • Run 3: Model ReXNet using data processing.                          which is designed to deal with the problem of the noisy labels. With
      • Run 4: Ensemble model from Run 1 and Run 2, with the                all of those efforts, we achieve our highest PR-AUC of 0.1435.
        factor of 0.65 and 0.35 for Run 1 and Run 2, respectively.              For future work, we first aim to investigate the factors that
                                                                            could improve the model’s accuracy. Then, we want to design more
3.3    Results                                                              efficient models specified for the music emotion recognition task.
After experimenting and evaluating the selected models, the results
are shown in Table 1. In comparison, the EfficientNet and ReXNet
                                                                            ACKNOWLEDGMENTS
models trained with the co-teaching method have better accuracy
than the traditional-trained models. Intended to increase the overall       This work was funded by Gia Lam Urban Development and In-
accuracy, from these two models, we created the ensemble model.             vestment Company Limited, Vingroup and supported by Vingroup
To maximize the result, we apply linear optimization to find the            Innovation Foundation (VINIF) under project code VINIF.2019.DA19
Emotions and Themes in Music                                                                         MediaEval’21, December 13-15 2021, Online


REFERENCES                                                                        European signal processing conference (EUSIPCO). IEEE, 1–5.
[1] Tri-Nhan Do, Minh-Tri Nguyen, Hai-Dang Nguyen, Minh-Triet Tran,           [8] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret
    and Xuan-Nam Cao. 2020. HCMUS at MediaEval 2020: Emotion Classi-              Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple
    fication Using Wavenet Features with SpecAugment and EfficientNet.            data augmentation method for automatic speech recognition. arXiv
    In MediaEval 2020 Workshop.                                                   preprint arXiv:1904.08779 (2019).
[2] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu,           [9] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He,
    Ivor Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust training          and Piotr Dollár. 2020. Designing network design spaces. In Pro-
    of deep neural networks with extremely noisy labels. arXiv preprint           ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
    arXiv:1804.06872 (2018).                                                      Recognition. 10428–10436.
[3] Dongyoon Han, Sangdoo Yun, Byeongho Heo, and YoungJoon Yoo.              [10] Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model
    2021. Rethinking Channel Dimensions for Efficient Model Design. In            scaling for convolutional neural networks. In International Conference
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern         on Machine Learning. PMLR, 6105–6114.
    Recognition. 732–741.                                                    [11] Mingxing Tan and Quoc V Le. 2019. Mixconv: Mixed depthwise
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep             convolutional kernels. arXiv preprint arXiv:1907.09595 (2019).
    residual learning for image recognition. In Proceedings of the IEEE      [12] Philip Tovstogan, Dmitry Bogdanov, and Alastair Porter. 2021. Media-
    conference on computer vision and pattern recognition. 770–778.               Eval 2021: Emotion and Theme Recognition in Music Using Jamendo.
[5] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic           In MediaEval 2021 Workshop.
    optimization. arXiv preprint arXiv:1412.6980 (2014).                     [13] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-
[6] Khaled Koutini, Shreyan Chowdhury, Verena Haunschmid, Hamid                   Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint
    Eghbal-Zadeh, and Gerhard Widmer. 2019. Emotion and theme recog-              arXiv:1710.09412 (2017).
    nition in music with Frequency-Aware RF-Regularized CNNs, In
    MediaEval 2019 Workshop. arXiv preprint arXiv:1911.05833.
[7] Khaled Koutini, Hamid Eghbal-Zadeh, Matthias Dorfer, and Gerhard
    Widmer. 2019. The receptive field as a regularizer in deep convolu-
    tional neural networks for acoustic scene classification. In 2019 27th