Semi-Supervised Music Emotion Recognition using Noisy Student Training and Harmonic Pitch Class Profiles Hao Hao Tan helloharry66@gmail.com ABSTRACT Model ROC-AUC PR-AUC F-Score We present Mirable’s submission to the 2021 Emotions and Themes baseline 0.7258 0.1077 0.1656 in Music challenge. In this work, we intend to address the question: long-normal 0.7256 0.1024 0.1578 can we leverage semi-supervised learning techniques on music long-hpcp 0.7587 0.1220 0.1854 emotion recognition? With that, we experiment with noisy student long-hpcp-noisy 0.7614 0.1235 0.1833 training, which has improved model performance in the image short-normal 0.7477 0.1234 0.1855 classification domain. As the noisy student method requires a strong short-hpcp 0.7541 0.1275 0.1864 teacher model, we further delve into the factors including (i) input short-hpcp-noisy 0.7488 0.1226 0.1804 training length and (ii) complementary music representations to ensemble 0.7687 0.1356 0.1978 further boost the performance of the teacher model. For (i), we Table 1: Test-set performance of our models. find that models trained with short input length perform better in PR-AUC, whereas those trained with long input length perform better in ROC-AUC. For (ii), we find that using harmonic pitch class according to the training strategy, which will be discussed in Sec- profiles (HPCP) consistently improve tagging performance, which tion 2.3. For data augmentation, we perform time masking and suggests that harmonic representation is useful for music emotion frequency masking, similar to the idea in SpecAugment [5]. The tagging. Finally, we find that noisy student method only improves maximum possible length of both masks vary between 20 to 60, tagging results for the case of long training length. Additionally, we and the value is being sampled randomly for each training batch. find that ensembling representations trained with different training lengths can improve tagging results significantly, which suggest 2.2 Model Training a possible direction to explore incorporating multiple temporal As shown in Figure 1, our base model architecture is similar to resolutions in the network architecture for future work. CRNN [2], with some revisions which include adding residual con- nections to our ConvBlock, and using GeMPool [6] instead of Max- 1 INTRODUCTION Pool. We train all of our models for a maximum of 100 epochs, with Emotions and themes are high-level musical attributes that are an Adam optimizer and learning rate of 0.0001. Early stopping is abstract and highly subjective. Obtaining emotion labels typically performed when the validation ROC-AUC does not improve for 5 require human annotation, which can be time consuming and po- epochs, and we store the model weights from the epoch with the tentially costly. Is it possible to use semi-supervised learning tech- best ROC-AUC evaluated on the validation set. niques, such that we can leverage on unlabelled music tracks to learn emotion tags, while only using a small amount of labelled 2.3 Long VS Short Training Length data? Following this question, we intend to explore the usage of For the long training length mode, we use the first ≈ 185 seconds noisy student training [9] on music emotion recognition. Recently, of the track, which corresponds to 1600 time steps in the Mel- [8] proposed the music tagging transformer, which also uses noisy spectrogram after average pooling. For the short training length student training, but it is applied to general music tagging and mode, we chunk each track into samples of length ≈ 9.25 seconds, does not focus on emotion and theme related tags. Additionally, we which corresponds to 80 time steps in the Mel-spectrogram after explore two other factors to improve the tagging performance of average pooling. During evaluation, we average the logits of all the teacher model: (i) the input training length; (ii) adding music chunks to obtain the final output for each track. representations to complement the learning of music emotion. 2 APPROACH 2.4 Harmonic Pitch Class Profiles (HPCP) HPCP [3] is a type of chroma feature that describes tonality and 2.1 Pre-Processing and Augmentation harmonic content of a music track. We extract HPCP with 12 pitch We extract Mel-spectrograms with 128 bins from raw audio using classes from raw audio using a sampling rate of 44.1kHz. We do a sampling rate of 44.1kHz, and the Mel-spectrograms are down- not apply average pooling along the temporal dimension of HPCP. sampled with an averaging factor of 10 along the temporal dimen- The corresponding number of time steps for HPCP are 4000 and sion. The number of time steps for each Mel-spectrogram vary 200 for both long and short training length mode respectively. We concatenate the learnt latent features from the Mel-spectrogram Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). and the HPCP block, each with dimension 𝑑 = 256, and pass through MediaEval’21, December 13-15 2021, Online two linear layers to obtain the fused output. MediaEval’21, December 13-15 2021, Online Hao Hao Tan Figure 1: Overview of our model. Model Avg TPR Avg TNR with short input length perform significantly better in PR-AUC. According to Table 2, this is because the former has a higher TNR, long-hpcp-noisy 0.3645 0.8851 while the latter has a higher TPR. Since PR-AUC focuses more on short-normal 0.3842 0.8737 the minority class (in this case the positive class) and ROC-AUC ensemble 0.4099 0.8671 focuses on both, the latter model scores better in PR-AUC. We Table 2: Average true positive rate (TPR) and true negative also find that adding HPCP improves tagging results consistently rate (TNR) for each model across all labels. for both cases, which suggests that harmonic representation is 2.5 Noisy Student Training important for music emotion recognition. Noisy student training [9] is an extension of self-training, with For noisy student training, the results are rather inconclusive. the usage of equal-or-larger student models and added noise to We find slight improvements in the long training length case, but improve the representation learnt from the teacher model. To add the result degrades for the short training length case. Also, we only noise, we enhance data augmentation by increasing the maximum run noisy student training for 1 iteration, as we find the results possible masking length to between 30 and 90 for both time and consistently degrade for subsequent iterations. Additionally, we try frequency masking, as well as adding standard Gaussian noise to add more unlabelled tracks from the Lakh MP3 dataset (≈ 45, 000 with a weight of 0.01. To implement stochastic depth [4], we use 30 seconds track) to increase the training dataset size, but we do 3 StochasticConvBlocks which are ConvBlocks that could be ran- not observe any performance improvement. We infer that noisy domly bypassed with a probability of 0.1 each. During evaluation, student method might not necessarily work well for music emotion all the layers will be passed through. StochasticConvBlock also has recognition tasks, due to the abstract nature and subjectivity of an additional dropout of probability 0.1 after the ReLU layer. emotion and theme labels. Hence, a small subset of emotion labels In this work, we use the corresponding HPCP models for each might not be sufficient to represent the full dataset. long and short training length mode as the teacher model. We only For model ensembling, we choose to ensemble the ‘long-noisy’ use the predictions which are > 0.1 as positive pseudo-labels, and model and the ‘short-normal’ model. We find that 𝛼 = 0.7 is optimal those < 1𝑒 −6 as negative pseudo-labels. Both decision thresholds through our validation set, hence suggesting that the final output are determined by conducting an empirical evaluation on the pre- gives more weightage to the short training length model. From dicted value distribution using the teacher model, carried out on the test set results, we can also see that this ensemble method the training and validation set. We take the leftmost 5% percentile improves the tagging performance significantly, which suggest that for the negative label distribution, and the rightmost 5% percentile combining different views of audio in terms of temporal resolution for the positive label distribution to ensure better confidence. can produce better learnt representations. 4 DISCUSSION AND OUTLOOK 2.6 Model Ensemble While investigating the related work, we find that this work still Finally, we investigate the results of combining the output of both uses a relatively long training length (even for short length we long and short training length models, by simply taking the weighted use ≈ 9 seconds, as compared to previous works with ≈ 2 to 5 sum of their best models: 𝑙 𝑓 𝑖𝑛𝑎𝑙 = 𝛼 · 𝑙𝑠ℎ𝑜𝑟𝑡 + (1 − 𝛼) · 𝑙𝑙𝑜𝑛𝑔 . We use seconds), and low temporal resolution, which we intend to change the validation set to find the ratio 𝛼 which gives the best results. in our future work. For future work, we are interested in tweaking the network architecture to capture views of different temporal 3 RESULTS AND ANALYSIS resolutions in the audio sample. We would also like to explore For the training length factor, we find that models trained with using noisy student training with different model architectures and long input length perform better in ROC-AUC, but models trained datasets of a much larger scale. Emotions and Themes in Music MediaEval’21, December 13-15 2021, Online REFERENCES [1] Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and Xavier Serra. 2019. The MTG-Jamendo Dataset for Automatic Music Tagging. In Machine Learning for Music Discovery Workshop, Interna- tional Conference on Machine Learning (ICML 2019). [2] Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho. 2017. Convolutional recurrent neural networks for music classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2392–2396. [3] Emilia Gómez. 2006. Tonal description of polyphonic audio for music content processing. INFORMS Journal on Computing 18, 3 (2006), 294–304. [4] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Wein- berger. 2016. Deep networks with stochastic depth. In European con- ference on computer vision. Springer, 646–661. [5] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019). [6] Filip Radenović, Giorgos Tolias, and Ondřej Chum. 2018. Fine-tuning CNN image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence 41, 7 (2018), 1655–1668. [7] Philip Tovstogan, Dmitry Bogdanov, and Alastair Porter. 2021. Media- Eval 2021: Emotion and Theme Recognition in Music Using Jamendo. In Proc. of the MediaEval 2021 Workshop, Online, 13-15 December 2021. [8] Minz Won, Keunwoo Choi, and Xavier Serra. 2021. Semi-supervised music tagging transformer. In Proc. of International Society for Music Information Retrieval. [9] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. 2020. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10687–10698.