Semi-Supervised Music Emotion Recognition using Noisy
           Student Training and Harmonic Pitch Class Profiles
                                                                           Hao Hao Tan
                                                                    helloharry66@gmail.com

ABSTRACT                                                                                         Model          ROC-AUC      PR-AUC     F-Score
We present Mirable’s submission to the 2021 Emotions and Themes                                  baseline         0.7258      0.1077     0.1656
in Music challenge. In this work, we intend to address the question:                          long-normal         0.7256      0.1024     0.1578
can we leverage semi-supervised learning techniques on music                                   long-hpcp          0.7587      0.1220     0.1854
emotion recognition? With that, we experiment with noisy student                            long-hpcp-noisy       0.7614      0.1235     0.1833
training, which has improved model performance in the image                                   short-normal        0.7477      0.1234     0.1855
classification domain. As the noisy student method requires a strong                           short-hpcp         0.7541      0.1275     0.1864
teacher model, we further delve into the factors including (i) input                        short-hpcp-noisy      0.7488      0.1226     0.1804
training length and (ii) complementary music representations to                                 ensemble          0.7687      0.1356     0.1978
further boost the performance of the teacher model. For (i), we
                                                                                             Table 1: Test-set performance of our models.
find that models trained with short input length perform better in
PR-AUC, whereas those trained with long input length perform
better in ROC-AUC. For (ii), we find that using harmonic pitch class                 according to the training strategy, which will be discussed in Sec-
profiles (HPCP) consistently improve tagging performance, which                      tion 2.3. For data augmentation, we perform time masking and
suggests that harmonic representation is useful for music emotion                    frequency masking, similar to the idea in SpecAugment [5]. The
tagging. Finally, we find that noisy student method only improves                    maximum possible length of both masks vary between 20 to 60,
tagging results for the case of long training length. Additionally, we               and the value is being sampled randomly for each training batch.
find that ensembling representations trained with different training
lengths can improve tagging results significantly, which suggest                     2.2   Model Training
a possible direction to explore incorporating multiple temporal
                                                                                     As shown in Figure 1, our base model architecture is similar to
resolutions in the network architecture for future work.
                                                                                     CRNN [2], with some revisions which include adding residual con-
                                                                                     nections to our ConvBlock, and using GeMPool [6] instead of Max-
1    INTRODUCTION                                                                    Pool. We train all of our models for a maximum of 100 epochs, with
Emotions and themes are high-level musical attributes that are                       an Adam optimizer and learning rate of 0.0001. Early stopping is
abstract and highly subjective. Obtaining emotion labels typically                   performed when the validation ROC-AUC does not improve for 5
require human annotation, which can be time consuming and po-                        epochs, and we store the model weights from the epoch with the
tentially costly. Is it possible to use semi-supervised learning tech-               best ROC-AUC evaluated on the validation set.
niques, such that we can leverage on unlabelled music tracks to
learn emotion tags, while only using a small amount of labelled                      2.3   Long VS Short Training Length
data? Following this question, we intend to explore the usage of                     For the long training length mode, we use the first ≈ 185 seconds
noisy student training [9] on music emotion recognition. Recently,                   of the track, which corresponds to 1600 time steps in the Mel-
[8] proposed the music tagging transformer, which also uses noisy                    spectrogram after average pooling. For the short training length
student training, but it is applied to general music tagging and                     mode, we chunk each track into samples of length ≈ 9.25 seconds,
does not focus on emotion and theme related tags. Additionally, we                   which corresponds to 80 time steps in the Mel-spectrogram after
explore two other factors to improve the tagging performance of                      average pooling. During evaluation, we average the logits of all
the teacher model: (i) the input training length; (ii) adding music                  chunks to obtain the final output for each track.
representations to complement the learning of music emotion.

2 APPROACH                                                                           2.4   Harmonic Pitch Class Profiles (HPCP)
                                                                                     HPCP [3] is a type of chroma feature that describes tonality and
2.1 Pre-Processing and Augmentation                                                  harmonic content of a music track. We extract HPCP with 12 pitch
We extract Mel-spectrograms with 128 bins from raw audio using                       classes from raw audio using a sampling rate of 44.1kHz. We do
a sampling rate of 44.1kHz, and the Mel-spectrograms are down-                       not apply average pooling along the temporal dimension of HPCP.
sampled with an averaging factor of 10 along the temporal dimen-                     The corresponding number of time steps for HPCP are 4000 and
sion. The number of time steps for each Mel-spectrogram vary                         200 for both long and short training length mode respectively. We
                                                                                     concatenate the learnt latent features from the Mel-spectrogram
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                   and the HPCP block, each with dimension 𝑑 = 256, and pass through
MediaEval’21, December 13-15 2021, Online                                            two linear layers to obtain the fused output.
MediaEval’21, December 13-15 2021, Online                                                                                          Hao Hao Tan


                                                        Figure 1: Overview of our model.


                    Model          Avg TPR      Avg TNR                      with short input length perform significantly better in PR-AUC.
                                                                             According to Table 2, this is because the former has a higher TNR,
              long-hpcp-noisy       0.3645        0.8851
                                                                             while the latter has a higher TPR. Since PR-AUC focuses more on
                short-normal        0.3842        0.8737
                                                                             the minority class (in this case the positive class) and ROC-AUC
                  ensemble          0.4099        0.8671
                                                                             focuses on both, the latter model scores better in PR-AUC. We
Table 2: Average true positive rate (TPR) and true negative                  also find that adding HPCP improves tagging results consistently
rate (TNR) for each model across all labels.                                 for both cases, which suggests that harmonic representation is
2.5    Noisy Student Training                                                important for music emotion recognition.
Noisy student training [9] is an extension of self-training, with               For noisy student training, the results are rather inconclusive.
the usage of equal-or-larger student models and added noise to               We find slight improvements in the long training length case, but
improve the representation learnt from the teacher model. To add             the result degrades for the short training length case. Also, we only
noise, we enhance data augmentation by increasing the maximum                run noisy student training for 1 iteration, as we find the results
possible masking length to between 30 and 90 for both time and               consistently degrade for subsequent iterations. Additionally, we try
frequency masking, as well as adding standard Gaussian noise                 to add more unlabelled tracks from the Lakh MP3 dataset (≈ 45, 000
with a weight of 0.01. To implement stochastic depth [4], we use             30 seconds track) to increase the training dataset size, but we do
3 StochasticConvBlocks which are ConvBlocks that could be ran-               not observe any performance improvement. We infer that noisy
domly bypassed with a probability of 0.1 each. During evaluation,            student method might not necessarily work well for music emotion
all the layers will be passed through. StochasticConvBlock also has          recognition tasks, due to the abstract nature and subjectivity of
an additional dropout of probability 0.1 after the ReLU layer.               emotion and theme labels. Hence, a small subset of emotion labels
    In this work, we use the corresponding HPCP models for each              might not be sufficient to represent the full dataset.
long and short training length mode as the teacher model. We only               For model ensembling, we choose to ensemble the ‘long-noisy’
use the predictions which are > 0.1 as positive pseudo-labels, and           model and the ‘short-normal’ model. We find that 𝛼 = 0.7 is optimal
those < 1𝑒 −6 as negative pseudo-labels. Both decision thresholds            through our validation set, hence suggesting that the final output
are determined by conducting an empirical evaluation on the pre-             gives more weightage to the short training length model. From
dicted value distribution using the teacher model, carried out on            the test set results, we can also see that this ensemble method
the training and validation set. We take the leftmost 5% percentile          improves the tagging performance significantly, which suggest that
for the negative label distribution, and the rightmost 5% percentile         combining different views of audio in terms of temporal resolution
for the positive label distribution to ensure better confidence.             can produce better learnt representations.
                                                                             4   DISCUSSION AND OUTLOOK
2.6    Model Ensemble                                                        While investigating the related work, we find that this work still
Finally, we investigate the results of combining the output of both          uses a relatively long training length (even for short length we
long and short training length models, by simply taking the weighted         use ≈ 9 seconds, as compared to previous works with ≈ 2 to 5
sum of their best models: 𝑙 𝑓 𝑖𝑛𝑎𝑙 = 𝛼 · 𝑙𝑠ℎ𝑜𝑟𝑡 + (1 − 𝛼) · 𝑙𝑙𝑜𝑛𝑔 . We use   seconds), and low temporal resolution, which we intend to change
the validation set to find the ratio 𝛼 which gives the best results.         in our future work. For future work, we are interested in tweaking
                                                                             the network architecture to capture views of different temporal
3     RESULTS AND ANALYSIS                                                   resolutions in the audio sample. We would also like to explore
For the training length factor, we find that models trained with             using noisy student training with different model architectures and
long input length perform better in ROC-AUC, but models trained              datasets of a much larger scale.
Emotions and Themes in Music                                                  MediaEval’21, December 13-15 2021, Online


REFERENCES
[1] Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and
    Xavier Serra. 2019. The MTG-Jamendo Dataset for Automatic Music
    Tagging. In Machine Learning for Music Discovery Workshop, Interna-
    tional Conference on Machine Learning (ICML 2019).
[2] Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho.
    2017. Convolutional recurrent neural networks for music classification.
    In 2017 IEEE International Conference on Acoustics, Speech and Signal
    Processing (ICASSP). IEEE, 2392–2396.
[3] Emilia Gómez. 2006. Tonal description of polyphonic audio for music
    content processing. INFORMS Journal on Computing 18, 3 (2006),
    294–304.
[4] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Wein-
    berger. 2016. Deep networks with stochastic depth. In European con-
    ference on computer vision. Springer, 646–661.
[5] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret
    Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple
    data augmentation method for automatic speech recognition. arXiv
    preprint arXiv:1904.08779 (2019).
[6] Filip Radenović, Giorgos Tolias, and Ondřej Chum. 2018. Fine-tuning
    CNN image retrieval with no human annotation. IEEE transactions on
    pattern analysis and machine intelligence 41, 7 (2018), 1655–1668.
[7] Philip Tovstogan, Dmitry Bogdanov, and Alastair Porter. 2021. Media-
    Eval 2021: Emotion and Theme Recognition in Music Using Jamendo.
    In Proc. of the MediaEval 2021 Workshop, Online, 13-15 December 2021.
[8] Minz Won, Keunwoo Choi, and Xavier Serra. 2021. Semi-supervised
    music tagging transformer. In Proc. of International Society for Music
    Information Retrieval.
[9] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. 2020.
    Self-training with noisy student improves imagenet classification. In
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
    Recognition. 10687–10698.