Frequency Dependent Convolutions for Music Tagging Vincent Bour lileonardo, Paris, France vincent.bour@lileonardo.com ABSTRACT computation time, we have decided to restrict ourselves to using We present a deep convolutional neural network approach for emo- mel spectrograms as input. tions and themes recognition in music using the MTG-Jamendo The task of recognizing emotions and themes seems particularly dataset. The model takes mel spectrograms as input and tries to well adapted to training on small excerpts of music. Indeed, while leverage translation invariance in the time dimension while allow- an instrument tag can be attributed to a song when it is only present ing convolution filters to depend on the frequency. It has led the on a small part of it, emotions and themes can often be recognized lileonardo team to achieve the highest score of the 2021 MediaEval on most parts of the song. Moreover, by reducing the input time Multimedia Evaluation benchmark for this task1 . to lengths as small as 1 second, we observed that the ability of the network to perform its task did not radically diminish. We have computed 128 frequency bands mel spectrograms from 1 INTRODUCTION the original audio files by keeping the sample rate of 44100 Hz, Emotions and Themes Recognition in Music using Jamendo is a using a fft window length of 2048 with 50% hop length and a low- multi-label classification task of the 2021 MediaEval Multimedia pass filter with maximum frequency set to 16 kHz (see Figure 1 for Evaluation benchmark. Its goal is to automatically recognize the a comparison between these mel spectrograms and the original 96 emotions and themes conveyed in a music recording by means of bands mel spectrograms provided by the MTG-Jamendo dataset). audio analysis. We refer to [9] for more details on the task. Figure 1: An example of 96 and 128 bands mel spectrograms 2 RELATED WORK In the wake of its successes in computer vision, the use of convo- lution neural networks has become a very common approach for audio and music tagging and often leads to state-of-the-art results (see [10] for a comparison of different CNN approaches in music tagging). Among other things, [10] shows that a convolutional neu- ral network trained on small excerpts of music constitutes a simple but efficient method for music tagging. A number of previous works highlight the importance of the choice of filters with respect to the frequency dimension. In [7], the authors use domain knowledge to design vertical filters aimed to We found that a simple CNN model overfitted rather quickly capture different spectral features. In [4], the authors add a channel when trained on small random chunks of the tracks, unless a care- in order to make the convolution filters frequency aware. They also fully engineered learning rate scheduling scheme was used. When study the influence of the receptive field and show that models with we tried to train it on the 128 bands spectrograms, it performed limited receptive field in the frequency dimension perform better. significantly worse and overfitted even more quickly. As it is often the case with multi-label tagging tasks, there is In order to try and overcome the class imbalance issue, we used a pronounced imbalance between positive and negative classes. a weighted loss, where for each label, the weight of the positive The proportion of positive examples for each label in the training class is inversely proportionate to its frequency in the training set: set ranges from 0.6% to 9.3%. To address this issue, the authors of 𝑐   [3] tried to adapt the loss function and used focal loss [6], which 1Γ• 2 2𝑝𝑖 𝑙 (π‘₯, 𝑦) = βˆ’ 𝑦𝑖 log(π‘₯𝑖 ) + (1 βˆ’ 𝑦𝑖 ) log(1 βˆ’ π‘₯𝑖 ) , increases the weight of the wrongly classified examples. 𝑐 𝑖=1 1 + 𝑝𝑖 1 + 𝑝𝑖 where 𝑝𝑖 is the frequency of the positive class in the train set for 3 APPROACH label 𝑖 and 𝑐 = 56 is the number of labels. According to the results of [5], deep convolutional neural networks We also adopted an input stem constituted of three convolutional seem to be able to perform comparably well for music tagging layers followed by a max pooling layer, inspired by [8] and now tasks when used on raw waveforms or on mel spectrograms. After widely used in ResNets [1]. It significantly improved the results, having experimented on a smaller dataset, and in order to save probably because it resulted in a better low level analysis from 1 https://multimediaeval.github.io/2021-Emotion-and-Theme-Recognition-in-Music-Task/ the initial part of the model. It divides the time and frequency results dimensions by a factor four while outputting 128 channels. Four more convolutional layers are aimed at performing a higher level Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). analysis of the spectrogram. The dimension is then reduced to 1 Γ— 1 MediaEval’21, December 13-15 2021, Online by using a dense layer on the remaining frequency dimensions and by averaging along the time dimension. MediaEval’21, December 13-15 2021, Online V. Bour Table 1: Description of the layers for 128 frequency input Table 2: Experimental results dimension and 224 time input dimension Validation Test Layer Input Kernel / stride Channels Dropout PR ROC PR ROC conv1 128 Γ— 224 3Γ—3/2 64 0 convs 0.1169 0.7434 0.1483 0.7715 conv2 64 Γ— 112 3Γ—3/1 64 0 freq-dep 0.1155 0.7452 0.1504 0.7744 conv3 64 Γ— 112 3Γ—3/1 128 0 mels-96 0.1173 0.7471 0.1479 0.7723 maxpool 64 Γ— 112 3Γ—3/2 mels-128 0.1140 0.7389 0.1483 0.7710 filter1 32 Γ— 56 3Γ—3/2 128 0.2 input-128 0.1176 0.7448 0.1500 0.7743 filter2 16 Γ— 28 3Γ—3/2 256 0.2 input-224 0.1180 0.7468 0.1507 0.7753 filter3 16 Γ— 28 3Γ—3/1 256 0.2 bce 0.1184 0.7411 0.1472 0.7702 filter4 8 Γ— 14 3Γ—3/2 256 0.2 weighted 0.1143 0.7453 0.1488 0.7744 collapse 4Γ—7 4Γ—1/1 512 0.2 focal 0.1145 0.7420 0.1469 0.7678 avgpool 1Γ—7 1Γ—7/1 ensemble 0.1179 0.7461 0.1506 0.7752 fc 1Γ—1 1Γ—1 1024 0.5 output 1Γ—1 1Γ—1 56 0.5 5 DISCUSSION AND OUTLOOK In order to give the model the possibility to detect different The same training process was used for all models both for the sake features at different frequencies, we tried and let the convolution of simplicity and in order to provide better comparison possibilities. filters of the higher level layers vary with the frequency. Despite However, the models with frequency dependent convolutions have the six fold increase in parameters with this approach, it gave better a much higher number of parameters and may benefit from more results on the 128 bands mel spectrograms. regularization. It seems that with appropriate regularization, no The choice of this architecture comes from the hypothesis that overfitting occurs and the network could be trained for a longer the problem is invariant by translation in the time dimension, that number of epochs. the first layers compute low level visual features, and that at a The results obtained on the test set, when broken down by label, higher level, the characteristic features depend on the frequency are very different from those obtained on the validation set, in a and are not similar on the low, middle or high frequency parts of way that is very consistent across all the models we have trained. the spectrogram. On average, the score is significantly higher on the test set than We trained the network on randomly chosen small excerpts of on the validation set. Some labels (deep, summer, powerful...) are each track. Final predictions are obtained by averaging the outputs always better predicted on the test set whereas other labels (movie, for every small size segment in the track. action, groovy...) are always better predicted on the validation set. The fact that this is consistent across the models seem to show an 4 EXPERIMENTAL RESULTS inherent difference in the data rather than to be indicative of a high variance. We evaluated the model described in Table 1, first with normal An increase in the input time length results in a small improve- 3 by 3 convolutions, then with frequency dependent convolutions ment. It is not clear whether this improvement is due to the addi- in layers filter1–4. tional input data seen by the network resulting in a virtual increase We trained these two models both on the original 96 bands mel in the number of epochs, if it comes from a better management of spectrograms provided by the MTG-Jamendo dataset and on the side effects introduced by padding, or from better averages seen by 128 bands ones described in the previous section. We also tried the fully connected layer. Since the receptive field at the average input time lengths of 128 and 224, corresponding to excerpts of pooling layer is independent of the input length, the model cannot approximately 3s and 5.2s. Finally, we trained the models with a actually use features present at a longer time scale. classic binary cross-entropy loss, with the weighted loss described The actual influence of the number of bands in the spectrogram in section 3 and with the focal loss with parameters 𝛼 = 0.25 and and of the frequency of the low pass filter applied before the fft is 𝛾 = 2. We then averaged the predictions of the trained models for not clear and would need further study. any given choice of these hyperparameters to obtain the results in The introduction of residual blocks in the place of the four filter Table 2. layers seemed to provide a small but limited improvement. A resnet All models were trained for 200 epochs using SGD with Nesterov version of the model would be a good candidate to further improve momentum of 0.9, a learning rate of 0.5 and weight decay of 2e-5. the results. We used a cosine learning rate decay, mixup with 𝛼 = 0.2 [11] and stochastic weight averaging on the last 40 epochs [2]. A PyTorch implementation is available online2 . ACKNOWLEDGMENTS The author is grateful to FranΓ§ois Malabre for very helpful discus- 2 https://github.com/vibour/emotion-theme-recognition sions. Emotions and Themes in Music MediaEval’21, December 13-15 2021, Online REFERENCES [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Iden- tity Mappings in Deep Residual Networks. In Computer Vision – ECCV 2016. 630–645. [2] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. Averaging weights leads to wider optima and better generalization. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018. 876–885. [3] Dillon Knox, Timothy Greer, Benjamin Ma, Emily Kuo, Krishna So- mandepalli, and Shrikanth Narayanan. 2020. MediaEval 2020 Emotion and Theme Recognition in Music Task: Loss Function Approaches for Multi-label Music Tagging. In Proc. of the MediaEval 2020 Workshop, Online, 13–15 December 2020. [4] Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer. 2019. Receptive-Field-Regularized CNN Variants for Acoustic Scene Clas- sification. In Acoustic Scenes and Events 2019 Workshop (DCASE2019). 124–128. [5] Jongpil Lee, Jiyoung Park, Luke Kim, and Juhan Nam. 2017. Sample- level Deep Convolutional Neural Networks for Music auto-tagging Using Raw Waveforms. In Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland. [6] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr DollΓ‘r. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980–2988. [7] Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas F. Ehmann, and Xavier Serra. 2018. End-to-end Learning for Music Audio Tagging at Scale. In Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23–27, 2018. 637–644. [8] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [9] Philip Tovstogan, Dmitry Bogdanov, and Alastair Porter. 2021. Media- Eval 2021: Emotion and Theme Recognition in Music Using Jamendo. In Proc. of the MediaEval 2021 Workshop, Online, 13–15 December 2021. [10] Minz Won, Andres Ferraro, Dmitry Bogdanov, and Xavier Serra. 2020. Evaluation of CNN-based Automatic Music Tagging Models. In 17th Sound and Music Computing Conference (SMC2020). [11] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez- Paz. 2018. mixup: Beyond Empirical Risk Minimization. In Interna- tional Conference on Learning Representations.