=Paper=
{{Paper
|id=Vol-3181/paper21
|storemode=property
|title=Frequency Dependent Convolutions for Music Tagging
|pdfUrl=https://ceur-ws.org/Vol-3181/paper21.pdf
|volume=Vol-3181
|authors=Vincent Bour
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Bour21
}}
==Frequency Dependent Convolutions for Music Tagging==
Frequency Dependent Convolutions for Music Tagging
Vincent Bour
lileonardo, Paris, France
vincent.bour@lileonardo.com
ABSTRACT computation time, we have decided to restrict ourselves to using
We present a deep convolutional neural network approach for emo- mel spectrograms as input.
tions and themes recognition in music using the MTG-Jamendo The task of recognizing emotions and themes seems particularly
dataset. The model takes mel spectrograms as input and tries to well adapted to training on small excerpts of music. Indeed, while
leverage translation invariance in the time dimension while allow- an instrument tag can be attributed to a song when it is only present
ing convolution filters to depend on the frequency. It has led the on a small part of it, emotions and themes can often be recognized
lileonardo team to achieve the highest score of the 2021 MediaEval on most parts of the song. Moreover, by reducing the input time
Multimedia Evaluation benchmark for this task1 . to lengths as small as 1 second, we observed that the ability of the
network to perform its task did not radically diminish.
We have computed 128 frequency bands mel spectrograms from
1 INTRODUCTION the original audio files by keeping the sample rate of 44100 Hz,
Emotions and Themes Recognition in Music using Jamendo is a using a fft window length of 2048 with 50% hop length and a low-
multi-label classification task of the 2021 MediaEval Multimedia pass filter with maximum frequency set to 16 kHz (see Figure 1 for
Evaluation benchmark. Its goal is to automatically recognize the a comparison between these mel spectrograms and the original 96
emotions and themes conveyed in a music recording by means of bands mel spectrograms provided by the MTG-Jamendo dataset).
audio analysis. We refer to [9] for more details on the task.
Figure 1: An example of 96 and 128 bands mel spectrograms
2 RELATED WORK
In the wake of its successes in computer vision, the use of convo-
lution neural networks has become a very common approach for
audio and music tagging and often leads to state-of-the-art results
(see [10] for a comparison of different CNN approaches in music
tagging). Among other things, [10] shows that a convolutional neu-
ral network trained on small excerpts of music constitutes a simple
but efficient method for music tagging.
A number of previous works highlight the importance of the
choice of filters with respect to the frequency dimension. In [7], the
authors use domain knowledge to design vertical filters aimed to We found that a simple CNN model overfitted rather quickly
capture different spectral features. In [4], the authors add a channel when trained on small random chunks of the tracks, unless a care-
in order to make the convolution filters frequency aware. They also fully engineered learning rate scheduling scheme was used. When
study the influence of the receptive field and show that models with we tried to train it on the 128 bands spectrograms, it performed
limited receptive field in the frequency dimension perform better. significantly worse and overfitted even more quickly.
As it is often the case with multi-label tagging tasks, there is In order to try and overcome the class imbalance issue, we used
a pronounced imbalance between positive and negative classes. a weighted loss, where for each label, the weight of the positive
The proportion of positive examples for each label in the training class is inversely proportionate to its frequency in the training set:
set ranges from 0.6% to 9.3%. To address this issue, the authors of π
[3] tried to adapt the loss function and used focal loss [6], which 1Γ 2 2ππ
π (π₯, π¦) = β π¦π log(π₯π ) + (1 β π¦π ) log(1 β π₯π ) ,
increases the weight of the wrongly classified examples. π π=1 1 + ππ 1 + ππ
where ππ is the frequency of the positive class in the train set for
3 APPROACH label π and π = 56 is the number of labels.
According to the results of [5], deep convolutional neural networks We also adopted an input stem constituted of three convolutional
seem to be able to perform comparably well for music tagging layers followed by a max pooling layer, inspired by [8] and now
tasks when used on raw waveforms or on mel spectrograms. After widely used in ResNets [1]. It significantly improved the results,
having experimented on a smaller dataset, and in order to save probably because it resulted in a better low level analysis from
1 https://multimediaeval.github.io/2021-Emotion-and-Theme-Recognition-in-Music-Task/ the initial part of the model. It divides the time and frequency
results dimensions by a factor four while outputting 128 channels. Four
more convolutional layers are aimed at performing a higher level
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
analysis of the spectrogram. The dimension is then reduced to 1 Γ 1
MediaEvalβ21, December 13-15 2021, Online by using a dense layer on the remaining frequency dimensions and
by averaging along the time dimension.
MediaEvalβ21, December 13-15 2021, Online V. Bour
Table 1: Description of the layers for 128 frequency input Table 2: Experimental results
dimension and 224 time input dimension
Validation Test
Layer Input Kernel / stride Channels Dropout PR ROC PR ROC
conv1 128 Γ 224 3Γ3/2 64 0 convs 0.1169 0.7434 0.1483 0.7715
conv2 64 Γ 112 3Γ3/1 64 0 freq-dep 0.1155 0.7452 0.1504 0.7744
conv3 64 Γ 112 3Γ3/1 128 0
mels-96 0.1173 0.7471 0.1479 0.7723
maxpool 64 Γ 112 3Γ3/2 mels-128 0.1140 0.7389 0.1483 0.7710
filter1 32 Γ 56 3Γ3/2 128 0.2 input-128 0.1176 0.7448 0.1500 0.7743
filter2 16 Γ 28 3Γ3/2 256 0.2 input-224 0.1180 0.7468 0.1507 0.7753
filter3 16 Γ 28 3Γ3/1 256 0.2
bce 0.1184 0.7411 0.1472 0.7702
filter4 8 Γ 14 3Γ3/2 256 0.2
weighted 0.1143 0.7453 0.1488 0.7744
collapse 4Γ7 4Γ1/1 512 0.2 focal 0.1145 0.7420 0.1469 0.7678
avgpool 1Γ7 1Γ7/1
ensemble 0.1179 0.7461 0.1506 0.7752
fc 1Γ1 1Γ1 1024 0.5
output 1Γ1 1Γ1 56 0.5
5 DISCUSSION AND OUTLOOK
In order to give the model the possibility to detect different The same training process was used for all models both for the sake
features at different frequencies, we tried and let the convolution of simplicity and in order to provide better comparison possibilities.
filters of the higher level layers vary with the frequency. Despite However, the models with frequency dependent convolutions have
the six fold increase in parameters with this approach, it gave better a much higher number of parameters and may benefit from more
results on the 128 bands mel spectrograms. regularization. It seems that with appropriate regularization, no
The choice of this architecture comes from the hypothesis that overfitting occurs and the network could be trained for a longer
the problem is invariant by translation in the time dimension, that number of epochs.
the first layers compute low level visual features, and that at a The results obtained on the test set, when broken down by label,
higher level, the characteristic features depend on the frequency are very different from those obtained on the validation set, in a
and are not similar on the low, middle or high frequency parts of way that is very consistent across all the models we have trained.
the spectrogram. On average, the score is significantly higher on the test set than
We trained the network on randomly chosen small excerpts of on the validation set. Some labels (deep, summer, powerful...) are
each track. Final predictions are obtained by averaging the outputs always better predicted on the test set whereas other labels (movie,
for every small size segment in the track. action, groovy...) are always better predicted on the validation set.
The fact that this is consistent across the models seem to show an
4 EXPERIMENTAL RESULTS inherent difference in the data rather than to be indicative of a high
variance.
We evaluated the model described in Table 1, first with normal
An increase in the input time length results in a small improve-
3 by 3 convolutions, then with frequency dependent convolutions
ment. It is not clear whether this improvement is due to the addi-
in layers filter1β4.
tional input data seen by the network resulting in a virtual increase
We trained these two models both on the original 96 bands mel
in the number of epochs, if it comes from a better management of
spectrograms provided by the MTG-Jamendo dataset and on the
side effects introduced by padding, or from better averages seen by
128 bands ones described in the previous section. We also tried
the fully connected layer. Since the receptive field at the average
input time lengths of 128 and 224, corresponding to excerpts of
pooling layer is independent of the input length, the model cannot
approximately 3s and 5.2s. Finally, we trained the models with a
actually use features present at a longer time scale.
classic binary cross-entropy loss, with the weighted loss described
The actual influence of the number of bands in the spectrogram
in section 3 and with the focal loss with parameters πΌ = 0.25 and
and of the frequency of the low pass filter applied before the fft is
πΎ = 2. We then averaged the predictions of the trained models for
not clear and would need further study.
any given choice of these hyperparameters to obtain the results in
The introduction of residual blocks in the place of the four filter
Table 2.
layers seemed to provide a small but limited improvement. A resnet
All models were trained for 200 epochs using SGD with Nesterov
version of the model would be a good candidate to further improve
momentum of 0.9, a learning rate of 0.5 and weight decay of 2e-5.
the results.
We used a cosine learning rate decay, mixup with πΌ = 0.2 [11] and
stochastic weight averaging on the last 40 epochs [2].
A PyTorch implementation is available online2 . ACKNOWLEDGMENTS
The author is grateful to FranΓ§ois Malabre for very helpful discus-
2 https://github.com/vibour/emotion-theme-recognition sions.
Emotions and Themes in Music MediaEvalβ21, December 13-15 2021, Online
REFERENCES
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Iden-
tity Mappings in Deep Residual Networks. In Computer Vision β ECCV
2016. 630β645.
[2] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov,
and Andrew Gordon Wilson. 2018. Averaging weights leads to wider
optima and better generalization. In 34th Conference on Uncertainty in
Artificial Intelligence 2018, UAI 2018. 876β885.
[3] Dillon Knox, Timothy Greer, Benjamin Ma, Emily Kuo, Krishna So-
mandepalli, and Shrikanth Narayanan. 2020. MediaEval 2020 Emotion
and Theme Recognition in Music Task: Loss Function Approaches for
Multi-label Music Tagging. In Proc. of the MediaEval 2020 Workshop,
Online, 13β15 December 2020.
[4] Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer. 2019.
Receptive-Field-Regularized CNN Variants for Acoustic Scene Clas-
sification. In Acoustic Scenes and Events 2019 Workshop (DCASE2019).
124β128.
[5] Jongpil Lee, Jiyoung Park, Luke Kim, and Juhan Nam. 2017. Sample-
level Deep Convolutional Neural Networks for Music auto-tagging
Using Raw Waveforms. In Proceedings of the 14th Sound and Music
Computing Conference, July 5-8, Espoo, Finland.
[6] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr DollΓ‘r.
2017. Focal loss for dense object detection. In Proceedings of the IEEE
international conference on computer vision. 2980β2988.
[7] Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas F.
Ehmann, and Xavier Serra. 2018. End-to-end Learning for Music
Audio Tagging at Scale. In Proceedings of the 19th International Society
for Music Information Retrieval Conference, ISMIR 2018, Paris, France,
September 23β27, 2018. 637β644.
[8] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
Zbigniew Wojna. 2016. Rethinking the Inception Architecture for
Computer Vision. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
[9] Philip Tovstogan, Dmitry Bogdanov, and Alastair Porter. 2021. Media-
Eval 2021: Emotion and Theme Recognition in Music Using Jamendo.
In Proc. of the MediaEval 2021 Workshop, Online, 13β15 December 2021.
[10] Minz Won, Andres Ferraro, Dmitry Bogdanov, and Xavier Serra. 2020.
Evaluation of CNN-based Automatic Music Tagging Models. In 17th
Sound and Music Computing Conference (SMC2020).
[11] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-
Paz. 2018. mixup: Beyond Empirical Risk Minimization. In Interna-
tional Conference on Learning Representations.