=Paper=
{{Paper
|id=Vol-3181/paper21
|storemode=property
|title=Frequency Dependent Convolutions for Music Tagging
|pdfUrl=https://ceur-ws.org/Vol-3181/paper21.pdf
|volume=Vol-3181
|authors=Vincent Bour
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Bour21
}}
==Frequency Dependent Convolutions for Music Tagging==
<pdf width="1500px">https://ceur-ws.org/Vol-3181/paper21.pdf</pdf>
<pre>
            Frequency Dependent Convolutions for Music Tagging
                                                                           Vincent Bour
                                                                    lileonardo, Paris, France
                                                                 vincent.bour@lileonardo.com
ABSTRACT                                                                               computation time, we have decided to restrict ourselves to using
We present a deep convolutional neural network approach for emo-                       mel spectrograms as input.
tions and themes recognition in music using the MTG-Jamendo                               The task of recognizing emotions and themes seems particularly
dataset. The model takes mel spectrograms as input and tries to                        well adapted to training on small excerpts of music. Indeed, while
leverage translation invariance in the time dimension while allow-                     an instrument tag can be attributed to a song when it is only present
ing convolution filters to depend on the frequency. It has led the                     on a small part of it, emotions and themes can often be recognized
lileonardo team to achieve the highest score of the 2021 MediaEval                     on most parts of the song. Moreover, by reducing the input time
Multimedia Evaluation benchmark for this task1 .                                       to lengths as small as 1 second, we observed that the ability of the
                                                                                       network to perform its task did not radically diminish.
                                                                                          We have computed 128 frequency bands mel spectrograms from
1     INTRODUCTION                                                                     the original audio files by keeping the sample rate of 44100 Hz,
Emotions and Themes Recognition in Music using Jamendo is a                            using a fft window length of 2048 with 50% hop length and a low-
multi-label classification task of the 2021 MediaEval Multimedia                       pass filter with maximum frequency set to 16 kHz (see Figure 1 for
Evaluation benchmark. Its goal is to automatically recognize the                       a comparison between these mel spectrograms and the original 96
emotions and themes conveyed in a music recording by means of                          bands mel spectrograms provided by the MTG-Jamendo dataset).
audio analysis. We refer to [9] for more details on the task.
                                                                                       Figure 1: An example of 96 and 128 bands mel spectrograms
2     RELATED WORK
In the wake of its successes in computer vision, the use of convo-
lution neural networks has become a very common approach for
audio and music tagging and often leads to state-of-the-art results
(see [10] for a comparison of different CNN approaches in music
tagging). Among other things, [10] shows that a convolutional neu-
ral network trained on small excerpts of music constitutes a simple
but efficient method for music tagging.
   A number of previous works highlight the importance of the
choice of filters with respect to the frequency dimension. In [7], the
authors use domain knowledge to design vertical filters aimed to                           We found that a simple CNN model overfitted rather quickly
capture different spectral features. In [4], the authors add a channel                 when trained on small random chunks of the tracks, unless a care-
in order to make the convolution filters frequency aware. They also                    fully engineered learning rate scheduling scheme was used. When
study the influence of the receptive field and show that models with                   we tried to train it on the 128 bands spectrograms, it performed
limited receptive field in the frequency dimension perform better.                     significantly worse and overfitted even more quickly.
   As it is often the case with multi-label tagging tasks, there is                        In order to try and overcome the class imbalance issue, we used
a pronounced imbalance between positive and negative classes.                          a weighted loss, where for each label, the weight of the positive
The proportion of positive examples for each label in the training                     class is inversely proportionate to its frequency in the training set:
set ranges from 0.6% to 9.3%. To address this issue, the authors of                                     𝑐                                                   
[3] tried to adapt the loss function and used focal loss [6], which                                  1Õ       2                  2𝑝𝑖
                                                                                        𝑙 (𝑥, 𝑦) = −              𝑦𝑖 log(𝑥𝑖 ) +        (1 − 𝑦𝑖 ) log(1 − 𝑥𝑖 ) ,
increases the weight of the wrongly classified examples.                                             𝑐 𝑖=1 1 + 𝑝𝑖               1 + 𝑝𝑖
                                                                                       where 𝑝𝑖 is the frequency of the positive class in the train set for
3     APPROACH                                                                         label 𝑖 and 𝑐 = 56 is the number of labels.
According to the results of [5], deep convolutional neural networks                       We also adopted an input stem constituted of three convolutional
seem to be able to perform comparably well for music tagging                           layers followed by a max pooling layer, inspired by [8] and now
tasks when used on raw waveforms or on mel spectrograms. After                         widely used in ResNets [1]. It significantly improved the results,
having experimented on a smaller dataset, and in order to save                         probably because it resulted in a better low level analysis from
1 https://multimediaeval.github.io/2021-Emotion-and-Theme-Recognition-in-Music-Task/   the initial part of the model. It divides the time and frequency
results                                                                                dimensions by a factor four while outputting 128 channels. Four
                                                                                       more convolutional layers are aimed at performing a higher level
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
                                                                                       analysis of the spectrogram. The dimension is then reduced to 1 × 1
MediaEval’21, December 13-15 2021, Online                                              by using a dense layer on the remaining frequency dimensions and
                                                                                       by averaging along the time dimension.
MediaEval’21, December 13-15 2021, Online                                                                                              V. Bour

Table 1: Description of the layers for 128 frequency input                                Table 2: Experimental results
dimension and 224 time input dimension
                                                                                                  Validation              Test
     Layer         Input       Kernel / stride      Channels   Dropout                           PR       ROC        PR          ROC
     conv1      128 × 224          3×3/2                64        0               convs        0.1169    0.7434    0.1483    0.7715
     conv2      64 × 112           3×3/1                64        0               freq-dep     0.1155    0.7452    0.1504    0.7744
     conv3      64 × 112           3×3/1                128       0
                                                                                  mels-96      0.1173    0.7471    0.1479    0.7723
    maxpool      64 × 112          3×3/2                                          mels-128     0.1140    0.7389    0.1483    0.7710
     filter1      32 × 56          3×3/2                128      0.2              input-128    0.1176    0.7448    0.1500    0.7743
     filter2      16 × 28          3×3/2                256      0.2              input-224    0.1180    0.7468    0.1507    0.7753
     filter3      16 × 28          3×3/1                256      0.2
                                                                                  bce          0.1184    0.7411    0.1472    0.7702
     filter4      8 × 14           3×3/2                256      0.2
                                                                                  weighted     0.1143    0.7453    0.1488    0.7744
    collapse       4×7             4×1/1                512      0.2              focal        0.1145    0.7420    0.1469    0.7678
    avgpool        1×7             1×7/1
                                                                                  ensemble     0.1179    0.7461     0.1506    0.7752
      fc           1×1               1×1                1024     0.5
    output         1×1               1×1                 56      0.5

                                                                         5   DISCUSSION AND OUTLOOK
    In order to give the model the possibility to detect different       The same training process was used for all models both for the sake
features at different frequencies, we tried and let the convolution      of simplicity and in order to provide better comparison possibilities.
filters of the higher level layers vary with the frequency. Despite      However, the models with frequency dependent convolutions have
the six fold increase in parameters with this approach, it gave better   a much higher number of parameters and may benefit from more
results on the 128 bands mel spectrograms.                               regularization. It seems that with appropriate regularization, no
    The choice of this architecture comes from the hypothesis that       overfitting occurs and the network could be trained for a longer
the problem is invariant by translation in the time dimension, that      number of epochs.
the first layers compute low level visual features, and that at a           The results obtained on the test set, when broken down by label,
higher level, the characteristic features depend on the frequency        are very different from those obtained on the validation set, in a
and are not similar on the low, middle or high frequency parts of        way that is very consistent across all the models we have trained.
the spectrogram.                                                         On average, the score is significantly higher on the test set than
    We trained the network on randomly chosen small excerpts of          on the validation set. Some labels (deep, summer, powerful...) are
each track. Final predictions are obtained by averaging the outputs      always better predicted on the test set whereas other labels (movie,
for every small size segment in the track.                               action, groovy...) are always better predicted on the validation set.
                                                                         The fact that this is consistent across the models seem to show an
4     EXPERIMENTAL RESULTS                                               inherent difference in the data rather than to be indicative of a high
                                                                         variance.
We evaluated the model described in Table 1, first with normal
                                                                            An increase in the input time length results in a small improve-
3 by 3 convolutions, then with frequency dependent convolutions
                                                                         ment. It is not clear whether this improvement is due to the addi-
in layers filter1–4.
                                                                         tional input data seen by the network resulting in a virtual increase
   We trained these two models both on the original 96 bands mel
                                                                         in the number of epochs, if it comes from a better management of
spectrograms provided by the MTG-Jamendo dataset and on the
                                                                         side effects introduced by padding, or from better averages seen by
128 bands ones described in the previous section. We also tried
                                                                         the fully connected layer. Since the receptive field at the average
input time lengths of 128 and 224, corresponding to excerpts of
                                                                         pooling layer is independent of the input length, the model cannot
approximately 3s and 5.2s. Finally, we trained the models with a
                                                                         actually use features present at a longer time scale.
classic binary cross-entropy loss, with the weighted loss described
                                                                            The actual influence of the number of bands in the spectrogram
in section 3 and with the focal loss with parameters 𝛼 = 0.25 and
                                                                         and of the frequency of the low pass filter applied before the fft is
𝛾 = 2. We then averaged the predictions of the trained models for
                                                                         not clear and would need further study.
any given choice of these hyperparameters to obtain the results in
                                                                            The introduction of residual blocks in the place of the four filter
Table 2.
                                                                         layers seemed to provide a small but limited improvement. A resnet
   All models were trained for 200 epochs using SGD with Nesterov
                                                                         version of the model would be a good candidate to further improve
momentum of 0.9, a learning rate of 0.5 and weight decay of 2e-5.
                                                                         the results.
We used a cosine learning rate decay, mixup with 𝛼 = 0.2 [11] and
stochastic weight averaging on the last 40 epochs [2].
   A PyTorch implementation is available online2 .                       ACKNOWLEDGMENTS
                                                                         The author is grateful to François Malabre for very helpful discus-
2 https://github.com/vibour/emotion-theme-recognition                    sions.
Emotions and Themes in Music                                                    MediaEval’21, December 13-15 2021, Online


REFERENCES
 [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Iden-
     tity Mappings in Deep Residual Networks. In Computer Vision – ECCV
     2016. 630–645.
 [2] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov,
     and Andrew Gordon Wilson. 2018. Averaging weights leads to wider
     optima and better generalization. In 34th Conference on Uncertainty in
     Artificial Intelligence 2018, UAI 2018. 876–885.
 [3] Dillon Knox, Timothy Greer, Benjamin Ma, Emily Kuo, Krishna So-
     mandepalli, and Shrikanth Narayanan. 2020. MediaEval 2020 Emotion
     and Theme Recognition in Music Task: Loss Function Approaches for
     Multi-label Music Tagging. In Proc. of the MediaEval 2020 Workshop,
     Online, 13–15 December 2020.
 [4] Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer. 2019.
     Receptive-Field-Regularized CNN Variants for Acoustic Scene Clas-
     sification. In Acoustic Scenes and Events 2019 Workshop (DCASE2019).
     124–128.
 [5] Jongpil Lee, Jiyoung Park, Luke Kim, and Juhan Nam. 2017. Sample-
     level Deep Convolutional Neural Networks for Music auto-tagging
     Using Raw Waveforms. In Proceedings of the 14th Sound and Music
     Computing Conference, July 5-8, Espoo, Finland.
 [6] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár.
     2017. Focal loss for dense object detection. In Proceedings of the IEEE
     international conference on computer vision. 2980–2988.
 [7] Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas F.
     Ehmann, and Xavier Serra. 2018. End-to-end Learning for Music
     Audio Tagging at Scale. In Proceedings of the 19th International Society
     for Music Information Retrieval Conference, ISMIR 2018, Paris, France,
     September 23–27, 2018. 637–644.
 [8] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
     Zbigniew Wojna. 2016. Rethinking the Inception Architecture for
     Computer Vision. In Proceedings of the IEEE Conference on Computer
     Vision and Pattern Recognition (CVPR).
 [9] Philip Tovstogan, Dmitry Bogdanov, and Alastair Porter. 2021. Media-
     Eval 2021: Emotion and Theme Recognition in Music Using Jamendo.
     In Proc. of the MediaEval 2021 Workshop, Online, 13–15 December 2021.
[10] Minz Won, Andres Ferraro, Dmitry Bogdanov, and Xavier Serra. 2020.
     Evaluation of CNN-based Automatic Music Tagging Models. In 17th
     Sound and Music Computing Conference (SMC2020).
[11] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-
     Paz. 2018. mixup: Beyond Empirical Risk Minimization. In Interna-
     tional Conference on Learning Representations.

</pre>