<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Frequency Dependent Convolutions for Music Tagging</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vincent Bour lileonardo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paris</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France vincent.bour@lileonardo.com</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>We present a deep convolutional neural network approach for emotions and themes recognition in music using the MTG-Jamendo dataset. The model takes mel spectrograms as input and tries to leverage translation invariance in the time dimension while allowing convolution filters to depend on the frequency. It has led the lileonardo team to achieve the highest score of the 2021 MediaEval Multimedia Evaluation benchmark for this task1.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Emotions and Themes Recognition in Music using Jamendo is a
multi-label classification task of the 2021 MediaEval Multimedia
Evaluation benchmark. Its goal is to automatically recognize the
emotions and themes conveyed in a music recording by means of
audio analysis. We refer to [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for more details on the task.
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        In the wake of its successes in computer vision, the use of
convolution neural networks has become a very common approach for
audio and music tagging and often leads to state-of-the-art results
(see [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for a comparison of diferent CNN approaches in music
tagging). Among other things, [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] shows that a convolutional
neural network trained on small excerpts of music constitutes a simple
but eficient method for music tagging.
      </p>
      <p>
        A number of previous works highlight the importance of the
choice of filters with respect to the frequency dimension. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the
authors use domain knowledge to design vertical filters aimed to
capture diferent spectral features. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors add a channel
in order to make the convolution filters frequency aware. They also
study the influence of the receptive field and show that models with
limited receptive field in the frequency dimension perform better.
      </p>
      <p>
        As it is often the case with multi-label tagging tasks, there is
a pronounced imbalance between positive and negative classes.
The proportion of positive examples for each label in the training
set ranges from 0.6% to 9.3%. To address this issue, the authors of
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] tried to adapt the loss function and used focal loss [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which
increases the weight of the wrongly classified examples.
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>
        According to the results of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], deep convolutional neural networks
seem to be able to perform comparably well for music tagging
tasks when used on raw waveforms or on mel spectrograms. After
having experimented on a smaller dataset, and in order to save
1https://multimediaeval.github.io/2021-Emotion-and-Theme-Recognition-in-Music-Task/
results
computation time, we have decided to restrict ourselves to using
mel spectrograms as input.
      </p>
      <p>The task of recognizing emotions and themes seems particularly
well adapted to training on small excerpts of music. Indeed, while
an instrument tag can be attributed to a song when it is only present
on a small part of it, emotions and themes can often be recognized
on most parts of the song. Moreover, by reducing the input time
to lengths as small as 1 second, we observed that the ability of the
network to perform its task did not radically diminish.</p>
      <p>We have computed 128 frequency bands mel spectrograms from
the original audio files by keeping the sample rate of 44100 Hz,
using a ft window length of 2048 with 50% hop length and a
lowpass filter with maximum frequency set to 16 kHz (see Figure 1 for
a comparison between these mel spectrograms and the original 96
bands mel spectrograms provided by the MTG-Jamendo dataset).</p>
      <p>We found that a simple CNN model overfitted rather quickly
when trained on small random chunks of the tracks, unless a
carefully engineered learning rate scheduling scheme was used. When
we tried to train it on the 128 bands spectrograms, it performed
significantly worse and overfitted even more quickly.</p>
      <p>In order to try and overcome the class imbalance issue, we used
a weighted loss, where for each label, the weight of the positive
class is inversely proportionate to its frequency in the training set:
 (, ) = −  =1
1 Õ</p>
      <p>2 2
1 +   log( ) + 1 +  (1 −  ) log(1 −  ) ,
where  is the frequency of the positive class in the train set for
label  and  = 56 is the number of labels.</p>
      <p>
        We also adopted an input stem constituted of three convolutional
layers followed by a max pooling layer, inspired by [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and now
widely used in ResNets [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It significantly improved the results,
probably because it resulted in a better low level analysis from
the initial part of the model. It divides the time and frequency
dimensions by a factor four while outputting 128 channels. Four
more convolutional layers are aimed at performing a higher level
analysis of the spectrogram. The dimension is then reduced to 1 × 1
by using a dense layer on the remaining frequency dimensions and
by averaging along the time dimension.
      </p>
      <p>In order to give the model the possibility to detect diferent
features at diferent frequencies, we tried and let the convolution
iflters of the higher level layers vary with the frequency. Despite
the six fold increase in parameters with this approach, it gave better
results on the 128 bands mel spectrograms.</p>
      <p>The choice of this architecture comes from the hypothesis that
the problem is invariant by translation in the time dimension, that
the first layers compute low level visual features, and that at a
higher level, the characteristic features depend on the frequency
and are not similar on the low, middle or high frequency parts of
the spectrogram.</p>
      <p>We trained the network on randomly chosen small excerpts of
each track. Final predictions are obtained by averaging the outputs
for every small size segment in the track.
4</p>
    </sec>
    <sec id="sec-4">
      <title>EXPERIMENTAL RESULTS</title>
      <p>We evaluated the model described in Table 1, first with normal
3 by 3 convolutions, then with frequency dependent convolutions
in layers filter1–4.</p>
      <p>We trained these two models both on the original 96 bands mel
spectrograms provided by the MTG-Jamendo dataset and on the
128 bands ones described in the previous section. We also tried
input time lengths of 128 and 224, corresponding to excerpts of
approximately 3s and 5.2s. Finally, we trained the models with a
classic binary cross-entropy loss, with the weighted loss described
in section 3 and with the focal loss with parameters  = 0.25 and
 = 2. We then averaged the predictions of the trained models for
any given choice of these hyperparameters to obtain the results in
Table 2.</p>
      <p>
        All models were trained for 200 epochs using SGD with Nesterov
momentum of 0.9, a learning rate of 0.5 and weight decay of 2e-5.
We used a cosine learning rate decay, mixup with  = 0.2 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and
stochastic weight averaging on the last 40 epochs [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>A PyTorch implementation is available online2.
2https://github.com/vibour/emotion-theme-recognition
The same training process was used for all models both for the sake
of simplicity and in order to provide better comparison possibilities.
However, the models with frequency dependent convolutions have
a much higher number of parameters and may benefit from more
regularization. It seems that with appropriate regularization, no
overfitting occurs and the network could be trained for a longer
number of epochs.</p>
      <p>The results obtained on the test set, when broken down by label,
are very diferent from those obtained on the validation set, in a
way that is very consistent across all the models we have trained.
On average, the score is significantly higher on the test set than
on the validation set. Some labels (deep, summer, powerful...) are
always better predicted on the test set whereas other labels (movie,
action, groovy...) are always better predicted on the validation set.
The fact that this is consistent across the models seem to show an
inherent diference in the data rather than to be indicative of a high
variance.</p>
      <p>An increase in the input time length results in a small
improvement. It is not clear whether this improvement is due to the
additional input data seen by the network resulting in a virtual increase
in the number of epochs, if it comes from a better management of
side efects introduced by padding, or from better averages seen by
the fully connected layer. Since the receptive field at the average
pooling layer is independent of the input length, the model cannot
actually use features present at a longer time scale.</p>
      <p>The actual influence of the number of bands in the spectrogram
and of the frequency of the low pass filter applied before the ft is
not clear and would need further study.</p>
      <p>The introduction of residual blocks in the place of the four filter
layers seemed to provide a small but limited improvement. A resnet
version of the model would be a good candidate to further improve
the results.</p>
    </sec>
    <sec id="sec-5">
      <title>ACKNOWLEDGMENTS</title>
      <p>The author is grateful to François Malabre for very helpful
discussions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Identity Mappings in Deep Residual Networks</article-title>
          . In Computer Vision - ECCV
          <year>2016</year>
          .
          <volume>630</volume>
          -
          <fpage>645</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Izmailov</surname>
          </string-name>
          , Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson.
          <year>2018</year>
          .
          <article-title>Averaging weights leads to wider optima and better generalization</article-title>
          .
          <source>In 34th Conference on Uncertainty in Artificial Intelligence</source>
          <year>2018</year>
          ,
          <string-name>
            <surname>UAI</surname>
          </string-name>
          <year>2018</year>
          .
          <volume>876</volume>
          -
          <fpage>885</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Dillon</given-names>
            <surname>Knox</surname>
          </string-name>
          , Timothy Greer, Benjamin Ma, Emily Kuo, Krishna Somandepalli, and
          <string-name>
            <given-names>Shrikanth</given-names>
            <surname>Narayanan</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>MediaEval 2020 Emotion and Theme Recognition in Music Task: Loss Function Approaches for Multi-label Music Tagging</article-title>
          .
          <source>In Proc. of the MediaEval 2020 Workshop</source>
          , Online,
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          December
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Khaled</given-names>
            <surname>Koutini</surname>
          </string-name>
          ,
          <article-title>Hamid Eghbal-zadeh, and</article-title>
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Widmer</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Receptive-Field-Regularized CNN Variants for Acoustic Scene Classification</article-title>
          .
          <source>In Acoustic Scenes and Events 2019 Workshop (DCASE2019)</source>
          .
          <fpage>124</fpage>
          -
          <lpage>128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jongpil</given-names>
            <surname>Lee</surname>
          </string-name>
          , Jiyoung Park, Luke Kim, and
          <string-name>
            <given-names>Juhan</given-names>
            <surname>Nam</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Samplelevel Deep Convolutional Neural Networks for Music auto-tagging Using Raw Waveforms</article-title>
          .
          <source>In Proceedings of the 14th Sound and Music Computing Conference, July 5-8</source>
          , Espoo, Finland.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Tsung-Yi</surname>
            <given-names>Lin</given-names>
          </string-name>
          , Priya Goyal, Ross Girshick, Kaiming He, and
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Dollár</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Focal loss for dense object detection</article-title>
          .
          <source>In Proceedings of the IEEE international conference on computer vision</source>
          . 2980-
          <fpage>2988</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jordi</given-names>
            <surname>Pons</surname>
          </string-name>
          , Oriol Nieto, Matthew Prockup,
          <string-name>
            <surname>Erik M. Schmidt</surname>
            ,
            <given-names>Andreas F.</given-names>
          </string-name>
          <string-name>
            <surname>Ehmann</surname>
            , and
            <given-names>Xavier</given-names>
          </string-name>
          <string-name>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>End-to-end Learning for Music Audio Tagging at Scale</article-title>
          .
          <source>In Proceedings of the 19th International Society for Music Information Retrieval Conference</source>
          ,
          <source>ISMIR 2018</source>
          , Paris, France,
          <source>September 23-27</source>
          ,
          <year>2018</year>
          .
          <fpage>637</fpage>
          -
          <lpage>644</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , Vincent Vanhoucke, Sergey Iofe, Jon Shlens, and
          <string-name>
            <given-names>Zbigniew</given-names>
            <surname>Wojna</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Rethinking the Inception Architecture for Computer Vision</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Philip</given-names>
            <surname>Tovstogan</surname>
          </string-name>
          , Dmitry Bogdanov, and
          <string-name>
            <given-names>Alastair</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>MediaEval 2021: Emotion and Theme Recognition in Music Using Jamendo</article-title>
          .
          <source>In Proc. of the MediaEval 2021 Workshop</source>
          , Online,
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          December
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Minz</surname>
            <given-names>Won</given-names>
          </string-name>
          , Andres Ferraro, Dmitry Bogdanov, and
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Serra</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Evaluation of CNN-based Automatic Music Tagging Models</article-title>
          .
          <source>In 17th Sound and Music Computing Conference (SMC2020).</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Hongyi</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Moustapha Cisse,
          <string-name>
            <given-names>Yann N.</given-names>
            <surname>Dauphin</surname>
          </string-name>
          , and David LopezPaz.
          <year>2018</year>
          .
          <article-title>mixup: Beyond Empirical Risk Minimization</article-title>
          . In International Conference on Learning Representations.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>