Music theme recognition using CNN and self-attention
                                                             Manoj Sukhavasi, Sainath Adapa
                                                manoj.sukhavasi1@gmail.com,adapasainath@gmail.com

ABSTRACT                                                                               3     APPROACH
We present an efficient architecture to detect mood/themes in music                    We used the pre-computed Mel-spectrograms made available by
tracks on autotagging-moodtheme subset of the MTG-Jamendo                              the organizers of the challenge2 . No additional pre-processing steps
dataset. Our approach consists of two blocks, a CNN block based                        were undertaken other than the normalization of the input Mel-
on MobileNetV2 architecture and a self-attention block from Trans-                     spectrogram features.
former architecture to capture long term temporal characteristics.                        As image-based data augmentation techniques have been shown
We show that our proposed model produces a significant improve-                        to be effective in audio tagging [1, 2], we used transformations such
ment over the baseline model. Our model (team name: AMLAG)                             as Random crop and Random Scale. Additionally, we also employed
achieves 4th place on PR-AUC-macro Leaderboard in MediaEval                            SpecAugment and Mixup. SpecAugment[14] proposed initially for
2019: Emotion and Theme Recognition in Music Using                                     speech recognition, masks blocks of frequency channels or time
Jamendo.                                                                               steps of a log Mel-spectrogram. Mixup [22] samples two training
                                                                                       examples randomly and linearly mixes them (both the feature space
                                                                                       and the labels).
1     INTRODUCTION                                                                        We propose two methods: MobilenetV2 architecture, and Mo-
Automatic music tagging is a multi-label classification task to pre-                   bileNetV2 architecture combined with a self-attention block to
dict the music tags corresponding to the audio contents. Tagging                       capture long term temporal characteristics. We describe both of
music with themes (action, documentary) and mood (sad, upbeat)                         these methods in detail below.
can be useful in music discovery and recommendation. MediaEval
2019: Emotion and Theme Recognition in Music Using                                     3.1     MobileNetV2
Jamendo aims to improve the machine learning algorithms to au-                         It has been shown previously that using pre-trained ImageNet mod-
tomatically recognize the emotions and themes conveyed in a mu-                        els helps in the case of audio tagging [1, 13]. Hence, we employed
sic recording [3]. This task involves the prediction of moods and                      MobileNetV2 [17] for the current task. Since Mel-spectrograms are
themes conveyed by a music track, given the raw audio on the                           single channel, the input data is transformed into a three-channel
autotagging-moodtheme subset of the MTG-Jamendo dataset [4].                           tensor by passing it through two convolution layers. This tensor
The overview paper [3] describes the task in more detail, and also                     is then sent to the MobileNetV2 unit. As the number of labels is
introduces us to a baseline solution based on VGG-ish features. In                     different here, the linear layer at the very end is replaced. No other
this paper, we describe our Fourth place submission on PR-AUC-                         modifications were performed to the original MobileNetV2 archi-
macro Leaderboard 1 which improves the results significantly on                        tecture.
the baseline solution.
                                                                                       3.2     MobileNetV2 with Self-attention
2     RELATED WORK                                                                     The architecture described in sub-section 3.1 might not be able to
Conventionally feature extraction from audio relied on signal pro-                     capture the long-term temporal characteristics. The dataset con-
cessing to compute relevant features from time or frequency domain                     sists of tracks with varying lengths with a majority longer than
representation. As an alternative to these solutions, architectures                    200s. Self-attention has been shown to capture long-range tem-
based on Convolutional Neural Networks(CNN) [6] have become                            poral characteristics in the context of music tagging [20]. Hence
more popular recently following their success in CV, speech pro-                       self-attention mechanism can be helpful in the current task. In this
cessing. Extensions to CNNs have also been proposed to capture                         section, we describe our extended MobileNetV2 architecture with
the long term temporal information in the form of CRNN [7]. Re-                        self-attention.
cently [20] has shown that self-attention applied to music tagging                        The architecture consists of 2 main blocks: modified MobileNetV2
captures temporal information. This architecture was based on the                      (identical to the architecture described in [1]) to capture freq-time
transformer architecture which was very successful in Natural Lan-                     characteristics, and the self-attention block to capture long term
guage Processing (NLP)[19]. In this paper, we propose two methods                      temporal characteristics.
MobileNetV2 and MobileNetV2 with self-attention which are based                           Similar to the transformer model [19], multi-head self-attention
mainly on these two previous works [1, 20].                                            with positional encoding was implemented for the current archi-
                                                                                       tecture. Since our task consists only of classification we use only
1 https://multimediaeval.github.io/2019-Emotion-and-Theme-Recognition-in-Music-Task/   the encoder part of it similar to BERT [9]. Our implementation
results                                                                                is based on the architecture described in [20]. We use 4 attention
                                                                                       heads and 2 attention layers. The input sequence length is 16 and
Copyright 2019 for this paper by its authors. Use                                      has embedding size of 256.
permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).                                                         2 https://github.com/MTG/mtg-jamendo-dataset
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                             M. Sukhavasi, S. Adapa


   The control flow within this architecture is as follows:                                            Baseline     Submi-      Submi-
       • Input is a Mel-spectrogram tensor of length 4096 (number                                      (vggish)     ssion 1     ssion 2
         of bands being 96). This input is divided length-wise into               PR-AUC-macro         0.107734     0.118306    0.125896
         16 segments, with each segment’s length being 256.                       ROC-AUC-macro        0.725821     0.732416    0.752886
       • Each of the 16 slices is sent through the modified Mo-                   F-score-macro        0.165694     0.151891    0.182957
         bileNetV2 block to extract the features.                                 precision-macro      0.138216     0.135673    0.145545
       • The feature maps are then fed into the Self-attention block.             recall-macro         0.30865      0.306015    0.39164
         At the end of this block, two dense layers are put to use to             PR-AUC-micro         0.140913     0.150605    0.151706
         generate the predictions.                                                ROC-AUC-micro        0.775029     0.784128    0.797624
       • Additionally, the feature maps from the MobileNetV2 block                F-score-micro        0.177133     0.152349    0.164375
         are also used to generate predictions. With each segment,                precision-micro      0.116097     0.098133    0.10135
         we have a set of predictions. All the sixteen predictions are            recall-micro         0.37348      0.340428    0.434691
         averaged to obtain the final prediction.                                       Table 1: Performance on the test dataset
   As described above, the architecture generates two predictions:
one solely using the MobileNetV2, and the other using the Mo-
bileNetV2 and the Self-attention blocks. While training, combined
loss from both the predictions are used for back-propagation.
                                                                                            Tag        calm        documentary          epic
4    TRAINING AND RESULTS
We made two submissions under the team name AMLAG3 , one
each using the two architectures described in sections 3.1 and 3.2.
Both the submissions employ the same Mel-spectrogram inputs                      0.14
and Binary Cross-entropy loss as the optimization metric. PyTorch
[15] was used for training the model in both cases.
                                                                          Loss

                                                                                 0.13
   For submission 1, the AMSGrad variant of the Adam algorithm
[12, 16] with a learning rate of 1e-3 was utilized for optimization.
Whenever the overall loss on the validation set stopped improving                0.12
for five epochs, the learning rate was reduced by a factor of 10. For
this training we use input Mel-spectrogram of length 6590, padding
                                                                                 0.11
is used to make all the inputs of constant length. We observed that
not all classes benefited from being trained together (see Figure 1).                   0         10          20           30           40
Hence, following the approach taken in [5], early stopping was done                                           Epoch
separately for each class based on the loss value for that particular
class. Additionally, an attempt was made to find subsets of classes
                                                                         Figure 1: Trend in loss values for three sample classes while
that train well together, but ultimately the overall performance had
                                                                         training the MobileNetV2 model. The plot illustrates the de-
been lower than when all the classes were jointly trained. This is
                                                                         tail that not all classes were benefiting from joint training.
one avenue for future research with this dataset.
                                                                         In this case, the loss for epic class is decreasing while the loss
   To prepare submission 2, we use input Mel-spectrogram of length
                                                                         for calm is increasing, documentary loss is almost stagnant.
4096, padding is used to make all the inputs of constant length. We
train the model for 120 epochs while utilizing Adam as the initial
optimizer. We then employ an optimization technique proposed in
[10, 20]: the optimizer is switched from Adam to Stochastic gradient     5       OTHER APPROACHES
descent (with Nesterov momentum [18]) after 60 epochs for better         Some of the approaches that we have tried, but haven’t observed
generalization of the model. Early stopping was done jointly for all     better performance are listed below:
classes based on the macro-averaged AUC-ROC on the validation                    • A dense layer architecture that uses OpenL3 embeddings
set.                                                                               [8]
   We present the results for both the submissions in Table 1. Also,             • A dense layer architecture that uses the pre-computed
results from the baseline approach that uses VGG-ish architecture                  statistical features from Essentia using the feature extractor
are shown for comparison purposes. In all the metrics, the Mo-                     for AcousticBrainz. This data was made available by the
bileNetV2 with a self-attention block exhibits an improvement over                 organizers, along with the raw audio and Mel-spectrogram
solely using the MobileNetV2. With respect to the baseline model,                  data.
submission 2 proved to be an improvement over all but the micro-                 • CNN architecture that directly uses the raw audio repre-
averaged F-score and Precision metrics. On the task leaderboard,                   sentation, as described in [11]
our model achieved 4th position in case of PR-AUC-macro, and 5th                 • Similar to using the MobileNetV2 in Section 3.1, we tested
position in case of F-score-macro.                                                 another ImageNet pre-trained architecture - ResNeXt model
3 https://github.com/sainathadapa/mediaeval-2019-moodtheme-detection               [21].
Emotion and Theme recognition in music using Jamendo                                 MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES                                                                     [19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
 [1] Sainath Adapa. 2019. Urban Sound Tagging using Convolutional                   Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017.
     Neural Networks. (2019). arXiv:cs.SD/1909.12699                                Attention Is All You Need. arXiv e-prints, Article arXiv:1706.03762
 [2] Ruslan Baikulov. 2019. Argus solution Freesound Audio Tagging                  (Jun 2017), arXiv:1706.03762 pages. arXiv:cs.CL/1706.03762
     2019. (2019). https://github.com/lRomul/argus-freesound Accessed:         [20] Minz Won, Sanghyuk Chun, and Xavier Serra. 2019.                  To-
     2019-10-01.                                                                    ward Interpretable Music Tagging with Self-Attention. arXiv e-
 [3] Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, and Minz Won.              prints, Article arXiv:1906.04972 (Jun 2019), arXiv:1906.04972 pages.
     2019. MediaEval 2019: Emotion and Theme Recognition in Music                   arXiv:cs.SD/1906.04972
     Using Jamendo. In 2019 Working Notes Proceedings of the MediaEval         [21] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He.
     Workshop, MediaEval 2019. 1–3.                                                 2017. Aggregated residual transformations for deep neural networks.
 [4] Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and              In Proceedings of the IEEE conference on computer vision and pattern
     Xavier Serra. 2019. The MTG-Jamendo Dataset for Automatic Music                recognition. 1492–1500.
     Tagging. In Machine Learning for Music Discovery Workshop, Interna-       [22] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-
     tional Conference on Machine Learning (ICML 2019). Long Beach, CA,             Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint
     United States. http://hdl.handle.net/10230/42015                               arXiv:1710.09412 (2017).
 [5] Rich Caruana. 1998. A dozen tricks with multitask learning. In Neural
     networks: tricks of the trade. Springer, 165–191.
 [6] Keunwoo Choi, George Fazekas, and Mark Sandler. 2016. Auto-
     matic tagging using deep convolutional neural networks. arXiv e-
     prints, Article arXiv:1606.00298 (Jun 2016), arXiv:1606.00298 pages.
     arXiv:cs.SD/1606.00298
 [7] Keunwoo Choi, György Fazekas, Mark B. Sandler, and Kyunghyun
     Cho. 2016. Convolutional Recurrent Neural Networks for Music
     Classification. CoRR abs/1609.04243 (2016). arXiv:1609.04243 http:
     //arxiv.org/abs/1609.04243
 [8] Jason Cramer, Ho-Hsiang Wu, Justin Salamon, and Juan Pablo Bello.
     2019. Look, Listen, and Learn More: Design Choices for Deep Audio
     Embeddings. In ICASSP 2019-2019 IEEE International Conference on
     Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3852–3856.
 [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
     2018. BERT: Pre-training of Deep Bidirectional Transformers for
     Language Understanding. arXiv e-prints, Article arXiv:1810.04805
     (Oct 2018), arXiv:1810.04805 pages. arXiv:cs.CL/1810.04805
[10] Nitish Shirish Keskar and Richard Socher. 2017. Improving Gen-
     eralization Performance by Switching from Adam to SGD. CoRR
     abs/1712.07628 (2017). arXiv:1712.07628 http://arxiv.org/abs/1712.
     07628
[11] Taejun Kim, Jongpil Lee, and Juhan Nam. 2019. Comparison and
     Analysis of SampleCNN Architectures for Audio Classification. IEEE
     Journal of Selected Topics in Signal Processing 13, 2 (2019), 285–297.
[12] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic
     optimization. arXiv preprint arXiv:1412.6980 (2014).
[13] Mario Lasseck. 2018. Acoustic Bird Detection With Deep Convolu-
     tional Neural Networks. In Proceedings of the Detection and Classifica-
     tion of Acoustic Scenes and Events 2018 Workshop. Tampere University
     of Technology.
[14] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret
     Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple
     data augmentation method for automatic speech recognition. arXiv
     preprint arXiv:1904.08779 (2019).
[15] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward
     Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga,
     and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
[16] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. 2019. On the conver-
     gence of adam and beyond. arXiv preprint arXiv:1904.09237 (2019).
[17] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,
     and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and
     linear bottlenecks. In Proceedings of the IEEE Conference on Computer
     Vision and Pattern Recognition. 4510–4520.
[18] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
     2013. On the importance of initialization and momentum in deep
     learning. In International conference on machine learning. 1139–1147.