Music theme recognition using CNN and self-attention Manoj Sukhavasi, Sainath Adapa manoj.sukhavasi1@gmail.com,adapasainath@gmail.com ABSTRACT 3 APPROACH We present an efficient architecture to detect mood/themes in music We used the pre-computed Mel-spectrograms made available by tracks on autotagging-moodtheme subset of the MTG-Jamendo the organizers of the challenge2 . No additional pre-processing steps dataset. Our approach consists of two blocks, a CNN block based were undertaken other than the normalization of the input Mel- on MobileNetV2 architecture and a self-attention block from Trans- spectrogram features. former architecture to capture long term temporal characteristics. As image-based data augmentation techniques have been shown We show that our proposed model produces a significant improve- to be effective in audio tagging [1, 2], we used transformations such ment over the baseline model. Our model (team name: AMLAG) as Random crop and Random Scale. Additionally, we also employed achieves 4th place on PR-AUC-macro Leaderboard in MediaEval SpecAugment and Mixup. SpecAugment[14] proposed initially for 2019: Emotion and Theme Recognition in Music Using speech recognition, masks blocks of frequency channels or time Jamendo. steps of a log Mel-spectrogram. Mixup [22] samples two training examples randomly and linearly mixes them (both the feature space and the labels). 1 INTRODUCTION We propose two methods: MobilenetV2 architecture, and Mo- Automatic music tagging is a multi-label classification task to pre- bileNetV2 architecture combined with a self-attention block to dict the music tags corresponding to the audio contents. Tagging capture long term temporal characteristics. We describe both of music with themes (action, documentary) and mood (sad, upbeat) these methods in detail below. can be useful in music discovery and recommendation. MediaEval 2019: Emotion and Theme Recognition in Music Using 3.1 MobileNetV2 Jamendo aims to improve the machine learning algorithms to au- It has been shown previously that using pre-trained ImageNet mod- tomatically recognize the emotions and themes conveyed in a mu- els helps in the case of audio tagging [1, 13]. Hence, we employed sic recording [3]. This task involves the prediction of moods and MobileNetV2 [17] for the current task. Since Mel-spectrograms are themes conveyed by a music track, given the raw audio on the single channel, the input data is transformed into a three-channel autotagging-moodtheme subset of the MTG-Jamendo dataset [4]. tensor by passing it through two convolution layers. This tensor The overview paper [3] describes the task in more detail, and also is then sent to the MobileNetV2 unit. As the number of labels is introduces us to a baseline solution based on VGG-ish features. In different here, the linear layer at the very end is replaced. No other this paper, we describe our Fourth place submission on PR-AUC- modifications were performed to the original MobileNetV2 archi- macro Leaderboard 1 which improves the results significantly on tecture. the baseline solution. 3.2 MobileNetV2 with Self-attention 2 RELATED WORK The architecture described in sub-section 3.1 might not be able to Conventionally feature extraction from audio relied on signal pro- capture the long-term temporal characteristics. The dataset con- cessing to compute relevant features from time or frequency domain sists of tracks with varying lengths with a majority longer than representation. As an alternative to these solutions, architectures 200s. Self-attention has been shown to capture long-range tem- based on Convolutional Neural Networks(CNN) [6] have become poral characteristics in the context of music tagging [20]. Hence more popular recently following their success in CV, speech pro- self-attention mechanism can be helpful in the current task. In this cessing. Extensions to CNNs have also been proposed to capture section, we describe our extended MobileNetV2 architecture with the long term temporal information in the form of CRNN [7]. Re- self-attention. cently [20] has shown that self-attention applied to music tagging The architecture consists of 2 main blocks: modified MobileNetV2 captures temporal information. This architecture was based on the (identical to the architecture described in [1]) to capture freq-time transformer architecture which was very successful in Natural Lan- characteristics, and the self-attention block to capture long term guage Processing (NLP)[19]. In this paper, we propose two methods temporal characteristics. MobileNetV2 and MobileNetV2 with self-attention which are based Similar to the transformer model [19], multi-head self-attention mainly on these two previous works [1, 20]. with positional encoding was implemented for the current archi- tecture. Since our task consists only of classification we use only 1 https://multimediaeval.github.io/2019-Emotion-and-Theme-Recognition-in-Music-Task/ the encoder part of it similar to BERT [9]. Our implementation results is based on the architecture described in [20]. We use 4 attention heads and 2 attention layers. The input sequence length is 16 and Copyright 2019 for this paper by its authors. Use has embedding size of 256. permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 https://github.com/MTG/mtg-jamendo-dataset MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France M. Sukhavasi, S. Adapa The control flow within this architecture is as follows: Baseline Submi- Submi- • Input is a Mel-spectrogram tensor of length 4096 (number (vggish) ssion 1 ssion 2 of bands being 96). This input is divided length-wise into PR-AUC-macro 0.107734 0.118306 0.125896 16 segments, with each segment’s length being 256. ROC-AUC-macro 0.725821 0.732416 0.752886 • Each of the 16 slices is sent through the modified Mo- F-score-macro 0.165694 0.151891 0.182957 bileNetV2 block to extract the features. precision-macro 0.138216 0.135673 0.145545 • The feature maps are then fed into the Self-attention block. recall-macro 0.30865 0.306015 0.39164 At the end of this block, two dense layers are put to use to PR-AUC-micro 0.140913 0.150605 0.151706 generate the predictions. ROC-AUC-micro 0.775029 0.784128 0.797624 • Additionally, the feature maps from the MobileNetV2 block F-score-micro 0.177133 0.152349 0.164375 are also used to generate predictions. With each segment, precision-micro 0.116097 0.098133 0.10135 we have a set of predictions. All the sixteen predictions are recall-micro 0.37348 0.340428 0.434691 averaged to obtain the final prediction. Table 1: Performance on the test dataset As described above, the architecture generates two predictions: one solely using the MobileNetV2, and the other using the Mo- bileNetV2 and the Self-attention blocks. While training, combined loss from both the predictions are used for back-propagation. Tag calm documentary epic 4 TRAINING AND RESULTS We made two submissions under the team name AMLAG3 , one each using the two architectures described in sections 3.1 and 3.2. Both the submissions employ the same Mel-spectrogram inputs 0.14 and Binary Cross-entropy loss as the optimization metric. PyTorch [15] was used for training the model in both cases. Loss 0.13 For submission 1, the AMSGrad variant of the Adam algorithm [12, 16] with a learning rate of 1e-3 was utilized for optimization. Whenever the overall loss on the validation set stopped improving 0.12 for five epochs, the learning rate was reduced by a factor of 10. For this training we use input Mel-spectrogram of length 6590, padding 0.11 is used to make all the inputs of constant length. We observed that not all classes benefited from being trained together (see Figure 1). 0 10 20 30 40 Hence, following the approach taken in [5], early stopping was done Epoch separately for each class based on the loss value for that particular class. Additionally, an attempt was made to find subsets of classes Figure 1: Trend in loss values for three sample classes while that train well together, but ultimately the overall performance had training the MobileNetV2 model. The plot illustrates the de- been lower than when all the classes were jointly trained. This is tail that not all classes were benefiting from joint training. one avenue for future research with this dataset. In this case, the loss for epic class is decreasing while the loss To prepare submission 2, we use input Mel-spectrogram of length for calm is increasing, documentary loss is almost stagnant. 4096, padding is used to make all the inputs of constant length. We train the model for 120 epochs while utilizing Adam as the initial optimizer. We then employ an optimization technique proposed in [10, 20]: the optimizer is switched from Adam to Stochastic gradient 5 OTHER APPROACHES descent (with Nesterov momentum [18]) after 60 epochs for better Some of the approaches that we have tried, but haven’t observed generalization of the model. Early stopping was done jointly for all better performance are listed below: classes based on the macro-averaged AUC-ROC on the validation • A dense layer architecture that uses OpenL3 embeddings set. [8] We present the results for both the submissions in Table 1. Also, • A dense layer architecture that uses the pre-computed results from the baseline approach that uses VGG-ish architecture statistical features from Essentia using the feature extractor are shown for comparison purposes. In all the metrics, the Mo- for AcousticBrainz. This data was made available by the bileNetV2 with a self-attention block exhibits an improvement over organizers, along with the raw audio and Mel-spectrogram solely using the MobileNetV2. With respect to the baseline model, data. submission 2 proved to be an improvement over all but the micro- • CNN architecture that directly uses the raw audio repre- averaged F-score and Precision metrics. On the task leaderboard, sentation, as described in [11] our model achieved 4th position in case of PR-AUC-macro, and 5th • Similar to using the MobileNetV2 in Section 3.1, we tested position in case of F-score-macro. another ImageNet pre-trained architecture - ResNeXt model 3 https://github.com/sainathadapa/mediaeval-2019-moodtheme-detection [21]. Emotion and Theme recognition in music using Jamendo MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion [1] Sainath Adapa. 2019. Urban Sound Tagging using Convolutional Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Neural Networks. (2019). arXiv:cs.SD/1909.12699 Attention Is All You Need. arXiv e-prints, Article arXiv:1706.03762 [2] Ruslan Baikulov. 2019. Argus solution Freesound Audio Tagging (Jun 2017), arXiv:1706.03762 pages. arXiv:cs.CL/1706.03762 2019. (2019). https://github.com/lRomul/argus-freesound Accessed: [20] Minz Won, Sanghyuk Chun, and Xavier Serra. 2019. To- 2019-10-01. ward Interpretable Music Tagging with Self-Attention. arXiv e- [3] Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, and Minz Won. prints, Article arXiv:1906.04972 (Jun 2019), arXiv:1906.04972 pages. 2019. MediaEval 2019: Emotion and Theme Recognition in Music arXiv:cs.SD/1906.04972 Using Jamendo. In 2019 Working Notes Proceedings of the MediaEval [21] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Workshop, MediaEval 2019. 1–3. 2017. Aggregated residual transformations for deep neural networks. [4] Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and In Proceedings of the IEEE conference on computer vision and pattern Xavier Serra. 2019. The MTG-Jamendo Dataset for Automatic Music recognition. 1492–1500. Tagging. In Machine Learning for Music Discovery Workshop, Interna- [22] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez- tional Conference on Machine Learning (ICML 2019). Long Beach, CA, Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint United States. http://hdl.handle.net/10230/42015 arXiv:1710.09412 (2017). [5] Rich Caruana. 1998. A dozen tricks with multitask learning. In Neural networks: tricks of the trade. Springer, 165–191. [6] Keunwoo Choi, George Fazekas, and Mark Sandler. 2016. Auto- matic tagging using deep convolutional neural networks. arXiv e- prints, Article arXiv:1606.00298 (Jun 2016), arXiv:1606.00298 pages. arXiv:cs.SD/1606.00298 [7] Keunwoo Choi, György Fazekas, Mark B. Sandler, and Kyunghyun Cho. 2016. Convolutional Recurrent Neural Networks for Music Classification. CoRR abs/1609.04243 (2016). arXiv:1609.04243 http: //arxiv.org/abs/1609.04243 [8] Jason Cramer, Ho-Hsiang Wu, Justin Salamon, and Juan Pablo Bello. 2019. Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3852–3856. [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, Article arXiv:1810.04805 (Oct 2018), arXiv:1810.04805 pages. arXiv:cs.CL/1810.04805 [10] Nitish Shirish Keskar and Richard Socher. 2017. Improving Gen- eralization Performance by Switching from Adam to SGD. CoRR abs/1712.07628 (2017). arXiv:1712.07628 http://arxiv.org/abs/1712. 07628 [11] Taejun Kim, Jongpil Lee, and Juhan Nam. 2019. Comparison and Analysis of SampleCNN Architectures for Audio Classification. IEEE Journal of Selected Topics in Signal Processing 13, 2 (2019), 285–297. [12] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [13] Mario Lasseck. 2018. Acoustic Bird Detection With Deep Convolu- tional Neural Networks. In Proceedings of the Detection and Classifica- tion of Acoustic Scenes and Events 2018 Workshop. Tampere University of Technology. [14] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019). [15] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017). [16] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. 2019. On the conver- gence of adam and beyond. arXiv preprint arXiv:1904.09237 (2019). [17] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520. [18] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In International conference on machine learning. 1139–1147.