Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs Khaled Koutini, Shreyan Chowdhury, Verena Haunschmid, Hamid Eghbal-zadeh, Gerhard Widmer Johannes Kepler University Linz firstname.lastname@jku.at ABSTRACT Scene Analysis (CASA), and achieved top ranks in international We present CP-JKU submission to MediaEval 2019; a Receptive Field- challenges [16]. In this report, we extend the previous work to Music (RF)-regularized and Frequency-Aware CNN approach for tagging Information Retrieval (MIR) and demonstrate that these models can music with emotion/mood labels. We perform an investigation be used to recognize emotion in music, and achieve new state-of- regarding the impact of the RF of the CNNs on their performance on the-art results. this dataset. We observe that ResNets with smaller receptive fields – originally adapted for acoustic scene classification – also perform 2 SETUP well in the emotion tagging task. We improve the performance of 2.1 Data Preparation such architectures using techniques such as Frequency Awareness We used a sampling rate of 44.1 kHz to extract the input features. and Shake-Shake regularization, which were used in previous work We apply a Short Time Fourier Transform (STFT). The window size on general acoustic recognition tasks. for the STFT is 2048 samples and the overlap between windows is 75% for submissions 1, 2 and 3, and 25% for submissions 4 and 1 INTRODUCTION 5. We use perceptually-weighted Mel-scaled spectrograms similar to [4, 14, 16], which results in an input having 256 Mel bins in the Content based emotion recognition in music is a challenging task frequency dimension. in part because of noisy datasets and unavailability of royalty-free audio of consistent quality. The recently released MTG-Jamendo dataset [2] is aimed at addressing these issues. 2.2 Optimization The Emotion and Theme Recognition Task of MediaEval 2019 In a setup similar to [14, 16, 17], we use Adam [13] for 200 epochs. uses a subset of this dataset containing relevant emotion tags, and We start with 10 epochs warm-up learning rate, we train with a the task objective is to predict scores and decisions for these tags constant learning rate of 1 × 10−4 for 60 epochs. After that, we use a from audio (or spectrograms). The details of this specific data subset, linear learning rate scheduler for 50 epochs, dropping the learning task description, data splits, and evaluation strategy can be found rate to 1 × 10−6 . We finally train for 80 more epochs using the final in the overview paper [1]. learning rate. Convolutional Neural Networks (CNNs) achieve state-of-the-art results in many tasks such as image classification [8, 10], acoustic 2.3 Data Augmentation scene classification [4, 16] and audio tagging [5]. These models can Mix-up [21] has proven essential in our experiments to boost the learn their own features and classifiers in an end-to-end fashion, perfomance and the generalization of our models. These results are which as a result reduces the need for task-specific feature engineer- consistent with experience from our previous work [14, 16, 17]. ing. Although CNNs are capable of learning high-level concepts given very simple and low-level information, the careful design of 3 ADAPTING CNNS the network architectures in CNNs is a crucial step in achieving Convolutional Neural Networks (CNNs) have shown great suc- good results. cess in many acoustic tasks [4–6, 9, 11, 14–20]. In our submis- In a recent study [14, 16], Koutini et al. showed that the receptive sions, we build on this success and investigate their performance field (RF) of CNN architectures is a very important factor when it on tasks more specific to music. We use mainly adapted versions comes to processing audio signals. Based on these findings, a regu- of ResNet [8]. We adapt the architectures to the task using the larization technique was proposed, that can significantly boost the guidelines proposed in Koutini et al.[14]1 . We use the CNN variants performance of CNNs when used with spectrogram features. Fur- introduced in [17]. ther, in [17] a drawback of CNNs in the audio domain is highlighted, which is caused by the lack of spatial ordering in convolutional 3.1 Receptive Field Regularization layers. As a solution, Frequency-Aware (FA) Convolutional Layers Limiting the receptive field (RF) has been shown to have a great were introduced, to be used in CNNs with the commonly-used impact on the performance of a CNN in a number of acoustic recog- spectrogram input. nition and detection tasks [14, 16]. We investigated the influence The proposed RF-regularization and FA-CNNs have shown great of the receptive field in this task in a setup similar to [14]. promise in several tasks in the field of Computational Auditory Figure 1 shows the PR-AUC on both the the validation (val) and Copyright 2019 for this paper by its authors. Use testing (test) sets, for ResNet models with different receptive fields permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 The source code is published at https://github.com/kkoutini/cpjku_dcase19 MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France K. Koutini et al. Table 1: PR-AUC results Validation Testing Submission PR-AUC PR-AUC ShakeFAResNet* .1132 .1480 FAResNet* .1149 .1463 Avg_ensemble* .1189 .1546 ResNet34 .0924 .1021 CRNN .0924 .1172 CP_ResNet .1097 .1325 VGG-ish-baseline - .1077 popular baseline - .0319 *: indicates an ensemble. Figure 1: PR-AUC for ResNets with different RFs models tested during our experiments, and were submitted as addi- tional baselines against which to compare our modified CNNs. ShakeFAResNet We average the prediction of 5 Shake-Shake reg- and their SWA (see Section 3.4 below) variants. The results show the ularized FA-ResNets with different initlizations. Their frequency larger receptive field causes performance drops in accordance to the RF is regularized as explained in Section 3.1. They have however findings of [14]. Moreover, further experiments showed that size of different RF over the time dimension. the receptive field over the time dimension has lower significance FAResNet similar to Shakefaresnet, but without Shake-Shake reg- on performance. ularization. Avg_ensemble We average the prediction of all the models in- 3.2 Frequency-Awareness and FA-ResNet cluded in both Shakefaresnet and Faresnet. In addition, we add a Figure 1 shows that smaller-RF ResNets perform better. As shown RF-regularized ResNet and DenseNet as introduced in [14]. in [17], Frequency-Awareness can compensate for the lack of freuqency ResNet34 In our preliminary experiments, Vanilla Resnet-34 out- information caused by the smaller RF. We use Frequency-Aware performed Resnet-18 and Resnet-50 on the validation set, so we ResNet (FA-ResNet) introduced in [17]. picked this architecture as an additional baseline. CRNN The CRNN network was motivated by the notion that global structure of musical features could affect the perception of certain 3.3 Shake-Shake Regularization aspects of music (like mood), as mentioned by Choi et al [3]. We The Shake-Shake regularization [7] is proposed for improved sta- use an architecture similar to the one used by Choi et al, where the bility and robustness. As shown in [16] and [17], although Shake- CNN part acts as the feature extractor and the RNN part acts as Shake ResNets do not perform well in the original acoustic scene a temporal aggregator. This approach increased the performance classification problem, it performed really well in this task. from the baseline CNN and the Resnet-34. CP_ResNet (not submitted to the challenge) We also show the 3.4 Model Averaging results of a single model RF-regularized ResNet. Stochastic Weight Averaging: Similar to [16, 17], we use Sto- chastic Weight Averaging (SWA) [12]. We add networks weights to 4.2 Results the average every 3 epochs. The averaged networks turned out to Table 1 shows the results of our submitted systems and compares out-perform each of the single networks. them with the baselines. We can see that our RF-regularized and Snapshot Averaging: When computing the final prediction we Frequency-Aware CNNs outperform the baselines by a significant also average the predictions of 5 snapshots of the networks during margin, resulting in ranking as the top 3 submissions in the chal- training. Specifically, we average the model with the highest PR- lenge. The systems that are marked with a star *, are ensembles of AUC on the validation set with the last 4 SWA models’ predictions multiple models and snapshots (Section3.4). Table 1 also shows a during training. single RF-regularized ResNet (CP_ResNet) can perform very well Multi-model Averaging: We average different models that have compared to the baselines. different architectures, initialization and/or receptive fields over time. ACKNOWLEDGMENTS 4 SUBMISSIONS AND RESULTS This work has been supported by the LCM – K2 Center within the framework of the Austrian COMET-K2 program, and the European 4.1 Submitted Models Research Council (ERC) under the EU’s Horizon 2020 research and Overall, we submitted five models to the challenge: the first three innovation programme, under grant agreement No 670035 (project are variations of the approach described above; the other two were “Con Espressione”). Emotion and Theme Recognition in Music MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [16] Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer. 2019. CP- [1] Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, and Minz Won. JKU submissions to DCASE’19: Acoustic Scene Classification and Audio 2019. MediaEval 2019: Emotion and Theme Recognition in Music Tagging with Receptive-Field-Regularized CNNs. Technical Report. Using Jamendo. In MediaEval Benchmark Workshop. DCASE2019 Challenge. [2] Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and [17] Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer. 2019. Xavier Serra. 2019. The MTG-Jamendo dataset for automatic music Receptive-field-regularized CNN variants for acoustic scene classifi- tagging. cation. In Proceedings of the Detection and Classification of Acoustic [3] Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho. Scenes and Events 2019 Workshop (DCASE2019). 2017. Convolutional recurrent neural networks for music classification. [18] Donmoon Lee, Subin Lee, Yoonchang Han, and Kyogu Lee. Ensemble of In 2017 IEEE International Conference on Acoustics, Speech and Signal Convolutional Neural Networks for Weakly-Supervised Sound Event Processing (ICASSP). IEEE, 2392–2396. Detection Using Multiple Scale Input. DCASE2017 Challenge. [4] Matthias Dorfer, Bernhard Lehner, Hamid Eghbal-zadeh, Christop [19] Bernhard Lehner, Hamid Eghbal-Zadeh, Matthias Dorfer, Filip Ko- Heindl, Fabian Paischer, and Gerhard Widmer. 2018. Acoustic scene rzeniowski, Khaled Koutini, and Gerhard Widmer. 2017. Classifying classification with fully convolutional neural networks and I-vectors. Short Acoustic Scenes with I-Vectors and CNNs: Challenges and Opti- In Proceedings of the Detection and Classification of Acoustic Scenes and misations for the 2017 DCASE ASC Task. In DCASE 2017-challenge on Events 2018 Challenge (DCASE2018). Detection and Classification of Acoustic Scenes and Events. DCASE2017 [5] Matthias Dorfer and Gerhard Widmer. 2018. Training general-purpose Challenge. audio tagging networks with noisy labels and iterative self-verification. [20] Yuma Sakashita and Masaki Aono. 2018. Acoustic scene classification In Proceedings of the Detection and Classification of Acoustic Scenes and by ensemble of spectrograms based on adaptive temporal divisions. Events 2018 Workshop (DCASE2018). 178–182. DCASE2018 Challenge. [6] Hamid Eghbal-Zadeh, Bernhard Lehner, Matthias Dorfer, and Ger- [21] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez- hard Widmer. 2016. CP-JKU Submissions for DCASE-2016: A Hybrid Paz. 2018. mixup: Beyond Empirical Risk Minimization. In 6th Inter- Approach Using Binaural i-Vectors and Deep Convolutional Neural national Conference on Learning Representations, ICLR 2018, Vancouver, Networks. In DCASE 2016-challenge on Detection and Classification of BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Acoustic Scenes and Events. DCASE2016 Challenge. [7] Xavier Gastaldi. 2017. Shake-shake regularization. arXiv preprint arXiv:1705.07485 (2017). [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778. [9] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson. 2017. CNN Architectures for Large-Scale Audio Classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017-03). 131–135. https://doi. org/10.1109/ICASSP.2017.7952132 [10] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700–4708. [11] Turab Iqbal, Qiuqiang Kong, Mark Plumbley, and Wenwu Wang. Stacked convolutional neural networks for general-purpose audio tagging. DCASE2018 Challenge. [12] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. Averaging Weights Leads to Wider Optima and Better Generalization. arXiv preprint arXiv:1803.05407 (2018). [13] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Sto- chastic Optimization. In 3rd International Conference on Learning Rep- resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. [14] Khaled Koutini, Hamid Eghbal-zadeh, Matthias Dorfer, and Gerhard Widmer. 2019. The Receptive Field as a Regularizer in Deep Convo- lutional Neural Networks for Acoustic Scene Classification. In Pro- ceedings of the European Signal Processing Conference (EUSIPCO). A Coruña, Spain. [15] Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer. 2018. Iterative Knowledge Distillation in R-CNNs for Weakly-Labeled Semi- Supervised Sound Event Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) (2018-11). 173–177.