Emotion and Theme Recognition in Music
                          with Frequency-Aware RF-Regularized CNNs
    Khaled Koutini, Shreyan Chowdhury, Verena Haunschmid, Hamid Eghbal-zadeh, Gerhard Widmer
                                                             Johannes Kepler University Linz
                                                                firstname.lastname@jku.at
ABSTRACT                                                                       Scene Analysis (CASA), and achieved top ranks in international
We present CP-JKU submission to MediaEval 2019; a Receptive Field-             challenges [16]. In this report, we extend the previous work to Music
(RF)-regularized and Frequency-Aware CNN approach for tagging                  Information Retrieval (MIR) and demonstrate that these models can
music with emotion/mood labels. We perform an investigation                    be used to recognize emotion in music, and achieve new state-of-
regarding the impact of the RF of the CNNs on their performance on             the-art results.
this dataset. We observe that ResNets with smaller receptive fields
– originally adapted for acoustic scene classification – also perform          2 SETUP
well in the emotion tagging task. We improve the performance of                2.1 Data Preparation
such architectures using techniques such as Frequency Awareness
                                                                               We used a sampling rate of 44.1 kHz to extract the input features.
and Shake-Shake regularization, which were used in previous work
                                                                               We apply a Short Time Fourier Transform (STFT). The window size
on general acoustic recognition tasks.
                                                                               for the STFT is 2048 samples and the overlap between windows
                                                                               is 75% for submissions 1, 2 and 3, and 25% for submissions 4 and
1    INTRODUCTION                                                              5. We use perceptually-weighted Mel-scaled spectrograms similar
                                                                               to [4, 14, 16], which results in an input having 256 Mel bins in the
Content based emotion recognition in music is a challenging task
                                                                               frequency dimension.
in part because of noisy datasets and unavailability of royalty-free
audio of consistent quality. The recently released MTG-Jamendo
dataset [2] is aimed at addressing these issues.
                                                                               2.2     Optimization
    The Emotion and Theme Recognition Task of MediaEval 2019                   In a setup similar to [14, 16, 17], we use Adam [13] for 200 epochs.
uses a subset of this dataset containing relevant emotion tags, and            We start with 10 epochs warm-up learning rate, we train with a
the task objective is to predict scores and decisions for these tags           constant learning rate of 1 × 10−4 for 60 epochs. After that, we use a
from audio (or spectrograms). The details of this specific data subset,        linear learning rate scheduler for 50 epochs, dropping the learning
task description, data splits, and evaluation strategy can be found            rate to 1 × 10−6 . We finally train for 80 more epochs using the final
in the overview paper [1].                                                     learning rate.
    Convolutional Neural Networks (CNNs) achieve state-of-the-art
results in many tasks such as image classification [8, 10], acoustic           2.3     Data Augmentation
scene classification [4, 16] and audio tagging [5]. These models can           Mix-up [21] has proven essential in our experiments to boost the
learn their own features and classifiers in an end-to-end fashion,             perfomance and the generalization of our models. These results are
which as a result reduces the need for task-specific feature engineer-         consistent with experience from our previous work [14, 16, 17].
ing. Although CNNs are capable of learning high-level concepts
given very simple and low-level information, the careful design of             3     ADAPTING CNNS
the network architectures in CNNs is a crucial step in achieving               Convolutional Neural Networks (CNNs) have shown great suc-
good results.                                                                  cess in many acoustic tasks [4–6, 9, 11, 14–20]. In our submis-
    In a recent study [14, 16], Koutini et al. showed that the receptive       sions, we build on this success and investigate their performance
field (RF) of CNN architectures is a very important factor when it             on tasks more specific to music. We use mainly adapted versions
comes to processing audio signals. Based on these findings, a regu-            of ResNet [8]. We adapt the architectures to the task using the
larization technique was proposed, that can significantly boost the            guidelines proposed in Koutini et al.[14]1 . We use the CNN variants
performance of CNNs when used with spectrogram features. Fur-                  introduced in [17].
ther, in [17] a drawback of CNNs in the audio domain is highlighted,
which is caused by the lack of spatial ordering in convolutional               3.1     Receptive Field Regularization
layers. As a solution, Frequency-Aware (FA) Convolutional Layers
                                                                               Limiting the receptive field (RF) has been shown to have a great
were introduced, to be used in CNNs with the commonly-used
                                                                               impact on the performance of a CNN in a number of acoustic recog-
spectrogram input.
                                                                               nition and detection tasks [14, 16]. We investigated the influence
    The proposed RF-regularization and FA-CNNs have shown great
                                                                               of the receptive field in this task in a setup similar to [14].
promise in several tasks in the field of Computational Auditory
                                                                                  Figure 1 shows the PR-AUC on both the the validation (val) and
Copyright 2019 for this paper by its authors. Use                              testing (test) sets, for ResNet models with different receptive fields
permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).                                                 1 The source code is published at https://github.com/kkoutini/cpjku_dcase19
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                 K. Koutini et al.

                                                                                            Table 1: PR-AUC results

                                                                                                            Validation    Testing
                                                                                Submission
                                                                                                            PR-AUC        PR-AUC
                                                                                ShakeFAResNet*              .1132         .1480
                                                                                FAResNet*                   .1149         .1463
                                                                                Avg_ensemble*               .1189         .1546
                                                                                ResNet34                    .0924         .1021
                                                                                CRNN                        .0924         .1172
                                                                                CP_ResNet                   .1097         .1325
                                                                                VGG-ish-baseline            -             .1077
                                                                                popular baseline            -             .0319
                                                                                *: indicates an ensemble.


      Figure 1: PR-AUC for ResNets with different RFs                   models tested during our experiments, and were submitted as addi-
                                                                        tional baselines against which to compare our modified CNNs.
                                                                        ShakeFAResNet We average the prediction of 5 Shake-Shake reg-
and their SWA (see Section 3.4 below) variants. The results show the    ularized FA-ResNets with different initlizations. Their frequency
larger receptive field causes performance drops in accordance to the    RF is regularized as explained in Section 3.1. They have however
findings of [14]. Moreover, further experiments showed that size of     different RF over the time dimension.
the receptive field over the time dimension has lower significance      FAResNet similar to Shakefaresnet, but without Shake-Shake reg-
on performance.                                                         ularization.
                                                                        Avg_ensemble We average the prediction of all the models in-
3.2    Frequency-Awareness and FA-ResNet                                cluded in both Shakefaresnet and Faresnet. In addition, we add a
Figure 1 shows that smaller-RF ResNets perform better. As shown         RF-regularized ResNet and DenseNet as introduced in [14].
in [17], Frequency-Awareness can compensate for the lack of freuqency   ResNet34 In our preliminary experiments, Vanilla Resnet-34 out-
information caused by the smaller RF. We use Frequency-Aware            performed Resnet-18 and Resnet-50 on the validation set, so we
ResNet (FA-ResNet) introduced in [17].                                  picked this architecture as an additional baseline.
                                                                        CRNN The CRNN network was motivated by the notion that global
                                                                        structure of musical features could affect the perception of certain
3.3    Shake-Shake Regularization
                                                                        aspects of music (like mood), as mentioned by Choi et al [3]. We
The Shake-Shake regularization [7] is proposed for improved sta-        use an architecture similar to the one used by Choi et al, where the
bility and robustness. As shown in [16] and [17], although Shake-       CNN part acts as the feature extractor and the RNN part acts as
Shake ResNets do not perform well in the original acoustic scene        a temporal aggregator. This approach increased the performance
classification problem, it performed really well in this task.          from the baseline CNN and the Resnet-34.
                                                                        CP_ResNet (not submitted to the challenge) We also show the
3.4    Model Averaging                                                  results of a single model RF-regularized ResNet.
Stochastic Weight Averaging: Similar to [16, 17], we use Sto-
chastic Weight Averaging (SWA) [12]. We add networks weights to         4.2    Results
the average every 3 epochs. The averaged networks turned out to         Table 1 shows the results of our submitted systems and compares
out-perform each of the single networks.                                them with the baselines. We can see that our RF-regularized and
Snapshot Averaging: When computing the final prediction we              Frequency-Aware CNNs outperform the baselines by a significant
also average the predictions of 5 snapshots of the networks during      margin, resulting in ranking as the top 3 submissions in the chal-
training. Specifically, we average the model with the highest PR-       lenge. The systems that are marked with a star *, are ensembles of
AUC on the validation set with the last 4 SWA models’ predictions       multiple models and snapshots (Section3.4). Table 1 also shows a
during training.                                                        single RF-regularized ResNet (CP_ResNet) can perform very well
Multi-model Averaging: We average different models that have            compared to the baselines.
different architectures, initialization and/or receptive fields over
time.
                                                                        ACKNOWLEDGMENTS
4 SUBMISSIONS AND RESULTS                                               This work has been supported by the LCM – K2 Center within the
                                                                        framework of the Austrian COMET-K2 program, and the European
4.1 Submitted Models                                                    Research Council (ERC) under the EU’s Horizon 2020 research and
Overall, we submitted five models to the challenge: the first three     innovation programme, under grant agreement No 670035 (project
are variations of the approach described above; the other two were      “Con Espressione”).
Emotion and Theme Recognition in Music                                                 MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES                                                                       [16] Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer. 2019. CP-
 [1] Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, and Minz Won.                JKU submissions to DCASE’19: Acoustic Scene Classification and Audio
     2019. MediaEval 2019: Emotion and Theme Recognition in Music                     Tagging with Receptive-Field-Regularized CNNs. Technical Report.
     Using Jamendo. In MediaEval Benchmark Workshop.                                  DCASE2019 Challenge.
 [2] Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and           [17] Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer. 2019.
     Xavier Serra. 2019. The MTG-Jamendo dataset for automatic music                  Receptive-field-regularized CNN variants for acoustic scene classifi-
     tagging.                                                                         cation. In Proceedings of the Detection and Classification of Acoustic
 [3] Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho.                   Scenes and Events 2019 Workshop (DCASE2019).
     2017. Convolutional recurrent neural networks for music classification.     [18] Donmoon Lee, Subin Lee, Yoonchang Han, and Kyogu Lee. Ensemble of
     In 2017 IEEE International Conference on Acoustics, Speech and Signal            Convolutional Neural Networks for Weakly-Supervised Sound Event
     Processing (ICASSP). IEEE, 2392–2396.                                            Detection Using Multiple Scale Input. DCASE2017 Challenge.
 [4] Matthias Dorfer, Bernhard Lehner, Hamid Eghbal-zadeh, Christop              [19] Bernhard Lehner, Hamid Eghbal-Zadeh, Matthias Dorfer, Filip Ko-
     Heindl, Fabian Paischer, and Gerhard Widmer. 2018. Acoustic scene                rzeniowski, Khaled Koutini, and Gerhard Widmer. 2017. Classifying
     classification with fully convolutional neural networks and I-vectors.           Short Acoustic Scenes with I-Vectors and CNNs: Challenges and Opti-
     In Proceedings of the Detection and Classification of Acoustic Scenes and        misations for the 2017 DCASE ASC Task. In DCASE 2017-challenge on
     Events 2018 Challenge (DCASE2018).                                               Detection and Classification of Acoustic Scenes and Events. DCASE2017
 [5] Matthias Dorfer and Gerhard Widmer. 2018. Training general-purpose               Challenge.
     audio tagging networks with noisy labels and iterative self-verification.   [20] Yuma Sakashita and Masaki Aono. 2018. Acoustic scene classification
     In Proceedings of the Detection and Classification of Acoustic Scenes and        by ensemble of spectrograms based on adaptive temporal divisions.
     Events 2018 Workshop (DCASE2018). 178–182.                                       DCASE2018 Challenge.
 [6] Hamid Eghbal-Zadeh, Bernhard Lehner, Matthias Dorfer, and Ger-              [21] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-
     hard Widmer. 2016. CP-JKU Submissions for DCASE-2016: A Hybrid                   Paz. 2018. mixup: Beyond Empirical Risk Minimization. In 6th Inter-
     Approach Using Binaural i-Vectors and Deep Convolutional Neural                  national Conference on Learning Representations, ICLR 2018, Vancouver,
     Networks. In DCASE 2016-challenge on Detection and Classification of             BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
     Acoustic Scenes and Events. DCASE2016 Challenge.
 [7] Xavier Gastaldi. 2017. Shake-shake regularization. arXiv preprint
     arXiv:1705.07485 (2017).
 [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
     Residual Learning for Image Recognition. In Proceedings of the IEEE
     Conference on Computer Vision and Pattern Recognition. 770–778.
 [9] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen,
     R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney,
     R. J. Weiss, and K. Wilson. 2017. CNN Architectures for Large-Scale
     Audio Classification. In 2017 IEEE International Conference on Acoustics,
     Speech and Signal Processing (ICASSP) (2017-03). 131–135. https://doi.
     org/10.1109/ICASSP.2017.7952132
[10] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q.
     Weinberger. 2017. Densely Connected Convolutional Networks. In
     Proceedings of the IEEE Conference on Computer Vision and Pattern
     Recognition. 4700–4708.
[11] Turab Iqbal, Qiuqiang Kong, Mark Plumbley, and Wenwu Wang.
     Stacked convolutional neural networks for general-purpose audio
     tagging. DCASE2018 Challenge.
[12] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov,
     and Andrew Gordon Wilson. 2018. Averaging Weights Leads to Wider
     Optima and Better Generalization. arXiv preprint arXiv:1803.05407
     (2018).
[13] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Sto-
     chastic Optimization. In 3rd International Conference on Learning Rep-
     resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference
     Track Proceedings.
[14] Khaled Koutini, Hamid Eghbal-zadeh, Matthias Dorfer, and Gerhard
     Widmer. 2019. The Receptive Field as a Regularizer in Deep Convo-
     lutional Neural Networks for Acoustic Scene Classification. In Pro-
     ceedings of the European Signal Processing Conference (EUSIPCO). A
     Coruña, Spain.
[15] Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer. 2018.
     Iterative Knowledge Distillation in R-CNNs for Weakly-Labeled Semi-
     Supervised Sound Event Detection. In Proceedings of the Detection and
     Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)
     (2018-11). 173–177.