Emotion and Themes Recognition in Music Utilising Convolutional and Recurrent Neural Networks Shahin Amiriparian1 , Maurice Gerczuk1 , Eduardo Coutinho2 , Alice Baird1 , Sandra Ottl1 , Manuel Milling1 , Björn Schuller1,3 1 ZD.B. Chair of Embedded Intelligence for Health Care & Wellbeing, Univeristy of Augsburg, Germany 2 Applied Music Research Lab, Department of Music, University of Liverpool, U. K. 3 GLAM – Group on Language, Audio & Music, Imperial College London, U. K. amiriparian@ieee.org ABSTRACT models capture both shift-invariant, high-level features (convolu- Emotion is an inherent aspect of music, and associations to music tional block), and long(er)-term temporal context (recurrent block) can be made via both life experience and specific musical tech- from the musical inputs [7, 8]. The MTG-Jamendo dataset [8] in- niques applied by the composer. Computational approaches for cludes 18 486 audio tracks with 56 distinct mood and theme annota- music recognition have been well-established in the research com- tions/tags. All audio files have at least one tag. The dataset provides munity; however, deep approaches have been limited and not yet 60-20-20 % splits for training, validation, and testing. For the full comparable to conventional approaches. In this study, we present description of the challenge data, please refer to [8]. our fusion system of end-to-end convolutional recurrent neural networks (CRNN) and pre-trained convolutional feature extrac- 2.1 Convolutional Recurrent Neural Network tors for music emotion and theme recognition1 . We train 9 models The CRNN system (upper part of Figure 1) consists of a vgg-ish and conduct various late fusion experiments. Our best performing model (which is trained on the Audioset dataset [19]) with the fi- model (team name: AugLi) achieves 74.2 % ROC-AUC on the test nal global average pooling layer replaced by an RNN. Specifically, partition which is 1.6 percentage points over the baseline system we add 2 recurrent layers with 256 units (we tried 128, 256, and of the MediaEval 2019 Emotion & Themes in Music task. 512 units) and a dropout [27] of 0.3 (out of [0.2, 0.3, 0.4]) for each layer, followed by a 1 024 unit dense layer, batch normalisation [20], 1 INTRODUCTION ReLU activation [24] and a dropout of 0.3. Tagging is performed by a 52 unit dense layer with sigmoid activation. We initialise the The ability of music to express and induce emotions is a well-known convolutional feature extractor with the official SoundNet trained and demonstrable fact [21]. It communicates and induces simi- weights [6]. Subsequently, sequences of log Mel spectrograms are lar emotional states in all listeners because musical parameters generated using the kapre keras library [9]. Afterwards, the input is (e. g., rhythm, melody, timbre, dynamics) encode affective informa- resampled to 16k Hz, and 64 Mel filters and an FFT window of 512 tion that is implicitly decoded by listeners [14, 18]. Furthermore, samples with a hop size of 256 are used. During training, we sample both music psychologists and computer scientists have provided a random 20 s chunk of every song and apply random Gaussian plenty of evidence that listeners construe emotional meaning by noise with a maximum power of 0.2. For evaluation, we use the cen- attending to structural aspects of the acoustic signal at various tre 20 s chunk of each song. We apply the RMSprop optimiser [28] levels [10, 13, 22]. Recent deep learning solutions demonstrate the and train the network with a batch size of 32. We first train only suitability of recurrent neural networks (RNNs), autoencoders, and the top RNN and tagging layers for 20 epochs with a learning rate convolutional neural networks (CNNs) for the task of audio-based of 0.001, keeping the weights of the pre-trained vgg-ish frozen. We music emotion recognition (MER) [17, 23, 25]. In [12], we have then unfreeze the feature extraction layers and resume training utilised denoising autoencoders and a transfer learning approach from the best checkpoint – measured in validation Receiver Oper- for time-continuous predictions of emotion in music and speech. atingCharacteristic Curve (ROC-AUC) – with a reduced learning Furthermore, we have conducted both psychological and computa- rate of 0.0001 for another 80 epochs. Finally, the best overall model tional experiments that aimed at clarifying the role of music struc- is restored and evaluated on the test partition. ture in the expression and induction of musical emotions [11, 15]. In this paper, we introduce our end-to-end architecture for the task of emotion and theme recognition in music at MediaEval 2019 [7]. 2.2 Utilising pre-trained CNNs The second model (see bottom part of Figure 1) uses our Deep 2 APPROACH Spectrum system2 [3] to extract pre-trained CNN features from Our framework – which is motivated by our previous works with Mel spectrograms (128 Mel filters) of the songs, which have been CRNNs [1, 5] – is depicted in Figure 1. It consists of two models shown to outperform engineered feature sets on a variety of acous- whose predictions are fused to obtain the final predictions. These tic tasks [2–4]. We use an ImageNet [16] pre-trained VGG16 archi- tecture and forward plots of 1 and 5 second audio chunks through 1 https://github.com/amirip/AugLi-MediaEval the network [26]. The activations of the penultimate layer then Copyright 2019 for this paper by its authors. Use form our feature vectors. We extract these features for the first 30 permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 https://github.com/DeepSpectrum/DeepSpectrum MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France S. Amiriparian et al. 𝐴𝑢𝑑𝑖𝑜 𝑃𝑙𝑜𝑡𝑠 𝐶𝑅𝑁𝑁 𝐵𝑙𝑜𝑐𝑘 GRU/ GRU/ 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 (B)LSTM (B)LSTM 𝑀𝑢𝑠𝑖𝑐 𝑇𝑟𝑎𝑐𝑘 𝑣𝑔𝑔−𝑖𝑠ℎ 𝑅𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑙𝑎𝑦𝑒𝑟𝑠 𝐹𝑖𝑛𝑎𝑙 𝑀𝑜𝑑𝑒𝑙 𝐹𝑢𝑠𝑖𝑜𝑛 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝐼𝑚𝑎𝑔𝑒𝑁𝑒𝑡 𝑝𝑟𝑒𝑡𝑟𝑎𝑖𝑛𝑒𝑑 𝑉𝐺𝐺16 𝑙𝑎𝑟𝑔𝑒 GRU/ GRU/ (B)LSTM (B)LSTM 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑣𝑒𝑐𝑡𝑜𝑟𝑠 𝐻𝑖𝑔ℎ−𝑙𝑒𝑣𝑒𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑙𝑎𝑦𝑒𝑟𝑠 𝑅𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑙𝑎𝑦𝑒𝑟𝑠 Figure 1: An overview of our system composed of two CRNN blocks. For a detailed account on the framework refer to Section 2. seconds (the minimum song duration in the dataset [8]) of each Table 1: Performance of our proposed approaches. All re- song and use them as sequenced input for training RNNs. For both sults are given in macro ROC-AUC. Baseline accuracy on the feature types, three RNN architectures are trained which differ in test set is 72.5 % ROC-AUC [7]. the choice of recurrent cells, as with the CRNN. We chose an archi- CRNN tecture with 2 recurrent layers of size 1 024 units each, followed by RNN type validation test a dense layer with the same number of units before the final densely LSTM 71.4 69.4 connected prediction layer. Afterwards, batch normalisation is used GRU 72.6 69.5 after each of the recurrent layers and the penultimate dense layer. BLSTM 71.9 68.2 Finally, a dropout of 0.4 is applied to the activations of the hidden Deep Spectrum [3] + RNN layers. We train the model using RMSprop with a learning rate of spectrogram width (s) RNN type validation test 0.001 and batch size 32 for a maximum of 1 000 epochs, but perform 1 LSTM 70.1 70.0 early stopping if the validation ROC-AUC does not increase for 1 GRU 68.4 69.8 over 50 epochs. Thus, none of our models was trained for more than 1 BLSTM 69.2 71.0 5 LSTM 69.0 70.3 200 epochs. As for the CRNN, we restore the best model checkpoint 5 GRU 68.8 69.9 before evaluating on the test partition. 5 BLSTM 68.4 70.8 Fusion 2.3 Fusion Experiments fused models test To explore further potential performance improvements, we apply All CRNN (3 models) – 70.7 model fusion experiments by averaging the prediction scores re- All 1s Deep Spectrum (3 models) – 71.5 turned by our networks for the test partition. From these scores, All 5s Deep Spectrum (3 models) – 71.6 All Deep Spectrum (6 models) – 72.6 we generate corresponding tag decisions with the official challenge All systems (9 models) – 74.2 script. In total, we evaluate five different fusion scenarios: fusion of all systems, fusion of all Deep Spectrum , fusion of all CRNN from spectrograms with an ImageNet pre-trained CNN provide systems and fusion of Deep Spectrum systems trained on 1 s and further information not found by training on audio data alone. Our 5 s feature windows, respectively. fusion configuration further achieves a macro average F1 of 17.5 % and a macro PR-AUC of 11.7 %. 3 RESULTS AND ANALYSIS The results of our experiments are shown in Table 1. Our best CRNN 4 DISCUSSION AND OUTLOOK model with GRU layers reaches 69.5 % ROC-AUC on the test set, We outperformed the competitive challenge baseline of MediaEval while a bi-directional LSTM trained on 1 s Deep Spectrum features 2019 Emotion & Themes in Music task after fusing the outputs of achieves 71.0 % ROC-AUC. These results can be explained by the our two systems (cf. Table 1) . We also demonstrated that the Deep fact that we use a fixed size chunk of each song (20 s for CRNN and Spectrum + RNN approach (which makes use of CNNs pre-trained 30 s for Deep Spectrum + RNN) instead of the whole song. We made on ImageNet) yields better results than the CRNN with the vgg- this choice because training of the RNN models on longer sequences ish model. For the future work, a systematic comparison between quickly becomes computationally infeasible. Nonetheless, we can engineered and data-driven feature sets will be done by using the see that fusion leads to an increase in performance. For each type of same machine learning models. Its aim will be to determine the system, in-group fusion only leads to marginal performance boosts. usefulness of data-driven features for emotions and theme predic- We notice a larger positive effect by combining various system types tions in music. We believe that this research direction can lead hinting at complimentary information found on different scales. to a better understanding of the relevant cues for emotion com- Finally, fusing all 9 systems increases the performance to 74.2 % munications in music and improvements in automated emotion ROC-AUC on the test set. This shows that the features extracted recognition systems. Emotion & Themes in Music MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [14] Eduardo Coutinho and Björn Schuller. 2017. Shared acoustic codes [1] Shahin Amiriparian, Alice Baird, Sahib Julka, Alyssa Alcorn, Sandra underlie emotional communication in music and speechâĂŤEvidence Ottl, Suncica Petrović, Eloise Ainger, Nicholas Cummins, and Björn from deep transfer learning. PloS one 12, 6 (2017), e0179289. Schuller. 2018. Recognition of Echolalic Autistic Child Vocalisations [15] Eduardo Coutinho, Felix Weninger, Björn Schuller, and Klaus R. Utilising Convolutional Recurrent Neural Networks. In Proceedings of Scherer. 2014. The munich LSTM-RNN approach to the MediaEval INTERSPEECH 2018, 19th Annual Conference of the International Speech 2014 “Emotion in Music” Task. In CEUR Workshop Proceedings, Martha Communication Association. ISCA, Hyderabad, India, 2334–2338. Larson, Bogdan Ionescu, Xavier Anguera, Maria Eskevich, Pavel Ko- [2] Shahin Amiriparian, Nicholas Cummins, Maurice Gerczuk, Sergey rshunov, Markus Schedl, Mohammad Soleymani, Georgios Petkos, Pugachevskiy, Sandra Ottl, and Björn Schuller. 2018. “Are You Playing Richard Sutcliffe, Jaeyoung Choi, and Gareth J.F. Jones (Eds.), Vol. 1263. a Shooter Again?!” Deep Representation Learning for Audio-based CEUR, Barcelona, Spain. Video Game Genre Recognition. IEEE Transactions on Games 11 (2018). [16] J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and Li Fei-Fei. 2009. Ima- [3] Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cum- geNet: A large-scale hierarchical image database. In 2009 IEEE Con- mins, Michael Freitag, Sergey Pugachevskiy, and Björn Schuller. 2017. ference on Computer Vision and Pattern Recognition. IEEE, Miami, FL, Snore Sound Classification Using Image-based Deep Spectrum Fea- 248–255. tures. In Proceedings of INTERSPEECH 2017, 18th Annual Conference of [17] Yizhuo Dong, Xinyu Yang, Xi Zhao, and Juan Li. 2019. Bidirectional the International Speech Communication Association. ISCA, Stockholm, Convolutional Recurrent Sparse Network (BCRSN): An Efficient Model Sweden, 3512–3516. for Music Emotion Recognition. IEEE Transactions on Multimedia [4] Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cum- (2019). mins, Sergey Pugachevskiy, and Björn Schuller. 2018. Bag-of-Deep- [18] Alf Gabrielsson and Erik Lindström. 2010. The role of structure in the Features: Noise-Robust Deep Feature Representations for Audio Anal- musical expression of emotions. In Handbook of music and emotion: ysis. In Proceedings of the 31st International Joint Conference on Neural Theory, research, applications, Patrik N. Juslin and John Sloboda (Eds.). Networks (IJCNN). IEEE, Rio de Janeiro, Brazil, 2419–2425. Oxford University Press, Oxford, 367–400. [5] Shahin Amiriparian, Sahib Julka, Nicholas Cummins, and Björn [19] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Schuller. 2018. Deep Convolutional Recurrent Neural Networks Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A for Rare Sound Event Detection. In Proceedings 44. Jahrestagung für Saurous, Bryan Seybold, and others. 2017. CNN architectures for Akustik, DAGA 2018. DEGA, Deutsche Gesellschaft fÃijr Akustik e.V. large-scale audio classification. In 2017 ieee international conference on (DEGA), Munich, Germany. acoustics, speech and signal processing (icassp). IEEE, 131–135. [6] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: [20] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Ac- Learning sound representations from unlabeled video. In Advances in celerating deep network training by reducing internal covariate shift. neural information processing systems, D. D. Lee, M. Sugiyama, U. V. arXiv preprint arXiv:1502.03167 (2015). Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., [21] Patrik N. Juslin and John Sloboda (Eds.). 2011. Handbook of music and Barcelona, Spain, 892–900. emotion: Theory, research, applications. Oxford University Press. [7] Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, and Minz Won. [22] Youngmoo E Kim, Erik M Schmidt, Raymond Migneco, Brandon G 2019. MediaEval 2019: Emotion and Theme Recognition in Music Morton, Patrick Richardson, Jeffrey Scott, Jacquelin A Speck, and Using Jamendo. In MediaEval Benchmarking Initiative for Multimedia Douglas Turnbull. 2010. Music emotion recognition: A state of the art Evaluation. Sophia Antipolis, France. review. In Proceedings of ISMIR, Vol. 86. Utrecht, Holland, 937–952. [8] Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and [23] Huaping Liu, Yong Fang, and Qinghua Huang. 2019. Music Emotion Xavier Serra. 2019. The MTG-Jamendo Dataset for Automatic Music Recognition Using a Variant of Recurrent Neural Network. In 2018 Tagging. In Machine Learning for Music Discovery Workshop, Interna- International Conference on Mathematics, Modeling, Simulation and tional Conference on Machine Learning (ICML 2019). ICML, Long Beach, Statistics Application (MMSSA 2018). Atlantis Press. CA, United States. [24] Vinod Nair and Geoffrey Hinton. 2010. Rectified linear units improve [9] Keunwoo Choi, Deokjin Joo, and Juho Kim. 2017. Kapre: On-GPU Au- restricted boltzmann machines. In Proceedings of the 27th international dio Preprocessing Layers for a Quick Implementation of Deep Neural conference on machine learning (ICML-10). Haifa, Israel, 807–814. Network Models with Keras. In Machine Learning for Music Discov- [25] Richard Orjesek, Roman Jarina, Michal Chmulik, and Michal Kuba. ery Workshop at 34th International Conference on Machine Learning. 2019. DNN Based Music Emotion Recognition from Raw Audio Signal. ICML, International Conference on Machine Learning (ICML), Sydney, In 2019 29th International Conference Radioelektronika (RADIOELEK- Australia. TRONIKA). IEEE, 1–4. [10] Eduardo Coutinho and Angelo Cangelosi. 2009. The Use of Spatio- [26] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Con- Temporal Connectionist Models in Psychological Studies of Musical volutional Networks for Large-Scale Image Recognition. CoRR Emotions. Music Perception: An Interdisciplinary Journal 27, 1 (sep abs/1409.1556 (2014). 2009), 1–15. [27] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, [11] Eduardo Coutinho and Angelo Cangelosi. 2011. Musical Emotions : and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent Predicting Second-by-Second Subjective Feelings of Emotion From neural networks from overfitting. The journal of machine learning Low-Level Psychoacoustic Features and Physiological Measurements. research 15, 1 (2014), 1929–1958. Emotion 11, 4 (aug 2011), 921–937. [28] Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: [12] Eduardo Coutinho, Jun Deng, and Björn Schuller. 2014. Transfer Divide the gradient by a running average of its recent magnitude. learning emotion manifestation across music and speech. In 2014 COURSERA: Neural networks for machine learning 4, 2 (2012), 26–31. International Joint Conference on Neural Networks (IJCNN). IEEE, 3592– 3598. [13] Eduardo Coutinho and Nicola Dibben. 2013. Psychoacoustic cues to emotion in speech prosody and music. Cognition & Emotion 27, 4 (jun 2013), 658–684.