Emotion and Themes Recognition in Music Utilising
                   Convolutional and Recurrent Neural Networks
             Shahin Amiriparian1 , Maurice Gerczuk1 , Eduardo Coutinho2 , Alice Baird1 , Sandra Ottl1 ,
                                      Manuel Milling1 , Björn Schuller1,3
               1 ZD.B. Chair of Embedded Intelligence for Health Care & Wellbeing, Univeristy of Augsburg, Germany
                               2 Applied Music Research Lab, Department of Music, University of Liverpool, U. K.
                                 3 GLAM – Group on Language, Audio & Music, Imperial College London, U. K.

                                                             amiriparian@ieee.org

ABSTRACT                                                                  models capture both shift-invariant, high-level features (convolu-
Emotion is an inherent aspect of music, and associations to music         tional block), and long(er)-term temporal context (recurrent block)
can be made via both life experience and specific musical tech-           from the musical inputs [7, 8]. The MTG-Jamendo dataset [8] in-
niques applied by the composer. Computational approaches for              cludes 18 486 audio tracks with 56 distinct mood and theme annota-
music recognition have been well-established in the research com-         tions/tags. All audio files have at least one tag. The dataset provides
munity; however, deep approaches have been limited and not yet            60-20-20 % splits for training, validation, and testing. For the full
comparable to conventional approaches. In this study, we present          description of the challenge data, please refer to [8].
our fusion system of end-to-end convolutional recurrent neural
networks (CRNN) and pre-trained convolutional feature extrac-             2.1    Convolutional Recurrent Neural Network
tors for music emotion and theme recognition1 . We train 9 models         The CRNN system (upper part of Figure 1) consists of a vgg-ish
and conduct various late fusion experiments. Our best performing          model (which is trained on the Audioset dataset [19]) with the fi-
model (team name: AugLi) achieves 74.2 % ROC-AUC on the test              nal global average pooling layer replaced by an RNN. Specifically,
partition which is 1.6 percentage points over the baseline system         we add 2 recurrent layers with 256 units (we tried 128, 256, and
of the MediaEval 2019 Emotion & Themes in Music task.                     512 units) and a dropout [27] of 0.3 (out of [0.2, 0.3, 0.4]) for each
                                                                          layer, followed by a 1 024 unit dense layer, batch normalisation [20],
1    INTRODUCTION                                                         ReLU activation [24] and a dropout of 0.3. Tagging is performed
                                                                          by a 52 unit dense layer with sigmoid activation. We initialise the
The ability of music to express and induce emotions is a well-known
                                                                          convolutional feature extractor with the official SoundNet trained
and demonstrable fact [21]. It communicates and induces simi-
                                                                          weights [6]. Subsequently, sequences of log Mel spectrograms are
lar emotional states in all listeners because musical parameters
                                                                          generated using the kapre keras library [9]. Afterwards, the input is
(e. g., rhythm, melody, timbre, dynamics) encode affective informa-
                                                                          resampled to 16k Hz, and 64 Mel filters and an FFT window of 512
tion that is implicitly decoded by listeners [14, 18]. Furthermore,
                                                                          samples with a hop size of 256 are used. During training, we sample
both music psychologists and computer scientists have provided
                                                                          a random 20 s chunk of every song and apply random Gaussian
plenty of evidence that listeners construe emotional meaning by
                                                                          noise with a maximum power of 0.2. For evaluation, we use the cen-
attending to structural aspects of the acoustic signal at various
                                                                          tre 20 s chunk of each song. We apply the RMSprop optimiser [28]
levels [10, 13, 22]. Recent deep learning solutions demonstrate the
                                                                          and train the network with a batch size of 32. We first train only
suitability of recurrent neural networks (RNNs), autoencoders, and
                                                                          the top RNN and tagging layers for 20 epochs with a learning rate
convolutional neural networks (CNNs) for the task of audio-based
                                                                          of 0.001, keeping the weights of the pre-trained vgg-ish frozen. We
music emotion recognition (MER) [17, 23, 25]. In [12], we have
                                                                          then unfreeze the feature extraction layers and resume training
utilised denoising autoencoders and a transfer learning approach
                                                                          from the best checkpoint – measured in validation Receiver Oper-
for time-continuous predictions of emotion in music and speech.
                                                                          atingCharacteristic Curve (ROC-AUC) – with a reduced learning
Furthermore, we have conducted both psychological and computa-
                                                                          rate of 0.0001 for another 80 epochs. Finally, the best overall model
tional experiments that aimed at clarifying the role of music struc-
                                                                          is restored and evaluated on the test partition.
ture in the expression and induction of musical emotions [11, 15].
In this paper, we introduce our end-to-end architecture for the task
of emotion and theme recognition in music at MediaEval 2019 [7].
                                                                          2.2    Utilising pre-trained CNNs
                                                                          The second model (see bottom part of Figure 1) uses our Deep
2    APPROACH                                                             Spectrum system2 [3] to extract pre-trained CNN features from
Our framework – which is motivated by our previous works with             Mel spectrograms (128 Mel filters) of the songs, which have been
CRNNs [1, 5] – is depicted in Figure 1. It consists of two models         shown to outperform engineered feature sets on a variety of acous-
whose predictions are fused to obtain the final predictions. These        tic tasks [2–4]. We use an ImageNet [16] pre-trained VGG16 archi-
                                                                          tecture and forward plots of 1 and 5 second audio chunks through
1 https://github.com/amirip/AugLi-MediaEval
                                                                          the network [26]. The activations of the penultimate layer then
Copyright 2019 for this paper by its authors. Use                         form our feature vectors. We extract these features for the first 30
permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).                                            2 https://github.com/DeepSpectrum/DeepSpectrum

MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                                    S. Amiriparian et al.

                        𝐴𝑢𝑑𝑖𝑜 𝑃𝑙𝑜𝑡𝑠                                   𝐶𝑅𝑁𝑁 𝐵𝑙𝑜𝑐𝑘


                                                                                     GRU/      GRU/              𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
                                                                                   (B)LSTM   (B)LSTM

          𝑀𝑢𝑠𝑖𝑐 𝑇𝑟𝑎𝑐𝑘

                                                            𝑣𝑔𝑔−𝑖𝑠ℎ                𝑅𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑙𝑎𝑦𝑒𝑟𝑠                                             𝐹𝑖𝑛𝑎𝑙
                                                                                                                               𝑀𝑜𝑑𝑒𝑙 𝐹𝑢𝑠𝑖𝑜𝑛
                                                                                                                                              𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
                                          𝐼𝑚𝑎𝑔𝑒𝑁𝑒𝑡 𝑝𝑟𝑒𝑡𝑟𝑎𝑖𝑛𝑒𝑑 𝑉𝐺𝐺16


                                                                                 𝑙𝑎𝑟𝑔𝑒         GRU/      GRU/
                                                                                             (B)LSTM   (B)LSTM   𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
                                                                                𝑓𝑒𝑎𝑡𝑢𝑟𝑒
                                                                                𝑣𝑒𝑐𝑡𝑜𝑟𝑠

                                                    𝐻𝑖𝑔ℎ−𝑙𝑒𝑣𝑒𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑙𝑎𝑦𝑒𝑟𝑠                𝑅𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑙𝑎𝑦𝑒𝑟𝑠


Figure 1: An overview of our system composed of two CRNN blocks. For a detailed account on the framework refer to Section 2.


seconds (the minimum song duration in the dataset [8]) of each                        Table 1: Performance of our proposed approaches. All re-
song and use them as sequenced input for training RNNs. For both                      sults are given in macro ROC-AUC. Baseline accuracy on the
feature types, three RNN architectures are trained which differ in                    test set is 72.5 % ROC-AUC [7].
the choice of recurrent cells, as with the CRNN. We chose an archi-                                                CRNN
tecture with 2 recurrent layers of size 1 024 units each, followed by                                                   RNN type validation                     test
a dense layer with the same number of units before the final densely                                                        LSTM       71.4                    69.4
connected prediction layer. Afterwards, batch normalisation is used                                                          GRU       72.6                    69.5
after each of the recurrent layers and the penultimate dense layer.                                                       BLSTM        71.9                    68.2
Finally, a dropout of 0.4 is applied to the activations of the hidden                                     Deep Spectrum [3] + RNN
layers. We train the model using RMSprop with a learning rate of                        spectrogram width (s)           RNN type validation                     test
0.001 and batch size 32 for a maximum of 1 000 epochs, but perform                      1                                   LSTM       70.1                    70.0
early stopping if the validation ROC-AUC does not increase for                          1                                    GRU       68.4                    69.8
over 50 epochs. Thus, none of our models was trained for more than                      1                                 BLSTM        69.2                    71.0
                                                                                        5                                   LSTM       69.0                    70.3
200 epochs. As for the CRNN, we restore the best model checkpoint                       5                                    GRU       68.8                    69.9
before evaluating on the test partition.                                                5                                 BLSTM        68.4                    70.8
                                                                                                                   Fusion
2.3    Fusion Experiments                                                               fused models                                                            test
To explore further potential performance improvements, we apply                         All CRNN (3 models)                               –                    70.7
model fusion experiments by averaging the prediction scores re-                         All 1s Deep Spectrum (3 models)                   –                    71.5
turned by our networks for the test partition. From these scores,                       All 5s Deep Spectrum (3 models)                   –                    71.6
                                                                                        All Deep Spectrum (6 models)                      –                    72.6
we generate corresponding tag decisions with the official challenge                     All systems (9 models)                            –                    74.2
script. In total, we evaluate five different fusion scenarios: fusion
of all systems, fusion of all Deep Spectrum , fusion of all CRNN                      from spectrograms with an ImageNet pre-trained CNN provide
systems and fusion of Deep Spectrum systems trained on 1 s and                        further information not found by training on audio data alone. Our
5 s feature windows, respectively.                                                    fusion configuration further achieves a macro average F1 of 17.5 %
                                                                                      and a macro PR-AUC of 11.7 %.
3     RESULTS AND ANALYSIS
The results of our experiments are shown in Table 1. Our best CRNN                    4      DISCUSSION AND OUTLOOK
model with GRU layers reaches 69.5 % ROC-AUC on the test set,                         We outperformed the competitive challenge baseline of MediaEval
while a bi-directional LSTM trained on 1 s Deep Spectrum features                     2019 Emotion & Themes in Music task after fusing the outputs of
achieves 71.0 % ROC-AUC. These results can be explained by the                        our two systems (cf. Table 1) . We also demonstrated that the Deep
fact that we use a fixed size chunk of each song (20 s for CRNN and                   Spectrum + RNN approach (which makes use of CNNs pre-trained
30 s for Deep Spectrum + RNN) instead of the whole song. We made                      on ImageNet) yields better results than the CRNN with the vgg-
this choice because training of the RNN models on longer sequences                    ish model. For the future work, a systematic comparison between
quickly becomes computationally infeasible. Nonetheless, we can                       engineered and data-driven feature sets will be done by using the
see that fusion leads to an increase in performance. For each type of                 same machine learning models. Its aim will be to determine the
system, in-group fusion only leads to marginal performance boosts.                    usefulness of data-driven features for emotions and theme predic-
We notice a larger positive effect by combining various system types                  tions in music. We believe that this research direction can lead
hinting at complimentary information found on different scales.                       to a better understanding of the relevant cues for emotion com-
Finally, fusing all 9 systems increases the performance to 74.2 %                     munications in music and improvements in automated emotion
ROC-AUC on the test set. This shows that the features extracted                       recognition systems.
Emotion & Themes in Music                                                               MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES                                                                       [14] Eduardo Coutinho and Björn Schuller. 2017. Shared acoustic codes
 [1] Shahin Amiriparian, Alice Baird, Sahib Julka, Alyssa Alcorn, Sandra              underlie emotional communication in music and speechâĂŤEvidence
     Ottl, Suncica Petrović, Eloise Ainger, Nicholas Cummins, and Björn               from deep transfer learning. PloS one 12, 6 (2017), e0179289.
     Schuller. 2018. Recognition of Echolalic Autistic Child Vocalisations       [15] Eduardo Coutinho, Felix Weninger, Björn Schuller, and Klaus R.
     Utilising Convolutional Recurrent Neural Networks. In Proceedings of             Scherer. 2014. The munich LSTM-RNN approach to the MediaEval
     INTERSPEECH 2018, 19th Annual Conference of the International Speech             2014 “Emotion in Music” Task. In CEUR Workshop Proceedings, Martha
     Communication Association. ISCA, Hyderabad, India, 2334–2338.                    Larson, Bogdan Ionescu, Xavier Anguera, Maria Eskevich, Pavel Ko-
 [2] Shahin Amiriparian, Nicholas Cummins, Maurice Gerczuk, Sergey                    rshunov, Markus Schedl, Mohammad Soleymani, Georgios Petkos,
     Pugachevskiy, Sandra Ottl, and Björn Schuller. 2018. “Are You Playing            Richard Sutcliffe, Jaeyoung Choi, and Gareth J.F. Jones (Eds.), Vol. 1263.
     a Shooter Again?!” Deep Representation Learning for Audio-based                  CEUR, Barcelona, Spain.
     Video Game Genre Recognition. IEEE Transactions on Games 11 (2018).         [16] J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and Li Fei-Fei. 2009. Ima-
 [3] Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cum-                  geNet: A large-scale hierarchical image database. In 2009 IEEE Con-
     mins, Michael Freitag, Sergey Pugachevskiy, and Björn Schuller. 2017.            ference on Computer Vision and Pattern Recognition. IEEE, Miami, FL,
     Snore Sound Classification Using Image-based Deep Spectrum Fea-                  248–255.
     tures. In Proceedings of INTERSPEECH 2017, 18th Annual Conference of        [17] Yizhuo Dong, Xinyu Yang, Xi Zhao, and Juan Li. 2019. Bidirectional
     the International Speech Communication Association. ISCA, Stockholm,             Convolutional Recurrent Sparse Network (BCRSN): An Efficient Model
     Sweden, 3512–3516.                                                               for Music Emotion Recognition. IEEE Transactions on Multimedia
 [4] Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cum-                  (2019).
     mins, Sergey Pugachevskiy, and Björn Schuller. 2018. Bag-of-Deep-           [18] Alf Gabrielsson and Erik Lindström. 2010. The role of structure in the
     Features: Noise-Robust Deep Feature Representations for Audio Anal-              musical expression of emotions. In Handbook of music and emotion:
     ysis. In Proceedings of the 31st International Joint Conference on Neural        Theory, research, applications, Patrik N. Juslin and John Sloboda (Eds.).
     Networks (IJCNN). IEEE, Rio de Janeiro, Brazil, 2419–2425.                       Oxford University Press, Oxford, 367–400.
 [5] Shahin Amiriparian, Sahib Julka, Nicholas Cummins, and Björn                [19] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke,
     Schuller. 2018. Deep Convolutional Recurrent Neural Networks                     Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A
     for Rare Sound Event Detection. In Proceedings 44. Jahrestagung für              Saurous, Bryan Seybold, and others. 2017. CNN architectures for
     Akustik, DAGA 2018. DEGA, Deutsche Gesellschaft fÃĳr Akustik e.V.                large-scale audio classification. In 2017 ieee international conference on
     (DEGA), Munich, Germany.                                                         acoustics, speech and signal processing (icassp). IEEE, 131–135.
 [6] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet:           [20] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Ac-
     Learning sound representations from unlabeled video. In Advances in              celerating deep network training by reducing internal covariate shift.
     neural information processing systems, D. D. Lee, M. Sugiyama, U. V.             arXiv preprint arXiv:1502.03167 (2015).
     Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc.,          [21] Patrik N. Juslin and John Sloboda (Eds.). 2011. Handbook of music and
     Barcelona, Spain, 892–900.                                                       emotion: Theory, research, applications. Oxford University Press.
 [7] Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, and Minz Won.           [22] Youngmoo E Kim, Erik M Schmidt, Raymond Migneco, Brandon G
     2019. MediaEval 2019: Emotion and Theme Recognition in Music                     Morton, Patrick Richardson, Jeffrey Scott, Jacquelin A Speck, and
     Using Jamendo. In MediaEval Benchmarking Initiative for Multimedia               Douglas Turnbull. 2010. Music emotion recognition: A state of the art
     Evaluation. Sophia Antipolis, France.                                            review. In Proceedings of ISMIR, Vol. 86. Utrecht, Holland, 937–952.
 [8] Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and           [23] Huaping Liu, Yong Fang, and Qinghua Huang. 2019. Music Emotion
     Xavier Serra. 2019. The MTG-Jamendo Dataset for Automatic Music                  Recognition Using a Variant of Recurrent Neural Network. In 2018
     Tagging. In Machine Learning for Music Discovery Workshop, Interna-              International Conference on Mathematics, Modeling, Simulation and
     tional Conference on Machine Learning (ICML 2019). ICML, Long Beach,             Statistics Application (MMSSA 2018). Atlantis Press.
     CA, United States.                                                          [24] Vinod Nair and Geoffrey Hinton. 2010. Rectified linear units improve
 [9] Keunwoo Choi, Deokjin Joo, and Juho Kim. 2017. Kapre: On-GPU Au-                 restricted boltzmann machines. In Proceedings of the 27th international
     dio Preprocessing Layers for a Quick Implementation of Deep Neural               conference on machine learning (ICML-10). Haifa, Israel, 807–814.
     Network Models with Keras. In Machine Learning for Music Discov-            [25] Richard Orjesek, Roman Jarina, Michal Chmulik, and Michal Kuba.
     ery Workshop at 34th International Conference on Machine Learning.               2019. DNN Based Music Emotion Recognition from Raw Audio Signal.
     ICML, International Conference on Machine Learning (ICML), Sydney,               In 2019 29th International Conference Radioelektronika (RADIOELEK-
     Australia.                                                                       TRONIKA). IEEE, 1–4.
[10] Eduardo Coutinho and Angelo Cangelosi. 2009. The Use of Spatio-             [26] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Con-
     Temporal Connectionist Models in Psychological Studies of Musical                volutional Networks for Large-Scale Image Recognition. CoRR
     Emotions. Music Perception: An Interdisciplinary Journal 27, 1 (sep              abs/1409.1556 (2014).
     2009), 1–15.                                                                [27] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,
[11] Eduardo Coutinho and Angelo Cangelosi. 2011. Musical Emotions :                  and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent
     Predicting Second-by-Second Subjective Feelings of Emotion From                  neural networks from overfitting. The journal of machine learning
     Low-Level Psychoacoustic Features and Physiological Measurements.                research 15, 1 (2014), 1929–1958.
     Emotion 11, 4 (aug 2011), 921–937.                                          [28] Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop:
[12] Eduardo Coutinho, Jun Deng, and Björn Schuller. 2014. Transfer                   Divide the gradient by a running average of its recent magnitude.
     learning emotion manifestation across music and speech. In 2014                  COURSERA: Neural networks for machine learning 4, 2 (2012), 26–31.
     International Joint Conference on Neural Networks (IJCNN). IEEE, 3592–
     3598.
[13] Eduardo Coutinho and Nicola Dibben. 2013. Psychoacoustic cues to
     emotion in speech prosody and music. Cognition & Emotion 27, 4 (jun
     2013), 658–684.