Recognizing Music Mood and Theme Using Convolutional
                    Neural Networks and Attention
                                                                         †                   †
                                              Alish Dipani1, 2, , Gaurav Iyer2, , Veeky Baths2
                                                                       1 Upload AI LLC, USA
                                    2 Cognitive Neuroscience Lab, BITS Pilani, K.K.Birla Goa Campus, India

                         alish.dipani@uploadai.com,f20170544@goa.bits-pilani.ac.in,veeky@goa.bits-pilani.ac.in

ABSTRACT                                                                             tagging [6], source separation [30], music emotion classification
We present the UAI-CNRL submission to MediaEval 2020 task on                         [16] and speaker identification [18].
Emotion and Theme Recognition in Music. We make use of the                              Transformer networks which use self-attention layers [28] have
ResNet34 architecture, coupled with a self-attention module to                       been successful in tackling language tasks involving long-range
detect moods/themes in music tracks. The autotagging-moodtheme                       dependencies. They have also been used in the field of audio pro-
subset of the MTG-Jamendo dataset was used to train the model. We                    cessing for many tasks, such as automatic tagging [29], source
show that the proposed model outperforms the provided VGG-ish                        separation [5], and speech recognition [2].
and popularity baselines.                                                               A combination of these methods have been demonstrated to
                                                                                     achieve state-of-the-art performance [2, 9, 32]. Inspired by these, we
                                                                                     use convolution layers to extract features from mel-spectrograms
1    INTRODUCTION                                                                    and self-attention layers to process those features to predict the
Music has been shown to induce a variety of emotions such as                         moods/themes.
happiness, sadness, and anger [7, 8, 27]. This induction of emotions
can be attributed to different intrinsic properties such as tempo,                   3     APPROACH
rhythm variations, intensity, mode and extrinsic properties such as                  We make use of a popular convolutional neural network archi-
the association of music with personal events and previous experi-                   tecture, the ResNet [10] as a feature extractor to extract compact
ences [12, 23]. These emotional responses could also be one of the                   representations of our data. We pair this with self-attention [28] in
important motivators for humans to listen to music [20–22].                          order to capture long-term temporal attributes of the given data. We
   Automatic tagging and detection of emotions of music is a dif-                    also make use of batch normalization [11] and dropout [24] in order
ficult task considering the subjectivity of human emotions. The                      to further regularize the model. We describe the model architecture
MTG-Jamendo dataset [4] aims at tackling several such autotagging                    in this section. Our code and trained model are available at this
tasks by providing royalty-free audios of consistent quality with sev-               URL§ .
eral tags for genre, instruments and mood/theme. The Emotion and
Theme Recognition Task of MediaEval 2020 uses the mood/theme                         3.1    ResNet34
subset of the MTG-Jamendo dataset. The task is as follows - given                    Residual connections make training deep neural networks easier,
audio, automatically detect one or multiple moods/themes out of                      since they address the problem of vanishing gradients. We make
56 given tags, for example, fun, sad, romantic, happy [3].                           use of a standard ResNet34 architecture to take advantage of this
   In this paper, we describe our approach (team name: UAI-CNRL)                     property. This is preceded by two convolutional layers in order to
for this task by using convolutional neural networks to extract                      reshape the data into a form that can be fed into the ResNet. Another
features from the mel-spectrograms of the audios and multi-head                      convolutional layer is used after the ResNet feature extractor to
self-attention to predict the mood/theme by processing the ex-                       reduce the number of channels.
tracted features. Our approach achieves better performance than
the baselines.                                                                       3.2    Self-Attention
                                                                                     The MTG-Jamendo dataset consists of tracks of varying lengths, a
2    RELATED WORK                                                                    majority of which are over 200 seconds. Using self-attention, we
Convolutional neural networks (CNNs) have been successful in                         attempt to capture long-range temporal attributes and summarize
extracting meaningful features for tasks such as image recognition                   the sequence of music representation.
[10, 14] and object detection [10]. In the field of audio processing,                Our model architecture is inspired by the works done in [25], which
CNNs have been used for a variety of tasks, such as automatic                        uses multi-head attention along with positional encoding. 2 layers,
    † Authors Contributed Equally
                                                                                     each consisting of 4 attention heads were used. The input sequence
    § https://github.com/alishdipani/Multimediaeval2020-emotions-and-themes-in-      length and embedding size used were unchanged.
music
                                                                                     3.3    Data Augmentation
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
                                                                                        3.3.1 Mixup. Previous submissions to MediaEval 2019 [25] for
MediaEval’20, December 14-15 2020, Online                                            this task have shown that Mixup [31] greatly improves the per-
                                                                                     formance of the model being used. Mixup creates a new training
MediaEval’20, December 14-15 2020, Online                                                                            A. Dipani, G. Iyer, V. Baths


example by linearly combining two random, existing training sam-                                      Table 1: Results
ples - in the feature space as well as in the label space. More formally,
Mixup trains a neural network on convex combinations of pairs of                 Metric                Ours     VGG-ish[3]     popularity[3]
examples and their labels. This helps the model alleviate unwanted
                                                                                 ROC-AUC-macro        0.7360       0.7258         0.5000
behaviours, such as memorization, especially since the dataset size
                                                                                 PR-AUC-macro         0.1275       0.1077         0.03192
is relatively small.
                                                                                 precision-macro      0.1639       0.1382         0.0014
                                                                                 recall-macro         0.3487       0.3086         0.0179
   3.3.2 SpecAugment. SpecAugment [19] is an augmentation tech-
                                                                                 F-score-macro        0.1884       0.1657         0.0026
nique used for speech recognition, which involves augmenting the
spectrogram itself, instead of the waveform data. SpecAugment                    ROC-AUC-micro        0.7865       0.7750          0.5139
modifies the spectrogram by warping it in the time axis, masking                 PR-AUC-micro         0.1369       0.1409          0.0341
blocks of frequency channels, and masking blocks of time steps.                  precision-micro      0.1105       0.1161          0.0799
This makes the model more robust to missing information in terms                 recall-micro         0.4032       0.3735          0.0447
of the input speech data as well as frequency information.                       F-score-micro        0.1735       0.1771          0.0573

  3.3.3 Other Augmentations. Other transformation techniques,
such as random cropping and random scaling were used to further
augment the given data.                                                     5    RESULTS
                                                                            The proposed model produces results that improve on those of the
4     TRAINING DETAILS                                                      given VGG-ish and popularity baselines. We obtain an ROC-AUC-
This section describes the details of data pre-processing, architec-        macro metric of 0.7360 and a PR-AUC-macro metric of 0.1275. For
ture and other training details.                                            comparison, the baseline VGG-ish model produces an ROC-AUC
                                                                            macro of 0.7258 and a PR-AUC macro of 0.1077. Detailed results
                                                                            can be found in Table 1.
4.1    Data Preparation
We use the mel-spectrograms provided in the MTG-Jamendo dataset             6    FUTURE WORK
for the purpose of training. Random cropping and scaling are used
                                                                            In this section, we discuss other approaches that we considered
to augment and transform the data into a tensor of length 4096
                                                                            towards the problem statement. These may be used as pointers
(approximately 87.4 seconds). Additionally, SpecAugment is used
                                                                            towards future work on tasks involving this dataset.
to augment the dataset.
                                                                               Our approach can be broken down into two parts - first, the
                                                                            extraction of features from the audio data and second, processing
4.2    Architecture and Control Flow                                        the extracted features to predict the moods/themes. Both these
      • The input tensor of shape (1, 96, 4096) is divided into 16          parts could be potentially improved upon, and we mention a few
        segments length-wise, each new segment being of length              ways to do so below.
        256.                                                                   With respect to feature extraction:
      • Each segment is then processed through 2 convolutional                     • Using a wider range of features to aid the classification task
        layers, in order to obtain a representation with 3 channels.                 instead of using mel-spectrograms. For example, the LEAF
      • The obtained representation is then passed into the ResNet34                 frontend proposed by [1] can be used for this approach.
        feature extractor, followed by a convolutional layer to ob-                • Using self-supervised approach to extract features, such
        tain an intermediate representation.                                         as wav2vec 2.0 [2]. This would also reduce reliance on
      • The feature maps are then passed through the self-attention                  labelled data.
        module, followed by a series of linear layers to obtain the                • Using temporal convolutional networks [15] to extract fea-
        final class scores. Dropout is used to regularise the training               tures directly from audio instead of using mel-spectrograms.
        process.
      • The model returns the outputs of the self-attention module              With respect to the processing of extracted features:
        and the feature maps (after passing them through the linear                • Using dual path processing inspired by [17] in order to
        layers). Both outputs are used to compute the loss and                       capture long-term dependencies while also reducing com-
        perform backpropagation, but only the outputs of the self-                   putational load.
        attention module are used to make predictions.                             • Exploring ways of processing the raw audio data with more
                                                                                     powerful models, such as WaveNet [26] in order to obtain
4.3    Hyperparameters and Other Details                                             better insights into the dataset, and theme recognition in
                                                                                     general.
The model was trained with the Adam [13] optimizer, at a learning
rate of 1e-4, for 35 epochs. The values of 𝛽 1 and 𝛽 2 were set to 0.9
and 0.999 respectively. Binary cross entropy loss was used as the           ACKNOWLEDGMENTS
loss function.                                                              We thank Shell Xu Hu for helpful discussions.
Emotions and Themes in Music                                                                          MediaEval’20, December 14-15 2020, Online


REFERENCES                                                                         Data Augmentation Method for Automatic Speech Recognition. Inter-
 [1] Anonymous. 2021. A Universal Learnable Audio Frontend. In Sub-                speech 2019 (Sep 2019). https://doi.org/10.21437/interspeech.2019-2680
     mitted to International Conference on Learning Representations. https:   [20] Mark Reybrouck and Tuomas Eerola. 2017. Music and its inductive
     //openreview.net/forum?id=jM76BCb6F9m under review.                           power: a psychobiological and evolutionary approach to musical emo-
 [2] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael                  tions. Frontiers in Psychology 8 (2017), 494.
     Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning        [21] Thomas Schäfer, Peter Sedlmeier, Christine Städtler, and David Huron.
     of Speech Representations. (2020). arXiv:cs.CL/2006.11477                     2013. The psychological functions of music listening. Frontiers in
 [3] Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, and Minz Won.             psychology 4 (2013), 511.
     2020. Emotion and Theme Recognition in Music Using Jamendo. In           [22] Roni Shifriss, Ehud Bodner, and Yuval Palgi. 2015. When you’re down
     Working Notes Proceedings of the MediaEval 2020 Workshop.                     and troubled: Views on the regulatory power of music. Psychology of
 [4] Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and             Music 43, 6 (2015), 793–807.
     Xavier Serra. 2019. The MTG-Jamendo Dataset for Automatic Music          [23] John A Sloboda and Patrik N Juslin. 2001. Psychological perspectives
     Tagging. In Machine Learning for Music Discovery Workshop, Interna-           on music and emotion. Music and emotion: Theory and research (2001),
     tional Conference on Machine Learning (ICML 2019). Long Beach, CA,            71–104.
     United States. http://hdl.handle.net/10230/42015                         [24] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,
 [5] Jingjing Chen, Qirong Mao, and Dong Liu. 2020. Dual-path transformer          and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Pre-
     network: Direct context-aware modeling for end-to-end monaural                vent Neural Networks from Overfitting. Journal of Machine Learn-
     speech separation. arXiv preprint arXiv:2007.13975 (2020).                    ing Research 15, 56 (2014), 1929–1958. http://jmlr.org/papers/v15/
 [6] Keunwoo Choi, George Fazekas, and Mark Sandler. 2016. Auto-                   srivastava14a.html
     matic tagging using deep convolutional neural networks. (2016).          [25] Manoj Sukhavasi and Sainath Adapa. 2019. Music theme recognition
     arXiv:cs.SD/1606.00298                                                        using CNN and self-attention. (2019). arXiv:cs.SD/1911.07041
 [7] Hauke Egermann, Nathalie Fernando, Lorraine Chuen, and Stephen           [26] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,
     McAdams. 2015. Music induces universal emotion-related psychophys-            Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and
     iological responses: comparing Canadian listeners to Congolese Pyg-           Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw
     mies. Frontiers in psychology 5 (2015), 1341.                                 Audio. (2016). arXiv:cs.SD/1609.03499
 [8] Thomas Fritz, Sebastian Jentschke, Nathalie Gosselin, Daniela Samm-      [27] Daniel Västfjäll. 2001. Emotion induction through music: A review of
     ler, Isabelle Peretz, Robert Turner, Angela D Friederici, and Stefan          the musical mood induction procedure. Musicae Scientiae 5, 1_suppl
     Koelsch. 2009. Universal recognition of three basic emotions in music.        (2001), 173–211.
     Current biology 19, 7 (2009), 573–576.                                   [28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
 [9] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang,             Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017.
     Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu,                  Attention Is All You Need. (2017). arXiv:cs.CL/1706.03762
     and Ruoming Pang. 2020. Conformer: Convolution-augmented Trans-          [29] Minz Won, Sanghyuk Chun, and Xavier Serra. 2019. Toward
     former for Speech Recognition. (2020). arXiv:eess.AS/2005.08100               interpretable music tagging with self-attention. arXiv preprint
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.                        arXiv:1906.04972 (2019).
     2015. Deep Residual Learning for Image Recognition. (2015).              [30] Jeroen Zegers and Hugo Van hamme. 2019. CNN-LSTM models for
     arXiv:cs.CV/1512.03385                                                        Multi-Speaker Source Separation using Bayesian Hyper Parameter
[11] Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accel-         Optimization. (2019). arXiv:cs.LG/1912.09254
     erating Deep Network Training by Reducing Internal Covariate Shift.      [31] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-
     (2015). arXiv:cs.LG/1502.03167                                                Paz. 2018. mixup: Beyond Empirical Risk Minimization. (2018).
[12] Stéphanie Khalfa, Mathieu Roy, Pierre Rainville, Simone Dalla Bella,          arXiv:cs.LG/1710.09412
     and Isabelle Peretz. 2008. Role of tempo entrainment in psychophysi-     [32] Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu,
     ological differentiation of happy and sad music? International Journal        Ruoming Pang, Quoc V. Le, and Yonghui Wu. 2020. Pushing the
     of Psychophysiology 68, 1 (2008), 17–26.                                      Limits of Semi-Supervised Learning for Automatic Speech Recognition.
[13] Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Sto-                (2020). arXiv:eess.AS/2010.10504
     chastic Optimization. (2017). arXiv:cs.LG/1412.6980
[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2017. Ima-
     genet classification with deep convolutional neural networks. Com-
     mun. ACM 60, 6 (2017), 84–90.
[15] Colin Lea, Rene Vidal, Austin Reiter, and Gregory D Hager. 2016. Tem-
     poral convolutional networks: A unified approach to action segmen-
     tation. In European Conference on Computer Vision. Springer, 47–54.
[16] Xin Liu, Qingcai Chen, Xiangping Wu, Yan Liu, and Yang
     Liu. 2017. CNN based music emotion classification. (2017).
     arXiv:cs.MM/1704.05665
[17] Yi Luo, Zhuo Chen, and Takuya Yoshioka. 2020. Dual-path RNN: effi-
     cient long sequence modeling for time-domain single-channel speech
     separation. (2020). arXiv:eess.AS/1910.06379
[18] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. Vox-
     Celeb: A Large-Scale Speaker Identification Dataset. Interspeech 2017
     (Aug 2017). https://doi.org/10.21437/interspeech.2017-950
[19] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret
     Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple