=Paper= {{Paper |id=Vol-2670/MediaEval_19_paper_13 |storemode=property |title=Emotion and Theme Recognition of Music Using Convolutional Neural Networks |pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_13.pdf |volume=Vol-2670 |authors=Shengzhou Yi,Xueting Wang,Toshihiko Yamasaki |dblpUrl=https://dblp.org/rec/conf/mediaeval/YiWY19 }} ==Emotion and Theme Recognition of Music Using Convolutional Neural Networks== https://ceur-ws.org/Vol-2670/MediaEval_19_paper_13.pdf
                               Emotion and Theme Recognition of Music
                                Using Convolutional Neural Networks
                                          Shengzhou Yi, Xueting Wang, and Toshihiko Yamasaki
                                                                The University of Tokyo
                                                  {yishengzhou,xt_wang,yamasaki}@hal.t.u-tokyo.ac.jp

ABSTRACT
Our team, "YL-UTokyo", participated in the task: Emotion and
Theme Recognition in Music Using Jamendo. The goal of this task
is to recognize moods and themes conveyed by the audio tracks.
We tried several Convolutional Neural Networks with different
architectures or mechanisms. As a result, we find that a relatively
shallow network achieved better performance on this task.




1    INTRODUCTION
                                                                                                Figure 1: Mel-spectrogram
We participated in one of the tasks in MediaEval 2019: Emotion
and Theme Recognition in Music Using Jamendo [2]. This task
involves the prediction of moods and themes conveyed by a music                        Table 1: The architecture of 6-layer model
track. Moods are often defined as feelings conveyed by the music
(e.g. happy, sad, dark, melancholy) and themes are associated with                         Mel-spectrogram      Input: 96x1280x1
events or contexts where the music is suited to be played (e.g. epic,                       Conv 3x3x32
melodic, christmas, love, film, space).                                                       MP (2, 2)         Output: 48x640x32
   In the task, there are three types of audio representations, includ-                     Conv 3x3x64
ing traditional handcrafted audio features, mel-spectrograms, and                             MP (2, 4)         Output: 24x160x64
raw audio inputs. We only used the mel-spectrograms as input to                             Conv 3x3x128
train our model (Figure 1). We tried several Convolutional Neural
                                                                                              MP (2, 2)         Output: 12x80x128
Networks (CNNs) to find a suitable model for this task. The simplest
                                                                                            Conv 3x3x256
but effective model we tried is provided by the organizers. It only
                                                                                              MP (2, 4)         Output: 6x20x256
consists of five convolutional layers and one dense layer at last. We
                                                                                            Conv 3x3x512
also tried other models with more layers, but they didn’t always
                                                                                              MP (3, 5)         Output: 2x4x512
achieve better results. As a result, the model that achieved the best
performance in our experiments is a shallow neural network with                             Conv 3x3x256
only six convolutional layers and one dense layer.                                            MP (2, 4)         Output: 1x1x256
                                                                                                Dense
                                                                                               Sigmoid          Output: 56x1
2    RELATED WORK
Image classification performance has improved greatly with the
advent of large datasets such as ImageNet [5] using CNN architec-
                                                                             3 APPROACH
tures such as VGG [9], Inception [10], and ResNet [6]. There are
also many research of music emotion recognition or music clas-               3.1 Model
sification using CNN architectures [4, 7]. Even though statistical           We concentrated on finding the most suitable CNN architecture
machine learning (e.g. Support Vector Machine [8] and Random                 for the task. The baseline is a simple but effective model consisting
Forest [1]) can still achieve good performance in some tasks, deep           of five convolutional layers and a final dense layer. We also tried
learning, especially CNN based method, is more popular and can               other models with deeper architecture. We tried models with 6,
achieve better performance in most tasks. For large-scale datasets,          16, 18 or 25 convolutional layers. In particular, the most shallow
deep learning is much more practicable than statistical machine              model we considered is a fully convolution neural network with
learning.                                                                    ELU activations, six 3x3 convolutional layers, and 32, 64, 128, 256,
                                                                             512, 256 units for each layer respectively (Table 1).
                                                                                We also tried some models with the residual architecture [6].
Copyright 2019 for this paper by its authors. Use                            The convolutional block consists of 1x1, 3x3 and 1x1 convolutional
permitted under Creative Commons License Attribution                         layer sequentially. This is the architecture for inputs and outputs
4.0 International (CC BY 4.0).                                               with the same size and unit number. For the block that maps inputs
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                            Shengzhou Yi, Xueting Wang and Toshihiko Yamasaki

                                                                                            Table 2: Experiment result
            x                                   x
                                                                             Conv Layers    Residual   PR-AUC-macro        ROC-AUC-macro
       Conv 1x1                          Conv 1x1
                                                                             5 (baseline)     No            0.1161               0.7475
                                          Conv 3x3       Conv 1x1
                                                                                  6           No            0.1256               0.7532
       Conv 3x3                                                                   16          Yes           0.1125               0.7393
                                         (stride = 2)   (stride = 2)
                                                                                  18          Yes           0.1135               0.7460
       Conv 1x1                          Conv 1x1                                 25          No            0.1009               0.7319


            ELU                                 ELU                                Table 3: Top-5 and bottom-5 tag-wise AUCs
      (a) Without downsampling           (b) With downsampling
                                                                                  Tag       Rank    PR-AUC-macro        ROC-AUC-macro
                  Figure 2: Residual architecture
                                                                                summer        1          0.4698               0.9033
                                                                                  deep        2          0.4435               0.9137
to outputs with smaller size and more units, the stride of 3x3 con-            corporate      3          0.4017               0.8849
volutional layer is two and the shortcut is a 1x1 convolutional layer             epic        4          0.3886               0.8384
for downsampling (Figure 2).                                                      film        5          0.3606               0.7709
                                                                                  etro       52          0.0213               0.7943
3.2     Dataset                                                                 holiday      53          0.0186               0.6856
The dataset includes 17,982 music tracks with mood and theme                      cool       54          0.0185               0.6763
annotations. The split for training, validation and test is about                 sexy       55          0.0145               0.7327
2 : 1 : 1. In total, there are 56 tags, and tracks can possibly have             travel      56          0.0117               0.5990
more than one tag. There are three types of audio representations,
including traditional handcrafted audio features, mel-spectrograms,
and raw audio inputs. The traditional handcrafted audio features         performance in image classification task. However, the model with
are from Essentia [3] using the feature extractor for AcousticBrainz.    deep architecture didn’t always achieve a better performance in this
These features were used in the MediaEval genre recognition tasks.       task. We also tried residual architecture that commonly used for im-
The number of mel-bands of the mel-spectrograms is 96. The raw           proving the performance of neural networks. However, the models
audio inputs are in MP3 format with 44.1 kHz sampling rate.              with residual architecture didn’t have an advantage in performance.

3.3     Experiment                                                       5     DISCUSSION AND OUTLOOK
We only used the pre-computed mel-spectrograms (Figure 1) as             The number of samples (18K) in the dataset is relatively smaller
inputs, and we used different data augumentation methods in train-       than some image datasets (e.g. CIFAR-10: 60K, MS-COCO: 200K,
ing, validation and test dataset. Let T be the length of input section   ImageNet: 517K) and the length of audio data (>30s) is relatively
[frame]. For training dataset, we randomly cropped a T -frame sec-       longer than some sound datasets (e.g. UrbanSound8K: <4s, ESC-50:
tion from each audio track in every epoch. For validation and test       5s, AudioSet: 10s). According to our experience, the generalization
dataset, we respectively cropped 10 and 20 T -frame sections from        ability of models is especially important in this task. Therefore, it is
each audio track at regular intervals. We averaged the predictions       reasonable that relatively shallow VGG-based network with strong
over all sections of each audio track. The length of input section T     generalization ability can achive better performance.
is 1,280 frames. We trained our networking using Adam with the              In the future, we plan to use all of the audio representations
batch size of 64 and the learning rate of 0.001.                         because we think it is interesting that we treat audio recognition as
                                                                         a multimodal task. Traditional handcrafted audio features and the
4     RESULTS AND ANALYSIS                                               raw audio inputs may bring great improvement in the performance
We compared the performance of the models that have different            of our model.
architectures or mechanisms in Table 2. Suprisingly, the model that
achieved the best performance in our experiments was a relatively        6     CONCLUSION
shallow model that only consists of six convolutional layers, the        In our experiments, we applied several convolutional neural net-
architecture of which is detailed introduced in Section 3.1. Moreover,   works to recogonize the emotion and theme of music. A shal-
the top-5 and bottom-5 tag-wise AUCs of the 6-layer model are            low VGG-based network that consists of six convolutional layers
showned in Table 3. The performance achieved by the best 6-layer         achieved the best performance with PR-AUC-macro of 0.1256 and
model is in the fifth place among all 29 submissions.                    ROC-AUC-macro of 0.7532. We think that the generalization abil-
   The network with 25 convolutional layers consists of one 7x7, 24      ity of the models is very important in this task. The link to our
3x3 convolutional layers and five max pooling layers for downsam-        source code: https://github.com/YiShengzhou12330379/Emotion-
pling. It’s commonly believed that deep models can achieve a better      and-Theme-Recognition-in-Music-Using-Jamendo.
Emotion and Theme Recognition in Music Using Jamendo                              MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES
 [1] Miguel Angel Ferrer Ballester. 2018. A Novel Approach to String
     Instrument Recognition. In Proceedings of Image and Signal Processing:
     8th International Conference, Vol. 10884. 165–175.
 [2] Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, and Minz Won.
     2019. MediaEval 2019: Emotion and Theme Recognition in Music Using
     Jamendo. In MediaEval Benchmark Workshop.
 [3] Dmitry Bogdanov, Nicolas Wack, Emilia Gómez Gutiérrez, Sankalp
     Gulati, Herrera Boyer, and others. 2013. Essentia: An audio analysis
     library for music information retrieval. In Proceedings of the Interna-
     tional Society for Music Information Retrieval. 493–498.
 [4] Keunwoo Choi, George Fazekas, and Mark Sandler. 2016. Automatic
     tagging using deep convolutional neural networks. arXiv preprint
     arXiv:1606.00298 (2016).
 [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
     2009. Imagenet: A large-scale hierarchical image database. In Proceed-
     ings of the IEEE Conference on Computer Vision and Pattern Recognition.
     248–255.
 [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
     residual learning for image recognition. In Proceedings of the IEEE
     Conference on Computer Vision and Pattern Recognition. 770–778.
 [7] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke,
     Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A
     Saurous, Bryan Seybold, and others. 2017. CNN architectures for
     large-scale audio classification. In Proceedings of the IEEE International
     Conference on Acoustics, Speech and Signal Processing. 131–135.
 [8] Renato Panda, Ricardo Malheiro, and Rui Pedro Paiva. 2018. Musical
     Texture and Expressivity Features for Music Emotion Recognition. In
     Proceedings of the International Society for Music Information Retrieval.
     383–391.
 [9] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
     lutional networks for large-scale image recognition. arXiv preprint
     arXiv:1409.1556 (2014).
[10] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
     Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
     Rabinovich. 2015. Going deeper with convolutions. In Proceedings of
     the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.