=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_13
|storemode=property
|title=Emotion
and Theme Recognition of Music Using Convolutional Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_13.pdf
|volume=Vol-2670
|authors=Shengzhou Yi,Xueting Wang,Toshihiko Yamasaki
|dblpUrl=https://dblp.org/rec/conf/mediaeval/YiWY19
}}
==Emotion
and Theme Recognition of Music Using Convolutional Neural Networks==
Emotion and Theme Recognition of Music Using Convolutional Neural Networks Shengzhou Yi, Xueting Wang, and Toshihiko Yamasaki The University of Tokyo {yishengzhou,xt_wang,yamasaki}@hal.t.u-tokyo.ac.jp ABSTRACT Our team, "YL-UTokyo", participated in the task: Emotion and Theme Recognition in Music Using Jamendo. The goal of this task is to recognize moods and themes conveyed by the audio tracks. We tried several Convolutional Neural Networks with different architectures or mechanisms. As a result, we find that a relatively shallow network achieved better performance on this task. 1 INTRODUCTION Figure 1: Mel-spectrogram We participated in one of the tasks in MediaEval 2019: Emotion and Theme Recognition in Music Using Jamendo [2]. This task involves the prediction of moods and themes conveyed by a music Table 1: The architecture of 6-layer model track. Moods are often defined as feelings conveyed by the music (e.g. happy, sad, dark, melancholy) and themes are associated with Mel-spectrogram Input: 96x1280x1 events or contexts where the music is suited to be played (e.g. epic, Conv 3x3x32 melodic, christmas, love, film, space). MP (2, 2) Output: 48x640x32 In the task, there are three types of audio representations, includ- Conv 3x3x64 ing traditional handcrafted audio features, mel-spectrograms, and MP (2, 4) Output: 24x160x64 raw audio inputs. We only used the mel-spectrograms as input to Conv 3x3x128 train our model (Figure 1). We tried several Convolutional Neural MP (2, 2) Output: 12x80x128 Networks (CNNs) to find a suitable model for this task. The simplest Conv 3x3x256 but effective model we tried is provided by the organizers. It only MP (2, 4) Output: 6x20x256 consists of five convolutional layers and one dense layer at last. We Conv 3x3x512 also tried other models with more layers, but they didn’t always MP (3, 5) Output: 2x4x512 achieve better results. As a result, the model that achieved the best performance in our experiments is a shallow neural network with Conv 3x3x256 only six convolutional layers and one dense layer. MP (2, 4) Output: 1x1x256 Dense Sigmoid Output: 56x1 2 RELATED WORK Image classification performance has improved greatly with the advent of large datasets such as ImageNet [5] using CNN architec- 3 APPROACH tures such as VGG [9], Inception [10], and ResNet [6]. There are also many research of music emotion recognition or music clas- 3.1 Model sification using CNN architectures [4, 7]. Even though statistical We concentrated on finding the most suitable CNN architecture machine learning (e.g. Support Vector Machine [8] and Random for the task. The baseline is a simple but effective model consisting Forest [1]) can still achieve good performance in some tasks, deep of five convolutional layers and a final dense layer. We also tried learning, especially CNN based method, is more popular and can other models with deeper architecture. We tried models with 6, achieve better performance in most tasks. For large-scale datasets, 16, 18 or 25 convolutional layers. In particular, the most shallow deep learning is much more practicable than statistical machine model we considered is a fully convolution neural network with learning. ELU activations, six 3x3 convolutional layers, and 32, 64, 128, 256, 512, 256 units for each layer respectively (Table 1). We also tried some models with the residual architecture [6]. Copyright 2019 for this paper by its authors. Use The convolutional block consists of 1x1, 3x3 and 1x1 convolutional permitted under Creative Commons License Attribution layer sequentially. This is the architecture for inputs and outputs 4.0 International (CC BY 4.0). with the same size and unit number. For the block that maps inputs MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France Shengzhou Yi, Xueting Wang and Toshihiko Yamasaki Table 2: Experiment result x x Conv Layers Residual PR-AUC-macro ROC-AUC-macro Conv 1x1 Conv 1x1 5 (baseline) No 0.1161 0.7475 Conv 3x3 Conv 1x1 6 No 0.1256 0.7532 Conv 3x3 16 Yes 0.1125 0.7393 (stride = 2) (stride = 2) 18 Yes 0.1135 0.7460 Conv 1x1 Conv 1x1 25 No 0.1009 0.7319 ELU ELU Table 3: Top-5 and bottom-5 tag-wise AUCs (a) Without downsampling (b) With downsampling Tag Rank PR-AUC-macro ROC-AUC-macro Figure 2: Residual architecture summer 1 0.4698 0.9033 deep 2 0.4435 0.9137 to outputs with smaller size and more units, the stride of 3x3 con- corporate 3 0.4017 0.8849 volutional layer is two and the shortcut is a 1x1 convolutional layer epic 4 0.3886 0.8384 for downsampling (Figure 2). film 5 0.3606 0.7709 etro 52 0.0213 0.7943 3.2 Dataset holiday 53 0.0186 0.6856 The dataset includes 17,982 music tracks with mood and theme cool 54 0.0185 0.6763 annotations. The split for training, validation and test is about sexy 55 0.0145 0.7327 2 : 1 : 1. In total, there are 56 tags, and tracks can possibly have travel 56 0.0117 0.5990 more than one tag. There are three types of audio representations, including traditional handcrafted audio features, mel-spectrograms, and raw audio inputs. The traditional handcrafted audio features performance in image classification task. However, the model with are from Essentia [3] using the feature extractor for AcousticBrainz. deep architecture didn’t always achieve a better performance in this These features were used in the MediaEval genre recognition tasks. task. We also tried residual architecture that commonly used for im- The number of mel-bands of the mel-spectrograms is 96. The raw proving the performance of neural networks. However, the models audio inputs are in MP3 format with 44.1 kHz sampling rate. with residual architecture didn’t have an advantage in performance. 3.3 Experiment 5 DISCUSSION AND OUTLOOK We only used the pre-computed mel-spectrograms (Figure 1) as The number of samples (18K) in the dataset is relatively smaller inputs, and we used different data augumentation methods in train- than some image datasets (e.g. CIFAR-10: 60K, MS-COCO: 200K, ing, validation and test dataset. Let T be the length of input section ImageNet: 517K) and the length of audio data (>30s) is relatively [frame]. For training dataset, we randomly cropped a T -frame sec- longer than some sound datasets (e.g. UrbanSound8K: <4s, ESC-50: tion from each audio track in every epoch. For validation and test 5s, AudioSet: 10s). According to our experience, the generalization dataset, we respectively cropped 10 and 20 T -frame sections from ability of models is especially important in this task. Therefore, it is each audio track at regular intervals. We averaged the predictions reasonable that relatively shallow VGG-based network with strong over all sections of each audio track. The length of input section T generalization ability can achive better performance. is 1,280 frames. We trained our networking using Adam with the In the future, we plan to use all of the audio representations batch size of 64 and the learning rate of 0.001. because we think it is interesting that we treat audio recognition as a multimodal task. Traditional handcrafted audio features and the 4 RESULTS AND ANALYSIS raw audio inputs may bring great improvement in the performance We compared the performance of the models that have different of our model. architectures or mechanisms in Table 2. Suprisingly, the model that achieved the best performance in our experiments was a relatively 6 CONCLUSION shallow model that only consists of six convolutional layers, the In our experiments, we applied several convolutional neural net- architecture of which is detailed introduced in Section 3.1. Moreover, works to recogonize the emotion and theme of music. A shal- the top-5 and bottom-5 tag-wise AUCs of the 6-layer model are low VGG-based network that consists of six convolutional layers showned in Table 3. The performance achieved by the best 6-layer achieved the best performance with PR-AUC-macro of 0.1256 and model is in the fifth place among all 29 submissions. ROC-AUC-macro of 0.7532. We think that the generalization abil- The network with 25 convolutional layers consists of one 7x7, 24 ity of the models is very important in this task. The link to our 3x3 convolutional layers and five max pooling layers for downsam- source code: https://github.com/YiShengzhou12330379/Emotion- pling. It’s commonly believed that deep models can achieve a better and-Theme-Recognition-in-Music-Using-Jamendo. Emotion and Theme Recognition in Music Using Jamendo MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [1] Miguel Angel Ferrer Ballester. 2018. A Novel Approach to String Instrument Recognition. In Proceedings of Image and Signal Processing: 8th International Conference, Vol. 10884. 165–175. [2] Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, and Minz Won. 2019. MediaEval 2019: Emotion and Theme Recognition in Music Using Jamendo. In MediaEval Benchmark Workshop. [3] Dmitry Bogdanov, Nicolas Wack, Emilia Gómez Gutiérrez, Sankalp Gulati, Herrera Boyer, and others. 2013. Essentia: An audio analysis library for music information retrieval. In Proceedings of the Interna- tional Society for Music Information Retrieval. 493–498. [4] Keunwoo Choi, George Fazekas, and Mark Sandler. 2016. Automatic tagging using deep convolutional neural networks. arXiv preprint arXiv:1606.00298 (2016). [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition. 248–255. [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778. [7] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, and others. 2017. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 131–135. [8] Renato Panda, Ricardo Malheiro, and Rui Pedro Paiva. 2018. Musical Texture and Expressivity Features for Music Emotion Recognition. In Proceedings of the International Society for Music Information Retrieval. 383–391. [9] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [10] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.