=Paper= {{Paper |id=Vol-2670/MediaEval_19_paper_13 |storemode=property |title=Emotion and Theme Recognition of Music Using Convolutional Neural Networks |pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_13.pdf |volume=Vol-2670 |authors=Shengzhou Yi,Xueting Wang,Toshihiko Yamasaki |dblpUrl=https://dblp.org/rec/conf/mediaeval/YiWY19 }} ==Emotion and Theme Recognition of Music Using Convolutional Neural Networks== https://ceur-ws.org/Vol-2670/MediaEval_19_paper_13.pdf

Emotion and Theme Recognition of Music
Using Convolutional Neural Networks
Shengzhou Yi, Xueting Wang, and Toshihiko Yamasaki
The University of Tokyo
{yishengzhou,xt_wang,yamasaki}@hal.t.u-tokyo.ac.jp

ABSTRACT
Our team, "YL-UTokyo", participated in the task: Emotion and
Theme Recognition in Music Using Jamendo. The goal of this task
is to recognize moods and themes conveyed by the audio tracks.
We tried several Convolutional Neural Networks with different
architectures or mechanisms. As a result, we find that a relatively
shallow network achieved better performance on this task.

1 INTRODUCTION
Figure 1: Mel-spectrogram
We participated in one of the tasks in MediaEval 2019: Emotion
and Theme Recognition in Music Using Jamendo [2]. This task
involves the prediction of moods and themes conveyed by a music Table 1: The architecture of 6-layer model
track. Moods are often defined as feelings conveyed by the music
(e.g. happy, sad, dark, melancholy) and themes are associated with Mel-spectrogram Input: 96x1280x1
events or contexts where the music is suited to be played (e.g. epic, Conv 3x3x32
melodic, christmas, love, film, space). MP (2, 2) Output: 48x640x32
In the task, there are three types of audio representations, includ- Conv 3x3x64
ing traditional handcrafted audio features, mel-spectrograms, and MP (2, 4) Output: 24x160x64
raw audio inputs. We only used the mel-spectrograms as input to Conv 3x3x128
train our model (Figure 1). We tried several Convolutional Neural
MP (2, 2) Output: 12x80x128
Networks (CNNs) to find a suitable model for this task. The simplest
Conv 3x3x256
but effective model we tried is provided by the organizers. It only
MP (2, 4) Output: 6x20x256
consists of five convolutional layers and one dense layer at last. We
Conv 3x3x512
also tried other models with more layers, but they didn’t always
MP (3, 5) Output: 2x4x512
achieve better results. As a result, the model that achieved the best
performance in our experiments is a shallow neural network with Conv 3x3x256
only six convolutional layers and one dense layer. MP (2, 4) Output: 1x1x256
Dense
Sigmoid Output: 56x1
2 RELATED WORK
Image classification performance has improved greatly with the
advent of large datasets such as ImageNet [5] using CNN architec-
3 APPROACH
tures such as VGG [9], Inception [10], and ResNet [6]. There are
also many research of music emotion recognition or music clas- 3.1 Model
sification using CNN architectures [4, 7]. Even though statistical We concentrated on finding the most suitable CNN architecture
machine learning (e.g. Support Vector Machine [8] and Random for the task. The baseline is a simple but effective model consisting
Forest [1]) can still achieve good performance in some tasks, deep of five convolutional layers and a final dense layer. We also tried
learning, especially CNN based method, is more popular and can other models with deeper architecture. We tried models with 6,
achieve better performance in most tasks. For large-scale datasets, 16, 18 or 25 convolutional layers. In particular, the most shallow
deep learning is much more practicable than statistical machine model we considered is a fully convolution neural network with
learning. ELU activations, six 3x3 convolutional layers, and 32, 64, 128, 256,
512, 256 units for each layer respectively (Table 1).
We also tried some models with the residual architecture [6].
Copyright 2019 for this paper by its authors. Use The convolutional block consists of 1x1, 3x3 and 1x1 convolutional
permitted under Creative Commons License Attribution layer sequentially. This is the architecture for inputs and outputs
4.0 International (CC BY 4.0). with the same size and unit number. For the block that maps inputs
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France Shengzhou Yi, Xueting Wang and Toshihiko Yamasaki

Table 2: Experiment result
x x
Conv Layers Residual PR-AUC-macro ROC-AUC-macro
Conv 1x1 Conv 1x1
5 (baseline) No 0.1161 0.7475
Conv 3x3 Conv 1x1
6 No 0.1256 0.7532
Conv 3x3 16 Yes 0.1125 0.7393
(stride = 2) (stride = 2)
18 Yes 0.1135 0.7460
Conv 1x1 Conv 1x1 25 No 0.1009 0.7319

ELU ELU Table 3: Top-5 and bottom-5 tag-wise AUCs
(a) Without downsampling (b) With downsampling
Tag Rank PR-AUC-macro ROC-AUC-macro
Figure 2: Residual architecture
summer 1 0.4698 0.9033
deep 2 0.4435 0.9137
to outputs with smaller size and more units, the stride of 3x3 con- corporate 3 0.4017 0.8849
volutional layer is two and the shortcut is a 1x1 convolutional layer epic 4 0.3886 0.8384
for downsampling (Figure 2). film 5 0.3606 0.7709
etro 52 0.0213 0.7943
3.2 Dataset holiday 53 0.0186 0.6856
The dataset includes 17,982 music tracks with mood and theme cool 54 0.0185 0.6763
annotations. The split for training, validation and test is about sexy 55 0.0145 0.7327
2 : 1 : 1. In total, there are 56 tags, and tracks can possibly have travel 56 0.0117 0.5990
more than one tag. There are three types of audio representations,
including traditional handcrafted audio features, mel-spectrograms,
and raw audio inputs. The traditional handcrafted audio features performance in image classification task. However, the model with
are from Essentia [3] using the feature extractor for AcousticBrainz. deep architecture didn’t always achieve a better performance in this
These features were used in the MediaEval genre recognition tasks. task. We also tried residual architecture that commonly used for im-
The number of mel-bands of the mel-spectrograms is 96. The raw proving the performance of neural networks. However, the models
audio inputs are in MP3 format with 44.1 kHz sampling rate. with residual architecture didn’t have an advantage in performance.

3.3 Experiment 5 DISCUSSION AND OUTLOOK
We only used the pre-computed mel-spectrograms (Figure 1) as The number of samples (18K) in the dataset is relatively smaller
inputs, and we used different data augumentation methods in train- than some image datasets (e.g. CIFAR-10: 60K, MS-COCO: 200K,
ing, validation and test dataset. Let T be the length of input section ImageNet: 517K) and the length of audio data (>30s) is relatively
[frame]. For training dataset, we randomly cropped a T -frame sec- longer than some sound datasets (e.g. UrbanSound8K: <4s, ESC-50:
tion from each audio track in every epoch. For validation and test 5s, AudioSet: 10s). According to our experience, the generalization
dataset, we respectively cropped 10 and 20 T -frame sections from ability of models is especially important in this task. Therefore, it is
each audio track at regular intervals. We averaged the predictions reasonable that relatively shallow VGG-based network with strong
over all sections of each audio track. The length of input section T generalization ability can achive better performance.
is 1,280 frames. We trained our networking using Adam with the In the future, we plan to use all of the audio representations
batch size of 64 and the learning rate of 0.001. because we think it is interesting that we treat audio recognition as
a multimodal task. Traditional handcrafted audio features and the
4 RESULTS AND ANALYSIS raw audio inputs may bring great improvement in the performance
We compared the performance of the models that have different of our model.
architectures or mechanisms in Table 2. Suprisingly, the model that
achieved the best performance in our experiments was a relatively 6 CONCLUSION
shallow model that only consists of six convolutional layers, the In our experiments, we applied several convolutional neural net-
architecture of which is detailed introduced in Section 3.1. Moreover, works to recogonize the emotion and theme of music. A shal-
the top-5 and bottom-5 tag-wise AUCs of the 6-layer model are low VGG-based network that consists of six convolutional layers
showned in Table 3. The performance achieved by the best 6-layer achieved the best performance with PR-AUC-macro of 0.1256 and
model is in the fifth place among all 29 submissions. ROC-AUC-macro of 0.7532. We think that the generalization abil-
The network with 25 convolutional layers consists of one 7x7, 24 ity of the models is very important in this task. The link to our
3x3 convolutional layers and five max pooling layers for downsam- source code: https://github.com/YiShengzhou12330379/Emotion-
pling. It’s commonly believed that deep models can achieve a better and-Theme-Recognition-in-Music-Using-Jamendo.
Emotion and Theme Recognition in Music Using Jamendo MediaEval’19, 27-29 October 2019, Sophia Antipolis, France

REFERENCES
[1] Miguel Angel Ferrer Ballester. 2018. A Novel Approach to String
Instrument Recognition. In Proceedings of Image and Signal Processing:
8th International Conference, Vol. 10884. 165–175.
[2] Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, and Minz Won.
2019. MediaEval 2019: Emotion and Theme Recognition in Music Using
Jamendo. In MediaEval Benchmark Workshop.
[3] Dmitry Bogdanov, Nicolas Wack, Emilia Gómez Gutiérrez, Sankalp
Gulati, Herrera Boyer, and others. 2013. Essentia: An audio analysis
library for music information retrieval. In Proceedings of the Interna-
tional Society for Music Information Retrieval. 493–498.
[4] Keunwoo Choi, George Fazekas, and Mark Sandler. 2016. Automatic
tagging using deep convolutional neural networks. arXiv preprint
arXiv:1606.00298 (2016).
[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
2009. Imagenet: A large-scale hierarchical image database. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition.
248–255.
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
residual learning for image recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 770–778.
[7] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke,
Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A
Saurous, Bryan Seybold, and others. 2017. CNN architectures for
large-scale audio classification. In Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing. 131–135.
[8] Renato Panda, Ricardo Malheiro, and Rui Pedro Paiva. 2018. Musical
Texture and Expressivity Features for Music Emotion Recognition. In
Proceedings of the International Society for Music Information Retrieval.
383–391.
[9] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
lutional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556 (2014).
[10] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
Rabinovich. 2015. Going deeper with convolutions. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.