=Paper=
{{Paper
|id=Vol-1905/recsys2017_poster18
|storemode=property
|title=Music Emotion Recognition via End-to-End Multimodal Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-1905/recsys2017_poster18.pdf
|volume=Vol-1905
|authors=Byungsoo Jeon,Chanju Kim,Adrian Kim,Dongwon Kim,Jangyeon Park,Jungwoo Ha
|dblpUrl=https://dblp.org/rec/conf/recsys/JeonKKKPH17
}}
==Music Emotion Recognition via End-to-End Multimodal Neural Networks==
Music Emotion Recognition via End-to-End Multimodal Neural Networks Byungsoo Jeon∗ Chanju Kim Adrian Kim Carnegie Mellon University Clova, NAVER Corp. Clova, NAVER Corp. Pittsburgh, PA, USA Seongnam, Korea Seongnam, Korea jbsimdicd@gmail.com chanju.kim@navercorp.com adrian.kim@navercorp.com Dongwon Kim Jangyeon Park Jung-Woo Ha† Clova, NAVER Corp. Clova, NAVER Corp. Clova, NAVER Corp. Seongnam, Korea Seongnam, Korea Seongnam, Korea dongwon.kim@navercorp.com jangyeon.park@navercorp.com jungwoo.ha@navercorp.com ABSTRACT Music emotion recognition (MER) is a key issue in user context- aware recommendation. Many existing methods require hand-crafted features on audio and lyrics. Here we propose a new end-to-end method for recognizing emotions of tracks from their acoustic sig- nals and lyrics via multimodal deep neural networks. We evaluate our method on about 7,000 K-pop tracks labeled as positive or neg- ative emotion. The proposed method is compared to end-to-end unimodal models using audio signals or lyrics only. The experi- mental results show that our multimodal model achieves the best accuracy as 80%, and we discuss the reasons of these results. KEYWORDS Music Emotion Recognition, Music Recommendation, Multimodal Neural Network Figure 1: Model structure for multimodal music emotion classification from acoustic signals and lyrics of tracks ACM Reference format: Byungsoo Jeon, Chanju Kim, Adrian Kim, Dongwon Kim, Jangyeon Park, and Jung-Woo Ha. 2017. Music Emotion Recognition via End-to-End Multi- inspiring us to extend it to multimodal neural network for MER. [2] modal Neural Networks. RecSys ’17 Poster Proceedings, Como, Italy, August and [4] also suggest other neural network models for MER while 27–31, 2017, 2 pages. formulating regression based on unsupervised learning. [6] and [7] tackle sentence-level MER problem, but we formulate song-level 1 INTRODUCTION MER problem because it is more reasonable to recommend to users. Here we simplify MER as a polarity emotion (positive / nega- Music emotion recognition (MER) is a core technology of context- tive) classification of tracks to reduce the uncertainty from many aware music recommendation. Users usually want music to amplify emotion categories, considering an application to simple music their emotions while partying or driving, for examples. Music rec- recommendation scenario. We propose an end-to-end multimodal ommendation using content-based MER allows music’s emotion to neural network models without an additional feature engineering be aligned with that of users in these scenarios. However, this is process. We also create new dataset including tracks served on a challenging because it is still unclear how music is causing emo- Korean music streaming service to guarantee the high-quality data. tions. It is known that numerous factors such as tone, pace, and lyrics are related to determine music emotion. 2 DATA DESCRIPTION Existing studies tackle MER in various ways. They mainly formu- late MER as either a classification or a regression problem. Laurier We describe our dataset from a famous Korean music streaming et al. use four emotion categories in [5] while Hu et al. use 18 cate- service, NAVER Music 0 . This consists of 3,742 positive and 3,742 gories in [3]. Both of them require additional feature engineering, negative tracks with their lyrics, and they are represented into such as rhythmic and tonal feature extractions and psychological mel-spectrograms. How could we seperate a positive and negative feature extractions from words, while our model doesn’t. [1] pro- track? There are tracks tagged by editors in NAVER Music. We pose a convolutional recurrent neural network for music tagging, use a predefined emotion word dictionary to seperate positive and negative tags. For instance, positive emotion words are ‘happy’, ∗ This work was performed in NAVER Corp. † Corresponding author and ‘cheerful’, while negative emotion words are ‘sad’, and ‘lonely’. Then, we filter out the tracks whose tags include both positive and RecSys ’17 Poster Proceedings, Como, Italy 0 http://music.naver.com 2017. RecSys ’17 Poster Proceedings, August 27–31, 2017, Como, Italy B. Jeon, C. Kim, A. Kim, D. Kim, J. Park, and J.-W. Ha negative words. We reject the tracks whose length is less than a minute or lyrics include less than 30 words. We use the first one Data Model Accuracy Audio CNN 0.6479 minute of each mel-spectrograms, and only use noun, verb, adjec- RNN 0.6303 tive, and adverb in words of lyrics. Finally, each mel-spectrogram CNN+RNN 0.6619 Lyrics CNN 0.7815 is represented as a 128 by 1024 matrix including 128 mels and 1024 RNN 0.7716 time slots corresponding to one minute length of acoustic signals. Both CNN+RNN, CNN 0.8046 Also, we have (27,496, 400) word vectors where vocabulary size |V | = 27, 496 and the maximum length of word sequences is 400. Table 1: Classification Figure 2: Validation accura- 3 MUTIMODAL DEEP NETWORKS FOR accuracies using audio, cies and losses of the best lyrics, and both model for each modality MUSIC EMOTIONAL RECOGNITION Figure 1 illustrates our end-to-end multimodal neural network model that directly predicts track’s emotion from audio and lyrics. with Keras on Tensorflow while using Tesla M40 GPU. Table 1 Our model has audio and lyrics branches which take a (128, 1024) shows the classification accuracies for each model. We randomly mel-spectrogram (audio) and a (27496, 400) padded word vector split the dataset into 90% for training and the rest for validation, (lyrics) as an input, respectively. At the bottom of the audio branch, and obtain the results shown in Table 1 after 5 runs of the test there are five 1D convolution and max pooling layers to understand for each model. 1D CNN + RNN model is the best among models mel-spectrogram as a sequence of 1D vector xa whose length is using audio. 1D CNN model is the best among models using lyrics. 128. The reason why 1D CNN is better than RNN to predict from lyrics may be that the word sequences are too long. It is also notable that ua = [Maxpoolinд(Conv(xa ))]5 (1) the model for lyrics works better than that for audio. Overall, the Five 1D convolution layers whose filter sizes are all 3 have 128, multimodal model using audio and lyrics shows the best accuracy, 128, 128, 64, and 64 output of filters, respectively. Filter sizes of five 0.8046. Figure 2 presents validation accuracy and loss of the best max pooling layers are 3, 3, 3, 2, and 2. We use the exponential model for each modality (audio, lyrics, both). The model for lyrics linear unit (elu) function as a non-linear function of convolution and both shows little more stable convergence than that for audio. layers. On top of that, we put two RNN layers (GRU) whose output 5 CONCLUSION dimensionality is 64 and one fully connected layer whose weight We define MER as a polarity emotion classification and propose a matrix is F CaW is to build an audio embedding vector va with multimodal neural network model trained in an end-to-end man- length 64 before merging with lyrics branch. ner without additional feature engineering. We present lyrics are va = F CaW {GRU 2 (ua )} (2) better features than audio on our problem, and our multimodal models proves the best accurcacy, 80% compared to unimodal mod- At the bottom of the lyrics branch, there is an embedding layer els. We will further investigate end-to-end deep learning strategies (weight matrix: F Ce W ) whose output dimensionality is 200 followed with more tracks and emotional categories. Furthermore, we will by an 1D convolution layer whose filter size and number of output apply our method to the context-aware music recommendation are 3 and 250 where an input word vector is xl . On top of that, we service of Clova, Cloud-based AI-assistant platform developed as a put global 1D max pooling layer because it is more robust to noise collaboration project by NAVER-LINE1 . words than non-global one. As in the audio branch, there is one fully connected layer whose weight matrix is F Cl W on it to build REFERENCES an lyrics embedding vector with length 64. [1] Keunwoo Choi, George Fazekas, Mark Sandler, and Kyunghyun Cho. 2016. Con- vl = F Cl W {GlobalMaxpoolinд(Conv(F Ce W xl ))} (3) volutional Recurrent Neural Networks for Music Classification. arXiv preprint arXiv:1609.04243 (2016). We concatenate two branches because it shows the best perfor- [2] Eduardo Coutinho, George Trigeorgis, Stefanos Zafeiriou, and Björn W. Schuller. 2015. Automatically Estimating Emotion in Music with Deep Long-Short Term mance. Lastly, we produce final output by taking this concatenated Memory Recurrent Neural Networks. In Working Notes Proceedings of the Medi- vector as an input of one fully connected layer (weight matrix: aEval 2015 Workshop. F CmW ) whose output dimensionality is 64 followed by a softmax [3] Xiao Hu and J. Stephen Downie. 2010. When Lyrics Outperform Audio for Music Mood Classification: A Feature Analysis. In 11th ISMIR 2010. 619–624. layer to compute the binary cross-entorpy loss. [4] Moyuan Huang, Wenge Rong, Tom Arjannikov, Nan Jiang, and Zhang Xiong. 2016. Bi-Modal Deep Boltzmann Machine Based Musical Emotion Classification. o = So f tmax(ReLU (F CmW {Concatenate(va , vl )})) (4) In ICANN 2016. 199–207. [5] Cyril Laurier, Jens Grivolla, and Perfecto Herrera. 2008. Multimodal Music Mood As a result, output vector o includes two values, the probabilities Classification Using Audio and Lyrics. In 7th ICMLA 2008. 688–693. of a positive and negative emotional track. [6] Bin Wu, Erheng Zhong, Andrew Horner, and Qiang Yang. 2014. Music Emotion Recognition by Multi-label Multi-layer Multi-instance Multi-view Learning. In ACM MM 2014. 117–126. 4 EVALUATION [7] Yi-Hsuan Yang, Yu-Ching Lin, Ya-Fan Su, and Homer H. Chen. 2008. A Regression Approach to Music Emotion Recognition. IEEE Trans. Audio, Speech & Language We compare the classification accuracies of models using audio, Processing 16, 2 (2008), 448–457. lyrics, and both. We test five unimodal models and one multimodal model which consists of the best unimodal models for audio and lyrics as in Figure 1. To test these models, we implemented them 1 https://clova.ai