=Paper=
{{Paper
|id=Vol-1905/recsys2017_poster18
|storemode=property
|title=Music Emotion Recognition via End-to-End Multimodal Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-1905/recsys2017_poster18.pdf
|volume=Vol-1905
|authors=Byungsoo Jeon,Chanju Kim,Adrian Kim,Dongwon Kim,Jangyeon Park,Jungwoo Ha
|dblpUrl=https://dblp.org/rec/conf/recsys/JeonKKKPH17
}}
==Music Emotion Recognition via End-to-End Multimodal Neural Networks==
<pdf width="1500px">https://ceur-ws.org/Vol-1905/recsys2017_poster18.pdf</pdf>
<pre>
    Music Emotion Recognition via End-to-End Multimodal Neural
                            Networks
                  Byungsoo Jeon∗                                   Chanju Kim                                  Adrian Kim
           Carnegie Mellon University                         Clova, NAVER Corp.                           Clova, NAVER Corp.
              Pittsburgh, PA, USA                               Seongnam, Korea                             Seongnam, Korea
             jbsimdicd@gmail.com                           chanju.kim@navercorp.com                    adrian.kim@navercorp.com

                  Dongwon Kim                                    Jangyeon Park                               Jung-Woo Ha†
             Clova, NAVER Corp.                              Clova, NAVER Corp.                            Clova, NAVER Corp.
              Seongnam, Korea                                 Seongnam, Korea                               Seongnam, Korea
         dongwon.kim@navercorp.com                      jangyeon.park@navercorp.com                    jungwoo.ha@navercorp.com

ABSTRACT
Music emotion recognition (MER) is a key issue in user context-
aware recommendation. Many existing methods require hand-crafted
features on audio and lyrics. Here we propose a new end-to-end
method for recognizing emotions of tracks from their acoustic sig-
nals and lyrics via multimodal deep neural networks. We evaluate
our method on about 7,000 K-pop tracks labeled as positive or neg-
ative emotion. The proposed method is compared to end-to-end
unimodal models using audio signals or lyrics only. The experi-
mental results show that our multimodal model achieves the best
accuracy as 80%, and we discuss the reasons of these results.

KEYWORDS
Music Emotion Recognition, Music Recommendation, Multimodal
Neural Network                                                              Figure 1: Model structure for multimodal music emotion
                                                                            classification from acoustic signals and lyrics of tracks
ACM Reference format:
Byungsoo Jeon, Chanju Kim, Adrian Kim, Dongwon Kim, Jangyeon Park,
and Jung-Woo Ha. 2017. Music Emotion Recognition via End-to-End Multi-      inspiring us to extend it to multimodal neural network for MER. [2]
modal Neural Networks. RecSys ’17 Poster Proceedings, Como, Italy, August   and [4] also suggest other neural network models for MER while
27–31, 2017, 2 pages.                                                       formulating regression based on unsupervised learning. [6] and [7]
                                                                            tackle sentence-level MER problem, but we formulate song-level
1    INTRODUCTION                                                           MER problem because it is more reasonable to recommend to users.
                                                                               Here we simplify MER as a polarity emotion (positive / nega-
Music emotion recognition (MER) is a core technology of context-
                                                                            tive) classification of tracks to reduce the uncertainty from many
aware music recommendation. Users usually want music to amplify
                                                                            emotion categories, considering an application to simple music
their emotions while partying or driving, for examples. Music rec-
                                                                            recommendation scenario. We propose an end-to-end multimodal
ommendation using content-based MER allows music’s emotion to
                                                                            neural network models without an additional feature engineering
be aligned with that of users in these scenarios. However, this is
                                                                            process. We also create new dataset including tracks served on a
challenging because it is still unclear how music is causing emo-
                                                                            Korean music streaming service to guarantee the high-quality data.
tions. It is known that numerous factors such as tone, pace, and
lyrics are related to determine music emotion.
                                                                            2    DATA DESCRIPTION
   Existing studies tackle MER in various ways. They mainly formu-
late MER as either a classification or a regression problem. Laurier        We describe our dataset from a famous Korean music streaming
et al. use four emotion categories in [5] while Hu et al. use 18 cate-      service, NAVER Music 0 . This consists of 3,742 positive and 3,742
gories in [3]. Both of them require additional feature engineering,         negative tracks with their lyrics, and they are represented into
such as rhythmic and tonal feature extractions and psychological            mel-spectrograms. How could we seperate a positive and negative
feature extractions from words, while our model doesn’t. [1] pro-           track? There are tracks tagged by editors in NAVER Music. We
pose a convolutional recurrent neural network for music tagging,            use a predefined emotion word dictionary to seperate positive and
                                                                            negative tags. For instance, positive emotion words are ‘happy’,
∗ This work was performed in NAVER Corp.
† Corresponding author                                                      and ‘cheerful’, while negative emotion words are ‘sad’, and ‘lonely’.
                                                                            Then, we filter out the tracks whose tags include both positive and
RecSys ’17 Poster Proceedings, Como, Italy
                                                                            0 http://music.naver.com
2017.
RecSys ’17 Poster Proceedings, August 27–31, 2017, Como, Italy                                    B. Jeon, C. Kim, A. Kim, D. Kim, J. Park, and J.-W. Ha


negative words. We reject the tracks whose length is less than a
minute or lyrics include less than 30 words. We use the first one              Data         Model           Accuracy
                                                                               Audio        CNN              0.6479
minute of each mel-spectrograms, and only use noun, verb, adjec-                            RNN              0.6303
tive, and adverb in words of lyrics. Finally, each mel-spectrogram                        CNN+RNN            0.6619
                                                                               Lyrics       CNN              0.7815
is represented as a 128 by 1024 matrix including 128 mels and 1024                          RNN              0.7716
time slots corresponding to one minute length of acoustic signals.               Both   CNN+RNN, CNN         0.8046
Also, we have (27,496, 400) word vectors where vocabulary size
|V | = 27, 496 and the maximum length of word sequences is 400.
                                                                             Table 1: Classification                      Figure 2: Validation accura-
3    MUTIMODAL DEEP NETWORKS FOR                                             accuracies using audio,                      cies and losses of the best
                                                                             lyrics, and both                             model for each modality
     MUSIC EMOTIONAL RECOGNITION
Figure 1 illustrates our end-to-end multimodal neural network
model that directly predicts track’s emotion from audio and lyrics.          with Keras on Tensorflow while using Tesla M40 GPU. Table 1
Our model has audio and lyrics branches which take a (128, 1024)             shows the classification accuracies for each model. We randomly
mel-spectrogram (audio) and a (27496, 400) padded word vector                split the dataset into 90% for training and the rest for validation,
(lyrics) as an input, respectively. At the bottom of the audio branch,       and obtain the results shown in Table 1 after 5 runs of the test
there are five 1D convolution and max pooling layers to understand           for each model. 1D CNN + RNN model is the best among models
mel-spectrogram as a sequence of 1D vector xa whose length is                using audio. 1D CNN model is the best among models using lyrics.
128.                                                                         The reason why 1D CNN is better than RNN to predict from lyrics
                                                                             may be that the word sequences are too long. It is also notable that
                ua = [Maxpoolinд(Conv(xa ))]5                         (1)
                                                                             the model for lyrics works better than that for audio. Overall, the
   Five 1D convolution layers whose filter sizes are all 3 have 128,         multimodal model using audio and lyrics shows the best accuracy,
128, 128, 64, and 64 output of filters, respectively. Filter sizes of five   0.8046. Figure 2 presents validation accuracy and loss of the best
max pooling layers are 3, 3, 3, 2, and 2. We use the exponential             model for each modality (audio, lyrics, both). The model for lyrics
linear unit (elu) function as a non-linear function of convolution           and both shows little more stable convergence than that for audio.
layers.
   On top of that, we put two RNN layers (GRU) whose output                  5      CONCLUSION
dimensionality is 64 and one fully connected layer whose weight              We define MER as a polarity emotion classification and propose a
matrix is F CaW is to build an audio embedding vector va with                multimodal neural network model trained in an end-to-end man-
length 64 before merging with lyrics branch.                                 ner without additional feature engineering. We present lyrics are
                    va = F CaW {GRU 2 (ua )}                          (2)    better features than audio on our problem, and our multimodal
                                                                             models proves the best accurcacy, 80% compared to unimodal mod-
   At the bottom of the lyrics branch, there is an embedding layer
                                                                             els. We will further investigate end-to-end deep learning strategies
(weight matrix: F Ce W ) whose output dimensionality is 200 followed
                                                                             with more tracks and emotional categories. Furthermore, we will
by an 1D convolution layer whose filter size and number of output
                                                                             apply our method to the context-aware music recommendation
are 3 and 250 where an input word vector is xl . On top of that, we
                                                                             service of Clova, Cloud-based AI-assistant platform developed as a
put global 1D max pooling layer because it is more robust to noise
                                                                             collaboration project by NAVER-LINE1 .
words than non-global one. As in the audio branch, there is one
fully connected layer whose weight matrix is F Cl W on it to build           REFERENCES
an lyrics embedding vector with length 64.                                    [1] Keunwoo Choi, George Fazekas, Mark Sandler, and Kyunghyun Cho. 2016. Con-
      vl = F Cl W {GlobalMaxpoolinд(Conv(F Ce W xl ))}                (3)         volutional Recurrent Neural Networks for Music Classification. arXiv preprint
                                                                                  arXiv:1609.04243 (2016).
   We concatenate two branches because it shows the best perfor-              [2] Eduardo Coutinho, George Trigeorgis, Stefanos Zafeiriou, and Björn W. Schuller.
                                                                                  2015. Automatically Estimating Emotion in Music with Deep Long-Short Term
mance. Lastly, we produce final output by taking this concatenated                Memory Recurrent Neural Networks. In Working Notes Proceedings of the Medi-
vector as an input of one fully connected layer (weight matrix:                   aEval 2015 Workshop.
F CmW ) whose output dimensionality is 64 followed by a softmax               [3] Xiao Hu and J. Stephen Downie. 2010. When Lyrics Outperform Audio for Music
                                                                                  Mood Classification: A Feature Analysis. In 11th ISMIR 2010. 619–624.
layer to compute the binary cross-entorpy loss.                               [4] Moyuan Huang, Wenge Rong, Tom Arjannikov, Nan Jiang, and Zhang Xiong.
                                                                                  2016. Bi-Modal Deep Boltzmann Machine Based Musical Emotion Classification.
     o = So f tmax(ReLU (F CmW {Concatenate(va , vl )}))              (4)         In ICANN 2016. 199–207.
                                                                              [5] Cyril Laurier, Jens Grivolla, and Perfecto Herrera. 2008. Multimodal Music Mood
   As a result, output vector o includes two values, the probabilities            Classification Using Audio and Lyrics. In 7th ICMLA 2008. 688–693.
of a positive and negative emotional track.                                   [6] Bin Wu, Erheng Zhong, Andrew Horner, and Qiang Yang. 2014. Music Emotion
                                                                                  Recognition by Multi-label Multi-layer Multi-instance Multi-view Learning. In
                                                                                  ACM MM 2014. 117–126.
4    EVALUATION                                                               [7] Yi-Hsuan Yang, Yu-Ching Lin, Ya-Fan Su, and Homer H. Chen. 2008. A Regression
                                                                                  Approach to Music Emotion Recognition. IEEE Trans. Audio, Speech & Language
We compare the classification accuracies of models using audio,                   Processing 16, 2 (2008), 448–457.
lyrics, and both. We test five unimodal models and one multimodal
model which consists of the best unimodal models for audio and
lyrics as in Figure 1. To test these models, we implemented them             1 https://clova.ai

</pre>