=Paper=
{{Paper
|id=Vol-1436/Paper65
|storemode=property
|title=MediaEval 2015: Recurrent Neural Network Approach to Emotion in Music Tack
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper65.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/ChinW15
}}
==MediaEval 2015: Recurrent Neural Network Approach to Emotion in Music Tack==
MediaEval 2015: Recurrent Neural Network Approach to Emotion in Music Tack Yu-Hao Chin and Jia-Ching Wang Department of Computer Science and Information Engineering National Central University, Taiwan, R.O.C. kio19330@gmail.com, jiacwang@gmail.com ABSTRACT Specifically, this feature set is finally dropped in our submission This paper describes our work for the “Emotion in Music” task since the baseline feature set obtains a better performance. To of MediaEval 2015. The goal of the task is predicting affective illustrate our experiments in Section IV clearly, we still content of a song. The affective content is presented in terms of introduce this feature set in this paper. valence and arousal criterions, which are shown in a time- We extract 10 kinds of features that are often utilized in the continuous fashion. We adopt deep recurrent neural network music emotion related research. A Matlab toolbox- MIR toolbox (DRNN) to predict the valence and arousal for each moment of a [2] is used to extract features from each music clip. The song, and Limited-Memory-Broyden–Fletcher–Goldfarb– extracted features are beatspectrum, event density, zero-crossing Shanno algorithm (LBFGS) is used to update the weights when rate, MFCC, roll-off, brightness, roughness, chromagram, pitch, doing back-propagation. DRNN considers the target of the root mean square (RMS) energy, and low energy. These features previous time segments when predicting the target of the current can be classified into five categories according to their time segment. Such time-considering manners of predictions are properties, i.e. rhythm, timbre, tonality, pitch, and dynamics. believed to achieve better performance in comparison of Table I lists the class of each feature. common machine learning models. We finally use the baseline feature set, adopted by the champion of last year, after Table 1: Extracted features and the corresponding comparing it with our feature set. A 10-fold cross validation classes. evaluation is used to do the inner-experiments. The system Feature class Feature name achieves r values of -0.5904 for valence and 0.4195 for arousal. The Root-Mean-Squared Error (RMSE) for valence and arousal Dynamics RMS energy, low energy are 0.4054 and 0.3804, respectively. For the evaluation dataset, Rhythm beatspectrum, event density the system achieves r values of -0.0103+-0.3420 for valence and Timbre zero-crossing rate, Roll-off, brightness, MFCC 0.3417+-0.2501 for arousal. The Root-Mean-Squared Error for Pitch pitch valence and arousal are 0.3359+-0.1614 and 0.2555+-0.1255, Tonality chromagram respectively. 1. INTRODUCTION The “Emotion in Music” task asks participants to construct 3. APPROACH a system that can automatically predict valence and arousals We use deep recurrent neural network to regress the values for each 500ms segment of a song. The development set valence and arousal values for a song. Different from neural of the whole database consists of 431 clips, and each clip has a network, deep recurrent neural network has at least one cyclic length of 30 seconds. The annotators are asked to slide a pointer path of connections [3]. We set one layer to the recurrent layer, on the monitor when annotating the valence and arousal values and the recurrent layer considers its nodes of the previous one for the clips. The valence and arousal annotations are provided time step when computing the current value of the nodes. A such in a time-continuous manner. Please refer to [1] for more details. model is called L intermediate layer deep neural network in [4]. The time-series annotations are related to each other and we thus The weights of recurrent neural network can be updated use a time-considering machine learning model (Deep Recurrent using various methods, such as back propagation through time, Neural Network, DRNN) to do the work. The rest of paper is real-time recurrent learning, and Kalman-filtering-based weight organized as follows. Section II introduces a music information estimation. This paper adopts back propagation through time to retrieval feature set. Section III introduces recurrent neural update the weights. Specifically, the step size of the update is network and the Limited-Memory-Broyden–Fletcher–Goldfarb– estimated by a Limited-Memory-Broyden–Fletcher–Goldfarb– Shanno algorithm. Section IV shows the performance of our Shanno algorithm, which can compute the step size systemically system and makes a discussion about the experimental results. rather than determine the step size by the multiplication of a Section V provides a conclusion of our work. constant rate of learning and delta values. We adopt a multi-task architecture to predict the valence and arousal jointly. This architecture has been proved effective 2. FEATURE EXTRACTION in various machine learning works. On the other hand, to This section describes the feature set used in our work. involve the contextual information among the segments of the song, we concatenate the features of several segments together to be an input vector of the model. The size of the concatenation Copyright is held by the author/owner(s). MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany. is not analyzed in this paper. We just empirically set the size to 4.3 Discussion three. Apparently, our system does not obtain satisfied results in the task. Such results may come from several weakness of RNN: 1) The residual cannot be well back propagated to the nodes in 4. RESULTS AND DISCUSSION the first layer; 2) The computation of the current node cannot This section consists of three subsections, i.e., experimental consider its previous states by high number of time steps; 3) The setup, experimental results, and discussion. parameters (e.g., batch size, number of layers, activation function, normalization method, and rate of learning.) of the 4.1 Experimental Setup model are not well set. We adopt two feature sets, i.e., the MIR feature set mentioned in Section 2 and the baseline feature set provided by the organizers. The features are normalized by z-scores (i.e. 5. CONCLUSION subtracted by mean statistic and divided by the standard This paper presents our work of the 2015 MediaEval deviation). We train a recurrent neural network model to predict Emotion in Music task. Our system adopts recurrent neural the valence and arousal values, which is implemented using a network to regress the valence and arousal values. The system Matlab tool provided by [4]. The number of hidden layers is set achieves r values of -0.5904 for valence and 0.4195 for arousal. to three, and only the second layer is a recurrent layer. The The Root-Mean-Squared Error (RMSE) for valence and arousal number of hidden nodes in each layer is set to 500. A linear are 0.4054 and 0.3804, respectively. For the evaluation dataset, function is applied to each output node, and a sigmoid function the system achieves r values of -0.0103+-0.3420 for valence and is adopted to be the activation function of each hidden node. The 0.3417+-0.2501 for arousal. The Root-Mean-Squared Error for initialization of weights is implemented using a Xavier's weight valence and arousal are 0.3359+-0.1614 and 0.2555+-0.1255, initialization trick [5]. We train the model in a batch manner. respectively Our systems does not perform well in the task. The The batch size is set to 388. The rate of learning of the back unsatisfactory results may be obtained due to the lack of model propagation is set to 2. The training process of the model is tuning. A pre-training process should be involved to improve the stopped after the number of iterations achieves 100. In order to performance. avoid the over-fitting problem, we add a noise to each target when training the model. Specifically, we do not pre-train the model. The experiment of the development set is done using a 6. REFERENCES 10-fold cross validation. The performances of the methods are [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion in evaluated in terms of R-Squared for valence, R-Squared for music task at mediaeval 2015. In MediaEval 2015 arousal, Root-Mean-Squared Error (RMS) for valence, and Workshop, 2015. Root-Mean-Squared Error for arousal. [2] O. Lartillot and P. Toiviainen. MIR in Matlab (II): A toolbox for musical feature extraction from audio. In Proc. 4.2 Experimental Results Int. Conf. Music Information Retrieval, 2007, pages. 127– Table 2 shows the performances, which are obtained using 130, [Online] http://users.jyu.fi/lartillo/mirtoolbox/. the development set, of two approaches: Approach 1) The MIR [3] H. Jaeger. 2013. A tutorial on training recurrent neural feature set is extracted from the clips. A RNN model is adopted networks, covering BPPT, RTRL, EKF and the echo state to predict the VA values; Approach 2) The baseline feature set, network approach. International University Bremen. provided by the MediaEval 2015 official, is extracted from the clips. A RNN model is adopted to predict the VA values as well. [4] P. S. Huang, M. Kim, M. H. Johnson, and P. Smaragdis. Singing-voice separation from monaural recordings using deep recurrent neural networks. In Proc. Int. Conf. Music Table 2: Performances of the two approaches. Information Retrieval, 2014, pages. 477-482. Valence Arousal [5] X. Glorot and Y. Bengio. Understanding the difficulty of Method r RMSE r RMSE training deep feed forward neural networks. AISTATS Approach 1 -0.5810 0.4179 0.4079 0.3869 2010. Approach 2 -0.5904 0.4054 0.4195 0.3804 Table 3: Performance of approach 2 for evaluation dataset. Valence Arousal Method r RMSE r RMSE -0.0103+- 0.3359+- 0.3417+- 0.2555+- Approach 2 0.3420 0.1614 0.2501 0.1255 Since the MIR feature set performs worse than the baseline feature set, we only submitted Approach 2. Table 3 shows the official results of our submission.