=Paper=
{{Paper
|id=Vol-1436/Paper65
|storemode=property
|title=MediaEval 2015: Recurrent Neural Network Approach to Emotion in Music Tack
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper65.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/ChinW15
}}
==MediaEval 2015: Recurrent Neural Network Approach to Emotion in Music Tack==
<pdf width="1500px">https://ceur-ws.org/Vol-1436/Paper65.pdf</pdf>
<pre>
   MediaEval 2015: Recurrent Neural Network Approach to
                  Emotion in Music Tack

                                             Yu-Hao Chin and Jia-Ching Wang
                               Department of Computer Science and Information Engineering
                                       National Central University, Taiwan, R.O.C.
                                      kio19330@gmail.com, jiacwang@gmail.com


ABSTRACT                                                             Specifically, this feature set is finally dropped in our submission
This paper describes our work for the “Emotion in Music” task        since the baseline feature set obtains a better performance. To
of MediaEval 2015. The goal of the task is predicting affective      illustrate our experiments in Section IV clearly, we still
content of a song. The affective content is presented in terms of    introduce this feature set in this paper.
valence and arousal criterions, which are shown in a time-                 We extract 10 kinds of features that are often utilized in the
continuous fashion. We adopt deep recurrent neural network           music emotion related research. A Matlab toolbox- MIR toolbox
(DRNN) to predict the valence and arousal for each moment of a       [2] is used to extract features from each music clip. The
song,    and     Limited-Memory-Broyden–Fletcher–Goldfarb–           extracted features are beatspectrum, event density, zero-crossing
Shanno algorithm (LBFGS) is used to update the weights when          rate, MFCC, roll-off, brightness, roughness, chromagram, pitch,
doing back-propagation. DRNN considers the target of the             root mean square (RMS) energy, and low energy. These features
previous time segments when predicting the target of the current     can be classified into five categories according to their
time segment. Such time-considering manners of predictions are       properties, i.e. rhythm, timbre, tonality, pitch, and dynamics.
believed to achieve better performance in comparison of              Table I lists the class of each feature.
common machine learning models. We finally use the baseline
feature set, adopted by the champion of last year, after             Table 1: Extracted features and the corresponding
comparing it with our feature set. A 10-fold cross validation        classes.
evaluation is used to do the inner-experiments. The system
                                                                         Feature class                   Feature name
achieves r values of -0.5904 for valence and 0.4195 for arousal.
The Root-Mean-Squared Error (RMSE) for valence and arousal                Dynamics                 RMS energy, low energy
are 0.4054 and 0.3804, respectively. For the evaluation dataset,           Rhythm                 beatspectrum, event density
the system achieves r values of -0.0103+-0.3420 for valence and             Timbre       zero-crossing rate, Roll-off, brightness, MFCC
0.3417+-0.2501 for arousal. The Root-Mean-Squared Error for                  Pitch                           pitch
valence and arousal are 0.3359+-0.1614 and 0.2555+-0.1255,
                                                                           Tonality                      chromagram
respectively.

1. INTRODUCTION
      The “Emotion in Music” task asks participants to construct     3. APPROACH
a system that can automatically predict valence and arousals               We use deep recurrent neural network to regress the
values for each 500ms segment of a song. The development set         valence and arousal values for a song. Different from neural
of the whole database consists of 431 clips, and each clip has a     network, deep recurrent neural network has at least one cyclic
length of 30 seconds. The annotators are asked to slide a pointer    path of connections [3]. We set one layer to the recurrent layer,
on the monitor when annotating the valence and arousal values        and the recurrent layer considers its nodes of the previous one
for the clips. The valence and arousal annotations are provided      time step when computing the current value of the nodes. A such
in a time-continuous manner. Please refer to [1] for more details.   model is called L intermediate layer deep neural network in [4].
The time-series annotations are related to each other and we thus          The weights of recurrent neural network can be updated
use a time-considering machine learning model (Deep Recurrent        using various methods, such as back propagation through time,
Neural Network, DRNN) to do the work. The rest of paper is           real-time recurrent learning, and Kalman-filtering-based weight
organized as follows. Section II introduces a music information      estimation. This paper adopts back propagation through time to
retrieval feature set. Section III introduces recurrent neural       update the weights. Specifically, the step size of the update is
network and the Limited-Memory-Broyden–Fletcher–Goldfarb–            estimated by a Limited-Memory-Broyden–Fletcher–Goldfarb–
Shanno algorithm. Section IV shows the performance of our            Shanno algorithm, which can compute the step size systemically
system and makes a discussion about the experimental results.        rather than determine the step size by the multiplication of a
Section V provides a conclusion of our work.                         constant rate of learning and delta values.
                                                                           We adopt a multi-task architecture to predict the valence
                                                                     and arousal jointly. This architecture has been proved effective
2. FEATURE EXTRACTION                                                in various machine learning works. On the other hand, to
     This section describes the feature set used in our work.        involve the contextual information among the segments of the
                                                                     song, we concatenate the features of several segments together
                                                                     to be an input vector of the model. The size of the concatenation
Copyright is held by the author/owner(s).
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany.
is not analyzed in this paper. We just empirically set the size to       4.3 Discussion
three.                                                                         Apparently, our system does not obtain satisfied results in
                                                                         the task. Such results may come from several weakness of RNN:
                                                                         1) The residual cannot be well back propagated to the nodes in
4. RESULTS AND DISCUSSION                                                the first layer; 2) The computation of the current node cannot
     This section consists of three subsections, i.e., experimental      consider its previous states by high number of time steps; 3) The
setup, experimental results, and discussion.                             parameters (e.g., batch size, number of layers, activation
                                                                         function, normalization method, and rate of learning.) of the
4.1 Experimental Setup                                                   model are not well set.
      We adopt two feature sets, i.e., the MIR feature set
mentioned in Section 2 and the baseline feature set provided by
the organizers. The features are normalized by z-scores (i.e.            5. CONCLUSION
subtracted by mean statistic and divided by the standard                      This paper presents our work of the 2015 MediaEval
deviation). We train a recurrent neural network model to predict         Emotion in Music task. Our system adopts recurrent neural
the valence and arousal values, which is implemented using a             network to regress the valence and arousal values. The system
Matlab tool provided by [4]. The number of hidden layers is set          achieves r values of -0.5904 for valence and 0.4195 for arousal.
to three, and only the second layer is a recurrent layer. The            The Root-Mean-Squared Error (RMSE) for valence and arousal
number of hidden nodes in each layer is set to 500. A linear             are 0.4054 and 0.3804, respectively. For the evaluation dataset,
function is applied to each output node, and a sigmoid function          the system achieves r values of -0.0103+-0.3420 for valence and
is adopted to be the activation function of each hidden node. The        0.3417+-0.2501 for arousal. The Root-Mean-Squared Error for
initialization of weights is implemented using a Xavier's weight         valence and arousal are 0.3359+-0.1614 and 0.2555+-0.1255,
initialization trick [5]. We train the model in a batch manner.          respectively Our systems does not perform well in the task. The
The batch size is set to 388. The rate of learning of the back           unsatisfactory results may be obtained due to the lack of model
propagation is set to 2. The training process of the model is            tuning. A pre-training process should be involved to improve the
stopped after the number of iterations achieves 100. In order to         performance.
avoid the over-fitting problem, we add a noise to each target
when training the model. Specifically, we do not pre-train the
model. The experiment of the development set is done using a             6. REFERENCES
10-fold cross validation. The performances of the methods are            [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion in
evaluated in terms of R-Squared for valence, R-Squared for                   music task at mediaeval 2015. In MediaEval 2015
arousal, Root-Mean-Squared Error (RMS) for valence, and                      Workshop, 2015.
Root-Mean-Squared Error for arousal.
                                                                         [2] O. Lartillot and P. Toiviainen. MIR in Matlab (II): A
                                                                             toolbox for musical feature extraction from audio. In Proc.
4.2 Experimental Results                                                     Int. Conf. Music Information Retrieval, 2007, pages. 127–
     Table 2 shows the performances, which are obtained using                130, [Online] http://users.jyu.fi/lartillo/mirtoolbox/.
the development set, of two approaches: Approach 1) The MIR              [3] H. Jaeger. 2013. A tutorial on training recurrent neural
feature set is extracted from the clips. A RNN model is adopted              networks, covering BPPT, RTRL, EKF and the echo state
to predict the VA values; Approach 2) The baseline feature set,              network approach. International University Bremen.
provided by the MediaEval 2015 official, is extracted from the
clips. A RNN model is adopted to predict the VA values as well.          [4] P. S. Huang, M. Kim, M. H. Johnson, and P. Smaragdis.
                                                                             Singing-voice separation from monaural recordings using
                                                                             deep recurrent neural networks. In Proc. Int. Conf. Music
Table 2: Performances of the two approaches.                                 Information Retrieval, 2014, pages. 477-482.
                            Valence                   Arousal            [5] X. Glorot and Y. Bengio. Understanding the difficulty of
    Method
                        r             RMSE        r             RMSE         training deep feed forward neural networks. AISTATS
  Approach 1         -0.5810          0.4179    0.4079          0.3869       2010.
  Approach 2         -0.5904          0.4054    0.4195          0.3804


Table 3: Performance of approach 2 for evaluation
dataset.
                            Valence                   Arousal
    Method
                       r              RMSE        r             RMSE
                   -0.0103+-      0.3359+-     0.3417+-     0.2555+-
  Approach 2
                     0.3420        0.1614       0.2501       0.1255


     Since the MIR feature set performs worse than the baseline
feature set, we only submitted Approach 2. Table 3 shows the
official results of our submission.

</pre>